NEWS, EDITORIALS, REFERENCE
Passing Inline Arguments
Happy Hallowe'en C64 enthusiasts. This post is about passing inline arguments to subroutines in 6502. Hopefully it won't be too spoooky.
The KERNAL and Register Arguments
As limiting as just 3 bytes of arguments might seem, there are a lot of routines one could write in this world that can get by with so few. As mentioned above, the KERNAL uses processor registers exclusively for its arguments. There are times when three bytes aren't enough. But to handle this the KERNAL simply requires you to call one or more preparatory routines first.
The chief example being opening a file for read. In a C program this can be done with a single function, but in the KERNAL you have to supply several arguments. A Logical File Number (1 byte), a device number (1 byte), a secondary address or channel number (1 byte), a pointer to a filename string (2 bytes), plus the length of the string (1 byte). Count 'em up, that's 6 bytes. The KERNAL therefore requires you to call SETLFS with the first three arguments, and then SETNAM with the string pointer and length, before you can call OPEN. (And incidentally, OPEN doesn't take any input arguments.) A bit more complex than C, but not outrageous.
The advantage to processor arguments is that they're fast. Load the .A register with a value, that could be as short as 2 cycles. JSR to a routine that takes .A as its only argument, that takes a standard 6 cycles, but .A is already loaded and ready to use by the code in that routine. I mean, there is virtually no overhead. Especially if .A was already set by a previous routine's return, there is sometimes literally zero overhead in passing that data through to the next subroutine. So, if you can get away with using just 3 bytes of arguments, it's fast, it's really damn fast.
But there are downsides. Registers are effectively little more than in–processor global variables. Global variables which certain instructions operate directly on, and which those instructions require to work at all. An indirect indexed load, for example, can only use the .Y register, and must use the .Y register. So if you use the .Y register to pass in an argument, but the code in the routine needs to do an indirect indexed instruction, the .Y register has to be overwritten and whatever argument was passed in on it has to be saved somewhere so as not to be lost. This makes matters somewhat more complicated.
The other downside is that if your routine is using .A, .X and .Y for holding on to temporary state, such as the index of a loop, but inside the loop you JSR to another routine, it is necessary to know which registers that other routine (and more complicated, any subroutines that it may call, and so on ad infinitum) will use. The KERNAL routines are all very self–contained. They generally don't go off and call other routines, so their register usage is predictable. In fact, the documentation tells you exactly which registers are used (and thus disrupted) by each KERNAL routine. For example, VECTOR, takes .X and .Y as inputs, and returns a value in .X, but it uses .A in the process. So before calling VECTOR if you care about what's currently in .A you'd have to back it up first.
For the accumulator, backing up and restoring is easily done by pushing and pulling from the stack, but to push and pull .X or .Y from the stack it is much more complicated. The NMOS 6502, and the 6510 (NMOS 6502 variant used in the C64 and C128)2 there is no way to push .X and .Y directly to the stack, nor to pull them directly off the stack. You have to first transfer them to .A, then push that. And you similarly have to pull to .A and transfer to .X or .Y. This adds complication, because it involves disrupting .A. In the end, the most efficient way to write 6502 code that uses and worries about the registers for argument passing and local variables, is to write code carefully by hand. Like an art form.
C Language and the Stack
C is a super popular language, and without even considering how much code is still written in C today, most other modern languages are modeled syntactically on C. Don't believe me, the Wikipedia article, List of C-family programming languages, lists over 60 languages in the C family. One of the most recent being Apple's Swift which was first released just in 2014. The original C language began development in the late 1960s and first appeared in 1972.
Assembly is a step above Machine Code, by abstracting absolute and relative addresses with labels, and instructions with mnemonic codes, but the assembly code maintains a strict one–to–one relationship between code typed and instructions output by the assembler. The assembly programmer still has complete control over the precise gyrations the CPU will go through. C is a compiled language, which means by definition the programmer is not the one writing the Machine Code, the compiler is. The compiler is not an artist. It needs simple, reliable rules, that operate on a very local scale to produce predictable output that will work. There is no way that a compiler would go off looking through the code of routines that the current routine will call, to see which registers it will use and thus which are safe for its own use. It's just, too artsy. It's not rigorous enough.
Instead, the way C works is that all arguments are passed on the stack. Not only are all arguments on the stack, but all local variables of a function are also on the stack, and the return value is sent back via the stack too. The big advantage is that every instance of a function being called has its own unique place in memory, and that place is dynamically found and allocated. This makes recursive code really easy. A single function can call itself over and over, deeper and deeper, and each call instance has its own set of variables that don't interfere with the previous instances, because they are each stored further and further into the stack.
Passing arguments is easy too. The declaration of the function (or the function's signature) defines how many bytes of arguments it takes. When the function is called the caller puts that much data onto the stack and the function uses the stack pointer to reference those variables.
So, why doesn't the C64 programmer use C or a C–like solution?
I said it so poetically in my review of World of Commodore 2016, after listening to Bil Herd talk about the development of the C128 and where it was in the market at the time. People back in the 60s and 70s already knew what could be accomplished by a computer with an incredible amount of computing power. Computers in those decades already had 16 and 32 bit buses, megabytes of RAM and complex, timeslicing, multi–user Unix operating systems. The only problem is that these computers cost millions of dollars and were as big as modern cars. But, none–the–less, they had many of the modern computer amenities. What made the 6502 so special was that it was a complete central processing unit, all crammed together into a single IC package that could be sold for just a few dollars a piece, but not that it was a powerful and fully featured CPU. As I said in my article, early home computers really were just toys, in the eyes of the big business computers of the preceding 15 to 20 years.
Take the PDP-11, for example. The PDP-11 was commercially available starting in 1970, contemporaneous with the development of C. It was a very popular 16-bit minicomputer by DEC, that looks like an enormous stand–up freezer with tape reels on the front. Its main processor was in fact the design inspiration for both the Intel x86 family and the Motorola 68K family. And the C language explicitly took advantage of many of its low–level hardware features.
Just as a point of comparison, while the 6502 has 3 general purpose 8-bit registers, the PDP-11 has 8 general purpose 16-bit registers. But the the most important advantage is its addressing modes. The PDP-11 has, among others, 6(!) stack relative addressing modes and efficient support for multiple software stacks in addition to the hardware stack. The 6502 has a single, fixed address hardware stack, a single 8-bit stack pointer, no dedicated features for software stacks, and NO stack addressing modes at all. In fact the 6502 has just two instructions that affect the stack pointer. TSX and TXS. Transfer the Stack Pointer to .X register, and vice versa respectively. What this means is that performing operations directly on data in the stack is very crippling.
It is possible to write in C for the 6502, there is a 6502 C Compiler. But my understanding, from conversations I've had on IRC, is that it works by emulating in software the features of the CPU that are not supported natively. This is a recipe for slow execution. The bottom line is, no matter how convenient C's stack–oriented variables and arguments are, C was not designed for the 6502. And the 6502 is simply not well suited to run C code.
Alternative Argument Passing
Okay, now we know how the KERNAL works and what the limitations are with its techniques. And we also know why it is that the C64 doesn't just standardize on C (or a derivative) like every other modern computer. But we are still left with the problem of how to pass more than 3 bytes worth of arguments to a subroutine.
I can think of at least two ways to work around the limitation and then I'll go into some detail on a third and fairly clever solution. I found these tricks and techniques by reading the Official GEOS Programmer's Reference Guide. It discusses various ways that the GEOS Kernal supports passing arguments to its subroutines.
Zero Page on the 6502 has been described as a suite of pseudo registers. Almost every instruction has a Zero Page addressing mode that can work with data there more quickly than in higher memory pages. But, there is a limited amount of ZP space, just 256 bytes. And $0000 and $0001 are actually taken by the 6510's processor port.
But one trick that GEOS does is reserves 32 bytes of Zero Page as a set of 16, 16-bit registers, which it numbers R0 through R15. After that, it treats them very much the same way we would treat the real registers, and all their limitations described above. I said that one problem with the CPU registers is that they are effectively global variables. And so are the GEOS ZP registers. The advantage is that you've got 16 of them, and they're 16-bits wide. So various routines that need several 16-bit arguments are simply documented as, you put a 16-bit width into R5, you put an 8-bit height into R6, you set a fill pattern code into R3, and a buffer mode in R1, and then you jump to this subroutine and it draws a square and blah blah blah. I'm making up the specifics, but you get the point.
When the subroutine is called, it expects that you have populated the correct global ZP "registers" with suitable values, which it reads and acts upon. It also comes with a suite of Macros that help you set those registers. It's a bit slower than using real registers the way the KERNAL does, and has many of the same limitations, but it greatly expands the space, up to 32 bytes from just 3 bytes. You also have to have a lot of Zero Page dedicated to this solution. This is something GEOS could do because it replaces the C64 KERNAL entirely, and can do whatever it wants with the entire computer's memory space. This is more or less a solution I was able to dream up on my own, but in a less structured way. You can always just shove a few bytes directly into known system working space addresses, and then call a routine that will use those work space addresses. But, there is something that feels a bit dirty about this. What happens if you change the KERNAL and reassign some common work space addresses? Likely, old apps will stop working and will probably crash the system.
An alternative solution that I have come up with on my own, I don't recall seeing this in the GEOS Kernal, is to pass a pointer to a structure. This is the essential behavior in C64 OS of the Toolkit and its view objects. And it is also the way File References work. When calling a subroutine that operates on a file, for example, rather than with the KERNAL having to manually pass 5 different properties of varying 8- or 16-bit widths, the properties are set on a local (or allocated) block of memory. Then a pointer to that block of memory is put in .X and .Y (lo byte, hi byte) and the routine is called. The routine generally needs to write this pointer to zero page, and then use the .Y register as an index in indirect indexed mode to work with those properties.
There is no way around needing to use some global zero page space. As indirect pointers can only be referenced through addresses in zero page. However, moving the responsibility of what free space in ZP to use from the calling routine to the called routine feels much less dirty.
Using structures and passing pointers to structures is a pretty good solution, and it solves a lot of problems. But it isn't the best solution for all problems. Which is why we'll now look in more detail at one more solution I found in the GEOS Programmer's Reference Guide.
Inline Argument Passing
Let's just use the concrete example of what I've run into in C64 OS to understand the issue. C64 OS has a file module, which knows how to work with File Reference structs, and is a light wrapper for working with streams of data. The essential functionality I want is: fopen, fread, fwrite and fclose. Fopen will use a file reference struct to identify the device, partition, path and filename, and also dynamically stores the logical file number which doubles as the status for whether the file is currently open. Using a file reference requires a minimum of a 2–byte pointer. The file can be opened either for read or for write. After which, the pointer to the file reference can be used with fread or fwrite.
If the file was opened for read, for example, then we will want to call fread to actually read some data out of it. The arguments for fread will be, a pointer to the file reference, a pointer to a buffer of memory into which the data should be read, and a 16–bit length of data to read. That's 6 bytes of arguments.
If on the other hand the file was opened for write, then we want to call fwrite to write some data in memory to that file. The arguments will be, a pointer to the file reference, a pointer to a buffer in memory where some data is to be written, and a 16–bit length of data to write. That is also 6 bytes of arguments.
Closing a file reference is pretty straight forward. It only needs the 2 byte pointer to an open file reference. It just reads the logical file number out of the reference struct, closes the file and releases the LFN.
But now let's consider opening the file. If opening for read, not too hard. We need the pointer to the file reference, plus probably 1 byte to indicate the read/write direction. That's only 3 bytes, we could get away with putting the read/write flag in .A and the file ref pointer in .X and .Y.
When it comes time to opening a file to write, though, it becomes a bit more needy. You need the file ref pointer (2 bytes), the read/write flag (1 byte), plus a file type byte (1 byte, USR/SEQ/PRG), and you also need to indicate whether an overwrite should be permitted if the file already exists (1 byte), or if an append to an existing file should happen (1 byte). Naively, that's 6 bytes. Although, you could assign the Read/Write, to 1 bit, the file type to 2 bits and the overwrite and append to 1 bit each, and pack all that into a single byte argument. Dealing with bits makes your code fatter though. And nothing gets around the 6–byte requirements of the fread and fwrite routines.
Some GEOS routines can take inline arguments. Here's how it works:
You do a JSR to a subroutine, and in the calling code, you follow the JSR line with a series of argument data bytes. Effectively, as many as the routine needs. They're called inline arguments because they follow inline with your code.
The routine that accepts inline arguments begins by transferring the Stack Pointer to the .X register. It then uses the stack pointer to find the return address that the JSR pushed on. This address is actually the address where the inline arguments begin. It puts this address into a zero page pointer, where it can use indirect indexed calls to read the arguments (and optionally later overwrite them too if that made sense). But, it must also change the values on the stack to move the return address beyond the end of the inline arguments. When the routine does an RTS, the updated return address will be pulled off the stack and execution will continue just following the inline arguments. It's very clever. It's not always what you need, but it's a really great tool to have in the box if you need it.
Reading about it in the GEOS programmer's reference guide, it sounds a bit complicated to use. So I've written some sample code to show and explain how easy it really is. And how you can abstract the code to reuse it for different routines that take differing numbers of arguments.
Inline Arguments in Practice
Okay, so let's walk through this code and see how it works. We assemble to $0801 which is where a BASIC program should start. And the first include is the basic preamble. This will do a SYS2061, to jump to the first line of our assembly program which starts at line 6. The kernal.s include just defines the names of the kernal routines.
Getting right to the meat and potatoes of what we're showing, line 6 has a jsr getargs, this is an example routine that takes 3 bytes of inline arguments, and will print the three bytes in hexadecimal notation. On lines 7, 8 and 9 you can see that we've supplied three inline bytes of data. These are the arguments, immediately following the JSR instruction. At line 11 we call jsr getargs again, to show that in this case the 3 bytes of inline arguments are in the form of a .text string. It doesn't matter how we inline the arguments, all that matters is that the routine expects there to be 3 bytes and execution is going to continue just after those 3 bytes.
The rts on line 14 is the end of the program. Following this we include our handy inttohex.s. It has a routine tohex that takes a value in the accumulator and returns it as two hexadecimal (0 to F) PETSCII characters in the .X and .Y registers. prnthex is a simple little routine that will print the output of the tohex routine. First it CHROUTs a "$" followed by the two hexadecimal characters and lastly a new line.
And now for the getargs routine and how it works with inline arguments:
I like to comment my routines showing the ins and outs with arrows. In this case I've labeled my inline argument inputs as a1, a2 and a3. Following the right pointing arrow is usually a comment on what the argument holds. I've put in here .byte to indicate the length of the argument. In your calling routine, you could state the inline argument, for example, as a .word, then in the comments mention it's a .word, and you'll have to read two bytes to grab the whole thing.
Lines 5 to 16 do all the magic. The JSR instruction automatically pushes the return address onto the stack. That address is effectively just a 16-bit number. We have to add a number to that address that is equal to the total size of the inline arguments. In this case our args are 3 bytes. So we have to add 3 to that 16-bit return address. But adding 3 to the low byte could overflow that byte, so we have to do a full 16-bit add, using the carry. To begin the 16-bit add we start by clearing the carry, on line 5.
Next, we need to grab the current stack pointer. To do this we use the only instruction the 6502 has for getting the stack pointer, TSX, which transfers it to the .X register. To understand what happens next you have to know how the stack works. The stack consists of the entire first page of memory $0100 to $01FF. But things are pushed onto the stack starting from the top working down. The Stack Pointer is an offset into $01xx and always points to where the next pushed byte will go. Thus, when the stack is empty, the stack pointer is $FF. When the stack is full the stack pointer is $00. With each push, the stack pointer gets decremented, with each pull the stack pointer gets incremented.
After a JSR, whatever the stack pointer is, the return address is at the stack pointer +1, and the stack pointer +2. Here's the thing though, you'd think that once .X holds the stack pointer, you would need to manipulate .X to find the bytes you're after. Instead though, to access bytes relative to the stack pointer you can merely change the absolute start address from $0100 to something bigger or smaller.
Let's say the stack pointer is $F5. If you pushed something onto the stack it would go to $01F5, that means the last two things pushed onto the stack are actually stored at $01F6 and $01F7. Do a TSX, now .X is $F5. You don't need to INX and then read from $0100,X you can simply read $0101,X (which is $0101+$F5) add them together and you get $01F6. Similar to get $01F7 you don't need to INX, just start the absolute address as $0102,X (which is $0102+$F5).
Okay, now we know how we're referencing the address on the stack. Here's the next thing to know. When a JSR happens, the return address is pushed onto the stack high byte first, low byte second. So $0101,X references the low byte, $0102,X references the high byte. Now we're ready to see what happens.
Line 8 grabs the low byte from the stack. Line 9 stores it to a Zero Page address of our choice. It has to be stored somewhere in zero page to be able to read through it as a vector. Once we've stored that low byte, we can add 3 to it and write it straight back into the stack whence it came. Lines 10 and 11.
Adding 3 may have overflown the accumulator, but if it did, the carry is now set. Line 13 grabs the high byte and stores it at $fd, the high byte of the zero page vector. Then we add 0, which includes the carry and completes the full 16-bit add. And we write the new high byte straight back to the stack whence it came. And we're done.
The stack now contains a return address that is 3 bigger than it used to be, just past the end of the inline arguments. And the zeropage addresses $FC and $FD contain a vector that points at the block of inline arguments.
The only thing left to do is read those arguments. Here's one more trick of the JSR return address, it is actually the address of the next instruction... minus 1. But the RTS instruction expects that and returns you to the correct place. So, adding three to the return address was certainly the right thing to do. However, the vector at $FC/$FD actually points to one byte before the inline arguments. Again, no problem, we don't have to waste cycles and memory adding 1 to the vector, we just access the three arguments with offsets 1, 2 and 3, instead of 0, 1, and 2.
Lines 18 to 20, 22 to 24 and 26 to 28 show that we set .Y to the argument offset, then do an LDA indirect indexed through the zero page vector to grab that argument. The jsr prnthex simply prints that argument in hexadecimal which is what this example routine is supposed to do. Note that you don't need to get the arguments in any particular order. You can read any of them whenever it makes sense to in the routine. You can ignore some of them, or even modify and write a new value back to that inline argument memory location. The world's your oyster. You can have up to 255 bytes of inline arguments without needing to modify any of the initial stack manipulation logic.
So you might be asking, well that's neat, but why bother with all that? You could just put a few bytes at the end of your routine to hold the arguments, then load up .X and .Y with a pointer to those bytes and call the routine. The routine then only needs to write .X and .Y to the zero page addresses of its choice and boom you're done. Like this:
And yeah, that works too. But there are downsides. When you put the args below the routine:
- You have to use a label to find them.
- They are separated from the routine call they apply to.
- You have to add code LDX/LDY above routine call.
- If there are many blocks of args for different subroutines, it gets messy fast.
And all the converse are advantages for inline arguments. They don't need a label, because their address is known implicitly from the address of the JSR's return address. They sit together in your code with the JSR they apply to. If you have multiple JSRs each with their own arg blocks it doesn't get progressively messier. And you don't need any extra code on the calling side to set it up.
There is just one downside, in my opinion. The called routine has to have a fair chunk of code to manipulate the stack and set up the vector. Instead of 4 bytes for the end argument block, you need 22 bytes for the inline arguments. And if you had many routines that all use inline args those 22 bytes start adding up. Read on for one last solution to that problem!
A Slightly More Advanced Trick
For end arguments, you need 4 bytes in the called routine to setup the vector. And you need 4 bytes in the calling code to setup the .X and .Y pointer to the arguments (plus a label). So you actually need 8 bytes to "pass" the arguments. That's 8 bytes on top of the byte count of the arguments themselves.
With inline arguments, you need 0 bytes in the calling code, but 22 bytes in the called routine. But, if you're going to use this trick for numerous routines with inline arguments, you can move those 22 bytes into a shared argument–pointer–configuration routine, like this:
Now we've got a new routine called setargptr. It will handle the work of manipulating the stack and setting up the zero page vector. There are some gotchas. First, if we're going to use this routine for multiple routines that accept inline arguments, we need a way to customize how many inline arguments to expect. It can't just be hardcoded at 3.
We pass to this routine in the accumulator the number of inline arguments there shall be. The first thing this routine now does is writes the inline argument count (self–modifying code here) to a new label argc(+1). This overwrites the value to add to the low byte of the return address. The rest of the 16-bit add works as it did before.
The second gotcha is that the real routine, getargs in this case, has to call setargptr. But this call pushes its own return address onto the stack. So the place on the stack where our arguments are is 2 bytes further back. That's easy to deal with though, the new stack absolute offsets are simply $0103 and $0104, instead of $0101 and $0102. That's it.
The only other gotcha is that the setargptr routine now bears the responsibility for which Zero Page addresses to use for the vector. And thus, every routine that uses inline arguments and uses setargptr to manipulate the stack, must all use the same zero page vector. But, depending on your situation, that might be just fine.
In the actual getargs routine now, instead of all that messy code to manipulate the stack we just load the accumulator with the expected number of inline args, and JSR to setargptr. Boom, done. Now, it's 5 bytes per routine that uses inline arguments instead of 8 for end arguments. We actually save 3 bytes per routine. But, the setargptr routine is now 26 bytes. End arguments require 4 bytes per routine, plus 4 bytes per call. So if we write just a few routines with inline arguments, and call those routines a few times. We quickly end up using less total memory than if we used end arguments. But, to be honest, saving some memory is icing on the cake. Inline arguments are just cool.
Thanks for the tip GEOS! GEOS has got a few other tricks I'll probably end up exploring in future posts. Stay tuned.
- At some point in the future, I'd like to post my own version of the C64 KERNAL documentation as a Programming Reference. But, it's a lot of work. For now, you can read about the KERNAL in the C64 Programmer's Reference Guide. Chapter 5: Basic to Machine Language (PDF) [↩]
- The CMOS version of the 6502, the 65c02, includes extra instructions for pushing and pulling the .X and .Y registers directly to and from the stack, and other handy things. Unfortunately, there is no such 65c10 processor. [↩]