NEWS, EDITORIALS, REFERENCE
Implementing Factorial in 6502
Programming in 6502 has been an interesting adventure for me. I dabbled with it back 8 or 10 years ago. But I ultimately settled into coding most of my C64 projects in C. C felt sufficiently low level for my tastes. It was reasonably fast, much much faster than BASIC for example, there feels like a close 1-to-1 relationship between what you type and what eventually comes out the far end of the compiler. And it was still quite a learning curve before I was able to do much of anything useful.
Now that I'm not targeting the SuperCPU, even C feels bloated. The 6502 was just not designed for the sorts of contortions that C tries to put it through. For example, every time a C function is called the arguments are put onto the stack. The C function itself then manipulates those values on the stack as its working local memory space. This is really great for recursion or pre-emptive multitasking, but it pains me to think of what the 6502 must be doing to accomplish the stack referencing.
Once I'd decided that the stock C64 is what I will target for this particular project, I became fully committed to learning to do everything I need to do in 6502 assembly. At first it was a pretty big shock. And to be frank, it's often still shocking. Without the KERNAL and BASIC roms a C64 has almost nothing there upon which to write your code. You are literally writing against the registers of the I/O chips directly, and the only resources at your disposal to get anything done are the 6502/6510's 56 simple instructions. That sounds crazy. Well, how crazy is it? I wanted to try my hand at something both simple and complicated at the same time. Somehow I decided on writing a routine to compute factorial. How hard could that be? #FLW
That's it! Pretty basic. If the input is 1, return 1. But if the input is more than 1 then return the product of this times the factorial of one less than this. Recursion as we know can break our minds, but it is certainly elegant if nothing else. The whole function is only 5 lines, and very short lines at that. But more, the properties of recursion feel like they best capture the essence of factorial as a mathematical concept.
The second problem, hitting the maximum handlable number, essentially crops up because factorial grows really really fast. On the other hand, many algorithms won't grow nearly so fast but still need lots of iterations to get the answer. You certainly don't want to crap out after just 25368 iterations because of a stack overflow. So my next stab at writing this was to use an iterative loop that involves no recursion.
Writing this version of factorial took me just one more minute, approximately. It's not much longer, only longer by one line. It feels slightly less elegant to my eyes, but it is undoubtedly significantly more efficient. When I try to find the maximum number of times it can iterate before crapping out... I'm unable to find that number. What happens is that the computer just starts taking longer and longer to compute for larger and larger input numbers. But they all eventually finish without hitting weird limits like the stack overflowing. Of course, for the last however many iterations they aren't doing much besides computing Infinity * nextNumber.
Implementing Factorial in BASIC 2.0
But let's be even more fair. The Commodore 64 as a commercial product isn't just a 6510 processor on a bus hooked up to some RAM and I/O chips. It is that, but it also includes the KERNAL rom and, more importantly for the purposes of writing a factorial algorithm, the BASIC rom. BASIC was included with the computer precisely because it made writing programs and doing some fairly sophisticated math relatively easy and accessible to the average person.
BASIC, at its core, of course is just a bunch of 6502 code. About 8K of it in fact, permanently etched into a rom. So, anything we could do in BASIC is in a very real way something that can be done with 6502 assembly. But to get there from scratch first you'd have to write BASIC, or significant parts of it. This is not what we're going to be doing in our own 6502 implementation. But, first, let's look at how one could write factorial in BASIC.
Very briefly, a number in BASIC is stored as 5 bytes. 4 bytes are for precision, and store the most significant 32-bits worth of the number. The 5th byte is signed, and represents the offset of the decimal. By convention, within the 4 bytes the decimal is always presumed to fall after the first decimal number, and the 5th byte is adjusted to normalize any value that exceeds 9. So for example, 8 is stored as "8.000000..." with a decimal offset of 0. But 12 is stored as "1.200000..." with a decimal offset of 1. (I'm goin' from memory here, don't flay me if I'm a bit off in the precise details. Instead, leave a comment below!)
This is the reason why BASIC's precision stops after 8 decimal places. Because "868331762" (bearing in mind that by convention the decimal falls immediately after the leftmost 8,) when converted to hex is $33 C1 B0 F2, and that my friends is exactly 4 bytes. Cool stuff.
Implementing Factorial in 6502 Assembly
There is nothing wrong with writing some math algorithm in BASIC. If you were in high school and you had a Commodore 64 and you wanted to use it to help you with your homework, as I used to do, writing a quick little program in BASIC is absolutely ideal. Often much better and much more expressive than a mere calculator. That is the reason it was included with the machine in the first place. Because it adds tremendous utility to a home computer. It turns a bunch of chips into a powerful programmable computational tool. Nothing wrong with that!
However, what I really wanted to know is, what is the minimum amount of code necessary to write an algorithm like factorial, on the bare and essential features of the 6502 CPU. That's a very different question. If all I wanted to do was get the answer, and all I had was a C64, I'd use BASIC. But in 6502, where do we even begin? Well, let me show you.
The essential requirements of factorial are:
- The ability to loop
- The ability to decrement a counter
- The ability to multiply two numbers together, and
- The ability to determine that the answer has been arrived at and that the loop should end.
That's enough to compute factorial. But there is one other step that I consider at least tangentially necessary to complete the picture. You need to be able to see the answer, and so even though human readable output is not strictly part of the computation of factorial, it is close enough that I have to include it. Otherwise, there would be no way of knowing if the answer it arrived at is even right.
Fortunately, looping, decrementing a counter, and comparing to see if a counter has reached 1 and leaving the loop are all standard fare. Those things are super easy on a 6502. But what about multiplying two numbers together? Oops, as it turns out, there are no instructions that support multiplication directly. Not even two 8-bit inputs with an 8-bit output. Nothing. In order to do multiplication of any kind3 it is necessary to dig deep and figure out how multiplication fundamentally works and its relationship to addition and symbol shunting, such as bit shifting and rolling.
Whenever you plan to sit down and write an algorithm (or use someone else's hard work), you have to decide what precision and range you want to get out of it. In BASIC, that decision is effectively made for you, and the choice was made to use 32 bits of precision because that will handle a heck of a lot of normal use cases. One of the powers of writing the algorithm directly in 6502 is that if you wanted 64 or 128 bits of precision, you could do that. It puts the computer through more gyrations, but the possibility is there to do it. It is not possible to get more then 32 bits of precision out of BASIC.
It is safe to say that the multiplication routines for smaller bit widths is simpler than those for larger bit widths. And so two 8-bit inputs and an 8-bit result are definitely the simplest, but the range is so cripplingly limited as to be almost useless. 6! is 720, that's outside the range of an 8-bit result. The range of 16-bit doesn't take us much further. 9! is outside the range of 16-bit. However, as this is an experiment, I want to show how an 8-bit CPU can be used to do math beyond just 8-bits, but without unnecessarily bogging us down with the complexity of going to 32 or 64 bits. But rest assured, the same way the CPU has what it takes (namely thanks to the carry) to be able to extend math operations beyond 8-bit, the principle can be applied to extend to much larger bit widths too.
So here we have a 16-bit multiplication subroutine. It takes a 16-bit multiplier and a 16-bit multiplicand, and puts the product of the two in a 16-bit output. Note of course that there are lots and lots of combinations of two 16-bit inputs that will overflow a 16-bit output. The really safe way to deal with this problem is to use multiplication algorithms that have an output twice the width of the inputs. 16-bit by 16-bit input and a 32-bit output, which this routine doesn't do so bear that in mind.
Next, I did not write this routine. Sadly, it would have been beyond my ability. Instead I pulled it from this forum post on forum.6502.org. I modified it slightly to preserve the original inputs, and to improve clarity (in my personal judgement of what is clear.) There are other great examples of math algorithms with different levels of precision at Codebase64.
I don't want, nor am I well suited, to get into the mathematical details of how this works. I can say with some certainty though, that a multiplication can be acheived as successive additions. For example, if you have 13 times 5, you can initialize a running total as zero, and then in a loop add one of those two arguments to the total, decrement the other argument, and if it's not zero run the loop again. Thus adding 5 to a running total in a loop that loops 13 times. Or, the inverse, add 13 to a running total in a loop that loops 5 times over. There is indeed an addition stage in the algorithm above starting at the CLC inside the loop and ending just before the "skip" label. However, people far smarter than I am discovered a long time ago that there are certain efficiencies to be gained by doing the additions in other ways than merely doing the same addition some large number of times.
If you are multiplying two 16-bit numbers, the naïve way might mean you have to add 572 to the running total 65,000 times! Whereas, in the example above, the loop explicitly only loops 16 times, which is not coincidentally the number of bits in the output. Somehow by rolling the bits in the output, and rolling the bits in one of the arguments, it can be determined whether an addition need take place. Repeat 16 times and the correct answer magically appears. If you're into mathematical theory, you'd probably really enjoy analyzing this routine to death. For me, it is enough to know that it works, and it consists of 30 instructions. Or, 22 instructions if you don't care about corrupting one of the inputs in the process.
There is just one other thing I'd like to note about this routine. The inputs and the output are hardcoded references to what amount to global variables. The use of global variables is typically considered very bad form by modern computing standards. But it seems hard to avoid on the 6502. As it happens these global variables are in zero page, that was done so that all the access to them will be as fast as possible. But, I did test it and everything works just fine when they are not in zero page. Therefore, the product could be computed in a local variable, and could even be returned in the .X and .Y registers. However, these are not big concerns when literally the entire machine is at our disposal for just the one solitary purpose of computing factorial.
Factorial core routine
Now we can get to the core routine of factorial. Remember that the workspace variables for the multiplication have already been defined.
We enter the top of the loop and decrement factor. If it is now zero we leave the loop by branching to done. If not, we load the newly decremented factor into the multiplicand and JSR to mult16. Thus 6 x 5 ends up in product. Next we transfer product to the multiplier and loop back to the top. On the next iteration multiplier is prepopulated by with the result of 6x5 (30) and multiplicand is populated by the next decremented factor 4. Thus 30 x 4 ends up in product, and so on. Finally feels like we're making some progress.
Human readable output
That's all well and good, but how do we know if product contains the right answer unless we can see the output? The final product is just two bytes. But the problem is that those two bytes cannot simply be shoved onto the screen because they will represent some odd C64 Screencodes and we won't be able to interpret them as numbers. What we really want is to convert them to some base of numbers we can read.
Number conversions to and from decimal is very cool, and it's something I'd like to get into in a future post. But conversion from binary to hexadecimal is significantly more straight forward. One byte of binary input always converts into exactly 2 bytes of hexadecimal output. Here's a routine that I keep handy in my own coding repertoire while I'm working on C64 OS. I usually put it in an include file so that I can bring it into my source with a .include "inttohex.s" directive whenever I need it. It came out of a discussion with someone on #c64friends on IRC.
This is a clever little routine. The hexits label is effectively a conversion table. The contents of the string are bytes that the assembler encodes in PETSCII. But the offset into the table is the conversion to the PETSCII representation of that value. So if you had the actual integer 7, and you index 7 bytes into hexits, you arrive at the PETSCII value "7" (which is actually the integer 55, but we never need to know that.) Perfect.
At the calling label "inttohex" you see the documentation style that I've grown accustomed to. It shows the register and a right-pointing arrow to mean input, followed by a description of what it should hold. The registers that are outputs are followed by a left-pointing arrow and then a description of them. So, put an 8-bit integer in the accumulator and JSR inttohex. You get back two PETSCII characters in X and Y, lo and hi nybbles respectively.
To do that we push the accumlator onto the stack to back it up, then shift right 4 times. That takes the hi nybble (4-bits) and shifts it into the lower nybble. The upper nybble is replaced by zeros. Transfer that to the X register to be used as an index into hexits. Grab the hi nybble PETSCII character and stash it in a temp variable. Restore the original accumulator from the stack and using AND zero out the upper nybble leaving only the lower nybble intact. Transfer that to the X register and use it to index into hexits to grab the PETSCII value for the lower nybble. Lastly, stick the low nybble in X, read the high nybble from the temp variable directly into the Y register and return. Bingo bango.
That's a fine way to convert one byte into two PETSCII characters that are an easily readable hexadecimal output. But how do we physically output them? Well, we could try shoving them into screen memory, and that would work. But I'm pretty comfortable at this stage using the KERNAL routines designed to output characters to the screen. In this case, CHROUT ($ffd2).
There are however, two bytes in a 16-bit output. And although the CPU is little endian and wants the lo byte first followed by the hi byte, that's not the order we want to see them in. To make things easy and clear, I wrote a little macro "out16." You pass it the address of the low byte, and it converts the hi byte (low byte + 1) to PETSCII and chrout's the result, hi nybble lo nybble. Then it calls inttohex on the lo byte, and chrout's its result, hi nybble lo nybble. And finally the macro outputs a carriage return so multiple calls to out16 will put the results on separate lines.
Putting it all together
It took a long time, and a lot of effort, but we've got pretty much all the parts together we need to make a complete and running program. Here's what the final source code, altogether, in TurboMacroProX format looks like, including comments, section separators and a little something extra I'll talk about in a moment.
I like to structure my code such that constants appear at the top, followed by the main code body, followed by subroutines that will be used and called by the main code body. I nearly tore my hair out trying to debug the meaning of "label not defined" on the calls to out16. Eventually I realized that the assembler wants macro definitions to appear in the code before anything attempts to use them. I'm not entirely sure why. So, I moved the out16 macro definition to appear just below the constants.
Next, you have to declare where in memory you want your code to assemble to. I chose $0801 because that's where BASIC programs are loaded into, and because it's an easy place that I know how to jump to with a SYS command. This decision may have been the most hair pulling of this experiment. As it happens, when you double click a .PRG file to open it in VICE (which is what I was testing in,) it not only loads it, but it also runs it automatically, by issuing the RUN command. I totally get why it does this. It's so that if you're using VICE with a one file game you can just double click it, VICE will launch, load and run the program for you, and boom you're playing a C64 game in an emulator and you didn't have to know even a single arcane C64 "command."
The problem with RUN is that it attempts to execute my code as though it were a BASIC program, which it's not. The result of this is that BASIC was actually corrupting my code before I had a chance to run it. So, what I would do is double click the file, wait for BASIC to mess it up and return to the ready prompt and then I'd issue an SYS2049 after that. It crashed immediately every time. This was very frustrating until I figured out what was happening.
The secret is to include a tiny prelude to your code, that is itself in fact a valid BASIC program that when run will jump to your real code. This trick is used all the time in C64 games and demos. I didn't have what that prelude would be handy but it's surprisingly easy to figure out. I simply wrote a basic program: 10 SYS2049 Then I saved that to "disk" which in VICE put the .PRG basic program back onto my Mac's desktop.
Opening that file with the excellent Mac utility, Hex Fiend, revealed the following BASIC program bytes:
01 08 0B 08 0A 00 9E 32 30 34 39 00 00 00
The first two bytes are the load address, 01 08, $0801, so we can ignore them, they won't end up in memory. The next 2 bytes are a pointer to the start of the next BASIC line, 0B 08, ($080B). The next two bytes are the 16-bit program line number, 0A 00, ($000A, or 10) and that 9E must be the token for SYS. The next four bytes, 32 30 34 39, reveal very clearly the 2 0 4 9 conveniently visible in the lower nybble of each hex value. And the last 3 bytes 00 00 00 are the basic program terminating sequence. In other words, the basic program will occupy (2 + 2 + 1 + 4 + 3) 12 bytes in memory. Which means if the BASIC prelude starts at $0801 (2049), then my ASM code will start at 2049 + 12 or 2061. Thus we can just change the BASIC prelude so the SYS will go to 2061 instead of 2049 and we get a BASIC prelude like this:
0B 08 0A 00 9E 32 30 36 31 00 00 00
And so that is exactly what we put into the assembler as a sequence of bytes that immediately precedes the start of the main body code. And it works perfectly.
*=$0801 ;Basic Prelude, 10 SYS 2061 .byte $0B,$08,$0A,$00,$9E,$32,$30,$36,$31,$00,$00,$00
And there we have it running in VICE. I added one more #out16 to the body code so we could see the output of the multiplier before beginning to compute its factorial, just so we can see in the screenshot the two values. The first output is 0008 (8 in decimal), and the result is printed as 9D80 (40320 in decimal), and that is the correct answer! Hurray!
You can run that in VICE, or on a real C64 and see that it in fact works. Or analyze it with a Hex Editor. Note that it's hardcoded to compute 8!.
I'll be the first to admit that it took a lot of work, and a lot of effort, and the result is less capable than what I got with BASIC, but it feels good. It feels like we're at rock bottom. There is no code running that we are not explicitly aware of (with the exception of those final chrouts.) It's like painting a picture by applying every single stroke of paint to the canvas by hand.
The other thing I should say before signing off, is that in practice it doesn't feel quite this difficult when writing ASM code for C64 OS. First of all, I have yet to need multiplication even once. And second, after you've been working on your project for a while all the surrounding structure such as where to put constants and macros, what address to load your code to, and even output, become solved problems that you stop thinking about on a day to day basis.
That's it for this post. I consider that worthy of the Technical Deep Dive category.
- I'd like to point out just how incredibly awesome it is that these classic books are available for free at Archive.org. I read the Geos Programmers Reference Guide cover to cover on Archive.org. Although, I have Advanced Machine Language in paperback, and do prefer a paper copy in most cases, sometimes it's nearly impossible to get your hands on a physical copy. [↩]
- Actually, there is one kind of multiplication, and division, that can be done with the built in instructions. Multiplying by two can be accomplished by shifting all the bits left one position. And dividing by two can be done by shifting all the bits right on position. Using the carry shifts can be done on higher bytes that shift in a bit that fell of the edge of a lower byte, thus enabling shifts that are 16-bit, 32-bit, 64-bit and more. [↩]