NEWS, EDITORIALS, REFERENCE
Debugging Session, Textblit
Happy Valentine's Day, 6502 freaks. Here's a new genre of post, I've tagged this as: Programming Practice, because it gets right into the nitty gritty of a practical problem and the steps taken to fix it. This post includes a 30 minute video of a realtime debugging session. If a picture is worth a thousand words, well then, a video has gotta be worth something.
Late last month I posted my first video update. A video update is kind of cool because as I build out C64 OS, there will be more and more interesting things to see. And pictures are handy, but a video really gives you a flavor for how fast something is loading, or how the mouse is responding and how the screen is refreshing, etc.
In my video update, which was only a couple of minutes long, I held the camera in my left hand and worked the computer and the demo with my right. That's fine for a short clip, but to really get into the weeds you need your hands free and therefore a stand for your camera. I banged together a decent makeshift camera stand out of nothing more than a stiff metal coat–hanger, with a grippy rubber coating. My workspace is under a stairwell, so the ceiling is quite low. Hanging the coat–hanger from the ceiling works perfectly and positions it to point straight at the monitor.
Let's get into this video debugging session, and take a glance into some native coding.
I began this video knowing I had an interesting bug to find. It's interesting because it has a clear visual artifact and involves several low–level code modules that I'm working through for the first time.
Before I started my hunt for this bug, it struck me, this would be a great opportunity to catch the entire process in a video. I spend the first few minutes talking about the general layout of the source code and the convention I use for filename extensions. Then I spend maybe a minute describing the basic problem I'm having, and then we just hop right into using some tools, like DraBrowse64, Turbo Macro Pro, SuperMon64+ and JiffyDOS commands. The goal is to show how one goes about using the tools available on a C64 to hypothesize about the problem, run some tests, modify the code, reassemble and test again.
The video concludes with me being very self–satisfied, having diagnosed the problem, found the bug and fixed it, and showing that the result of the fix has removed the offending visual artifact.
WAIT, hold on. That's not the end of the story...
Shortly after concluding the video I decided I really should run it against a couple of additional test files. I very quickly noticed that I was getting unexpected results. Off video, I ended up going back into the debugging process, and after about 2 hours of testing some things out, I finally realized what the real source of the problem was. The fix shown in the video is completely wrong. It solves the visual problem by fluke, and in fact the entire on–screen display is wrong, symptomatic of the real problem, but it's only subtly wrong in a way that is hard to notice.
So, watch the video, and see how the tools work and the general process of debugging native code. And then read on below for a full description: What went wrong? How did I mis-diagnose the problem? How did my bogus fix look like it solved the problem? And what was the real problem and the real fix?
Now on to the REAL bug
The first thing that went wrong was that I mis–diagnosed what effect a certain kind of problem would result in. This led me down a rabbit hole looking for the problem in the wrong place. I ignored two common sense clues, because they didn't conform with the problem as I had imagined it. And the fake solution itself was made possible by an issue in the data. So let's dig a bit deaper on these.
When I looked at the PETSCII image it looked right, columns line up row after row. But, imagine if you started the base address of screen memory at one address offset from where it was supposed to be. Instead of passing a pointer to $0400, for example, imagine we passed $03FF. The first character would not be displayed on screen ($03FF isn't in screen memory), but then all of the characters in that first row would be offset to the left by one. In all subsequent rows, the character from the leftmost column would appear as the last character in the row above. That's because from the top left going right, memory wraps around to the first column of the next row.
The following image demonstrates the effect. (It's only 20x20, the C64 is 40x25, but the concept is the same.) If things go well, you should see the yellow circle, as on the left. But if you start one memory location short, as you can see in the green on the right, every row would pull left one column. Except, all the data in the first column would wrap up a row and to the opposite side of the screen. And at the end, you'd get one missing byte, represented here by the red square.
I am getting the problem this shows of the red square, where it has read in data off the end of the file. But what I was not seeing was any of the left hand column data wrapping to the right. The image itself looks normal.
The image looking normal led me to think I was simply not reading in 1000 characters of data. I even did the math, and because of my bias towards what I thought was the problem I did the math and still thought I was not reading in enough data.
Here's the thing though, you only see a row wrap around if there are any characters in the leftmost column to start with. As it happens with this image, the word EARTH is supposed to run along the right edge of the screen, and there are around two columns on the left that are just empty. By pulling everything left by one column, it didn't wrap the image like I thought it would. And so, it turns out, it really was pulled left.
Thinking that I was not reading in enough data, I made a change that would load in one extra byte. This was a mistake, now I was loading in 1001 bytes, and it's the 1001st byte that got drawn into the red square. But, usually this would have failed because the PETSCII file should only contain exactly 1000 bytes. However, I generated it using my web–based PETSCII Art Renderer. The file actually contains more data than the C64's text screen has. But that's just a problem with the renderer, and it doesn't matter too much that the file may have an extra row of data or so. However, it conspired to trick me into thinking it worked because there was that extra byte available to load in.
And the REAL fix
Turns out, I really was reading in 1000 bytes. My so–called fix actually would have introduced a new bug that loads in 1 byte more than you asked for.
And, it turns out, the code that specifies the start of the buffer is correct, and it really is looping copying 1000 bytes from the buffer. If that's all correct, then what the heck is wrong?
It's the textblit routine. My goal was to make this copy 1000 byte from a buffer into screen memory as fast as computationally possible. To do this, I'm using a macro, configregion that is invoked 4 times that self–mods the output of another macro, drawregion. Let's take a look at how these work, and why they're fast.
You call textblit with a RegPtr to a buffer. It then calls confreg macro 3 times, before setting up a loop around the output of 4 instances of the drawreg macro. Each instance of the drawreg macro starts with a label, regiona, b, c and d. The first line of the drawreg macro outputs an lda $ffff,x. But the confreg macros are designed to self–mod these load addresses.
Because an 8-bit register can only count from 0 to 255, but we need to copy 1000 bytes, the screen is divided into 4 regions that are 250 bytes big each. The 4 drawregion macros have the CPU copy 4 bytes, one from each region, per loop. This parallel copying is faster than doing one region after the next, because it reduces the total number of loops and the required branching logic to perform each iteration. Instead of 1000 iterations, there are only 250. See the post Anatomy of a Koala Viewer to see how the routines in that Koala Viewer program do something similar.
That's a pretty cool trick. But there is a problem when branching. We want X to loop from 249 to 0, therefore, we only want to stop when X rolls over from 0 to 255 (or -1 signed). There are generally two ways to do the branching for these kinds of loops:
The first way is faster, but only works if the biggest X value is less than 128. For any value 128 to 255 the 7th bit (b7, aka x000 0000) is set, which is also the negative flag for signed numbers. If you want to loop from 127 or less down to and including 0, you can branch using BPL (branch if positive.) 0 is considered positive simply because the negative flag is not set. You can't use this method if the max value starts off greater than 127 because the negative flag will already be set.
The second way is slower. You can do a CPX #255 at the end of every loop to see if X is exactly one less than 0 and then branch on the compare result. But, that kinda sucks because now you have a whole other instruction to perform on each loop.
The third solution is what I came up with. (Although I'm not claiming credit, this is well picked over territory.) Have the loop go from 250 to 1, and branch using BNE, that will cause the loop to end when X becomes 0. The drawreg macro then modifies the destination address to be one less than the base screen memory address. In other words, the write goes from $03FF,X where X ranges from 250 to 1. The copy goes in reverse, and ends at $0400.1 And then this is repeated across 4 regions to cover the whole screen.
That is cool. It's the fastest I can come up with. Ideally there would be no cycles wasted, because this routine is going to be called a lot for compositing screens together.
But there is a problem, and this was the true bug. The LDA address is set exactly as it is passed in, yet the X index is 1 more than it should be. So, the problem was not that my destination addresses were all pulled one to the left, but that the source addresses were all pushed one to the right.
One solution is to do a 16-bit decrement on the RegPtr when it's passed in. This is what I did, and it actually solves the problem.
For the Keeners
It's possible that this whole technique is trying to be more clever than it needs to be. It seems to be the case that there is no +1 cycle penalty for crossing a page boundary with STA Absolute,X but there definitely is that penalty with LDA Absolute,X.
By setting the source address back 1, we might be setting it from something that is nicely page aligned (most likely will be, because the page allocator will give out boundary–aligned pages of memory). What that means is that the majority of LDA's will be crossing a page boundary. All but 36, actually. That's 1000 - 36 = 964 cycles spent on page cross penalties. Plus a few extra cycles are required to do the 16-bit decrement on the initial pointer.
If you use method 2, however, you still can't avoid all page boundary crossings. In fact, there are only about 250 fewer page boundary crossings, all of which occur in the first region. After that, the boundaries get crossed regardless. But, still, that's 250 cycles saved. But, then you need to do a CPX Absolute 250 times. Each of those costs a whopping 4 cycles, or 1000 extra cycles.
Therefore, despite the initial penalty of page boundary crossings, in total the trickier technique saves around 750 cycles, or 0.75 milliseconds. Is it worth it? Hell yeah! It only cost me one little bug, and now that it's fixed, it's solid AND bitchin' fast.
One last thought. C64 OS is architected to follow some modern design principles from Model–View–Controller. What is in screen memory is never the only copy of some of the application's data. That violates the separation of model and view. In C64 OS, the screen is just a view. Anything that gets drawn to the screen is drawn from extant models somewhere else in memory.
Every application must have a standard draw routine. And you never call that draw routine manually. The main event loop calls multiple draw routines, in the correct order, and only when necessary. The application must be prepared to redraw the current state of its entire interface at any time.
Besides many advantages this brings that I won't go into here, one major advantage should be scrolling speed. The KERNAL's screen editor scrolls the screen perilously slowly, when compared with how quickly the 1MHz 6510 can copy data from an external buffer into screen memory. If you have an external buffer, and you want to scroll the screen, it is much faster to just change the pointer to the source address by the length of one row, and then copy the whole contents as quickly as possible. What the Screen Editor has to do is copy row 1 into row 0, then copy row 2 into row 1, and repeat for all rows. Then repeat again for all the rows of color data. Ouch.