NEWS, EDITORIALS, REFERENCE
Recursive File Copier in BASIC
When I first began work on C64 OS, I was so new to 6502 programming that I just started to write all my code into one gigantic source file. It didn't take long before this was not sustainable, and I had to break the project down into smaller modules.
I'm also pretty sensitive to the risk of losing data. Not just from a harddrive crashing but also just from the sheer stupidity of something like accidentally overwriting a code file with its header. And so I've gotten in the habit, every night at the end of my coding session I backup my files from the CMD HD where I work on them to the uIEC/SD. I have a backups folder on an SD card, and inside that I have YYYY-MM-DD folders. So, I can recover from a previous backup and even review my history, it's a pretty good setup. Also, from the SD card, I periodically copy the backups to another computer.
When everything was in one file, backup was really easy! But as I started to split my project up into many smaller files it started to get a little bit trickier. The way I was doing the copying was to use the built–in JiffyDOS two drive copier. The steps would be something like this:
- Switch to the uIEC/SD
- Switch into the backups folder
- Create and switch into a directory for the current date
- Set the JiffyDOS target drive, (@X10)
- Switch to the CMD HD
- Load the directory (possibly with a pattern)
- Hit Control-W to select all files
- List the directory and use Control-A to deselect some files
- Start the copy with RUN
- Repeat from the CMD directory load with a different pattern
The pattern loads were to get, say, all the header files first. Then later get the source code files, and then the includes. All the while skipping the assembled object files, or other random text files etc. I've devised a single letter file extension system to organize all my project files by type. They are as follows:
- no extension — Executable
- .a — Source ASM
- .s — Includes
- .h — Headers
- .t — Text File
- .o — Object File
- .m — Menu Definition
- .i — Info/Config file
That's quite a few steps, and the amount of work keeps getting greater the more files I add to the project. But then something much worse happened. In the post about Application Loading, I discussed my plans to have C64 OS load applications out of subdirectories. The main system folder will have Services and Applications subfolders. Inside these are folders named for applications. And inside each of these will be the standard layout of main executable, menu definitions file, about info and other app specific resources.
I've started to build out the first app I'll be writing for C64 OS, App Launcher. And instantaneously my old backup system has become paralyzingly laborious. It got to the point where I was starting to dread the process of just doing a simple backup. But, I have to do them every night or risk losing all my code in the case of a disk failure. My first thought was, one of the very first features I should add to the C64 OS File Manager is the ability to recursively copy folders. That's a fine idea, but I know I'm going to have untold many more nights of coding before that will be ready. I need a solution for backup now.
BASIC: The original scripting language
I've yet to see BASIC described as a scripting language, but the more I think about it the more apt a description I think it is. BASIC is often described as a high level language. But there is an important difference between high level languages and scripting languages. Usually the key difference is that scripting languages are interpreted. Well, BASIC is a high level interpreted language. It's effectively a scripting language since before people started to call them that.
I decided to try to write a quick and dirty file copy utility in BASIC. The problem is that I had no idea how to write a file copier. Commodore DOS provides commands for copying files between mechanisms that belong to one device, and are controlled by one device controller. So, for example, the CBM 3040 or 4040 Dual Disk Drive. You can issue a command telling them to copy a file by specifying one filename preceded by the internal drive mechanism number, and then the destination filename preceded by a different drive mech number.
This inter-mechanism file copying was later supported by the CMD HD, and is currently still supported by the uIEC/SD, where the drive mechanism numbers are instead partition numbers of the same physical mechanism. That was a very clever adaptation that maintained backwards compatibility with existing Commodore DOS syntax. But, how do you get data transfered from one serial bus device to a different serial bus device? I want to transfer data from a CMD HD to the uIEC/SD.
I succeeded in making what is probably the slowest file copier that has ever been written for the C64. Yay, I've made history! However, I do know of a way that I could make it slower. About the only thing I know is that reading one byte from one disk and then writing it to a different disk and repeating is unimaginably slow because between each read and each write the computer has to send all the TALK/UNTALK/LISTEN/UNLISTEN overhead. But, the problem is that I just didn't know of a very good way to read many bytes at a time such that I could write many bytes to the destination. Here's what I came up with: (don't bother to try this at home folks.)
Here's how this works. I begin by making a dimension (an array) of string variables with a capacity of 10,000 elements. Next I open a file (drives.h) for read on LF#2, and another file (drives.h2) for write on LF#3. (In the screenshot they're both on the same device, but this can by changed to a different device number, of course.) Next I set a variable k to 0, which is the length of data stored in the dimension d$. BASIC v2 evidently has no way of dynamically getting the length of how much of the dimension has been written to.
UPDATE: April 26, 2018
Now that I'm almost a year older and so much the wiser, I realize there would have been a much better/smarter/faster way of storing the data with BASIC than by using a BASIC dimension. You could poke the values being read to some address, plus the index. For example, as you read bytes from the file, poke the bytes to $c000+i (POKE 49152+i). And then to read them, just peek them back out. The only issue is converting the byte from a BASIC string, and back again.
At line 50 I read a byte, check if we're at the end of the file. Otherwise line 60 effectively pushes the byte onto the end of the dimension, but setting d$(k)=a$ and then incrementing k. As long as k is less than 10000 keep looping grabbing more bytes from the source and pushing them onto the dimension. When the dimension is full we go into another loop that writes all the bytes in the dimension to the output file. This is done by incrementing another variable l from zero until it's equal to k. And then we go back to line 40 to do another 10,000 byte read phase. If we hit the end of the file before filling the dimension, there is another loop at line 100, 110 that will write out the end of the file. Then both files are closed.
Believe it or not, for one file, a small sequential text file, this actually works. So it tricked me into thinking that maybe this code could be part of a larger script that would copy multiple files. That turned out to be bunk. After writing the larger program and watching it start to copy I realized it was going to take like an hour. So instead of waiting staring at the screen I spun around to another computer and started up a game of CREATURES 2 that I just got in the mail from Psytronik. I played for at least 45 minutes, maybe an hour, and when I turned around the utility had only copied about 10 files. Clearly, a file copy utility that is going to take 5 hours or more to finish is nearly useless.
I didn't know how else to read bytes from one file and write them to another file any faster using BASIC. So I went to bed feeling as though I'd failed. I had a solution, but it wasn't exactly a viable solution.
JiffyDOS, you are our friend
The next night, I got to thinking, maybe there is some way to use JiffyDOS's file copying abilities from inside a program. So I scoured the JiffyDOS manual looking for mention of how to use its copier from within a BASIC program. Unfortunately I didn't find anything. (UPDATE: July 4, 2017: Apparently I didn't look hard enough, because it's totally in the manual.)
I thought about it for a bit though, and remembered that when you load a directory on the C64, it loads it as though it were a BASIC program. What happens when you RUN that directory listing? Usually you just get a syntax error. But when you use the JiffyDOS two device file copier, you load the directory, then you use Control-A and Control-W which places asterisks into the BASIC lines just before the quoted file name. And then you RUN the directory listing to start the copy. That got me to thinking, so I wrote a little program to test it out.
It turns out this works brilliantly. Line 10 uses @# to set the default source device, and @X to set the default destination device. A little bit of testing shows these can also be variables, so you can go: DV=9:@#DV or ZZ=11:@XZZ and those work as you'd expect. Very powerful. As you can see in lines 20, 30 and 40, the asterisk is actually a basic extension provided by JiffyDOS. You can even use that one in direct mode. Amazing, all of the years I've been a Commodore user and all the years I've had JiffyDOS, let's call it 20 years, and I never knew you could initiate a one off file transfer using the asterisk. It's super easy, you just issue these in direct mode:
@#8 @X9 *"myfile"
... and boom, that will copy a file called "myfile" from device 8 to device 9. Default partitions and directories are used both for source and destination so you just change those ahead of time to determine where to copy the file to.
The next thing to point out is the file type. PRG, SEQ, USR. As can be deduced by seeing how it works when you load a directory, mark some files for copy, and run to start the copy, the source file type is on each line, following the quoted filename. I experimented a bit with this, and found that you can either use the 3-letter file type, or just the first letter. This is the same as when you're opening a file for write using the BASIC open command, like this:
open2,8,2,"afile,w,seq":rem with the full "seq" type open2,8,2,"afile,w,s":rem works exactly the same with just "s"
But, what does this file type represent? It has nothing to do with the file type of the file being read. It just sets the file type on the copied file. So you could be copying a PRG file, but if you do it as: *"myprgfile" seq then the destination file will have type SEQ. I have no idea what would happen if you tried to copy a REL file and changed its file type. Also, I'll have to play around to see what happens to a PRG's two–byte load address. My guess is that it just ends up being the first two bytes of the SEQ file.
Lastly, being able to use a variable for the filename to copy makes this the perfect copier for use in a BASIC program. (Or should I say BASIC script?)
Making a File Copier
In order to actually put these abilities of JiffyDOS to use, however, we need a way to load in a directory so that we have a set of filenames to loop over and copy.
About 4 or 5 months ago I stumbled upon this article over at pagetable.com. It's a pretty technically accurate description of what's going on with a directory load from disk, but I think his coding style prioritizes smallness of code to a fault. As someone in the comments points out, it's not even compatible with all versions of the C64 that were commercially shipped. Also, I disagree with his general sentiment that the directory load as a BASIC program is a hack. The article also says the drive makes assumptions about what the computer will understand, for example, to invert the text. It's not so much an assumption, (which sounds negative,) as that the two were written for each other. The directory listing contains a PETSCII value that is defined as REVERSE text ON, and the KERNAL's screen editor knows how to correctly interpret PETSCII. </rant>
It did help me however to understand how BASIC data is structured, because the directory is returned to the computer as a BASIC program. And thus LISTing it presents it just as a BASIC program gets listed.
Each line starts with a 2-byte pointer to the memory address where the next line starts. In modern terms the lines of a BASIC program are a singly-linked list. From the address of any line, the computer can get to the next line, but cannot get back to the previous line without starting from the beginning and searching for it by moving forward through the links. The next two bytes of a line are its 16-bit line number. At first, I guessed that when you enter a line while editing a BASIC program, that the interpreter must interpret your line number, hunt for the line number that precedes your line number, and then do some pointer manipulation to insert your new line into the linked-list. I wanted to be sure though, so I tested it out and was surprised by what I found.
I assumed the linked-list nature of the line numbers helped with inserting new lines because rather than moving large chunks of memory around it could just manipulate the pointers. The tests reveal, however, that it actually does move whole chunks of memory to make room for your new line. Although, having a linked list is still useful at execution time to lookup where a line number is. The interpreter would start at the first line and follow the links from line to line looking for the line number to GOTO or GOSUB to. Here's what my tests revealed.
First I entered line 20 (0x14), then I entered line 10 (0x0A). To my surprise when I looked at the memory layout, the complete content of line 10 precedes line 20's content.
14 08 0A 00 99 22 4A 55 53 54 20 41 20 54 45 53 54 22 00 27 08 14 00 99 22 48 45 4C 4C 4F 20 57 4F 52 4C 44 22 00 00 00
I then decided to add in a line 5, and to my continued surprise the complete contents of line 5 were put into memory before line 10. And lines 10 and 20 where shifted up in memory, and the next line pointers were adjusted to match their new locations.
19 08 05 00 99 22 48 4F 57 20 44 4F 45 53 20 49 54 20 57 4F 52 4B 22 00 2C 08 0A 00 99 22 4A 55 53 54 20 41 20 54 45 53 54 22 00 3F 08 14 00 99 22 48 45 4C 4C 4F 20 57 4F 52 4C 44 22 00 00 00
A final test continues to confirm the behaviour. I modified line 5 by making it longer. All the lines that follow it are shifted up, and the next line pointers are adjusted. This feels a bit brutal to me. If you had a really big program and you edit line 1, it would have to shift the entire program and update the pointers for every line. The upside to doing this is that it doesn't need a memory manager to find free space, and memory never gets fragmented. See my discussions on this in my posts about C64 OS's memory manager.
28 08 05 00 99 22 54 48 49 53 20 49 53 20 41 20 56 45 52 59 20 4C 4F 4E 47 20 4C 49 4E 45 20 49 4E 44 45 45 44 22 00 3B 08 0A 00 99 22 4A 55 53 54 20 41 20 54 45 53 54 22 00 4E 08 14 00 99 22 48 45 4C 4C 4F 20 57 4F 52 4C 44 22 00 00 00
The thing to note about how BASIC lines are structured is that the line number and where the line falls in the linked list are only loosely related. If you are actually editing the program and you enter a new line with a new line number, the BASIC interpreter will on–the–fly shift the higher lines up and out of the way, change all the link pointers, and insert your new line into the correct place according to its line number. However, if you save a basic program to disk and then edit the bytes on disk such that the first line in the linked list has a number bigger than the next line in the linked list, when it comes time to run the program or list the program, it doesn't follow the line numbers, it follows the linked list order. Take a look at these screenshots.
I entered the program as shown in the top screen. Line 10 first, as "first line" then line 20 as "second line." Listing it shows 10 comes before 20 as expected. Then I saved it, edited it in HexFiend for Mac, changing just the line numbers but not the linked list order and reloaded it. In the second screen you see the very odd behavior in that when listed, "line 20" lists before "line 10". However, you can see that "first line" and "second line" strings are still in the original order. And when RUN, it prints "first line" "second line". So when executing it's following the link pointers, not the line numbers. That's pretty neat.
This is relevant because this is how the disk's directory is able to use BASIC line numbers to represent the sizes of the files. But the order of the files is in the order that they are found on disk. In other words the disk's directory exploits the separation of line numbers from their line link pointers. Interestingly enough, that also allows two files that have the same size to produce two BASIC lines that have the same line number!
Knowing all of this, here's the program I wrote that loads in a set of file names by parsing a directory listing that is returned from the disk.
Let's walk through it. We want to know both the filename and its associated file type, so I start by making two dimensions, LN$ and LT$, for name and type. Both contain the same 100 element capacity. LI is our dimension index. Line 10 opens a file for load from device 8. Channel 1 is specifically reserved for "load" (as opposed to a mere "read") which is required by the drive's DOS when the filename is "$" to request the directory. Otherwise, if it were a read, the DOS looks for a file that is actually named "$" and doesn't find one. I'm also using a pattern with the directory load, which is optional. On that same line I do a GET#1,A$,B$. This pulls the first two bytes and they are then discarded, these are the two byte load address that are returned first with every load.
Line 20 initializes our filename string a$ to blank, and goes to the subroutine at 1000. The subroutine just pulls 4 bytes and discards them. These are the 2 byte line pointer and the 2 byte file size (or BASIC program line number) which we don't care about in this program. Immediately after trying to grab the 4 bytes the start a line number I check the status to see if we're at the end of the file. This will be true when we reach the end of the listing and is what ends the program.
You'll see a pattern on successive lines of getting bytes, comparing them to something and either proceeding or looping back to get another byte. 30 and 35 pull a byte and if it's not a quote (PETSCII 34, 0x22) loops back to grab another byte until it finds the first quote. 50 and 55 are a loop that pulls a byte, if it's a quote then it's the closing quote and the filename is over, otherwise it's the next character in the filename and is appended to a$. At 60 therefore a$ holds a complete filename. It is pushed onto the dimension LN$ at LI and printed so we see it. Line 70 is a loop that pulls a byte and if it's a space (32, 0x20) loops to pull another byte until it's not a space. This byte is being stored in b$. At 80 b$ therefore holds the first letter of the filetype, (P of PRG, S of SEQ, or U of USR.) This filetype character is put into the dimension LT$ also at LI, and then LI is incremented. Line 90 is a final loop that searches for the end of the line, which is null terminated (ends with a 0 byte). It then returns to 20 which reinitializes the filename string a$, gets the header and checks if we've hit the end of the directory, etc.
When the end is reached the file is closed, and ln$ and lt$ have the same number of elements. ln$ is full of filenames and lt$ has the corresponding file types for those filenames. That's pretty good! It's easy to see from here how you would just loop over the filenames and use the JiffyDOS file copy commands to copy each of those files. That's really good, we're close to a solution.
Recursive File Copier in BASIC
I've broken the final program down into 4 main chunks. The main program, the file copier, the directory loader, and some directory utility routines. Let's look at them one at a time.
DI and DO are device input and device output, 8 and 10 respectively. The subroutine at 5000 is to setup the initial backup folder, we'll come back to this. Next, the program configures what is effectively a "stack". FA$ is a dimension for filenames (FN$ somehow is reserved and using it produces an error when run), FT$ is a dimension for file types. And FC is a dimension used for making the recursion possible. FA$ and FT$ are 999 elements deep, but these could be made larger if necessary. They represent the total number of filenames that can be on the stack at any one time. This is, all the files in the root directory, plus all the files in the current subdirectory, plus all the files in that subdirectory's current subdirectory and so on recursively through everything being copied. FC is a dimension of numbers, where each element tells us how much of the stack is used for this recursive level. It's defined as 15, which means it cannot recurse more than 15 folders deep. This could also be lengthened if necessary.
FI and FJ are Stack Pointers. FI indexes into both FA$ and FT$, while FJ indexes into FC. Line 20 kicks the process off by GOSUB to the file copy routine at line 30. When this finally returns the entire backup is complete, so it prints the "Backup Complete" message and ends.
Not only does the file copying happen in this routine, but the recursive magic all happens here too. FL is initalized to zero. It's the local count of how many files (and folders) are found in the source's present directory. GOSUB 4000 takes us to a directory reading routine much as we have already seen, that loads the files from the present directory, and sets FL to how many it finds. As it finds files, it pushes the filenames (and their file types) on to the global FA$ (and FT$) dimensions, which updates the FI index as it goes.
Line 40 pushes the local file count onto the FC dimension. And if that count is zero, there are no files in this folder so it skips to the end of this routine which will pop that file count off and print a message about this folding being complete and return.
Line 100 sets the source and target devices, this probably doesn't have to be done once for every folder, but it doesn't hurt. 110 sets local variables $F and $T as the filename and file type that are the last element of those stacks. Lines 120 and 130 check to see if the file type is either P or S, and if they are does the JiffyDOS copy command with the current filename and either PRG or SEQ for destination file type. I tried to get those to be resolvable as variables, but for some reason that doesn't work. They have to be typed out. I'm not supporting USR (nor REL, DEL, etc.) because I only ever use PRG and SEQ file types. After copying a file it skips to line 200. This decrements the local file count, and pops the filename and file type off their stacks. If the local file count is still greater than zero it loops back up to line 110 to copy the next local file.
If the file type is neither P nor S, it is assumed to be D for directory. Lines 160 through 180 handle recursing into the subdirectory. GOSUB 5200 is a subroutine that creates the directory based on the current filename on the target device. Then switches into that directory on both source and target devices. The current local file count is backed up to the FC stack. And GOSUB 30 recurses by calling the file copy routine at 30 from within itself. Mind bending recursion occurs, but when it returns we GOSUB 5300 which pulls both source and target devices back up one directory. Then the local file count is restored and this routine continues copying files at this level.
The key to how this recursion works is that FC dimension. Let's say in the root folder it finds 10 files and 3 directories. The last element of FC is thus set to 13, indicating that the last 13 entries in FA$ and FT$ belong to this level. After copying one of those files, that file is popped from FA$/FT$ so now only the last 12 belong to this level. Thus the value at the end of FC is decremented to reflect this. When this level encounters a subdirectory, it creates the directory and moves source and target into it. Then it reads the file count in that folder and pushes a new file count to the end of FC. This tells that routine how many files it needs to process at this level before its over and the remaining files on FA$/FT$ belong to the previous level. It works brilliantly! Gads I love recursion.
This read directory routine is nearly identical to the standalone routine we went over earlier. With a couple of minor exceptions. When a directory is read in, a header flag, H, is set. This causes the first row, the directory header, to not be pushed on the stack as a file to be copied. And the P flag is used as a "phase." Reading a directory is done in two phases, the first gets all files that have any single character extension. This accords with how I've got all my files setup for C64 OS development. After phase 0, it loops back and does a second phase which repeats the whole routine except it changes the directory pattern to just load directory names.
Filenames and types are pushed on to FA$ and FT$ respectively, and FL is incremented for each entry found. And thats it. This routine only ever operates on the current directory of the source device. Setting up that current directory is entirely handled by the previous file copy routine.
This last section is for the three utility routines related to handling directories. 5000 is called once at the very start. It asks the user for the backup date to use, typically I'd put this in the format YYYY-MM-DD, but I don't have to. The current directory on the output device is switched into the backups folder, the new date folder is created and switched into. The source folder is assumed to be already the default, this is a useful assumption because it lets me recursively backup an subset of the C64 OS project, or actually any other folder on any other partition (with the one constraint that it only matches files with a single character extension.)
5200 makes a new directory in the current directory of the target device based on the current filename F$. Then it swithes into that folder on both target and source devices.
5300 tells both the source and target devices to go back up one folder.
And that's it folks! That's the whole thing. A recursive file copy utility, scripted in BASIC, and backending on the JiffyDOS 2 device file copier. One caveat is that you have to have JiffyDOS. But one time when I asked Glenn Holmer on IRC to help me test a drive detection routine I wrote, I asked if he had JiffyDOS. He said he did, then very cheekily added that he also has indoor plumbing. I take his point. If you don't have JiffyDOS yet, what are you waiting for? Go buy the roms now, available from Retro Innovations.
This above is what it looks like as the copier is copying and moving through directories. At the very top you see it make the App Launcher folder. Then reads its files and subfolders. Then it copies 4 files in that folder. Then that folder is complete. But then, its parent folder is also complete, and it steps back up and out of the "Services" folder. Next it makes the "Applications" folder which is sibling to "Services". But Applications is empty so after reading its files and folders it reports that that folder is complete. Then it continues copying all the other files from the root level. It actually works. Cool.
Here's how the code looks, not in screenshots, if you want to copy it and make use of it for yourself.