NEWS, EDITORIALS, REFERENCE

Subscribe to C64OS.com with your favorite RSS Reader
December 20, 2016#12 Software

A Modern Character Set

Post Archive Icon

I do like PETSCII. The graphics characters allow Commodore 8-bit programmers to produce some pretty stellar images out of only text. However, there are some obvious shortcomings to the standard character set on a C64. Some characters that, on PCs and Macs, are very standard are completely absent on the C64. Both on the keyboard and even in the default character set in ROM.

This isn't ordinarily a problem, on a C64 we simply don't use these characters. But, when trying to make an OS that intregrates with the modern internet, it starts to become a bit more of an issue.

The characters are: Backslash (\), Caret (^), Underscore (_), Back Tick (`), Pipe (|), Left and Right Braces ({}), and Tilde (~). These eight characters appear all over the web. Not just in special unicode character sets, but just plain ASCII which is used in email addresses, web URLs and more. Their ASCII values, respectively, are: 92, 94, 95, 96, and then 123, 125 for the braces, 124 for the pipe, and 126 for the tilde.

The equivalent characters in PETSCII are:

Backslash 92 British Pound
Caret 94 Up Arrow
Underscore 95 Back Arrow
Back Tick 96 A Graphic Full Width Dash
 
Pipe 124 A Graphic Left Half Hatch
Left Brace 123 A Graphic Full Height and Width PLUS
Right Brace 125 A Graphic Looks Like a Pipe
Tilde 126 A Graphic Full Hatch

I have opted to have C64 OS use (primarily) the Uppercase/Lowercase character set. This is because to display and edit text it is necessary to have both upper and lowercase characters. This does limit the set of graphic characters available, but this seems like an acceptable trade off for having both cases of alphabetic characters. Switching back and forth between the character sets will not be possible. Doing so would seriously break the appearence of text that was intentionally rendered in both cases.

What this means is that when looking at the built in characters, we really can only look at the second set to see what graphic symbols are available to us. Imagine if we did nothing at all, and simply used the characters as they appear in PETSCII without translation. Where there is supposed to be a caret we'd see an up arrow. That seems like a pretty acceptable alternative. A caret already looks like an up arrow, it's just missing the down stroked line. Where there is supposed to be an underscore, we'd see a left pointing arrow. That, in my mind, is not horrible, but it is certainly less acceptable than the caret.

Things really go south after that. Where there should be a backslash, we would see a Britsh pound symbol. The upside is that it's easy to type. And in certain contexts it wouldn't matter. We would only ever see the pound symbol, and when necessary we'd reproduce it. But, in other contexts where, for example, the backslash was meant as a delimiter in human readable text, a pound symbol would look very odd indeed. Back ticks are in the same boat. In some contexts it might not be horrible to see a full width dash. But back ticks are often used as a special form of quotation mark. Seeing full dashes around text might be odd there. Although not the end of the world.

The pipe, which is a vertical bar, would appear to us as a something that, in my opinion is quite suitable. It's a vertical strip, aligned to the left of the character cell, but hatched instead of solid. At least it is vaguely pipe-like. Braces and the tilde however are a total write off. The left and right braces ought to match one another as a pair. Instead we have an oversized PLUS, and what looks like... strange, an ordinary pipe! (Aka, the pipe in PETSCII is just offset by one place, more on this in a moment.) The tilde would be substituted for a full hatch. That to me is rather unsatisfying.

Out of the eight problematic characters, five are unacceptable as is, two are workable alternatives, and one is on the fence, but perhaps workable. The question is, what should be done about this problem? One solution is to use the VIC-II's ability to use an entirely custom character set. The way this works is that the original set can be copied from ROM into RAM, then a few of the offensive characters can be graphically modified, and some VIC-II registers can be adjusted to tell it where in its 16k memory page to find the character set. It works, it's easy, and it's a complete solution. The only problem is that it is an all or nothing affair. If you wanted to change the appearance of just a single character, and still have access to the rest of the complete character set, you would be required to copy half of the ROM to RAM. The ROM is 4K, and thus you are robbing the system of 2 full kilobytes of RAM, or over 3% of total addressable memory, just for that one character (and its reversed pair). Or in our case, just for those 5 unworkable characters. This is a sizable committment and therefore worth considering our alternatives.

Showing an alternative character set programmed from BASIC

The image above shows the concept at least. The character between the two square brackets is the British pound symbol in the ROM. While in a real context this should have been made into a backslash, for demonstration purposes it was turned into an underscore. At the bottom of the screen you can see a sample URL typed out to show that the underscore looks beautifully at home on a C64.

There are however, a whole suite of graphical characters that already exist in the uppercase/lowercase character set. Furthermore, C64 OS does not have provisions for typing the graphical characters that are not part of ASCII. These characters of course will be used for bits and pieces of the text-based UI. There is one other consideration, text which comes from the internet will have to be put through the ASC2PET routine to get a PETSCII version. And before PETSCII text can be rendered to the screen it has to be put through the PET2SCR routine.

What if we were to use PET2SCR to reassign the PETSCII values that correspond with ASCII symbols we want, to draw certain PETSCII graphic characters instead? Here's what I would suggest and what it would look like.

Showing an alternative character to screencode mapping

As you can see in the first six and a half lines, this is the unmodified shifted set from ROM. We had eight problematic characters. Underscore is most handily supported, because PETSCII graphic characters support a variety of horizontal bars of varying thickness and positions. You can see the underscore looks essentially perfect in its example use in a URL path.

The pipe is in a similar boat. PETSCII graphic characters have lots of full height vertical bars. The only downside is that the only bars positioned in the middle of the character are in the Uppercase/Graphics character set. In the Uppercase/Lowercase set that C64 OS uses, I'll have to choose whether the pipe appears butted against the left or against the right. Above you see how it would look butted against the left, in some mock code used as a bitwise OR.

The tilde is slightly less ideal, but not bad, in my opinion. What would be best would be a half hatch that fills the top half of the character. Strangely, neighter of the ROM character sets has such a character. Instead, I've opted for a thick bar at the top. It's like an underscore but it's oriented at the top and is a bit thicker. The tilde is not super common, and this seems to be sufficiently analogous that it definitely puts this character in the "workable" bucket.

The caret was already in the workable bucket. The up arrow. I think it continues to work and there is no more opportune PETSCII graphic to map it to.

I am the most fond of the solution for the left and right braces. Both character sets include these characters which is very fortunate. There are four characters which consist of a full-size stroke either vertically down the middle or horizontal across the middle of the character, plus a half-width stroke that extends from the middle to an edge. It's the same character essentially rotated in four different directions. The vertical stroke with tick to the left looks remarkably reminiscent of an open brace. And its mirror opposite looks remarkably like a right brace. I love these, I feel like we really lucked out on these ones.

The backslash is a bit of a sad story. The reason is because the Uppercase/Graphics character set has an absolutely perfect backslash character. There is however no exact equivalent in the shifted character set. I tried a few variants with completely off the wall choices, but finally settled on what you see above in the mock Windows directory path. It's like a low-res backslash. Two oversized pixels that form the shape and direction of a backslash. It's the best I think we can get from the PETSCII graphics in the shifted set. But, backslash is not hugely used on the internet, I think this puts it in the "workable" bucket.

Lastly we have the back tick. Oh the back tick. How I loathe thee. This is a bizarro character to start with. I looked up on the web, what is the proper usage for the back tick? It turns out, in the English language there is no proper use for it. It was once used in older times to be a visually alternative apostrophe or single quote, but this use is deemed inadvisable by grammarians. It is also used in certain PC/Mac computer programming languages, and in some shell scripting. But, quite simply, it is not widely used. I think if we encounter it in text, it should simply by drawn as an apostrophe.

Conclusion

So, where does this leave us? By mapping common characters found on the internet to some shifted set PETSCII graphics characters, we can take eight problematic characters and deal nicely with seven of them. The 8th is uncommon and not necessary even to produce. Sticking with this technique, we can save nearly 2 kilobytes of RAM. A little less than 2K savings, because the map of these characters and some logic in PET2SCR will be have to be written. But overall a huge memory saving.

One last consideration is what we will not have access to from PETSCII. We will lose the British pound symbol. This is represented in Unicode however, and it is an adventure and a few posts unto themselves to deal with Unicode in C64 OS. We will also lose the back arrow. Again, only in text rendering, not in drawing the UI. We will lose the full width dash, which is interesting. I can imagine a way to get a Unicode mdash to render using this character. We will lose a half-hatch and a full-hatch. No biggy. We will lose a pipe character, again not a problem as this character is reintroduced via a different encoding. And lastly, we'll lose the full height and width PLUS character. This is also not a big deal. The real plus is what is encoded on the web, this oversized PLUS was used only for PETSCII graphics, and will still be available to us for drawing C64 OS's UI should we choose to use it.

That's all for that. Again, comments, thoughts and suggestions are always welcome.