NEWS, EDITORIALS, REFERENCE

Subscribe to C64OS.com with your favorite RSS Reader
December 16, 2021#116 Programming Theory

Text Rendering and MText

Post Archive Icon

I wrote the first two posts in a 3-part series about the VIC-II and FLI Timing (Part I and Part II,) but I have a lot to say about things I'm working on presently, and I don't want to miss the moment to talk about them while they're fresh in my mind. We'll return to Part III of the VIC-II and FLI Timing after this.

In mid-2019 I wrote a post called Webservices: HTML and Image Conversion. I have since revisited the image conversion part in the post, Image Search and Conversion Service, but have not yet returned to the idea of a simple marked-text format, which I refer to as MText.

In the meantime, I have implemented the Toolkit classes that are always memory resident:

  • TKObj
  • TKView
  • TKScroll
  • TKSbar
  • TKSplit
  • TKTabs
  • TKLabel
  • TKCtrl
  • TKButton

Plus several more that are runtime loadable, relocated in memory and dynamically linked to their superclass:

  • TKTable
  • TKTCols
  • TKInput
  • TKPlaces
  • TKFileMeta
  • TKPathBar
  • TKInfoBar

All these classes and the whole Toolkit system are now quite mature—many bugs have been found and squashed, and several optimizations have been made—and are in productive use in the File Manager and across numerous Utilities.

Most recently I have been tackling the TKText class. It is for displaying (but not editing) multi-line text. It has numerous features that we'll get into in some detail in this post.

I stumbled upon a Stack Overflow question about how to optimize for making text selections in a custom text rendering view. The consensus opinion is that implementing a custom text rendering class is "obscenely difficult." That's why operating systems provide classes to do that work for you. And that's what TKText is all about.

The reason for an OS to provide a class is twofold:

First, so you don't have to do the technical heavy lifting yourself. You can just instantiate the class, call some of its methods and—BOOM—your application has rich text handling.

Second, text handling is tricky and nuanced. If every application implemented its own solution, the quality of the experience would vary substantially between applications. (Just as it does between two Commodore 64 programs chosen at random.)

By providing a class that handles it for you, lazy programmers (lazy-good, not lazy-bad) will rely on the provided class and the users benefit from a more consistent experience. Now let's get into what TKText provides and some of the difficulties implementing it.

 

What is TKText?

TKText is a C64 OS Toolkit class that descends directly from TKView. This allows it to snap into any view hierarchy, anchored to its parent for flexible resizing. It can be assigned as the content view of a TKScroll with which it negotiates its vertical and horizontal content size. It handles keyboard and mouse events. And it can have other objects attached to it as children.

At its heart, it displays a multi-line body of text.

Text is complicated. Not as complicated as HTML, granted, but it's more complicated than you'd first imagine.

There are at least two different encodings: PETSCII as found on and produced by the C64 and other Commodore 8-bit machines, and ASCII. Someone asked me on Twitter, "Where does ASCII come from on a C64?" It comes from the C64's interaction with the outside world. As soon as you start transferring files from a PC or Mac (via an SD Card or a over a network, or reading a PC floppy disk, or a CD-ROM,) you end up with ASCII text files.

There are multiple line endings: LF, CR, CRLF, which come from Unix or modern macOS, classic macOS or the C64 itself, or Windows-based PCs, respectively.

Further complications in text handling include: soft word wrapping, line justification, dynamic memory management, and efficient scrolling. Toss in text selections and the hit-testing required to make that work, and things are getting pretty complicated.

Plain Text, plus just a little more

MText is effectively plain text, with just a little more. We'll talk about plain text first. And later we'll get into some features of MText and the extra level of complication that introduces.

There is no way in hell the C64 is going to support unicode or UTF-8 encoding, so, "no" to that.

Let's instead just worry about about plain text in ASCII and PETSCII. These are relatively straightforward, every character is prepresented by a single byte. This neat arrangement is made more complicated by tab, which although it is a single byte, it produces varying amounts of physical space between to characters by aligning the character following the tab to the start of the next available tab stop. TKText, as of this writing, does not support tab. They are simply ignored for now.

Another complication is that CRLF line endings are two bytes; the CR and LF are each one byte. This is not a big big deal, but it's a complication that we'll see soon. The existence of both tab and varying line ending lengths hints at the difficult in applying simple arithmetic to determine, with certainty, where a character is going to fall. This has implications for efficient scrolling, hit-testing, and word wrapping.

Before simply diving straight in, a bit of historical context may be helpful for understanding the situation we're in when we try to open and render a random text file on the Commodore 64. The complications pertain to wrapping vs. unwrapping, fixed vs. flexible layouts, and the problem of multi-line indentation, (otherwise known as a hanging indent,) on a narrow (40-column) screen.

What a mess this all leads to.


The Humble Origins of Plain Text

While doing some reading for the work on TKText and MText, I discovered some pretty interesting tidbits from the history of information technology.

Once upon a time, before computers existed, electro-mechanical machines were used to transmit textual information over wires or by radio signals. The earliest of these machines, such as the telegraph set, date back to the middle of the 19th century.

Hughes telegraph, an early teleprinter built by Siemens and Halske in 1855.
Hughes telegraph, an early teleprinter built by Siemens and Halske in 1855. Very Steampunk.

The problem is that different telegraph machines used different encoding schemes for the characters. There was no universal standard for exchanging information between the different types of machines.

Development of ASCII began only in 1961, over 100 years after the telegraph machine pictured above. The American Standard Code for Information Interchange. Finally, if every machine—even those made by different companies and manufacturers—would adopt this same code, any one machine could send information and it would be successfully interpreted by any receiving machine. The development of standards, more generally, was a brilliant human innovation.

In the days of telegraph and teletype machines, which existed for well over a century before the first computers began to use video displays, the information code (ASCII and its teletype code predecessors) were tightly coupled to printed output. These machines had no computational ability; they had no memory, no capacity to look ahead before outputting, and hence no layout calculations of any kind were performed. The individual ASCII codes were so coupled to the hardware that they literally instruct the hardware what to do and where to go, one symbol at a time.

A teletype machine, with multi-line paper, hooked to a telephone.
A teletype machine, with multi-line paper, hooked to a telephone.

First, the familiar example. I'm reasonably sure everyone reading this blog knows that the reason why Windows PCs use CRLF as its line endings is because it dates to the time when ASCII drove teletype machines. The CR, Carriage Return (0x0d), tells the printing mechanism, the carriage, to return to the start of the current line. The LF, Line Feed (0x0a), tells the apparatus to roll the paper forward one line. Together, they prepare the hardware to begin printing from the start of the next line. Unix-based systems, though, standardized on only using the LF character, ditching the CR. While Commodore and classic-era Apple standardized on CR only, ditching the LF. Explaining why some computer systems abandoned two-character line endings in favor of one is the stuff of legend and lore. Some say it was because storage was at an extreme premium and it is cheaper to store one byte than two. I think that might be part of it, but I also think it's only half the story.

After all, transmission speed was also at an extreme premium. And there is little reason why a single character couldn't have been used to tell the electro-mechanical telegraph machine to perform both functions. So why was it split into two characters in the first place? The reason is because printing to paper is in some respects fundamentally different than displaying on a screen; there are things you can do with a paper print out that you cannot do with a screen. It is this difference that made two-character line endings irrelevant in a world that was transitioning to mostly screens rather than mostly printed paper. In the 1980s when you got a printer, the screen was dominant. The screen was truth, and you only printed something when you wanted a paper hardcopy of what was on the ephemeral screen. In the old days, though, paper was not an afterthought. Paper was the truth, and could do things that the screen was unable to replicate. At least not until screens were being driven by much more powerful computers and the development of Unicode in the late 80s and early 90s.

So what are we talking about here? I'm 40 years old, and I've been using computers for around 35 years, and there are things about ASCII that I just learned while doing my reading for implementing the TKText class. Either I'm a naive youngin, or this is stuff most people don't know.

The only reason to split the functions of Carriage Return and Line Feed into two steps is so that you can trigger one of them, but not the other. If you issue a Carriage Return without a Line Feed, the carriage returns to the start of the current line. From there, further characters that are printed will be combined on the paper with the characters that are already there. Of course, this makes no sense on a primitive video display terminal, but how was it used on paper?

Ever wonder what the underscore character is for? Maybe, like me, you've come across all these standard definitions: (From here.)

Explanations for what an underscore is and how to use it on a modern computer.
Explanations for what an underscore is and how to use it on a modern computer

All of these are reinterpretations of how an underscore can be used on a modern computer.

Underscore is ASCII's answer for underline. If you want to underline an entire line, you issue a Carriage Return with no Line Feed. Then you issue a series of underscores which are printed overtop of the existing characters, combining with them to produce underlined text. AH! Now it makes sense. Now we get the juices flowing. How about for strikethrough? No problem. Print a line, issue a CR, then a series of minus characters that get dashed along the midline of each character.

This is pretty cool, but returning to the start of the whole line is a bit of a blunt instrument. What if you only want to underscore a single word that's 5 letters long in the middle of a line? If you do a Carriage Return, you'd have to issue a bunch of spaces to reposition the carriage at the start of the word to underline. For this, ASCII has the backspace code. OH! "Back" and "Space", now this makes sense too. Backspace is not called delete for good reason. And the original delete is a forward delete because of how it worked on paper. The Backspace code moves the carriage back just one character position. You could then "delete" a character by issuing the delete code, which prints a solid block over top of the existing character, obscuring it and positioning the carriage once again following the deleted character. So, ASCII and paper-based deleting looked a lot more like this:

Deletions in printed ASCII looked a lot more like redactions on legal documents.
Deletions in printed ASCII looked a lot more like redactions on legal documents

Of course, Backspace isn't only for deletions/redactions. If you just printed out a 5 letter word and you want to underscore that word, it's a simple relative move: Backspace 5 times, then Underscore 5 times, and you're right back to where you were before. OH, this is delicious, the epiphanies just keep coming. Just what the heck are the Tilde (~ 0x7e), the Caret (^ 0x5e) and the Backtick (0x60) for anyway? If you think they represent your home directory, the symbol for a control-sequence, and how you run a command as the argument for another command in a shell script, respectively, congratulations! You are a modern computer nerd!

But now we know what they are really for. ASCII produces accented letters by arbitrarily combining them with common diacritical characters. Tilde is not how you represent the current user's home directory, it's how you spell "jalapeño."

J A L A P E N backspace tilde O

Although ASCII is occasionally panned as an english-centric American code, it in fact has pretty decent support for other Western European languages when used in its original printed context. The backtick and the caret, otherwise known to linguists as an accent grave and a circumflex, can be combined with any character, even those not typically found in, say, French. As some people online have noticed, I speak Esperanto. Esperanto uses a circumflex over 5 different letters: ĉ ĝ ĥ ĵ and ŝ. To make words like "seĝo" (rhymes with "hedge-Oh", chair) In ASCII, over a telegraph machine, this would be as simple as:

S E G backspace caret O

The same was true, mostly, for regular old typewriters too.

The back spacer key, from the Remington Standard Typewriter Manual.
The back spacer key, from the Remington Standard Typewriter Manual

Why did we go on such a long side trip into the history of ASCII? Because I love sidebars. I love to learn things I never knew before, and I like to share what I find interesting. But now that we have a better perspective on the historical prevalence and pervasiveness of printed text, we are in a much better place to understand the transition of text to the land of computers.


Wrapping vs. unwrapping

Since the C64 uses a carriage return (CR) code for its line ending, I'll refer to a line ending as a CR, for the sake of brevity.

Text wrapping is when a sentence flows off the end of one line and continues onto the start of the next. Hard wrapping means that a CR is found in the text data before a fixed line width is reached. The CR forces the text that follows to begin on the next line down. When text is soft wrapped, the text data itself doesn't have a CR at the end of every line, it has a CR only at the end of a whole paragraph. To wrap the text the computer performs calculations, based on the currently available width, to find a suitable place to break the string of text such that it gets drawn neatly on multiple lines, typically without cutting a word in half at an inappropriate place.

An alternative to soft wrapping, (although, this is only possible on a screen,) is to unwrap the text. Unwrapped, a whole paragraph is displayed in a single line. That long line simply disappears off the right edge of the screen. To see the part of the line that is wider than the line width, the viewport (sometimes the whole screen, sometimes a window on the screen) has to be scrolled horizontally.1

There are advantages and disadvantages to both soft and hard wrapping, which we're about to discuss, all of which have to be dealt with by TKText and helper libraries in C64 OS.

Hard wrapping is the original practice and, if you think about how this works on a mechanical typewriter, was once the only feasible practice. There is no intelligence in a typewriter; the intelligence lies solely in the typist. The typist strikes keys and the carriage moves across the page. Only the typist can know whether the next word that he or she will type is going to fit in the remaining space on the line. Each key pressed is immediately laid down with ink on the paper. The typist must hard wrap the text in their head, deciding when it is appropriate to go on to the next line. How physically to go to the next line depends on the model and features of the typewriter but traditionally involved pushing a metal arm to manually move the carriage back to the left side. On the Remington Standard Typewriter Model 10 and 11, (according to the manual) a line feed occurred mechanically but automatically as part of the process of returning the carriage.

Teletype machines, like typewriters, had a range of characters per line, but there were a few common standards. 80 characters per line was a standard width. This width relates to the pitch size of the characters and the standard width of the paper. Since computers initially used teletype-style machines for their textual input and output, computer systems adopted this width as a standard too. The design of IBM punch cards, with 80 characters per row, was based on the teletype's printed line width. And when the teletype machines were gradually replaced by video terminals, they too adopted 80 columns to mimic the printed standard.

An IBM standard punch card, 80 columns per row.
An IBM standard punch card, 80 columns per row

So here's the thing. If your display can show you 80 columns, and the text you're trying to display is hard wrapped at 80 columns, you're in luck. Soft wrapping is computationally expensive; hard wrapping is much easier. You just draw out each character until you hit the line ending. Then you start drawing out the next row of characters on the next line down. Nothing could be easier than that.

The problem is that 8 pixels per character, times 80 characters per line, requires a display matrix of at least 640 pixels across. That takes a lot of memory, a powerful video chip, and fast access to memory. As was discussed at some length in the first two parts of the posts on VIC-II Timing (here and here,) there is barely enough time for the VIC-II to pull off 320 pixels per line. That's only half the standard requirement for 80 columns. The C64 and many other 8-bit computers show 40 columns, with 8 pixels per character.

The C64 can do a soft-80, but it requires drawing characters in bitmap mode (slow) and each character gets only 4 pixels across. You try drawing an M or a W in only 4 pixels. It can be done, but it's brutal. I spent years of my teenage life accessing BBSes and the internet via dial-up shell account in Novaterm using a soft-80 font. My 15-year-old eyes forgave me, my 40-year-old eyes do not.

A C64 4x8 pixel font, for soft 80 columns.
A C64 4x8 pixel font, for soft 80 columns

Fixed vs. flexible layout

The fact of the matter is, a great many text files in this world are hard wrapped to 80 columns, or perhaps slightly less than 80. For example, there are tens of thousands of free books at Project Gutenberg, and their plain text files are virtually all hard wrapped to 72 characters per line. See The Time Machine by H.G. Wells as a perfect example.

If your video display (or your teletype machine, or your printer) can handle 80 columns, then hard wrapping to just under 80 columns is ideal. You can grab that text off Project Gutenberg, and dump it to a printer without any calculations, with no computational contribution to layout whatsoever, and it will look great. In fact, with your Commodore 64, a Wifi Modem, and a few lines of BASIC, you could source that file, read the data in over RS-232 and stream out the entire 200KB direct to a printer, and you would hold an 85-page book in your hands. From the perspective of the history of the world up until that point, there is something beautiful in the simplicity of the hardware requirements that gives you access to the power of… well, the Gutenberg revolution.

The Time Machine from Project Guten, hard wrapped.


But rarely do we print today. We want to see and read stuff on the screen, especially content accessed from the web. The problem with hard wrapped text is that it is inflexible. Our C64's screen is only 40 columns wide, but in practice slightly less because we'll have at least a vertical scroll bar. Without any additional wrapping work, a hard wrapped text will display out neatly on subsequent lines, but each individual line is too long to be seen at once. Maybe that's okay for some kinds of data, but there is no way in hell you'd want to read 40,000 words scrolling left and right on every line.

Although soft wrapping text is more computationally expensive, it is significantly more flexible, and thus preferable for our C64 with its low screen resolution. What we prefer, then, is that the text data itself not be wrapped. Just put a CR (or two) between paragraphs. Then have the computer decide how many words ought to fit on a line. This is not only more flexible for handling a narrow, but still fixed, width like 39 characters, but is truly realtime flexible. In the Help Utility, there will be a list on the left side that shows a table of contents. Click an item from the table and it loads in that file. This leaves less room for the text view, but with a splitter you could drag the sidebar either narrower or completely closed. You want the text in the text view to dynamically rewrap to best suit the available space.

Hard wrapped text that is wrapped too wide is the worst case scenario. You can always try to soft wrap hard wrapped text, and that will have the advantage that you won't have to scroll horizontally to see everything. But it looks unnatural because alternating lines have different lengths. For example, if a text is hard wrapped to 60 columns, and you soft wrap that to 39 columns, every odd row will be maybe around 35 characters long on average, and every other row will only have an average of 25 characters, giving the whole an awkward jagged right edge.

Books on iPhone soft wraps hard wrapped text.
Books on iPhone soft wraps hard wrapped text

Milton's Paradise Lost is an epic poem, that's why it's hard wrapped. A new line appears after every stanza. But the result of soft wrapping gives a noticeably jagged edge that you wouldn't see if there was a greater total available width.

Multi-line Indentation or Hanging Indents

Hard wrapping does have an advantage over soft wrapping though. Or rather, there is an extra complication to soft wrapping that hard wrapping doesn't need to worry about: How to handle hanging indents.

First, what is a hanging indent? Let's say you have a list of bullet points. The start of the line of the first point has, say, SPACE SPACE ASTERISK SPACE. The first text of the point begins 4 columns in from the left. Each bullet point ends with a CR, as though it were its own paragragh, because you want the next point to begin on the next line down.

If the text of a point is short, everything works out. But if the text is long enough that it needs to be wrapped to multiple lines, things get tricky.

In a hard wrapped text, the person doing the wrapping puts in the CR at the appropriate place (near the end of the fixed line width), and immediately following the CR puts in SPACE SPACE SPACE SPACE, four hardcoded spaces, so the text of the point's second line aligns with the text of the first line. The fact that the second line is indented more than the asterisk of the line above it, is called a hanging indent. When text is hard wrapped hanging indents are encoded into the data manually.

Hanging indent in a hard wrapped text.
Hanging indent in a hard wrapped text

If you soft wrap a hard wrapped text that has a hanging indent, things really look like shit. The hardcoded indent (the four SPACE characters) usually get stuck at some arbitrary place in the line, they don't vertically align, and just look like crazy extra space in the middle of a line, plus you lose the original intent of the hanging indentation in the first place, so the bullet points also get muddled.

If you soft wrap an unwrapped text that has the intention of a hanging indent, it's also quite tricky. If the text data simply uses two SPACEs at the start of the line, followed by perhaps an asterisk, or a minus, or some other bullet-like character, there is no way for the wrapping code to know that a hanging indent is intended. It will wrap it as though it were a regular paragraph. It doesn't look terrible, but it isn't as rich a layout as can be accomplished with the hard wrapped text.


What TKText does for Plain Text

Now we know a bunch of definitions, and we can see some of the complications, let's talk about what TKText does for rendering plain text.

When instantiated, TKText automatically loads a text-wrapping shared library called wrap.lib. TKText does not load text, nor save text. Your Application or Utility is responsible for loading the text into some memory somewhere, and the text must be a valid C-String. That is, the bytes of the text are consecutive in memory and end with a NULL (0x00) byte.

TKText has two setter methods: setstrp (Set String Pointer) and setstrf (Set String Flags). To have TKText present the text you have loaded, you call its setstrp method, passing a pointer to the start of the text string.


A brief aside on loading text

Loading text, (i.e. handling the file system and allocating the required memory,) using C64 OS KERNAL calls is not hard. Usually something is producing the file reference for you. For example, if you choose a file to open using the File Open Utility, the Utility produces the file reference structure to that file for you. In the case of the TextView Utility loading a file that is selected in the File Manager, the File Manager is maintaining the file reference. TextView Utility merely makes use of it without worrying about where (on what device, in which partition and directory) the file resides.

You first call fopen, which is a C64 OS KERNAL call, and pass a pointer to the file reference to open, along with the flags: FF_R | FF_S . These stand for File Flag Read, logically OR'd with File Flag Stat. The C64 OS KERNAL opens the file for read, but the stat flag also has it fetch the block size of the file, which it writes into a property of the file reference structure. Each file block represents 254 bytes of data, which is close but slightly less than a page of memory.

The fread KERNAL call takes two inline arguments, two 16-bit numbers that follow the JSR fread. Calls that take inline arguments manipulate the stack so that they return control to the code immediately following the arguments. The arguments are: Address in memory to read the data to, and 16-bit data length to read. You can read the block size off the file reference structure and write it into the high byte of fread's data length argument.

Next, with the block size from the file reference already in the register, you make the C64 OS KERNAL call pgalloc. This allocates a number of consecutive 256-byte memory pages, initializes them all to zero automatically (which clears them of crap that may have previously been in them), and returns the page address of the first page of the allocated block. You write this page address into the high byte of fread's buffer argument, and then the code continues on to call fread.

Fread then tries to read in 256 bytes times the number of blocks of the file size, into the allocated buffer. Fread stops reading automatically when it reaches the end of the file, even if it hasn't read in the full buffer size. The consequence is actually quite handy. Let's say the file reference indicates the file is 3 blocks big. You take that and allocate 3 pages. You get a buffer 3 * 256 bytes, or 768 bytes big, all zeroed out. And you ask fread to read up to 768 bytes. The file is in fact only a maximum of 3 * 254 bytes, or 762 bytes. So the read falls short of filling the buffer, leaving the pre-zeroed excess bytes to serve as the NULL terminator of the string. That's perfect.

It sounds complicated, because assembly language is very detail-oriented, but it only takes a few lines. It's really quite straightforward. Here's an example:


As soon as a TKText has a string assigned to it, the string is processed by wrap.lib which builds and caches a line table. TKText marks itself dirty, i.e. in need of a redraw. How it builds the line table depends on the TKText's string flags. One of the flags indicates whether the text should be soft wrapped or not. When you modify that bit, TKText flushes any existing line table and immediately has wrap.lib rebuild a new one. So, if your App knows that it wants the text to be wrapped, it would save time and memory to set the wrap bit first, and then set the pointer to the string.

TKText is a subclass of TKView. All TKViews have a width and height, these derive from its anchoring properties and how it is nested in the view hiearchy. Typically a TKText will be assigned as the content view of a TKScroll. This anchors the TKText to top, bottom, left and right of the containing TKScroll and manages insets automatically to accommodate the presence of vertical and horizontal scrollbars.

Building an Unwrapped Line Table

The unwrapped is clearly the easier of the two. It simply reads through the text until it finds a line ending, and pushes a pointer to the address following the line ending as the start of a new line. Pushes it onto what? A line table which it automatically allocates and reallocates as the table grows. When it reaches the end of the string, it returns the page address of the start of the line table, the number of lines in the table, a 16-bit number, plus the length of the longest line in the table is returned via a zero page workspace address.

TKText now knows how long the longest line is, so it can adjust its own content width property, and it knows how many lines are in the table, so it can adjust its own content height property. This causes its containing TKScroll's scrollbars to render correctly. It also copies and retains a pointer to the line table. The wrap library is done, it no longer retains any hold on the line table. When the TKText needs to build a new line table, either because it gets a new string, or because its wrap flag is flipped, or because its content is wrapped and its width changes, the first thing it must do is free the previous line table. Then the wrap library simply builds a new one, probably reusing the same memory that was just freed.

I tell you this just for curiosity's sake. As a developer, if you make use of TKText, you don't really have to worry about how it works.

Building a Soft Wrapped Line Table

The soft wrap algorithm is clearly a good deal more complicated. I scoured the web for an hour or two (a year or so ago) looking for a fast soft wrapping algorithm with a small memory footprint. I never found one. I found some that were voraciously memory consuming, and I found some that modify the original source text inserting line endings at key places, effectively converting unwrapped text into hard wrapped text. None of these was what I was looking for, so I sat down and wrote one myself.

It's a fair bit more nuanced than you would think. To begin with, you have to have a clear picture of exactly what its behavior should be under various circumstances. For example, what should it do if you put hordes of spaces between two ordinary words in a line? What should it do with hordes of spaces that come at the beginning of a line, i.e. just following a line ending? What should it do if hordes of spaces come at the end of a line, i.e. just before a line ending? What should it do if a single word exceeds the maximum line length? And so on. I experimented with these various cases using a standard soft wrapping text view in macOS, until I had a good understanding of the scope of behavior.

Then I implemented a routine using PHP. Why PHP, you ask? No reason, other than that I am an experienced PHP programmer and I just wanted to use something I was familiar with that I could run from a command line. After I had the routine working, I made it an executable script that could take arguments for a text file and a line width. After that, I rewrote the script, again in PHP, but in a highly non-standard way that mimicked the limitations of 6502 assembly to make porting to 6502 straightforward. Rather than using functions, it uses labels and goto's to simulate JMPs and branches.

Those two PHP routines can be found on my Github account, here and here, respectively.

Preview of the wordwrap routine in PHP on GitHub.
Preview of the wordwrap routine in PHP on GitHub

The way it works is unimaginative. It maintains 3 pointers (line index, word index and character index) and 2 counters (line length and word length.) All three pointers are initialized to the start of the text and the two counters are initialized to zero.

It advances through the entire text by incrementing the character index pointer, one character at a time. It reads through non-space characters until it hits the first space-character, then computes the size of the word as the difference between the character pointer and the word pointer. Then it checks to see if the current line length plus the word length exceed the maximum line length. If not, it adds the word length to the line length, and resets the word index to the current character index, and carries on.

If the addition of the word length would cause the line to exceed the maximum line length, it instead pushes the line pointer to the line table, and copies the word pointer to become the new line pointer, and resets the line length to zero. Effectively making the start of the word that was just read over become the start of the next line.

There are numerous gotchas that need to be handled, of course. A run of spaces is treated as though it is a word. A run of spaces comes to an end by encountering a non-space character, the start of the next word. But a regular word (a non-space-word) and a word of spaces (a space-word) are handled differently. If a space-word can be fitted onto this line then it gets added to the line just like any other word. However, if a space-word cannot fit it does not get bumped to the next line, but is allowed to extend infinitely long off the end of this line. The next non-space-word gets assigned as the start of the next line. This might sound odd at first, but, try it in a modern operating system. I've only tested this in macOS, but I'm 99.9% sure it will work the same in Windows, Linux etc. I should try it in a text editor or word processor on the Amiga to see how it behaves there.

The other big gotcha is what happens when a non-space-word exceeds the length of a line all on its own. First, the start of the big word gets bumped to the start of the next line, just as any word that can't fit on the previous line gets bumped down. But now, the line length is still exceeded. So the line index gets pushed to the line table and then advanced by precisely the size of the maximum line length, and the current line length is decreased by that length. If this is still longer than the maximum line length, this repeats until it's not. This effectively breaks a super long word at an arbitrary place. But before you think maybe this isn't standard behavior, this is in fact exactly what macOS does.

Notice where the long word starts and where it breaks.
Notice where the long word starts and where it breaks

Handling Line Endings

Even when the text is soft wrapped, line endings are still encountered. They are just encountered at the end of a paragraph, without regard for how long that paragraph is, rather than at the end of every so many words before they exceed the maximum line length. When a line ending is encountered, regardless of how long the current line is, the line index is pushed to the line table, the line length is reset to zero, and the start of the next word is copied to be the next line index.

Nothing is ever simple though.

Different types of text files have different line endings. As discussed above, Commodore computers and classic-era Apple computers use a single CR for a line ending. Unix and Linux, and similarly based OSes like modern macOS, use a single LF. Windows and MS-DOS and maybe some other OSes retain the original ASCII standard of CRLF, two characters in a row and always in that order.

Short of analyzing the data, you don't know ahead of time which type of line endings a file uses. If a file is produced in its entirety by a single process (like a text editor, or a log generator, etc.) the line endings will at least be consistent throughout. However, as someone pointed out on Twitter, if two files with different line endings were concatenated, you'd end up with a file with mixed line endings. I've decided simply not to handle this rare case. If it happens, well, something won't render quite right. Not the end of the world.

It's not as simple as just ignoring all LFs, and only attending to CRs. That will get you a new line for CR files and CRLF files, but it will completely miss all new lines in Unix text files that only have LFs and no CRs. But then, you don't want to create a new line for every CR and every LF, either. That would handle files with CRs correctly, and it would handle Unix files with LFs correctly, but you'd get double-spacing on those Windows files with both CR and LF.

I've opted for something in the middle that is still simple to implement. When a file is begun to be parsed, an ignore_LFs flag is cleared. Thus, initially, LFs are honored and result in a line index being added to the line table. However, if a CR is encountered the ignore_LFs is set. When an LF is encountered and the ignore LFs flag is set, no line index is pushed. Here's how this plays out. If the file contains only LFs, then a CR is never encountered, the ignore flag is never set, and all the LFs are honored. If the file contains only CRs, then every CR is honored, the ignore LFs flag gets set, but it doesn't matter because no LFs ever get encountered. If the file is all CRLF, then the CR gets encountered first, the flag gets set, and all subsequent LFs get ignored including the one immediately following this CR.

Even in the rare case where different line endings may be mixed in a file, if the file starts with LFs and then later switches to CR or CRLF, all of the initial LFs will be honored. The first CR will set the flag and subsequent LFs will be ignored. The only combination of mixed line endings that will cause a problem is if LFs follow CRs or CRLFs. The later solitary LFs will get ignored and those lines will string together. But they'll still soft wrap. Slightly more robust, I suppose, would be, every time you encounter an LF, check if the immediately preceding character is a CR. If it is, ignore the LF, otherwise honor it. But it's slightly slower and slightly more code, and I don't know if it'll ever matter.


Rendering the text via a line table

With the line table built, now let's imagine how we would render the text.

Let's suppose first that the text is soft wrapped. This guarantees that no line is longer than the visible width of the view. Thus, we can turn the horizontal scroll bar off to gain an extra line. We know how many lines there are, because it was returned from the wrap library and set as the TKText's content height.

As we scroll through the text, it's easy to find the first line to start drawing. The top scroll offset of the TKText view is used as an index into the line table. A typical way this works in 6502 assembly is to put a pointer to line table in zero page, put the index to the line into the Y index register, and use indirect indexed addressing to read the value from the table. The problem is that the indexing registers are only 8-bit, so they can only access 256 values. Each pointer is 2 bytes, so this would only be enough for 128 pointers.

Another common technique to access a 2-byte pointer with a single index is to split the pointers into two tables, one for high bytes, the other for low bytes. The same index can be used to pull one byte from each table. This technique is great, but only extends the available number of pointers from 128 to 256. Unfortunately, 256 lines of text is still not enough. 256 lines / 20 lines-per-screen is only around 12 screens of text. Assuming an average of maybe 35 characters per line, that's only 20 lines x 35 characters x 12 screens = 8400, or around 8KB of text. That's not much, considering that a C64 OS application might have 40 or more KB of free memory, we might need more than 5 times that number of pointers.

The line table is instead made of 2-byte pointers that come together as a pair. One page of memory is enough to reference 128 lines. As soon as the the line table requires 129 lines it auto-allocates 2 consecutive memory pages, and copies the content of the first page into the first of those 2 pages. The next line pointer ends up at the start of the second page. This process continues, adding pages as the line table grows.


A very brief aside about memory allocation

The use of memory grows in an interesting way:

First 1 page is allocated. But then 2 pages are allocated before the first is freed. If these two immediately follow the 1, then you end up with a 1 page gap.

Next, it tries to allocate 3 pages, but it can't use the 1 page gap, because it isn't consecutive with the other pages. So it gets 3 pages following the 2, then it frees the 2, which coalesce with the original 1 to make a 3 page gap.

Damn, this seems to be wasting lots of memory. Because, clearly, when it tries to allocate 4 pages, the 3 page gap can't be used either. The 4 are allocated following the 3 page gap.

But something special happens at this point. The 3 freed pages coalesce with the previous 3 page gap to form a 6 page gap. When it tries to allocate 5 pages, they fit into the 6 page gap and all the gaps disappear, compacting memory back down. Then the pattern begins again.

So, you do need a few pages to play with, while the line table is being built, but it doesn't continue indefinitely to leave an ever growing gap behind. At certain breakpoints it re-acquires the gaps leaving no fragmentation behind. And of course, if a gap remains after the table is finished being built, that gap can still be used by other processes for other purposes.

Memory allocation pattern for line table.
Memory allocation pattern for line table

Having the line table pointers in a contiguous range of memory is useful because, although we can't access them with a single index register, we can use simple arithmetic to set a pointer to the part of the line table that points at the table entry for the line pointer to any arbitrary line.

When rendering, we must acquire the pointer to the first visible line. Start by setting a line table pointer equal to the line index (i.e. the top scroll offset). Then multiply it by 2 (since line pointers are 2 bytes each), by left shifting the low byte first. Next left shift the high byte, and add the base page address of the line table to the high byte before even writing it back to the line table pointer.

Now you have a pointer into the line table, so you can fetch the line pointer by reading index 0 and index 1 from the line table pointer. And you're done!

Let's look at this one in code:

This is a very short amount of code. It multiplies and adds the line table offset all inline with simply copying the current line number to a zero page pointer. It's so tight! This is one of those cases where I absolutely love 6502 assembly. You don't even have to clear the carry before the addition, because while the wrap library and TKText can handle more than 256 rows, they cannot handle 32,263 rows. This guarantees that the high byte's high bit will be zero, which then gets shifted into the carry during the multiply. And the carry is thus clear and ready for the offset add.

Now, what you have in ltabptr zero page pointer is not a pointer to the start of the line of text. Rather it's a pointer to somewhere within the line table. To get the pointer to the line itself, we use indirect indexing, that's the Y index register through a zero page pointer, to copy the line pointer from the line table into another zero page pointer.

What's more, we only have to compute the ltabptr once, for the first line to draw. Now, recall that the Y register is only 8-bit, and because each pointer is 2 bytes, we can only read 128 line pointers throuugh a given zero page pointer. However, this is not a problem, because the screen can only show a maximum of 25 lines, and we've adjusted the ltabptr to point not to the start of the whole table, but to the first line to draw. Thus, from that position within the table, we can use regular (fast) indirect indexed accesses to get the rest of the line pointers.


We need 3 zero page pointers:

  • A pointer into the line table, to the line pointer, to the first visible line,
  • A pointer to the current line being drawn, and,
  • A pointer the start of the next line.

To draw all of the lines, we use two loops. The outer loop iterates over each line to draw, and the inner loop iterates over each character in that line. At the start of each line we set the draw context's local row and column. This does all sorts of magic that is way outside the scope of this post, but needless to say, it prepares pointers and clipping parameters so that successive calls to context draw (ctxdraw, a KERNAL call) will put the characters into the right spots in the screen buffer.

The inner loop interates over the characters in the line. We've got one pointer to the current line, plus another pointer to the start of the next line. We simply need to read characters from the the first line, and output them with ctxdraw, up until we reach the address of the start of the next line. Remember, the text is soft wrapped, so there may be no indication within the text data itself about where this "soft line" ends. However, there could be an indication within the text. We still have to watch for a real line ending. There are actually three ways this line could end:

  • It could end with a hard line ending.
  • It could be the last line of the content, and end at the NULL terminator.
  • On most lines, though, it will end by bumping up against the address of the start of the next line.

How do we actually read through a line?

Let's say we have two zero page pointers:

  • $2c/$2d points to the current line
  • $2e/$2f points to the start of the next line

Now suppose we use the Y index register to access characters offset from where $2c/$2d points. Y is 0, we read a byte and output it. Increment Y to 1. Now, how do we determine that the address in $2c/$2d PLUS Y is equal to the address in $2e/$2f? The answer is, it's actually really hard. Because you can't simply add Y to the low byte, and then have that possibly overflow, causing the high byte to increment, and end up with something useful to compare against the address in $2e/$2f. This just isn't something the 6502 can do.

I've said this before, this is the sort of thing that CPU designers and software framework designers have noticed and collaborated on in the past. If this is what the software needs to do to draw text into a UI, the algorithm could be made much more efficient if the CPU directly accommodated its needs. The trusty old 6502 only offers us what it offers us though, and we have to work with that. We have to find algorithms that work with what it can do.

As it happens, it is easier to increment the zero page pointer on every character output. The pointer that begins by pointing to the start of the current line marches forward through the line as the line is output. After each increment it is compared to the next line pointer, if it is equal to the next line pointer, this line is finished, and the rest of the visible line is padded out with blank space.

Let's take a quick look at how to increment a zero page pointer.

We increment the low byte first, and if the result is not zero skip past incrementing the high byte. When the low increment results in zero, we know an overflow has taken place. Don't branch, fall through to increment the high byte too.

Incrementing is slightly easier and faster than decrementing, for the simple reason that an overflow while incrementing sets the zero flag, but an underflow while decrementing doesn't set any uniquely identifiable flags. It will set the negative flag, sure, but the negative flag gets set on fully half of the decrements, all those that result in any negative number (i.e. any number where bit 7 is set.)

Here's the fastest way I know of to decrement a 16-bit number.

Before decrementing anything, load the low byte. If it is currently zero, then a decrement will result in an underflow, therefore decrement the high byte and fall through to decrement the low byte. It only requires 2 bytes and 3 cycles more than incrementing. It has the unfortunate side effect of clobbering a register, although it doesn't have to be the accumulator. Loading the low byte into any register will have the same effect on the zero flag.


Supporting ASCII

I've written blog posts and reference texts before that discuss the difference between ASCII and PETSCII.

I could summarize the organization of PETSCII and contrast it with ASCII in my sleep. So here's a very brief summary. PETSCII is an 8-bit code designed for home computers to display on video screen. The 256 codes are divided into 8 blocks of 32 characters each, numbered Block 1 to Block 8. The blocks are grouped into two mirrored sets: The lower blocks (1 to 4) and the upper blocks (5 to 8).

  1. Non-rendering control sequences.
  2. Numbers and symbols.
  3. Lowercase letters and some symbols and punctuation.
  4. Undefined and not used.

Block 5, 6, 7 and 8 are the SHIFTED mirrors or equivalents of Block 1, 2, 3 and 4 respectively. For instance, the characters in Block 5 have the same bit pattern as those Block 1, except with bit 7 (the highest bit) set.

  1. Non-rendering control sequences. (Often inversed function of Block 1 equivalents.)
  2. Graphical symbols.
  3. Uppercase letters and some symbols and graphic symbols.
  4. Undefined and not used.

Now let's contrast this with ASCII. ASCII is 7-bit code originally designed to be printed to paper, but later adapted for use on video displays. The 128 codes are divided into 4 blocks of 32 characters each, numbered Block 1 to Block 4.

  1. Non-rendering control sequences. (Many for communications and data transmission.)
  2. Numbers and symbols.
  3. Uppercase letters and some symbols and punctuation.
  4. Lowercase letters and some symbols and punctuation.

What does this all mean? TKText supports both ASCII and PETSCII encoding. PETSCII doesn't use Block 4, but ASCII uses it for lowercase letters. ASCII doesn't use anything in the upper 4 blocks, but PETSCII uses Blocks 5, 6 and 7. A flag that can be set on the TKText informs it that it should translate from ASCII. It does this using a C64 OS KERNAL call asc2pet. This translates bytes found in Block 4 (ASCII lowercase) to Block 3 (PETSCII lowercase). And it translates bytes found in Block 3 (ASCII uppercase) to Block 7 (PETSCII uppercase). All other characters it leaves unmodifed.

This leaves Block 8 unused. We'll return to this, as Block 8 is used for MText formatting codes.

On a side note, auto-detecting if a text is ASCII is pretty straightforward. PETSCII uses no characters in Block 4, as that whole block is undefined. ASCII on the other hand puts its lowercase letters in Block 4. The vast majority of characters in standard text are lowercase letters. Therefore, if any character is found in Block 4, it is almost certainly ASCII. Alternative, ASCII (true 7-bit ASCII) defines no characters in Blocks 5, 6, 7 or 8. Therefore if any character is found in one of those, the file is probably PETSCII.

Now, here's a philosophical question: If the file only consists of characters found in Blocks 2 and 3, plus a few standard characters in Block 1 (such as CR and LF), is it an ASCII file or a PETSCII file? The answer is, it's both. It's perfectly valid in both. It's just a matter of intention, "DID THE AUTHOR INTEND TO WRITE IN ALL CAPS? IF YES, THEN IT'S ASCII." Or "did the author intend to write in all lowercase? if yes, then it's petscii." There is no way to tell.


Horizontal Scrolling

When text is soft wrapped, every line is guaranteed to be no longer than the visible width of the view. That is after all the point of soft wrapping.2 However, when unwrapped, the lines extend out to wherever the line ending character(s) appear.

When we scroll downwards, we needed some quick way to get a pointer to the first line to be drawn. With horizontal scrolling we have a new rendering concern. As we scroll rightwards, we need a quick way to move the line pointer through the content of the line up to the point where it will start outputting.

Well, that's easy! This is just plain text. All we need to do is add the scroll offset to the line pointer, and start outputting from there, right?

The First Complications

This works, in the simplest case, but quickly some problems are uncovered. The first complication is that with unwrapped text, some lines may be significantly longer than others. Let's say the view's bounds are 35 characters wide, and its longest line is 100 characters wide. Let's also say that one of its lines is only 10 characters wide. The content width must be reported as 100 columns, so that the scroll bars permit scrolling far enough to accommodate the longest line. So far, so good.

However, if we scroll such that the first visible column is 50, we will take the line pointer to the start of the short line and then add 50 to that. But the line is only 10 long to start with, so the line pointer automatically gets blown past the address of the start address of the next line. Now, it is no longer acceptable to:

  1. Output a character
  2. Increment the pointer, and
  3. Check for pointer equality.

The current line pointer has to be checked for greater than or equal to the next line pointer before each character is output.

The next problem I ran into was a true harbinger of what was to come. When scrolling rightward, something funny was happening. When not scrolled at all, everything lined up. But when scrolling one column to the right, the first line would scroll, but the other lines didn't scroll, leaving the lines vertically unaligned. Scrolling one more column would cause all the lines to scroll but the vertical misalignment would remain.

What's going on here? Upon investigation I realized that files with CRLF line endings would see the CR and identify that as the end of the line. The next line pointer would then point at the LF. The line endings are used when wrapping and generating the line table, but when rendering the text, the line table is used to determine what should fall on each line. What to do then when a line contains an LF? CRs and LFs fall into block 1 (non-rendering control characters) in both PETSCII and ASCII. Therefore, when one of those is encountered, I simply ignore it, don't render anything, increment the pointer and move on to the next.

Have you spotted the problem? When scrolled to column 0, we read the LF from the start of the line, and skip it. Thus the character at index 1 of the line gets drawn to column 0. But, when we scroll the view to start rendering from character index 1, we add 1 to the start of line pointer, and start drawing from there. But this just skips the LF that was already being skipped, so those lines that start with an LF don't actually get scrolled. Only the first line—the first line of the file doesn't start with an LF—gets scrolled, the others merely skip their leading LF.

That may sound confusing, but the upshot is this: If any characters are non-rendering, it's not enough to just apply a scroll offset to the start of line pointer to figure out where to start drawing from.

For LFs in particular, I worked around this problem by allowing the LF to hang off the end of the preceding line. The next line pointer, rather than pointing at the LF, points to the character right after the LF. That solved this particular problem, but it doesn't solve the larger problem: the existence of non-rendering characters in the data means you cannot just apply an offset, because you have no idea how much of that offset is "rendering" and how much of it is "non-rendering." And what you really want is to offset the pointer by X number of "rendering" bytes.


Beyond Plain Text: MText encoding

We saw that Block 1 is shared by PETSCII and ASCII for control codes. Some of the meanings of the control codes correspond, but most don't. Block 5 is used by PETSCII for additional control codes, but it is unused by ASCII (it's above ASCII's range.) Block 4 is unused by PETSCII (it's values are simply undefined by PETSCII), but it is used by ASCII, therefore we don't want to use those bytes to mean something else in MText, or there would be contention of meaning, and you wouldn't be able to mix MText codes with ASCII.

Block 8, on the other hand, is undefined by PETSCII and unused by ASCII too. Block 8 is thus used for MText marker characters. Here's what I've defined so far, but, this might change:

Code Meaning Code Meaning
$E0 Black $F0 Normal Text
$E1 White $F1 Strong Text
$E2 Red $F2 Emphatic Text
$E3 Cyan $F3 Link Text
$E4 Purple $F4 Left Justification
$E5 Green $F5 Right Justification
$E6 Blue $F6 Center Justification
$E7 Yellow $F7 Full Justification
$E8 Orange $F8 Horizontal Rule
$E9 Brown $F9 reserved
$EA Light Red $FA reserved
$EB Dark Grey $FB reserved
$EC Medium Grey $FC reserved
$ED Light Green $FD reserved
$EE Light Blue $FE reserved
$EF Light Grey $FF reserved


Absolute Color Styles

First the colors. Why if PETSCII already defines colors are these colors defined here again? Because I can't make heads or tails, nor rhyme nor reason, of why the PETSCII color bytes are assigned as they are. They seem to just be randomly assigned bytes in PETSCII. The problem with that is you need a lookup table to map them. The colors in MText are designed such that you just AND #%00001111 and then send the result directly to the draw context, because the result is the VIC-II's value for that color.

Semantic Color Styles

Next there are 3 semantic styles. While the first 16 are hardcoded color values, the semantic styles indicate a meaning that is user definable. The C64's text mode is very limited in what it is capable of displaying. See my first post about MText for a description of what sorts of things it can reasonably represent. Each byte marks a change in style. There is no "end marker", because they are not nested. When an Emphatic Text marker is encountered, the output goes into an emphatic text style. That style persists until another style marker is found, up to the end of the paragraph.

The Themes Utility allows you to customize the color of Normal Text (display or body text, used principally for reading long paragraphs comfortably,) Strong Text (bold, used for short segments that draw your attention,) Emphatic Text (used for emphasis, an alternative to bold, usually not as attention grabbing, but subtly different from Normal Text.) Thus, in the text, if an Emphatic Text marker is found, the defined color for Emphatic Text is looked up from the Theme table and that new color is set in the draw context. To end Emphatic Text before the end of the paragraph, a Normal Text marker must appear. The Normal Text color is then looked up in the Theme table and applied to the draw context.

They're not nested because there is no point. It isn't possible to employ Emphatic and Strong text at the same time. Why use Semantic Styles at all? Because, if you change to a dark theme, with a dark or black background, you get to define the three colors that work best on that dark background. But on a light or white background you can define the three colors to work best on that background. In that case, why offer hardcoded colors? Because although they could produce a conflict with the background, they also open the possibility of creative and artistic styling that may be highly desirable. Such as employing color gradients across rows or columns. The two most common backgrounds will be white and black, therefore, it would be wise to avoid the white and black hardcoded colors.

Justification

There can be only one justification for an entire paragraph, thus the justification byte must be the first byte of a new paragraph. Justification bytes found elsewhere within the paragraph are invalid and are ignored.

Full justification is not implemented yet.

Horizontal Rule

The Horizontal Rule is special. Typically it should be found on a line on its own. That is, CR should precede and follow it. Or even two CRs could precede and follow it to give it some space. The Horizontal Rule is always drawn all the way across the visible area, but pulled in one character from either end. No matter how wide or narrow the view, no matter how you scroll horizontally, the Horizontal Rule flexs to nicely fill the space. But what character should be repeated to draw the divider? This is defined by the single character that immediately follows the Horizontal Rule marker. So a sequence $F8 $2A will draw a flexible divider made of asterisks. And a sequence of $EE $F8 $2D will draw a Light Blue flexible divider made of dashes. And so on.

Links and Link Values

There is a pair of ASCII bytes, partially shared by PETSCII, in Block 1: $02 and $03, which mean Start of Text (STX) and End of Text (ETX). End of Text is defined as STOP by PETSCII. These two codes are used to wrap an arbitrarily long string value. Value for what? Whatever they immediately follow. At the moment these values are only used for links, but they can be used for other values in the future.

A link, thus, is defined by the Link Text marker, followed by characters that are to be displayed as a clickable link. Then a STX character starts the non-rendering value of the link, up to the ETX character. Following the ETX character should be another style character, such as Normal Text to return the draw context to a non-link style.

At the moment, although this might change, links are displayed in the same style as a Toolkit Button. In the Daylight theme this means light red, reversed, text. As you get used to seeing buttons in User Interfaces, links appear as buttons that are in the middle of a block of text.

When you click a link, a pointer is set at the index of the clicked text. The pointer is then advanced forward until it encounters the first STX character. The TKText class then calls a "link clicked" callback, with a pointer to the start of the link's value string (the byte immediately following the STX). A link could be parsed and handled manually, of course, but more likely the pointer to the link value will be forwarded to something like a URL library.

Functions in the URL library will handle strings that are terminated normally, with a NULL (0) byte, or with the ETX character. This way, link values can be embedded within larger strings, and the text doesn't have to be copied anywhere.


MText wrapping and rendering

As I said, that floating LF at the start of a line was just a harbinger of what was to come. Although plain text is more or less a one-to-one, all characters are rendered characters, there are some exceptions. "Tab" being the main example. I haven't implemented support for a tab character moving the input to the next tab stop, yet.

With the introduction of MText though, all that simple one-to-one stuff goes out the window. There are now plenty of non-rendering characters strewn throughout the rendered characters.

MText is a strict superset of plain PETSCII or ASCII text. Thus, a plain text file is in fact an MText file, but one that just happens to not have any markers in it. Converting to plain text, or rendering as though it were plain text, is a matter of two things:

  1. Ignore all bytes in block 8, and
  2. Ignore all bytes from an STX to an ETX, inclusive

The wrap library (wrap.lib) handles MText implicitly. It's exactly the same code for handling plain text and MText. And the additional work is very lightweight. When wrapping text to a width it counts characters in the line. Now, any character greater than or equal to $E0 (i.e. Block 8) doesn't increment the characters-per-line count. If an STX is encountered, it goes into a short subloop that skips all characters until the first ETX is found. And that's very very nearly all that it needs to do! Super lightweight. It then generates the line table of pointers to starts of lines in precisely the same way, but the number of absolute bytes in a line may be more than the wrap width, because, some of them don't render.

The tricky part is rendering. If you're looping over the characters in a line to output and you encounter a marker, it gets immediately processed to, say, change a style. But what happens if are scrolled to half way through a paragraph, and the style byte that changes a color, or begins link text, starts on a line that comes BEFORE the first line you're rendering. The text should be Emphatic Text, but you don't know it because you're past the Emphatic Text byte already. The same goes for a paragraph justification byte. If the first line of the paragraph is off the top of the view port, you'll miss that byte and assume it should be normal left aligned text.

Well… this is a real problem. How far back might you have to scan to find the justification byte? How far back might you have to scan to find a style byte? What if the text is just plain text, and it contains no style bytes at all. What if you're scrolled to near the end of the text? It would have to scan backwards through the text all the way to the beginning. This would be horrifically inefficient. One idea I toyed with was having a table that indicates the style changes. But, where to put this table? Now we have to allocate and build more structure. Should it be part of the line table? Must every line indicate the style and justification defined for it? What a pain in the ass, and could result in a lot of bloat. What happens if the view is narrrow, thus resulting in lots of entries in the line table, every line table entry could end up longer than the content in the line itself!

Instead, I settled on a compromise.

All styles reset between paragraphs. This actually isn't all that different from HTML. In HTML tags can't be improperly nested.

For example, this is valid:

Whereas this is not valid:

The strong tag cannot be opened within one paragraph and then closed later inside another paragraph. If you want the content at the end of one paragraph to be strong AND you want the content at the start of the next paragraph to be strong too, you have to close the strong and redeclare it at the start of the next paragraph. Same with Links, you can't let an anchor tag flow across paragraphs. Same with the (albeit old and deprecated) center tag.

In MText, paragraphs are marked only by the line endings. Just like in a book. A paragraph consists of a long string of text that gets soft wrapped onto multiple lines, and ends when a line ending is reached. Justification is a paragraph-level property only, and justification gets reset to left at the start of every new paragraph. If you want several paragraphs in a row that are all centered, another center marker must appear as the first character in each of those paragraphs.3 The style is also automatically reverted to Normal Text, at the start of every new paragraph.

It's hard for MText to be malformed, because markers don't typically require closing markers. The only exception to this is with Links and STX/ETX pairs. If you open an STX block, the wrap library will suck EVERYTHING into the value string (including line endings and other MText markers) until it finds the ETX. So, don't forget to close an STX/ETX block! And also, strange things would happen if you clicked a link that had no STX/ETX block following it. It would actually scan forward until it finds the first STX, regardless of how far away it is.

The upshot of this paragraph-level resetting of styles and justification is that when rendering, and only when rendering the first visible line, it scans backwards only to the start of this paragraph to find any justification or style markers, which it then applies. If there were a very very long paragraph, yeah, okay, that might still lead to some slowdown as it scans backwards through it. But, in practice, most paragraphs are negligibly short even for a 1MHz CPU.

Horizontal Scrolling MText

Now we know that from the first line rendered we have to scan backwards to the start of the paragraph. But there is also the horizontal scrolling issue. Just like that LF harbinger of complication, there are now numerous non-rendering characters within a line.

There is no magic solution here. First the start of line pointer is taken from the line table, but if the view is horizontally scrolled, the line must be scanned from the start one character at a time. It counts down the horizontal offset for every rendering character. These are, essentially, characters in Blocks 2, 3, 4, 6 and 7. And just like in the wrap process, if an STX is encountered it goes into a tight loop skipping characters until it finds the ETX. Along the way it tracks changes to the style, but without applying them to the draw context yet.

Thus it counts down rendering characters that fit into the columns scrolled off the left. This could result in the line pointer being greater than or equal to the next line pointer, as mentioned earlier. If, finally, the line pointer is still less than the next line pointer, it applies to the draw context any style character found in the first part of the line, and then proceeds to draw out the rest of the line like normal.


Final Thoughts

Holy Mackerolly! That sounds really complicated!

You know what? It is complicated. But you know what's even more complicated? Parsing HTML into a tree, and then trying to render that tree, whilst throwing away 90% of all the crap it wants us to do that we have no way of doing anyway.

With MText, I'm trying to apply some mild structure to plain text that informs us only about what we are capable of rendering on the screen. And giving that to us in a format that is as tight and light on memory as humanly possible. I think this is going to go quite a distance towards that.

Hit testing, Selections and Links

This post is already 14,000 words long. So, I had to tie it off here. There is more complexity involved, which I haven't implemented yet to do hit testing. Hit testing is the code that figures out where in the source text you have clicked or dragged over with the mouse. The hit testing code underlies two important features: Text Selections and Link Clicking.

I'm not going to get into the details here, but, if you click the text, the hit testing needs to figure out if you've click on a link. And if you press the mouse button down, the hit testing needs to figure out where your selection starts. And as you drag, with the button down, the hit testing needs to figure out where the mouse is to set the selection end. Selection is a form of style, reverse text with a Theme-based selection color. The selection style CAN cross over paragraphs, and has to override any other inline style markers.

Hit testing is made somewhat more complicated by different justifications.

The Future and What's Next

Everything discussed earlier in this post is already implemented, and will be put to use, for example, in the Help Documentation and the Help Utility that will be part of the Version 1 release.

There are plans for the future that I'll briefly mention, but these will not make it into Version 1 of C64 OS. (Lest the release of Version 1 just be pushed off forever, which I do NOT want to do.)

Ultimately, I foresee MText as the foundation of web browsing in C64 OS. What it requires is an HTML to MText conversion webservice to act as a proxy. And since this proxy will be running on something like Linux in the cloud, it can be as sophisticated as a modern webservice can be. All the nasty complication of HTML can be dealt with in the cloud, and delivered as a much simplified, streamlined, low-memory-use MText document. Inline links are supported, although clicking on a link would usually direct the browser back to another proxy service.

The proxy could even divide a long HTML document into a series of pages. At the end of one MText document could appear a list of Page Links. Click one of those and back to the proxy to get a particular subsection of the long website.

Downloads to C64-specific content, such as .D64's or .SID files, image file formats like Koala or Art Studio, or other resources directly consumable by the C64 need not go through a proxy but can be downloaded directly.

Inline images to non-native graphics formats (GIF, JPG, PNG, WebP) can be routed to the image proxy service for conversion to a C64 friendly format.

Smart Objects

Links, in theory, could be used to handle images. But, I have another idea that I think is highly doable, and would work well with how the Toolkit works. The idea is to support embeddable smart objects within MText. Sounds crazy?! It's not that crazy. All we need is an MText marker to signify "Inline Object", and for the single byte that follows it to indicate how many rows of content should be dedicated to the object. The row count permits the line table to index the lines that precede and follow it. So if you grab the scroll bar and scroll way past an inline object, the computations of where you are in the MText are still fast and easy.

Following the Inline Object marker and its line count, could then come another STX/ETX pair for a custom value. This value gets parsed out, and specifies a C64 OS Toolkit class to instantiate. and perhaps has some data to feed to its init method. These classes would all descend from TKView, which allows them to be appended to the TKText object, with their top offset set to the row where the inline object is supposed to appear. The Toolkit is already very efficient at determining if a view's children are within bounds, and forwarding drawing and hittesting to them. To increase efficiency, an Inline Object could be destroyed when it is scrolled out of view, and instantiated and appended to the TKText only when it is scrolled into view.

This could SERIOUSLY open the door to a lot of very cool shit! Like, does a webpage link to a SID file? How about a SID playing class, that allow us to view some metadata about the SID file, and click a button to download it into memory and start playing it, right from there! Or, images don't have to be presented as dumb links, but could have a smart object that shows a cool little Icon, gives the dimensions and format of the original file, and offers buttons for how you want to fetch it: Scale it down? View a subsection of it unscaled? Dither it? Then click to download the conversion straight to memory to be viewed in the C64 OS splitscreen graphics mode, OR, click another button to save it directly to disk. The sky's the limit.

MText itself is pretty cool. And offers some nice presentation of textual content in a very lightweight fashion, fast and well suited to the C64. But, with the addition of smart inline objects, I think we could have a serious winner on our hands.

  1. In some primitive text editors only that one line gets scrolled, but this is less ideal as it horizontally misaligns consecutive lines.
  2. It's slightly more complicated than this. When a TKText view is appended to a parent TKScroll view, its bounds are automatically managed. But its content width and height are not the same as its bounds. When soft wrapped, the bounds width is passed as the wrap width. But this is not strictly necessary. An alternative wrap width could be used, which would make the soft wrapped content width still require horizontal scrolling.
  3. If you have eagle eyes (and you're still reading this far in), you may have noticed that this makes the explicit Left Justification marker redundant. I'm aware of this, and that's why the code values are not yet set in stone.

Do you like what you see?

You've just read one of my high-quality, long-form, weblog posts, for free! First, thank you for your interest, it makes producing this content feel worthwhile. I love to hear your input and feedback in the forums below. And I do my best to answer every question.

I'm creating C64 OS and documenting my progress along the way, to give something to you and contribute to the Commodore community. Please consider purchasing one of the items I am currently offering or making a small donation, to help me continue to bring you updates, in-depth technical discussions and programming reference. Your generous support is greatly appreciated.

Greg Naçu — C64OS.com

Want to support my hard work? Here's how!