NEWS, EDITORIALS, REFERENCE

Subscribe to C64OS.com with your favorite RSS Reader
April 22, 2019#80 Software

Webservices: HTML and Image Conversion

Post Archive Icon

Near the end of 2016 I had some wide-eyed ideas about using the C64 Gfx library as part of a online service for converting images. I have since developed that idea into an actual functional webservice that has been online for a many months. But I wanted to take a few minutes to talk about how it works and how it fits into a wider suite of services.

The place to checkout is http://services.c64os.com/about. This is the page that officially documents the services that are available. It currently documents and image conversion service and a relocated SID catalog. I'll write a separate post sometime in the future about the SID catalog. For now, let's discuss some of the other services.


The Internet

I want my C64 to be able to reach and consume as much of the internet as possible. The concept of the internet, a global interconnected network of heterogeneous platforms, sharing common protocols and exchanging common file formats, is brilliant. But, one big problem is that beginning many years ago the resources on the internet began to scale up with the ever increasing power of personal computers, and consequently most of those resources have become undigestable by older computers with more limited resources. This is not only a C64 problem, it's a problem for Amigas, and vintage Macs, and others.

The situation is not entirely unlike CD ROMs. Many years ago, when CD ROM drives became popular, the C64 was not long to follow in its ability to read the content of CDs. The IDE64 had support for CD ROM drives, the CMD HD could be connected to SCSI CD ROM drives, and even today the 1541 Ultimate can be used for accessing CD ROM drives connected via USB. The problem is that these discs and their relatively high capacity were immediately taken advantage of to store large media files. Large (for the time) images, rich with thousands of colors and compressed to save space. Video clips, audio clips, PDFs, HTML files and of course Windows- and Macintosh-only executable code. Our C64s could read the CDs, traverse the file system, but there was little that we could digest from those discs. Besides the odd small GIF, a bit later the odd small JPEG, a few text files that became rarer and rarer and contained less and less useful information, like copyright and license files. Yay for those.

 

The internet is similar, though, as we'll see different in important and convenient ways. We've had ethernet adapters for years, we've had dialup connections to the internet over PPP and shell accounts long before that. Now we have an abundance of WiFi modems. But the CD ROM problem described above has gotten ever more extreme. The resources to be found on the internet are huge. Take a site with a relatively simple concept, macrumors.com. It's just a front page, listing a few of the most recent articles. The HTML content alone is 202 kilobytes. Its tiny logo is an additional 21 kilobytes, in PNG format that the C64 (I believe) has never had the ability to decode. Even these sizes are out of reach, but it gets way worse. Each article usually comes with an image, and these images range from 100K to 500K… each.

There is nothing particularly egregious about a site like macrumors.com, it's just a standard, middle-of-the-road practice for 2019. The page weight though—with only 12 summaries—is over 10 megabytes. Now, granted, a lot of that is in scripts that download ads, and we would not have to worry about any of that. But, take a 300K JPEG. Do we want to see the image represented by that particular JPEG? Maybe it's critical to understanding the context of the article. Or maybe it's not, maybe it's useless. But how would we even find out, except to download it and see for ourselves?

At 9600 bps, which the UserPort WiFi modems can do (if we push it and the right drivers are used,) it would take around 5 minutes to download just one of those 300K images. That size dramatically outstrips available internal memory, though. So it would have to be stored somewhere. If we have to stream the download to disk, it will take much longer than 5 minutes. Alternatively, it could be stashed in an REU. But, what to do with it once it's downloaded? As I discussed in Thoughts About Graphics, to decode a JPEG that is just 19K on disk, takes over 5 minutes on a C64 with JuddPEG. It's safe to say, even if you did wait around to download a 300K JPEG, decoding it is effectively impossible. Considering that an image of that size would also have to be scaled down to 320x200, such an image is just inaccessible. Even if it were a small image, and we waited minutes to download, and then minutes to decode, just to see… a blurry picture of an iPad that doesn't add anything to the text of the story. Even if possible, it's clearly not practical or tolerable, even for a die-hard retro computer nerd. It's just too much.

But, this seems a shame because we have these TCP/IP-handling WiFi modems.

A download speed calculator.
A useful download speed calculator: https://downloadtimecalculator.com.

Here's how I think about it. If the internet consisted of a global network of C64s sharing data with each other, we would not find ourselves in this situation.

If there were only C64s out there, and everyone was using them for their day-to-day activities, the source content would be radically scaled down into small digestible chunks. Even if serious news publishers were out there doing their thing, writing op-eds about President Trump, and providing us photos of his orange face, those images would be in a format that is best suited our C64s. Just as people today don't put 50 megabyte RAW photos on mobile websites (yet), a world dominated by C64s would be supported by an industry that produces articles in lightly formatted text, divided into manageably small portions, accompanied by images no larger than 320x200 and no bigger than 9 to 10K, or perhaps 20K for some interlaced or wide format specialties. At 9600 bps, such an image would take just 8 seconds to download. Not exceedingly fast, but well within the bounds of tolerability.

A multi-color conversion of orange faced Trump.

Further, the render time would take 0 seconds. You wouldn't need an REU—although you could still benefit by having one—and caching the images to local storage would also be possible. It would only another second or two to write the file out, which could be done while you're viewing the image. For that matter, you could watch as an image is loaded in, like the old progressive JPEGs. You could cancel the load after only half the image comes down if you didn't like what you were seeing. It would be helpful to pre-filled video memory with a default pattern and colors so you wouldn't see random data in the part not yet loaded.

HTML vs text

I've been thinking a lot about HTML. And the more I think about it, the more I think it is entirely unsuitable on a C64. It is supposed to be an easy way to semantically markup text. In the earliest days of the web, the idea of semantic markup seemed to go over most people's heads. But we've come a long way and good web developers these days understand the meaning and the value of separating presentation from semantics. Coupled with the good practice of putting styles in external stylesheets, this is actually a big boon for users of older computers.

Let's take a common type of webpage, such as an article of news or a blog post such as you're reading now. What matters above all else is the textual content. If you had no styles, no images, and no other content, save for the plain text nicely formatted to fit the width of your screen, with wordwrap for maximum readability, you'd get to somewhere between 90% and 99% of the goal. This is after all why people pay 20+ dollars for a novel, which is nothing but hundreds of pages of plain, uniformly-sized, text. Because they care about the words and what they mean.

Perhaps what matters second most, and what makes the world wide web a web and not just a bunch of independently hosted text files, are the links from certain words in one document to the web address of some other document. Links are what elevate a document from mere text to hypertext. It so happens that HTML's markup language includes a special tag for implementing links, but it also includes a lot of other tags that are primarily for semantics.

HTML Tags Purpose
html, head, body define document structure
abbr, blockquote, caption, cite, code, label,
pre, span, h1, h2, h3, h4, h5, h6, sub, sup
describe segments of text
em, strong, i, u, b, small, del, ins describe the emphasis and accentuation of text
table, thead, tbody, th, tr, td define the structure of tabular data
ul, ol, li, dd, dl, dt define lists and their items
div, legend, article, aside, header, footer,
p, nav, section, title
define typical article structure
This is not an exhaustive list of HTML tags, and my grouping and description of them above is loose and not meant to be authoritative but to help demonstrate a concept.

Here's the question. What is the point of all of those semantic markers? There are at least three very useful properties to a text being marked up semantically:


• The first is that search engines and crawlers can use the semantics to be able to categorize the content, prioritizing some parts and deprioritizing others. Statistical analysis on the kinds of semantics found in a document allows the crawlers to categorize webpages into genres and perform other tricks.

• The second reason is that semantic markers make the text ideal for alternative consumption and navigation. For example, screen readers for people with low-vision can use the semantics to help guide the user through the content in a way which would be impossible if it were only line after line of text. Simple commands can skip from major headline to major headline. Or within an article, one can have section headings read out, allowing a low-vision user to skip directly to the section of interest.

• And the last benefit is the ability to apply a uniform style to all text of a similar semantic type. Such as, all paragraphs having the same font family, size, weight and color. And all H1 headings looking the same as each other, but with a different font family, size, weight and color, and so on. This can also help with low-vision. If the text of paragraphs is too small to read, one change to one style rule can uniformly increase the font size of all paragraphs.


This is a taste of the lessons learned from Web 1.0 that lead us to Web 2.0. But, how do each of these properties translate to use on a machine like the C64?

The first point is that it helps search engines, robots and automated processes to have better and clearer insight into the content. Okay, but we're not a search engine or a dynamic ripping service. We're an end consumer. And to the end consumer, frankly, some of those archane benefits might help to make the web work better, but they don't help us to view the content any more easily.

The second point speaks to people with accessibility needs. I have every sympathy for them, and the web, compared with say traditional books, is a watershed technology.1 But the C64 is a retro machine for computer and electronics enthusiasts. To my knowledge there are no screen reader devices for the C64, nor are there braille devices, etc. While accessibility features are a valuable development on modern platforms, they don't make a lot of sense on a platform where we're already looking for a reasonable way to cope with the smallest unit of content being 10X bigger—or more—than our total available memory.

Lastly we come to the point that people generally think HTML is good for, presentation. Even if you just hardcode the presentation properties, H1's big and bold, H2's a bit smaller than that, EM content italicized and STRONG text emboldened. We have a rather serious problem on the C64, especially in text mode.

  • We cannot make text bigger or smaller than usual.
  • We cannot offset text from the baseline.
  • We cannot make italicized text.
  • We cannot make underlined text.
  • We cannot make bold text.
  • We cannot make strike-through text.
  • We cannot change the font face, short of changing the font face for every piece of text everywhere on the screen at the same time.

All is not lost. There are a few things we can do.

  • We can change the color of text.
  • We can reverse text.
  • We can all-caps text.
  • We can center text.
  • We can indent text from the left or the right.

With the limited options we have for presentation, the lack of resources or special hardware for accessibility support, and the fact that we're the ultimate end consumer, is HTML really the best way to deliver the features we can support? I don't think so.

A Hyperlinked Text Web

The general idea of separating presentation details from semantic indicators is probably worth retaining. But, the number of semantic elements needs to be dramatically reduced. When I think about these things, I bounce the list of semantics we need to pare down against the set of presentation capabilities of the machine to see how they match up. And if we fundamentally cannot represent something, that thing ought to be struck from the list. A quick example might be the sub and sup tags. They indicate subscript and superscript text. But we have no way of doing anything with them.

Any one tag taken in isolation could be matched up with one of our presentation options. But then, when you mix them all together into the same document, you can't just use reverse text to mean 10 different things. I have come up with a minimal list of semantics worthy of supporting, paired up with a plausible way to represent each.

Color schemes are customizable, but for the purpose of consideration, let's imagine that the background is white. And remember that in C64 text mode (with access to all 256 screencodes) there can be only one background color common to every 8x8 cell, plus one custom color per 8x8 cell used by all the high bits within the cell.

Semantic Representation
Heading All Caps. Two blank lines below. High contrast color, black.
Sub-heading Word Caps. One blank line below. High contrast color, black.
Body Medium contrast color, dark grey.
Paragraph No additional blank lines above or below. First line indented from the left 2 spaces.
Link Alternative passive color, Light blue.
Emphasis Alternative active color, red.
Strong High contrast color, black.
List One blank line above, one blank line below.
Unordered item, level 1 An asterisk and 1 space. Wrapped lines indented 2 spaces from the left.
Unordered item, level 2 An hyphen and 1 space. Wrapped lines indented 2 spaces from the left.
Ordered item, level 1 One to X digits and 1 space. Wrapped lines indented 2 spaces from the left.
Ordered item, level 2 A, B, C ... AA, BB, CC and 1 space. Wrapped lines indented 2 spaces from the left.

Tables are a bit more difficult. The reason is because the number of columns wide is clearly defined, and the quantity of data within a cell is also well defined. But what to do if the table is 10 columns wide and we have a 40 column side screen? That gives us just 4 characters per column if we ignore any column dividing lines. Words just don't fit into cells that small, not without horizontal scrolling.

Fortunately tables, for layout, are used far less frequently today than they were in the early years of the web. Let's leave tables aside for now.

Here's an example of how the descriptions in the table above would look if put into practice on a C64.

Headings, Sub-headings, paragraphs, strong and link text. Sub-headings, lists, emphasis and selected text.

On the left we have an All Caps title, followed by two blank lines. Then the word caps sub-heading followed by one blank line. Then we have a paragraph. It's edge to edge but word-wrapped so no word gets cut off. The first line of the paragraph is indented 2 spaces. Next is another word capped sub-heading. Note, there is one blank line between the bottom of the first paragraph and the top of the next sub-heading. This was not defined in the table above, but needs to be inserted to have the sub-heading stand alone. As with HTML, extraneous space within a paragraph, between words, below or above content would be removed, and replaced with the "Standard" spacing for visual consistency.

Below the Star Trek sub-heading are two paragraphs. There is no space between the paragraphs, just as in most novels, the first line of the new paragraph is indented two spaces. Otherwise, every other line begins with a non-white-space character.

Then we have another All Caps title followed by two blank lines to make it really stand out. And the start of another paragraph. In the first Star Trek paragraph the word "Enterprise" appears in black, which is the rendering of strong text. And in the final paragraph the words "This paragraph" appear in light blue, to indicate they are a link.

In the righthand screenshot we see some examples of unordered and ordered nested lists. The critical part is that when a list item wraps, the second line aligns below the first text character of the first line, leaving the "bullet" left hanging. This lets the bullets really stand out, and defines where one list item ends and the next begins.

Lists always have one blank line before and after. You can see this in the nested sublist. The bullet is changed from an asterisk to a hyphen (minus sign), the wrapping now wraps to 4 characters from the left. Note though, that a list wants one blank line above itself, and a sub-heading wants one blank line below itself but when a list immediately follows a sub-heading, there aren't two blank lines, but one shared blank line. The wrapping and line spacing is the job of the C64 renderer.

Within the list items we see some black text near the top for strong, near the middle is red text for emphasis, and finally at the bottom we see reversed light green text. This is not a result of the semantics of the text, but is a C64 OS text selection. This is just there to show that OS-imposed UI conventions are not conflicting with the content's own rendering.

MText, not HTML

Given that we can only realistically deal with ~8 to 10 kinds of semantic markup, I am recommending that the C64 not be fed HTML at all. The HTML content is just too much. Let's take the homepage of macrumors.com as an example. The main HTML page is 188K, but the text of the page is just 46K. The markup alone increases the content volume by more then 300% (!!). And most of it is for things we can't deal with anyway.

But it's actually worse than that. Because, whatever you load the 188K of HTML into, you have to parse out of that content memory structures to be rendered to the screen. The easiest way to do this is to extract the text segments out, and format them together somewhere contiguously in memory. But you'd then end up having two copies, the 188K source HTML plus 46K or more of rendererable content. You can't just render directly out of the HTML, it would be much too slow. Making text selections would also be out of the question.

And we haven't even mentioned yet that we would actually have to parse the HTML, with all of its warts, malformations, improper nesting, etc. Just as pre-emptive multi-tasking is possible on a C64, in my opinion it isn't very useful beyond being an academic experiment in what can be done. It leaves you so resource-deprived that no process can actually be more than a few K big, while stealing precious cycles on every context switch. Parsing HTML is similarly useless. There have been HTML parsers written for the C64. But, they are projects for demonstration or to scratch an academic itch. They cannot deal with files beyond a relatively limited size and complexity. Which means if you put them to work on current websites, on the actual internet, they immediately choke.

When it comes to efficiency of memory usage, the source format, and the parsed format—ready to be rendered—need to be overlapping. I have a couple of examples of what I mean by this. A simple example is with common graphics file formats. Koala files begin with bitmap data, and are stored as PRG files with a load address of $6000. Same with RunPaint. Advanced Art Studio files begin with bitmap data and load to $2000. This didn't make sense to me at first, until I realized why they were doing this. $6000 and $2000 are both offsets that align perfectly with memory segments out of which the VIC-II can directly render a bitmap. So, you don't first ask an allocator for 8K of free memory, then load into that memory from disk, then copy the data from allocated memory to video memory. No no no. That would be insane. The video chip's addressable memory area is THE memory into which the file is directly loaded from disk. Then, only the color data, which is addressed discontiguously, needs to be moved to somewhere else.

My second example is with the C64 OS menu data files. The human editable data file is intentionally laid out such that 99% of the data stays put exactly where it was loaded. And the bytes have already been spaced out properly to allow for inline transformation into memory structures with pointers, flags, length counts, etc. necessary for quickly drawing them to the screen. This idea of inline transformation from source code to renderable structured content, is critical to using the memory efficiently.

Something akin to Markdown would be infinitely more accessible to the C64. But, I don't think Markdown itself is exactly the right solution, either. Although it's much closer to being right than HTML is. Instead, I suggest, something entirely new called mtext.

Thoughts about MText structure

There are several main types of "markup" (markup isn't the right word, I really mean "concepts from HTML") that I think we can and should support. There are block level elements: Headings, paragraphs, lists, tables (although I'm still setting these aside for now.) Second, there inline markers designed to delineate certain segments of text: Emphasis, Strong, etc. These inline delineations cannot cross over block level boundaries, but can be applied to sections of the text within a block. For example, you can't structure HTML like this:

<p>
  This is <strong>strong text.
</p>
<p>
  And so is this.</strong> This is now normal text.
</p>

Third, there are external resource links. These come in two main forms, links to other webpages and references to images (or audio, or whatever.) Links always come with a URL, an address to where the linked content resides. If it is an image, typically a webbrowser reads in the image and automatically displays it amongst the textual content. And if it's a link to another webpage the link has some clickable element to serve as its presence. For an image, my thought is to blow a whole in the content, render a 2x2 (or 3x3) generic icon (using custom characters), followed by some textual representation of the image, title text if it's available, if not then alt text, if not that then the filename as derivable from the URL. Below the title text several standard buttons, something like: Copy image URL, get image metadata, view the image, or save the image to disk. Perhaps like this:

Example of handling an image.

The generic 3x3 character icon would be provided by the browser application. The icon and the buttons draw visual attention to the existence of the image. The label gives you some way to identify it without opening it. About the buttons, Copy URL is self-evident. Info would make a request via the proxy to get details about the image: Dimensions, color depth, format, last updated date and time, and whatever other metadata the proxy service is able to extract from it. I haven't shown presenting the image info here, but I am imagining it being displayed in a panel that appears above the page, which you can look at, and then close. The View button requests the image url through the proxy service that converts it to Koala. Other formats and settings exposed by the proxy service, such as gamma, brightness, luminence, (and perhaps cropping and scaling) could also be configurable some time down the road. When viewing the downloaded image goes straight to the image buffer (behind the KERNAL ROM), and can be viewed with the OS's native support for switching screen modes, using the splitscreen feature, and is supported by its events and compositing system. If you choose to "Save" instead of view, I imagine this making the exact same proxy service request, but streaming the data out to disk straight away.

The other kind of resource link is to another website, of course. These don't need a special block representation, like images have, because they're linked from the relevant text you can click. A small complication is that an anchor, a link to another website, can be wrapped around an image. In that case, you have the URL to the image itself, plus you have the URL to the site the image is linking to. I think in this case, the label of the image block would be colored as regular link text. You'd have the 4 buttons for the image, plus you could click the link the label text to navigate to the other site.

In either case, you've got a URL. And URLs are a bit of a pain because they can be of any arbitrary length. The idea I'm toying with is for all URLs to come packed together in a section at the end of the document. The document content itself would have some short, fixed-length (read, easily parsable) codes strewn throughout. But the code marking the beginning of either linked text or an image label would contain an index number. This could be safely ignored until such time as the user clicks a link, or clicks a button to interact with an image. Then the index can be used to lookup the associated URL by scanning through the URL table at the end. Let's set this aside for a moment to cover the last major type of HTML concept we need to support.

Forms. Ohhh forms. The bane of every C64 web browser attempt. They're just so vast and varied and inherently hard to deal with. How can we deal with them? At the scale of the C64, we're talking no more than a few kilobytes for the code, maximum, for the whole application. Anymore than that, and there is no room left for the content. The TCP/IP is thankfully done by the WiFi modem, and the HTML parsing is done by the proxy, leaving us with very lightweight text with small semantic codes dotted throughout, and a table of URLs at the end. But what to do about the form? In my opinion, the forms should be treated as sub-pages. What this means is that in the initial request for a webpage, the proxy service could replace the entire branch of the DOM contained within a form tag with a single code indicating the existence of a form, and an index. When the main page is rendered, just as an image shows a special block with some buttons, a special block could appear with a button to open the form. This could then make a proxy request for the same URL but with a parameters specifying a form index. And what we get back is data structured specifically for that one form. All the other content of the page can be removed from memory to make room for dealing with the form. (A page already in memory could be cached to disk or to REU to quickly return to it.) I don't want to go into more detail on forms here. But, that's what I'm thinking.

MText Semantic Coding

XML-style tags are just way to verbose. With long, human-readable attribute names, and values wrapped in quoted strings. It's all just way to much. Instead, we need something like single byte codes. I haven't thought this through completely, but I've got some ideas.

The proxy service will deliver the text to us in 7-bit ASCII. Which means, the high bit, bit-7, for normal text is always low. Therefore a byte which has its high bit set can be interpreted as a special code byte. There are then 7 more bits with which to specify the code value. It would be ideal if all the information for all codes could be packed into a single byte, but I don't think that will be possible. The most important thing, in my opinion, is the ease with which the codes can be skipped over by other mostly generic, mostly MText-agnostic routines. If we select a range of text in a document, and open the Text Utility, a common tool that can be used by many different C64 OS applications, we want the Text Utility to be able to do things like count words without getting tripped up by the presence of MText semantic codes. Similarly, when copying and pasting text from an MText source to something that handles plaintext.2 You want the transfer to the clipping to be able to strip the MText of its codes with minimal effort and minimal insight into how MText works.

If the codes were all only one byte long, and always had their high bit set, they'd be easy to identify. As I said though, I don't think one byte will be enough. I really like the way UTF-8 works. If you've never read about how UTF-8 works, you really should take a break and go read about it. It's very clever. This article looks like a good one:

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text David C. Zentgraf — 2015 — http://kunststube.net/encoding/

The general idea is that C use the special NULL byte ($00) to terminate strings. In other words, all of C's functions interpret a zero-byte in a string of data as the end of the string. Meanwhile, one-byte characters are not enough to represent all the characters of Unicode. Therefore you need some way to represent them. One obvious way is to just say, okay, from now on every character is 2-bytes long. There are two main problems. The first problem is that basic ASCII texts instantly become twice the size. But the much worse problem is that the high byte of every plain ASCII character would end up being $00. Such strings would fail to pass cleanly through any system that expects those $00 bytes to mean, this is the end of the string. UTF-8 is a clever encoding scheme that is backwards compatible with ASCII, can be extended so that some characters require 2 bytes, some more rare characters require 3 bytes and more rare still require 4 bytes. But, none of those additional bytes is ever $00. So, UTF-8 strings can pass through older non-UTF-8 aware systems by sort of masquarading as ASCII.3

So, maybe all the MText coding bytes have their high-bit set. But the value of the first one encountered implicitly uses the next one as a parameter. Something like this:

Code Byte Parameter Byte Meaning
%1000 0001 Strong
%1100 0001 /Strong
%1000 0010 Emphasis
%1100 0010 /Emphasis
%1000 0011 Heading
%1100 0011 /Heading
%1000 0100 Sub-Heading
%1100 0100 /Sub-Heading
%1000 0101 Paragraph
%1100 0101 /Paragraph
%1000 0111 Ordered List
%1100 0111 /Ordered List
%1000 1000 Unordered List
%1100 1000 /Unordered List
%1000 1001 List Item
%1100 1001 /List Item
%1000 1010 %1xxx xxxx (URL index) Link
%1100 1010 /Link
%1000 1011 %1xxx xxxx (URL index) Image
%1100 1011 /Image
%1000 1100 Horizontal Rule

It's a big topic. And these are just some thoughts on how to make the coding much more lightweight both for transfer size and storage but also for processing time and code complexity.

I'm envisioning very little nesting. It is interesting to draw an analogy between nesting and multi-tasking. When you have very few resources, memory and CPU time, multi-tasking comes along and cuts those already sparse resources in half, while adding overhead in the form of context switching and code relocation. Nesting in an HTML document, such that a sidebar of content can appear beside body content (as in the layout of the website you're reading now) is just as problematic when you've only got a screen with 40 columns. A sidebar of only 10 columns is scarcely wide enough to hold one word. For example, such common words as "information" would not be able to fit in a sidebar that's 1/4th the width of the screen. Meanwhile, with such a sidebar, the body content then gets limited down to only 30 characters wide, or 29 if you want to leave a margin. It's just impractical.

Some nesting still makes sense. Such as a list inside a list. Essentially, while moving down through the text the context needs to keep track of how deeply nested the current content is, and translate the nesting depth into an offset from the left. Anything more complex and I think we start to lose the plot.

How to get MText?

Whatever MText is, the web is not made of it. The web is made of HTML (often broken or invalide HTML) and that's not going to change anytime soon. So, the idea is to use the services at services.c64os.com, a growing suite of proxy services, to convert the HTML to MText.

The MText retains the links, full ordinary URLs. Following a link to another page will route the request back through the proxy to get the next page as MText. The links of a page may point to more than one kind of content though. Images in a document should be converted to a special kind of link. If you try to follow a link to an image, rather than routing through a service for converting HTML to MText, it will go to the image conversion service. Links to binary files for download, for C64 software for example, would not need to go through the proxy.

The web is a vast and varied place though. And conversion from complex HTML to a simplified MText format will be rife with issues. The way I see it, at the moment, we can hardly consume any of the web. So, if we could have a proxy service that worked reliably and well enough on some small portion of the web, well then that's some small portion that is opened up to us.

I have a few particular targets in sight. Let's talk about these.

  • C64.com
  • csdb.dk
  • Duck Duck Go
  • Wikipedia

C64.com is home to thousands of SID tunes and games in .D64 format.

A download can be made from c64.com with a GET request to the download URL, with an ID of the game in the URL. For example, to download Bubble Bobble as a .D64, one only needs to make a GET request, via a WiFi modem to:

http://www.c64.com/games/download.php?id=91

There are hundreds of games available, directly downloadable, to a C64 if only the ability to navigate this one website was within reach. So it's an obvious target for making sure the conversion to MText works smoothly.

The same is true for CSDB.dk. It hosts thousands of demos, programs and tools. The URL scheme for initiating a download is quite simple. It doesn't depend on anything fancy. If we focus on getting the proxy to deal well with this one site, a huge library of C64 software becomes available directly.

Searching CSDB is as easy as putting the search term in the URL. Let's say we want to search for "techno", it's as easy as this:

https://csdb.dk/search/?search=techno

The fact that this address is HTTPS is not a problem, the proxy can handle that like a breeze. Once a release is found, downloading it is similarly easy to C64.com. Here's how to download a .PRG file of the "Technotronic Demo":

https://csdb.dk/release/download.php?id=214578

The downloading difference of each release is nothing by an ID number in a URL. Without an UI at all, you could download releases at random simply by changing the ID.

What about more generic stuff, like searching the web? DuckDuckGo.com has a no-javascript all HTML version of itself that is incredibly trimmed down and easily convertable to MText. The search term is similarly sendable in the URL. You don't even need to POST from an HTML form. Let's say we want to lookup "Star Trek". Here's how:

http://duckduckgo.com/html/?q=Star%20Trek

Now, these links take you to other websites, and the usability of the resultant site after being converted to MText will obviously vary from site to site. But for pure informational lookup, many sites out there are little more than an article on a blog or news site, and would be easily accessible to get to the heart of what matters, the textual content, with a solution for viewing images too.

Another major target I would consider is Wikipedia. Wikipedia articles come in a consistent format so any abnormalities in conversion could be explicitly worked with to ensure a usable output to MText. Searching is similary incredibly straight forward, and this one site, alone, opens up millions of articles in the English language alone.

If we roll back all the way to the start of this post, the magic of mass storage began with the CD-ROM. Hundreds of megabytes fit on a single disc opened the door for computers to have access to whole encyclopedias of information. But we were mostly left out of that revolution. CD-ROMs are now a distant memory. My kids have no concept of specific content only being available if you happen to own a shiny physical disc on which the data is stored.

The net is so much more accessible to the C64 than CD-ROMs ever were.

  1. Books are the ultimate final product. They are like a statue carved from stone. There is no separation, what-so-ever, between content and presentation. There is very little metadata, and what little there is is woven into the content in an unparsable, inextricable way. []
  2. Such as to the Notes Utility. Although, who knows, maybe the Notes utility could also work with MText, that would be cool. But, maybe a plain text editor then. []
  3. They'll break if a C-string function tries to manipulate them, but it's better than nothing! []

Do you like what you see?

You've just read one of my high-quality, long-form, weblog posts, for free! First, thank you for your interest, it makes producing this content feel worthwhile. I love to hear your input and feedback in the forums below. And I do my best to answer every question.

I'm creating C64 OS and documenting my progress along the way, to give something to you and contribute to the Commodore community. Please consider purchasing one of the items I am currently offering or making a small donation, to help me continue to bring you updates, in-depth technical discussions and programming reference. Your generous support is greatly appreciated.

Greg Naçu — C64OS.com

Want to support my hard work? Here's how!