Unicode Ate My Brain

What you are about to read is a slightly edited set of messages I wrote on a local bulletin board system. If you recognize the software, you're one of the few. Be warned that my understanding of Unicode and UTF is cursory at best, so don't rely on anything you read here.
2574 20:48  Smarasderagd

Milky: You want to hear about Unicode, do you? Har har. You FOOL. FOOLS. You are all just DOOMED, to hear me recount the story of the Great Character Set Internationalization Wars. Which aren't over.

Long long ago in the predawn mists of time (i.e. some time this century) a numerical code was devised for the transmittal of information between computers. And it was called the American Standard Code for Information Interchange, abbreviated ASCII. It had many excellent tendencies. The digits from 0 to 9 had codes which could be translated into their numerical values almost effortlessly. Most (but not all) of the characters available from the standard typewriter keyboard were in it. Later it was extended to encompass characters many typewriters lacked. But we shall consider this another time.

Time passed, relentlessly, and, well, America neglected to conquer the rest of the world. Germany was partitioned, but German still flourished. And along with it that pesky umlaut, and that weird mutant B-looking thing that is actually two s's stuck together. No problem: to represent the umlaut's modification of a vowel, add an e. And you can always write "ss" instead of that B-like freak.

France had been ravaged, but French flourished. Along with the acute, grave, circumflex accents, and the cedilla, at which point many of the proponents of ASCII began to feel the first faint stirrings of doubt. Sure, you can represent the accents by putting the quote, backquote, and caret after the accented letter, although there was potential confusion between a quote's use as an accent or as an apostrophe. But that cedilla.. there's nothing in ascii that looks like it at all, unless you want to go completely off the beam and nominate the comma.. but let's not get into that, or we'll be forced to dwell on the distressing European tendency to use it as a the decimal point, and to avoid using the period in representations of numbers entirely.

Finland had suffered Finlandization, but Finnish still flourished, and at this point most ASCII lovers threw up their hands in dismay, while the Finnish reluctantly appropriated some of the less used characters of ASCII to represent their own set of accented characters.

2574 21:00  Smarasderagd
More and more, ASCII begins to look a little, well, cramped.

I'm undead, but.. I'm NOT STOPPING! AAAaaaaaaaaaaaaaaaaaah.

So here things stood with ASCII, mostly effective, but getting rather worn around the edges, when..

China. Japan. Islam. Russia. Korea. India. Africa.

At which point it became clear that while there was probably a need for an International Standard Code for Information Interchange, ASCII was not it. Something new was needed. Something new was invented.

Several new things, in fact.

And that was the problem.

Because while certainly Latin-1, ISO-this and wide characters and multi-face character sets which reserved a chunk of space for each language and so forth and so on were all well and good, there remained the question as to WHICH one would rule the world. Latin-1 wasn't up to it. It was a noble and thoughtful effort at enlarging ASCII to take in all of the European languages, but there was no way it would represent any of the three separate character sets used by Japanese, or the unearthly welter of ideograms in even the revised and simplified written laguage of (ahem) continental China. So it fell to the super-sets, those which added bits to ASCII's original seven-plus-one, and there was a problem.

They were FAT.



Gross. Huge. Too big. Most of them were four times as big as ASCII, and the prospect of trading in a 2 gigabyte drive for an 8 gig drive brought little gladness to the hearts of most data managers. There was a solution, onsidered by some to be an abomination, but which held great promise.

Its name was Han.


2577 11:35  Smarasderagd

The epic story of unicode, as told by ignorant the veeblefester

When we last left our intrepid international character set standard, it was was waiting in the wings, along with a crucial, and as I mentioned, unpopular, concept, known by the name "Han." Now all -- and more -- can be revealed.

Han unification can trace its origin back to the strenuous efforts made to bring written Chinese into something resembling the modern world. Boiled down to essentials, when applied to the vast welter of international characters as a whole, it means this:

If it looks the same, it IS the same.

Or alternatively:

A difference that makes no difference IS no difference.

And thus was unicode born. Folding multiple character sets unrelentingly together, it managed to squeeze a serviceable international character set into 16 bits. Data managers everywhere (well, not really, but figuratively speaking) were somewhat relieved. They only had to get a 4 gig drive to replace each 2 gig drive. But you had to wonder if it might be possible to shrink things down even more than that, just possibly? And what about byte order, mh?m What about that?

Enter UTF. Its goal: provide as smooth and unruffled a transition from ASCII to Unicode as possible, while dealing with the byte order problem, and allowing programs to scan character streams without guesswork. Its goal: achieved.

The first 127 ASCII codes are the same as they always were. If the high bit of a character is set, though, this is a trigger. The high bit pattern in it (and succeeding characters) determines whether this is a 2- or 3-byte sequence, and what byte in that sequence. Since all the bytes have their high bit set, and a unique high bit pattern, a program can tell by examining the high bits of an individual byte whether it's in the middle of a multi-byte sequence, and where in that sequence. Sound unimportant? Heh. I guess YOU'VE never had to go backwards in a string before..

The other crucial advantage is that zero still means end of string, since it can only appear as a single byte, and never as a multi-byte sequence. This has the pleasant effect that naive programs can treat UTF streams as streams of ordinary 8-bit stuff, without worrying about the Unicode aspect if they don't want to. And that brings me to what I mentioned about what I was doing at work, umpty-ump messages ago. But that will have to explored in another message.


2577 11:42  Smarasderagd

My hovercraft is full of eight-bit eels.

The interminable saga of unicode and utf is almost at a conclusion. After years of high-falutin' international jockeying, here's where I come in.

I'd gotten ahold of a bunch of X programs which supported utf/unicode. An editor (called sam) and a command line window (called 9term). I also had a shell that could tolerate all this high-bit crap without having a gagging fit (it was called rc). And there the matter rested.

For you see, the story's not quite over. There's still the matter of input, which anyone seriously wanting to use unicode would have brought up by now. How do you enter one of a set of thousands of characters with a 101-key keyboard? Could the common ones be a little easier to type, please? All this and more, next on the great utf mystery show.


2579 07:41  Smarasderagd

Blerf. I have returned home after abusing coffee at work for four days. I figured out how to call here from there, though, so the future is bright, if a little radioactive.

Now, where was I.

Oh, right.

I'd had the set of unicode-capable programs for quite a while, without actually making use of unicode (they each had sterling qualities which made them useful in their own right) for the simple reason that I had no idea how to cause these wacky characters to appear. Plus the unicode font set was fucking huge, which is something you might expect of a font set that includes all the ideograms in standard Japanese, plus hiragana, katakana, and Greek AND Cyrillic and.. well anyway.

Well, one fateful day, I was sitting at my console (at home, as it happened, but I have the same environment set up at work) when the thought occured to me, faintly but persistently:

"Maybe there's a manual page for this."

I'd seen some passing description of sam's mouse interface in its manual page, but in general I found I had better luck clicking the mouse and seeing what happened. Working this way means your life is full of pleasant and/or embarrassing surprises, such as the time I accidentally found out how the string searching facility in 9term work, which increased its usefulness tenfold, and prompted intense chagrin at not having found it earlier, particularly since it WAS mentioned in the manual page. Still and all though, I'd come to distrust the documentation.

But still. "Maybe there's a manual page for this."

So I looked. And there was. How embarrassing. You could enter every character in any European language with ridiculous ease. And other stuff will only slightly less ease. So I tore off to work and socked the whole unicode font turd into place, and soon enough, despite all the warnings from friends and family (actually, no one knew I was doing this, so I got no warning, but "all" includes "none" among its possibilities, so there) I became a hopeless glyph junkie.

Chess pieces in my font. I delved into the theory of partitions to come up with a way to show them in every possible order. Well, I exaggerate slightly, and it would have been a lot easier if I hadn't been stupefied with caffeine at the time. And I see that space is running out, so we must wait another message before I confess the final madness.


2579 07:54  Smarasderagd

In which our tale of character sets comes to a bold, if deranged, close.

Well, God spank my face, but I didn't MEAN to use a default summary in that last message. Nevertheless. Onwards.

After I'd played around a while with the panoply of whacky characters available, I became gradually aware of just how pathetically wimpy Unix is when confronted with Evil, Nasty Characters With Their High Bit Set. ls refused to believe in them. vi spit them out. Even emacs kept its distance, hedging every one with a backslash and an octal code. So few places where a forlorn multi-byte sequence could find PEACE, ACCEPTANCE..

Converting the heathens abroad was too large a task. I contented myself with making my own revolution. ls was against me. Very well. The file system would have to go its own way. No other editor would display this freakish stuff. No problem, I only ever used sam, anyway.

So I made my statement where I could, in defined shell functions. And named them, not AFTER chess pieces. The names were THEMSELVES chess pieces. The white rook roams over the source tree, reconfiguring and updating, putting links in place, preparing the way. The white knight moves carefully and justly, if a little crookedly, compiling one file at a time. Fast and dangerous, the black knight sweeps across the network, compiling on all the machines at once.. just be sure you don't get caught.

So for now I am in my private paradise, although the walls are too high to see outside, and I worry that everything may collapse on me, I can say, though I can't say it properly here..

I [heart] unicode.

The End.