June 2007

Unicode Fanaticism

CIDMIconNewsletter Graydon Saunders, ATI Technologies, Inc.

If you’re starting to explore XML authoring, you probably hope to secure particular XML advantages:

      • semantic tagging
      • explicit separation of form and content
      • standards adherence through DITA or Docbook
      • information architecture designed to support topic-based authoring
      • language independence

Securing these advantages requires paying attention not only to how XML will represent your content but also how XML itself is represented.

Unicode exists to represent all world languages in a single encoding. This allows one set of software to work for everybody, and any content, even Chinese Algebra, to be reliably represented to any viewer, provided everyone is using Unicode tools.

Thorough, Unicode-enforcing Unicode support has to be present as a core part of the software design. In testing terms, this means that useful Unicode environments are all-or-nothing; an environment or application can handle the entire Unicode code point definition, or it can’t. Platform dependencies in character handling mean you can expect applications (which typically rely on the underlying platform for character handling support) to show platform-specific behaviors with respect to their Unicode handling, even if the application or environment being tested is advertised as being cross-platform. Testing well outside the set of characters you expect to use on all the platforms you intend to use is a very good idea for this reason.

Platform-specific behaviors make it hard to recommend specific applications, but in general you want an application which runs on a platform which supports all languages with the same version of the platform. Java and any modern Unix (including Linux) meet this criteria. Applications like the oXygen XML editor or Eclipse which run on a Java platform, can reliably provide enforced Unicode character representation anywhere you can run Java. Applications like OpenOffice which include their own character handling support and do not rely on the underlying platform to provide it can reliably provide enforced Unicode character representation on any supported platform.

The Specific Case of XML

XML processors are required by the XML specification to understand UTF-8 and UTF-16, two equivalent mechanisms for representing Unicode. The processor can understand any other encoding the implementers care to add, but those two are the only two you can count on. As a result, XML is by default represented in UTF-8. However, the XML declaration can specify that you want some other encoding instead.

<?xml version=”1.0″?> is equivalent to:
<?xml version=”1.0″ encoding=”UTF-8″?>

If you want UTF-16, you must ask for it:
<?xml version=”1.0″ encoding=”UTF-16″?>

So far, this is pretty bland, and in an ideal world where everything that handled XML reliably enforced Unicode, it would actually be pretty bland.

We are not in an ideal world

Because we are not in an ideal world, the XML specification includes what XML processors are required to do if they encounter an encoding error. They are required to fail.
As stated at the end of section 4.3.3, “Character Encoding in Entities,” of the XML specification document (

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process.

The XML specification defines “fatal error” as follows:

[Definition:] An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document’s logical structure to the application in the normal way).

This means that an XML processor–which includes an XSLT style sheet processor, as well as validating and non-validating XML processors–is required to fail if it finds one broken character. You may get a lengthy and detailed error report, but you won’t get your regular output.

Since the XML processor isn’t required to give you a useful error message or to continue past the first error it finds, it’s quite possible to find yourself going through a long XML file one or two lines at a time, trying to fix the encoding errors using the same Unicode-supporting-but-not-enforcing tools that got you into your present difficulty in the first place.

This scenario leads to complete failure, and it leads to failure in the writing step, long before localizing or generating output for customer deliverables become considerations.

Why don’t all applications enforce Unicode?

Many applications fail to enforce Unicode. Many applications claim, correctly, that they support Unicode; that does mean that they enforce Unicode encoding of all content. Unicode-supporting applications allow other encodings; they may well allow other encodings in a single file. A common example of this permissibility is Web browsers, which are designed to do the best job they can of displaying something to the user.

We’ve seen cases where localized HTML documentation claimed to be in a specific encoding and was actually in five different encodings (two of them being UTF-8 and UTF-16) before the mess got sufficiently bad that the browser could no longer cope and it became obvious to the person checking the localized documentation that there was a problem. (You can’t safely use a browser to check your XML for exactly the same reason – the browser won’t enforce an encoding requirement.)

Unless your entire tool chain – everything you use to create, store, and process your XML content – enforces Unicode, in either UTF-8 or UTF-16, various difficult-to-detect errors can – and will! – wind up in your XML, absorbing hours of frustrating effort while you try to correct them. Requiring human intervention to correct the XML content removes most XML writing advantages; you can’t rely on automated processing to produce your output, and you won’t recoup writer effort by separating form from content since the effort that used to go into formatting will instead go into correcting encoding errors.

Where do the errors come from?

Importing existing content, whether by cut-and-paste or through direct file importation, can introduce characters into an XML document that are not UTF-8 or UTF-16. A tool that is not fully Unicode aware can allow you to type characters in a non-Unicode encoding, or it can translate named entities (such as &trade; for the trademark sign ™) into a non-Unicode representation. (And, really, for XML, we want only a subset of Unicode representations; while there are several Unicode encodings, XML parsers are required to handle only UTF-8 and UTF-16, so we want to stick to those.) If the application attempts to appropriately transcode (“convert from one encoding to another encoding”) content from a non-Unicode encoding to a Unicode encoding and gets it wrong, you’re going to get errors as well. (For some reason, the trademark symbol, U+2122, is particularly prone to this error on Windows platforms.)

Why can’t you see the errors?

Unicode has a concept of “explicit inequality,” where two code points can be represented by the same glyph but are expressly not the same thing. So “Greek Capital Letter Omega,” Ω, Unicode code point U+03A9, may use the same glyph as “Ohm Sign”, Ω, Unicode code point U+2126. Despite the identical glyph, it’s not the same Unicode character. The “why can’t I get rid of that space?” version of this is the colon, U+003A, versus the “Fullwidth Colon,” U+FF1A. Depending on the display mechanism, you can spend a lot of time trying to figure out where the space on either side of the colon is coming from, or why the search for everything after the colon isn’t matching.

It is important that your Unicode-enforcing writing tool include a display mechanism for the code point of a given character; just looking at the glyph can’t tell you everything you need to know.
A standards-compliant tool chain can use this explicit inequality of code points for a kind of semantic checking. If you’re producing a text on Homeric poetry, you can automatically check to make sure you have no Ohm signs. If you’re producing a reference manual related to electrical engineering, you can automatically ensure that you have no capital omegas.

If you’re using a symbol font for your Greek letters, or the HTML &Omega; or &ohm; entities, you get hexadecimal character 57. This encoding is the same hexadecimal value (in UTF-8, ASCII, or Latin-1 encodings) as a capital W; in UTF-16 it isn’t anything at all. One interesting potential consequence is that if anything happened to the symbol font, your documentation would start measuring resistance in Watts, the units for power. This makes as much sense as measuring distance in gallons and makes subject matter experts justifiably upset.

A Unicode-enforcing tool chain ensures that you don’t suffer from this sort of symbolic rot. You may wind up with blank spaces or hollow boxes in your final output if you’ve used a code point that your fonts do not support, but you don’t get one glyph turning into another glyph because you’re using the font choice to contain a semantic distinction.

If you’re avoiding this problem by using a non-Unicode encoding’s actual capital omega rather than overloading a capital W through a font choice, the semantic distinction between “ohm” and “Greek letter” isn’t available to you; if you’re overloading a character through a font choice, the semantic distinction between “Latin letter w” and “ohm” might not be available to you. In pretty much all non-Unicode encoding schemes, there will be code points that correspond to more than one semantic entity (ohms and omega) or, worse, to more than one glyph, depending on some other metadata such as font family. This becomes a serious problem with localized content because you can’t necessarily lexicographically order your content appropriately based on the one-to-many code point to glyph mappings. (“Ohms” and “omega” should not sort to the same place in the index.)
It is a substantial advantage of enforced Unicode that you can rely, in automatic processing, on the semantic distinctions implied by the code point names. You may not care if the ohm signs are treated differently from the capital omegas, but you almost certainly do care that indexing can be made to work properly, irrespective of language, and that the character you entered is the character that appears in the output.

However, having decided that you want to use XML in a way that conforms to the specification, you will not find conformance easy to do. You have to test and test carefully that the whole tool chain properly enforces Unicode, and you have to do this before you decide which tool chain you’re going to use. How an application handles character encodings, whether that application is an XML editor, XML processor, or XSL processor, is a very low-level design choice; it can’t be patched afterwards to work correctly if it doesn’t work correctly now.

Is this effort worthwhile?

It is impossible to avoid encoding issues and retain XML standards compliance. You can find XML processors to validate your content which don’t follow the specification and either silently ignore non-Unicode characters or accept flags to delete any problematic character during processing.
Everything contains tradeoffs; what you’re accepting is a certain amount of up-front convenience in trade for never being quite sure that you’re going to get what you expect or that your content will work with other tools. Standards are of no benefit if you don’t actually follow them.

If you do rigorously follow standards, there are some benefits. One benefit is composed normalization (“Normalization Form C”), which the Unicode Technical Committee (see, specifically and defines as “A process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence.” This guarantees that however you entered information, it’s all the same when it’s stored. This convergence is a minor benefit in Latin languages – combining characters used to produce accented characters wind up as the accented character, and numeric entity references to specific, rarely typed, or hard-to-type code points get converted into those characters – and a substantial benefit in ideographic languages, such as Japanese or Chinese, where there are multiple input methods in active use.

Another benefit is avoiding surprise; the Chinese “National Standard” (“Guojia Biaozhun”) for character representation, GB18030, is legally mandated for all software sold in China after August 1, 2006. Because GB18030 is an encoding of all Unicode code points, one-to-one transformations between it and other Unicode encodings are straightforward. Transformations between arbitrary non-Unicode character representations and GB18030 are not straightforward and are not guaranteed to be possible. Someone with a Unicode-enforcing tool chain is in a much better position to adapt to this kind of regulatory requirement than someone who does not have such a tool chain in place, even when the final output format is not a Unicode encoding (in which case, you work in Unicode and transform as the final output step).
Fundamentally, you must decide what you’re trying to do with your XML writing system. If your objective includes semantically tagged information managed and presented through automatic machine processing, getting the encoding right is every bit as important as not introducing corrupt records into a database. If you object includes interoperability, reliability, and avoiding regulatory surprise through adherence to open standards in a global marketplace, enforced Unicode is the only game in town for complete character representation. CIDMIconNewsletter

About the Author

GraydonSaunders bw

Graydon Saunders
ATI Technologies, Inc.

Graydon Saunders is the Documentation Team Lead for a documentation team inside Advanced Micro Devices, Inc.  He does design, functional specification, and vendor relations while wearing the Delenda DITA CMS project hat. He writes Perl scripts, creates XSL output style sheets to handle 22 languages, configures CMS output generation, and does internal relationship and expectation management while wearing the tools and processes hat.  Graydon has a degree in Computing Science and a long-term interest in how you tell information from data.