OAXAL—Open Architecture for XML Authoring and Localization
XML has become totally ubiquitous. There is no major publishing system in use today that is not based on XML, including Microsoft Office, XHTML, FrameMaker, and Adobe InDesign. From the Localization point of view, it became obvious that certain things were missing from the standards perspective in terms of providing a comprehensive Open Standards-based solution. As usual, proprietary solutions quickly appeared to plug the gaps but with the commensurate drawbacks: lack of openness and transparency.
One of the most fundamental aspects of Open Standards is how well designed they are. This is not surprising: take a group of industry experts, a democratic charter, peer and public review, and you will usually end up with a professionally designed solution. I never cease to be surprised at how much better Open Standards-based solutions are than their proprietary equivalents.
Open Standards also provide an example of IT best practice: you create a specification by talking to all of the interested parties, publish the results for public comment, and then print the results. Sometimes things do not go according to plan, but the nature of standards allows for revision and review in light of practical feedback. Rather like democracy, they provide the ability for self-correction.
Open Architecture for XML Authoring and Localization (OAXAL) is a newly established OASIS reference architecture technical committee standard. It covers all of the aspects of technical publishing, including translation, to create an open and effective solution. It provides a reference model for how to construct an effective and efficient system for XML authoring and localization based on Open Standards.
One of the things that XML has initiated, starting in 1997 when it was established, is an explosion of Open Standards. The main reason is that XML provided the necessary extensible vocabulary for defining key aspects of data models that was previously missing. The other reason has been the dramatic reduction in communication costs that allows for cheap and regular teleconferencing calls, allowing geographically distant participants to cooperate easily and effectively.
OAXAL is made up of a number of core standards from W3C, OASIS and, last but not least, LISA, (Localization Industry Standards Association). The LISA standards have now been taken over by ETSI LIS (European Telecommunications Standards Institute, Localization Industry Standards). Figure 1 shows a diagrammatic representation of the OAXAL standards component stack.
Figure 1: OAXAL Standards Component Stack
Let us have a look at all of these standards in detail.
Those of us who remember the ‘bad old days’ before Unicode praise it every day. The ‘tower of Babel’ of various illogical and contradictory encoding schemes that preceded Unicode was the cause of much grief for anyone involved in translation of electronic documents.
Where would we be without XML? It has been a monumental standard that has given us the extensible language that was lacking previously. It was as if the IT industry was finally given a common language to talk to one another. It is not perfect, but rather like democracy, all of the alternatives are so much worse that anything else is not worth considering. Coming on the back of the lessons learned from SGML, it will remain for many years the fundamental building block of all sensible IT systems. Interestingly enough, the adoption of XML in the publishing industry has been much slower than that in computer science in general. There have been many reasons for this, but with Open Office and the latest version of Microsoft Office, the final hurdles have been breached.
A few things that are not well appreciated are the following:
- You should always use UTF8 or UTF16 encoding for XML documents. This goes for monolingual documents as well as those destined for translation, as Unicode provides the basic typographical elements which most documents require.
- You should always use the xml:lang attribute on the root element of your topics to denote the language of the document and anywhere within a document where a change of language occurs.
ITS stands for Internationalization Tag Set. This is the contribution from W3C to the localization process <http://www.w3.org/TR/its/>.
ITS allows for the declaration of Document Rules for localization. In effect, it provides a vocabulary that allows the declaration of the following for a given XML document type, for instance DITA (Darwin Information Typing Architecture), and advanced vocabulary for component based publishing:
- Which attributes are translatable.
- Which elements are ‘in line’, that is they do not break the linguistic flow of text, for instance ’emphasis’ elements.
- Which inline elements are ‘sub flows’, that is, although they are inline, they do not form part of the linguistic flow of the encompassing text, for instance ‘footer’ or ‘index’ markers.
W3C ITS provides much more including a namespace vocabulary that allows for fine-tuning localization for individual instances of elements within a document instance. W3C ITS is therefore at the core of localization processing.
Standard XML Vocabularies
DITA, DocBook, XHTML, SVG—all of these standards dramatically reduce the cost of XML adoption. One of the factors that initially limited the adoption of XML was the high cost of implementation. XML DTD and or Schema definition is neither simple nor cheap. The key benefit for having standard XML vocabularies is that it reduces the cost of adoption: software companies can write standard applications, tools, and utilities that lower the adoption price, sometimes dramatically. Not only that, but as is the case with DITA, they often introduce key advances in the way we understand, build, and use electronic documentation.
This is a key standard from LISA OSCAR that has now been transferred to ETSI LIS <http://portal.etsi.org/LIS> and <http://www.gala-global.org/lisa-oscar-standards>.
Think of xml:tm as the standard for tracking changes in a document. It allocates a unique identifier to each translatable sentence or standalone piece of text in an XML document. It is a core element of OAXAL because it links to all of the other standards into an elegant, integrated system.
At the core of xml:tm are the following concepts which together make up ‘Text Memory’:
- 1. Author Memory
- 2. Translation Memory
You can think of Author Memory in terms of change tracking, but also as a way to ensure authoring consistency—a key concept in improving authoring quality and reducing translation costs.
As far as Translation Memory (TM) is concerned, xml:tm introduces a revolutionary approach to document localization. It is very rare that a standard introduces such a fundamental change to an industry. Rather than separating memory from the document by storing all TM data away from the document in a relational database, xml:tm uses the document as the main repository with no duplication of data. This approach recognizes fundamentally that documents have a life cycle and, that within that life cycle, documents evolve and change, and that at regular stages in that cycle documents require translation.
SRX (Segmentation Rules eXchange) is the LISA OSCAR standard for defining and exchanging segmentation rules <http://www.gala-global.org/oscarStandards/srx/srx20.html>.
SRX uses an XML vocabulary to define the segmentation rules for a given language and to specify all of the exceptions. SRX uses Unicode regular expressions to achieve this standardization. The key benefit of SRX is not so much exchange, as the ability to create industry-wide repositories for the segmentation rules for each language. To this end, companies such as Heartsome, Max Programs, and XML-INTL have donated their own rule sets to LISA.
Unicode Technical Report 29
Unicode does not end with the encoding of character sets. The technical reports, which form part of the standard and are included as an annex, are equally important. TR #29 stands out as the way to define what constitutes words, characters, and punctuation. If you are writing a tokenizer for text, Unicode TR29 is where you start <http://www.unicode.org/reports/tr29/>.
Translation Memory eXchange is the original standard from LISA <http://www.gala-global.org/oscarStandards/tmx/tmx14-20041007.htm>.
It helped break the monopoly that proprietary systems had over translation memory content. TMX allows customers to change systems and Language Service Providers without loosing their TM assets.
Global information Management Metrics Exchange is a three-part standard from LISA concerning translation metrics. GMX/V treats the issue of what constitutes word and character counts, as well as allowing for the exchange of metric information within an XML vocabulary <http://www.gala-global.org/oscarStandards/gmx-v/gmx-v.html>.
Believe it or not, before GMX/V there was no standard for word and character counts. GMX/V defines a canonical form for counting words and characters in a transparent and unambiguous way. The two associated standards will be GMX/C for ‘complexity’ and GMX/Q for ‘quality’. These are still to be defined. Once the three GMX standards are available, they will provide a comprehensive way of defining a given localization task.
XML Localization Interchange File Format (XLIFF) is an OASIS standard for the exchange of data for translation <http://www.oasis-open.org/committees/xliff>.
Rather than having to send full, unprotected electronic documents for localization with the inevitable problems of data and file corruption, XLIFF provides a loss-less way of round-tripping text to be translated. LSPs, rather than having to acquire/write filters for different file formats or XML vocabularies, have merely to be able to process XLIFF files, which can include translation memory matching, terminology, and so on. Similarly, Computer Assisted Tool (CAT) providers have only the one format to deal with, rather than a spectrum of different original or proprietary exchange formats.
Putting it All Together
All of the above mentioned standards can be put together in the elegant architecture shown in Figure 2.
Figure 2: OAXAL Interaction of Standards
This architecture allows for a high degree of automation of the whole workflow for both authoring and localization, as shown in Figure 3.
Figure 3: Source Document Lifecycle
The key aspect with regards to authoring and OAXAL is the ability to maintain a completely automated list of changes to the document, as well as providing metrics in terms of word counts. In addition, OAXAL provides for the introduction of an Open Standards-based facility for Author Memory. Author Memory allows authors to access existing sentences that contain the same key words and to choose an existing sentence if appropriate, rather than writing something else from scratch. This choice can be accomplished in a number of ways, either in an automated fashion where the system monitors what the author is writing or in a manner that allows the author to see what exists by typing in key words such as “engine change oil filter” and seeing all relevant sentences that contain these key words.
The localization OAXAL workflow is shown in Figure 4. The key point is to enable a high degree of automation within an Open Standards framework.
Figure 4: Document Localization Lifecycle
Why Does this Matter?
The answer is simple, but not necessarily obvious at first glance. The typical workflow for a localization task previously looked like the chart in Figure 5.
Figure 5: Traditional Translation Workflow
This workflow shows how the vast majority of localization tasks are conducted. Each arrow is a potential point of failure, as well as being very labor-intensive. At the ASLIB conference in 2002, Professor Reinhard Schäler of the Localization Research Center, Limerick University presented the standard model for the Localization Industry as shown in Figure 6.
Figure 6: True Cost of Localization
Over half the cost of a localization effort is consumed by project management. This workflow is very error-prone and labor-intensive.
With OAXAL you can automate the complete workflow as shown in Figure 7.
Figure 7: OAXAL-based Automated Localization Workflow
The OAXAL workflow provides considerable cost savings and improves speed, efficiency, and consistency. It also allows for a standard and consistent way of presenting the text to be translated via a browser interface which further removes many manual processes. The current generation of Web 2.0 browsers allows for the creation of a fully functional translator workbench via a web browser, including the ability to have multiple translators working on the same file, auto propagation of matches within the file being worked upon, as well as supporting infinitely large files.
A translator’s work is constantly saved as well as written to a translation memory database for immediate availability as outlined in Figure 8.
Figure 8: Browser-based Translator Workbench
OAXAL and DITA
So what relevance does this have to DITA? A great deal! OAXAL provides an integrated SOA (Service Oriented Architecture) approach to DITA in a production framework. Sure, you can achieve this with proprietary solutions, but with the same problem of vendor lock-in. Doesn’t it make sense to put an Open Standard like DITA into an Open Standards-based workflow context? The standards are all there, and they are not difficult to implement. OAXAL effectively provides DITA with an Open Standards framework that enables automation and a high degree of cost reduction.
So much for the theory! None of this would be convincing without commercial and Open Source implementations, otherwise it would be no more than an interesting academic exercise. OAXAL is backed up with real life successful examples, including
- XTM from XTM International http://www.xtm-intl.com
- Open Source implementation http://okapi.sourceforge.net/
Up to this year it would have been difficult to persuade people that an economic XML publishing solution would be available. Thanks to open standards within an OAXAL framework this has been achieved.
You can get full details of the OAXAL specification from the OAXAL OASIS TC web page at <http://www.oasis-open.org/committees/oaxal/>.
CTO at XTM International, Andrzej Zydron is one of the leading IT experts on Localization and related Open Standards. Zydron has worked in IT since 1976 and has been responsible for major successful projects at Xerox, SDL, Oxford University Press and Ford of Europe, DocZone and Lingo24 in the fields of document imaging, dictionary systems and localization. Zydron is currently working on new advances in localization technology based on XML and linguistic methodology.