Keith Schengili-Roberts, Independent Consultant
All You Ever Wanted to Know About DITA Plug-ins
One of the great things about the design of the DITA Open Toolkit (DITA-OT) is that it is extensible; in other words you can tack on other processes that can further manipulate the data in your published documentation. The DITA-OT uses ant to launch the various processes that, depending on what you want, produces XHTML output, a PDF, Eclipse Help or any of the other natively supported output types. But you can also insert routines into ant that can launch additional routines, so you can extend its original functionality. These extensions to DITA are more commonly known as “plug-ins” and they provide additional features that can process the content of a published document in a specific way.
There are several open-source DITA-related plug-ins currently available. A good list of these can be found on the dita.xml.org site at: http://dita.xml.org/wiki/plugins-for-the-dita-open-toolkit, and covers useful functions for some precise requirements, such as providing output support for particular specializations (like the Java APIRef Reference specialization or the Troubleshooting specialization). There are others which are designed to aid in the conversion process from other formats, such as FrameMaker or DocBook. Many of these plug-ins have been provided by IBM, and are actively maintained by their programming staff for the open source community. There are a few more generalized plug-ins available, including the Ancestors Plug-in, which provides breadcrumb links into XHTML output, so that users reading an online manual can more easily navigate through its hierarchy of pages. There’s also the HTMLSearch Plug-in, which provides the necessary underpinnings for a search mechanism for online XHTML help. There’s even the “Music of DITA” plug-in which takes an iTunes music library listing (which is contained in XML file) and converts it into DITA XML files that can then be viewed in a standard web browser. Yes, you can easily DITA-fy your music collection if you want to!
All of these plug-ins are good to have, but there are often more specific—and even more general—functions that technical writing departments might like to have in their DITA toolset. One of the things I have always advocated is that whenever possible, whatevercan be automated in a publication process should be. Even the very best and most senior technical writers are fallible, and if a computer can be introduced into a process that can take away the burden of any sort of repetitive process that a person might have to do—the type of situation which is most prone to error—all the better.
Another chief advantage of writing in DITA XML (or any XML-based documentation for that matter) is that everything is data and potentially can be used as an input into some other process. That’s where DITA plug-ins can come to the fore, by taking the data that is already contained within the topics of a map and then transform them in a way that provides an additional benefit to the end-user.
I’ll look at the thought processes that have gone behind creating a plug-in for an index listing plug-in for PDF output that is currently being worked on by Running Bond Software.
Identifying the Need
One of the things that my business partner and veteran programmer Mark Zagrodney (also known as “The Guy Who Makes Things”) and I realized from our experiences of DITA-based documentation processes is that there was no auto-indexing mechanism available. Certainly there are tags designed to handle the creation of an index, but we wondered why there wasn’t some sort of mechanism that could take a list of keywords and then apply them to a target document at output. Our thinking was: why have a technical writer laboriously go in and tag all of the index-able terms in a document when we could design and create a plug-in that could automate the process. But before any coding could begin, we had to sit down and sketch out (on a napkin, as it was a discussion that came up during lunch) the process flow so that we knew how it was going to work.
The first questions was: do we want the plug-in to add <index> tags to the actual DITA topics, or should it just generate an intermediate version which would be tagged and then outputted? After some discussion we decided on the latter for several reasons: if we tagged content in the actual topics, we might run into problems if the process tried to re-index the same term later (the nightmare of endlessly nested <index>-tagged terms filled our heads). Also, what if we wanted to change a term that had been indexed in one publication to one that was not in another publication? Also, there is a programming principle that Mark reminded me of which says: “do no harm.” Using this principle as a guide, we preferred tagging a virtual version of the document on the off-chance our plug-in did something unexpected. If it was designed so that it did not actually change the source topic files in any way, the only place any error would appear would be in the PDF output. At worst the user would have to re-generate the PDF file. This way any problems could also be identified and fixed, and no actual “harm” would ever be caused to the original topics.
The next question was: how do we tell the plug-in what it should tag? The writers would need to have an understanding of the primary terms they might want to index and then be able to enter that in a form that the plug-in would be able to recognize and process. The idea is not to produce a concordance for the document (i.e., make a list of all of the terms used in the document), but to produce a useful index so that an end-user wanting to look up a specific term would be able to find all instances where it appears in the document. While you could index all instances of the word “the” or “a” in a document, its utility to the user is low.
There were a couple of possible ways to tackle this situation from an input standpoint: we could create a separate UI mechanism that would store the terms in a separate list, or we could create a topic that contained a master list of all index-able terms. We choose the latter path, since a) creating and modifying a single topic was frankly easier and arguably more robust than creating a new program for storing and modifying a list, and b) technical writers are already familiar with editing topics and would implicitly understand that a term added to the index topic would be reflected in output. In our minds, it was a classic case of “why re-invent the wheel?” when DITA already provided us with what we needed. While the details are still being worked out, the plan is to specialize a topic (which would be recognized by the plug-in) containing a list of index-able words, which a technical writer would have to create.
Initially we thought we would need only one master list for all of the terms that might need indexing. A master list would make things easier in one respect, since we could simply apply it universally to every document that we would output. But after some further thought we realized that there might be cases where an index-able term in one document would not be something you might want to index in another document. Indexing the word “via” might make sense in an electrical engineering document, because it is the term for the copper tracings on a motherboard, but if you wanted to publish a document using the same process that mentions that “you can get to Woodbine subway station via the Bloor subway line” then there’s little point in indexing the same term. So we settled on a single topic that could be included in a map for a document and would be recognized by the plug-in.
Finalizing the Process Flow
The process flow we decided upon was this: the technical writer creates a specialized “indexlist” DITA topic which contains a simple list of terms that are to be indexed. This list is appended to the map of the document to be indexed as a specialized topic. When a publish-to-PDF process is launched by the writer, the ant task launches our plug-in, which creates a virtual version of the whole document, and then systematically adds <index> tags to its content. When the plug-in finishes making its way through the list, it deletes the virtual copy of the indexlist topic so that it is not included in the document, and then goes on to publish the indexed PDF.
There are some additional issues that need to be worked out, such as setting exclusion conditions where you might not want to index particular terms depending on placement, such as in a running header or footer in a document or a term appears within a <codeblock>. While the original topics are never altered by the plug-in, the virtual copy of the map and its topics created by the indexing process still needs to be valid DITA XML for it to render properly, so the plug-in still needs to “tread carefully.”
By the time this article makes it to print, we hope to have a beta version of the indexing plug-in available for testing from the runningbondsoftware.com site. We are looking at creating additional plug-ins for those trying to automate other aspects of their DITA-based documentation processes, each of which will have to go through the same type of thought process as described here.
The lunch that Mark and I had thankfully didn’t involve any messy foods, so our napkin with the blueprint for the plug-in’s process flow did not have to serve double-duty that day. 😉
About the Author
Keith Schengili-Roberts is an independent DITA consultant/trainer, and is “The Guy Who Talks to People” for Running Bond Software. He was formerly the Manager for Documentation and Localization at AMD, where he successfully deployed a DITA-based Content Management System throughout its graphics engineering division. Keith is also an award-winning lecturer on Information Architecture and Information Management at the University of Toronto’s Faculty of Information Sciences. He has given many talks on DITA production metrics and best practices, was one of the founders of the Semiconductor DITA Implementers Group (SDIG). He also runs a popular industry blog atDITAWriter.com. Keith is the author of four professional technical titles, the most recent being Core CSS, 2nd Edition.