The Black Box: Converting Legacy Documentation

Home/Publications/Best Practices Newsletter/2002 – Best Practices Newsletter/The Black Box: Converting Legacy Documentation

CIDM

February 2002


Tools of the Trade: Part Two of a Series
The Black Box: Converting Legacy Documentation


CIDMIconNewsletter David Walske, David Walske, Inc.

Pop Quiz

A train leaves Chicago at 10:00 am traveling west at 45 miles per hour. A train leaves Los Angeles at 10:30 am traveling east at 33Ä miles per hour. Then a miracle occurs. At what point will the two trains pass each other?

Even with a slide rule, a Cray Supercomputer, and a steamer trunk full of Amtrak schedules the precise answer to this question could elude you indefinitely. You might as well posit the number of angels that can dance on the head of a pin. And yet some organizations embark upon the daunting task of converting their unstructured documentation base to Extensible Markup Language (XML) or Standard Generalized Markup Language (SGML) under a plan centered on a murky deus ex machina process no less indefinable.

This kind of process is sometimes referred to as a “black box” (see the figure below). When you operate a system that includes a black box, you do so with no specific knowledge of and with absolute blind faith in the operations that occur within its mysterious realm. Data goes in, a miracle occurs, and data comes out. No wonder documentation managers are hesitant to dive into the pool.

Drawing19

Figure 1: The “Black Box” Process

Successful content management-that promotes content reuse and repurposing to multiple output formats-requires well-ordered structure. Tagged text languages, such as XML and SGML, can provide order through the alignment of documents around a common structural design. This much is clear: the orderly world of structured documents is the Promised Land. How you get there is not so clear.

Though you walk through the valley of chaos, fear no evil. I can’t help you answer the question about the two trains, but just maybe I can point you toward the path of glorious structured documents.

This, the second article in a series, continues a journey of discovery. In the first article, “Braving the Wilds of Document Structure Specification,” we began our exploration by using Adobe FrameMaker+SGML to amplify and illuminate the inherent structure in existing documentation to produce a schema. Future articles in this series will include discussion of tools such as Softquad XMetaL, Tibco Extensibility XML Authority, and others. In this article, we’ll continue with FrameMaker+SGML as we journey bravely into the desert of document conversion.

Structuring Content with Information Models

Previously, we worked with a very simple document to learn the process of divining structure from existing documentation to create a schema. We then created a FrameMaker+SGML Element Definition Document (EDD) and explored its parallel Document Type Definition (DTD). Finally, we applied the structural rules of our schema in FrameMaker+SGML, manually converting our unstructured document to a structured document.

Measured in real time, this process would have taken, depending upon skill and experience, anywhere from two to four hours from start to finish. Once a schema has been defined and the EDD created and refined, this conversion time would decrease considerably for subsequent documents that follow the same schema. Even so, our sample document was quite brief. A medium size documentation base can easily run to a million words or more. Investing the kind of time required to manually convert documents-wrapping each chunk of content in SGML or XML tags-would be absurd to the extreme. Fortunately there is a better way.

Automatic conversion
Automatic conversion is a relative concept. That is to say that the very best document conversion methods will do about 80 percent of the work-on a good day. And above all, the rule of GIGO (garbage in = garbage out) applies. When we manually converted our sample document in the previous article, we had the distinct advantage of human judgment every step of the way. When you convert automatically, you depend upon the judgment and intelligence that is built into both the conversion process and the documents that are to be converted.

Because of its simplicity, we could easily convert our previous sample document using automatic conversion. We know exactly what to expect from it and how to process what we find. But this scenario is pretty far from the real world of documentation. Instead, we’ll develop an information model that anticipates content as yet unwritten. Next, we’ll create a sample document that conforms to our information model. Then, we’ll use FrameMaker+SGML to create a conversion table. And finally, we’ll use that newly created conversion table to automatically transform our sample document into a structured document.

Defining an Information Model
Information models don’t exist in a vacuum. To demonstrate the creation of an information model, we’ll need to work with sample content requirements. A common software documentation requirement is for installation instructions. We’ll use this as the basis of our sample information model and the sample content that follows. As before, we’ll start by analyzing content. Normally you would have an existing documentation base analysis to help guide your judgments, having perhaps already called upon the help of a content audit expert. For purposes of this article, we’ll work a bit more off the cuff.

The purpose of our sample content is to assist software users in the installation of a specific software product. To fulfill this requirement, we’ll have to provide information and instructions.

To install the software product, the user will need to know

  • What does this software product do? Am I installing the correct software?
  • What are the minimum system requirements? Do I have what I need to install the software?
  • What is new in this version? Is there anything that I should know before upgrading from a previous version I may have installed?
  • How do I install the software? What are the specific steps required to install the software?

Answering these questions here is not necessary. That’s the writer’s job. But we do need to know that we are asking the right questions. This is where a content audit provides invaluable guidance and direction. For the purpose of this article, we’ll assume these are the right questions and move on to defining an information model.

infomodel47

Figure 2: An Information Model

Building a schema
Now we’re ready to specify document structure. From our information model, we’ll specify a document structure that will be the basis for the creation of a FrameMaker+SGML EDD (see the figure).

docstructure46

Figure 3: A Document Structure

The highest level element is labeled Install and contains four sub-elements. These sub-elements, Overview, SysReq, WhatsNew, and HowTo, also contain sub-elements of their own. The relationships of these elements to each other are often expressed in terms such as parent, child, sibling, and so on. That is to say Install is parent to Overview, while Overview and SysReq are siblings that are both children of Install.

At this stage of the schema development, there is a certain amount of subjectivity. For example, an argument could be made for promoting the Title element to be a direct child of Install. This is a simple example of the kind of judgment calls a schema designer must make.

Using standard SGML notation, the contents of Install can be described as:

Overview, SysReq, WhatsNew, HowTo

The description for Overview is as follows:

Title, Body

Several elements contain sub-elements labeled Item. The description for Item is as follows:

Para+

This description indicates that the Item element must contain at least one Para element.

The Para element is described as follows:

<TEXT>

This description indicates that text is contained directly in the Para element.

The description for OrderedList is as follows:

ListIntro, Item, Item+

These descriptions are written in standard SGML notation, which was covered in some detail in the first article of this series. For more information, see “Braving the Wilds of Document Structure Specification” on pages 134-139 in the December 2001 issue of Best Practices.

Using judgment and tacit knowledge of our software product, we have created an information model to serve a specific documentation need. From that information, we devised a structural design or schema. Before we can produce FrameMaker+SGML structured documents that can be validated against our information model, we’ll need an EDD based on that model.

You can download a copy of the EDD for the information model we’ve just specified, as well as all of the sample files mentioned in this article at www.walske.com.

Converting Unstructured Documents

Ideally, structured documents are born, not converted. That is to say, writers in a documentation team should create new documents as structured documents, eliminating the need for back-end conversion. Of course, in the real world this is not always the case. Likely, you have a large base of legacy documentation in the form of unstructured documents. And because life goes on, your team is probably creating more unstructured documents even as you read these words. Maybe your team or company is not quite ready yet to start cranking out structured documents because of a lack of funds for new software or training or both. That doesn’t mean you can’t start producing documents that conform to an information model.

Writing unstructured for structured
Without retooling to a structured document production environment, you can make certain changes to your unstructured documents to support ease of conversion later. The automatic conversion process we’ll explore shortly uses the implied structure of your unstructured documents to transform them into structured documents. The more your documents follow rules of structure, the more efficient the conversion process will be. That’s why it is important to establish information models as soon as possible, well before you are ready to enforce the models with a schema.

Creating sample content
Next, we’ll create a short unstructured document for a fictional software product based on the information model that we just developed (see the figure below).

sample content36

Figure 5: Sample Content

Creating a conversion table
FrameMaker+SGML has a built-in facility for converting unstructured FrameMaker documents to structured FrameMaker+SGML documents. To use this function, we’ll need to create a conversion table that tells FrameMaker+SGML how to wrap the content in our document into elements and then how to arrange those elements. To create structure, the conversion table process uses the paragraph tags associated with the paragraphs that make up your unstructured FrameMaker document.

The conversion table is a simple three-column table contained within an ordinary FrameMaker document (see the tables below). The first column specifies paragraphs of content by paragraph tag name. The second column specifies an element in which to wrap each paragraph. The third column specifies a qualifier, which is a temporary identifier used for further processing.

For clarity, our conversion table is organized into several smaller tables contained in a single FrameMaker file. You can think of each of these smaller tables as a processing pass. The first table below wraps the lowest-level items and applies qualifiers.

The Lowest-Level Elements

Wrap this object or objects

In this element

With this qualifier

P:Title

Title

P:Body

Para

Body

P:Heading1

Heading

Heading1

P:BulletItem

Para

BulletItem

P:Heading2

Heading

Heading2

P:HowTo

Para

HowTo

P:StepItem

Para

StepItem

As each row is processed, paragraphs that are tagged as specified in the first column are each wrapped in an element as specified in the second column. In the sample content, the title paragraph, “Installing the PrintRight Laser Printer Optimizer,” is wrapped in a Title element, while several other paragraph types are each wrapped in a Para element. The third column applies a qualifier, which we’ll use next to specify a parent element in which to wrap each of the Para elements.

The second conversion table below shows how the qualifier is used to wrap each of the Para elements in a parent element that supports our information model and EDD. The qualifier allows us, for example, in the first line above to specifically single out the Para elements that contain Body paragraph text and then wrap those elements in a parent Body element. The prefixes E: and P: allow us to specify that the values we enter in column one are either element names or paragraph tag names. The qualifiers allow us to perform complex manipulation of the elements we create. Now we’ll wrap two kinds of lists in the third conversion table.

The Para Elements

Wrap this object or objects

In this element

With this qualifier

E:Para
[Body]+

Body

E:Para[BulletItem]

Item

Bullet

E:Para[StepItem]

Item

Step

E:Para[HowTo]

ListIntro

Remember that all of the paragraph values (values with the P: prefix) in column one are derived from the unstructured source document. Our source document assigns a paragraph tag of BulletItem to bulleted list items and a paragraph tag of StepItem to numbered list items. Our schema however labels all list item elements Item. This represents a fundamental difference in the way this kind of information is represented in structured and unstructured documents. The unstructured document assigns a bulleted or numbered paragraph style directly. The structured document represents this information based on the element’s position in relationship to other elements. If the Item element is a child of an UnorderedList element, then the text that it contains appears bulleted. If the Item element is a child of an OrderedList element, then the text that it contains appears numbered. Notice how the use of qualifiers helps preserve this distinction (see the third table below).

The List Elements

Wrap this object or objects

In this element

With this qualifier

E:Heading[Heading2]?, E:Item[Bullet]+

UnorderedList

E:ListIntro, E:Item[Step]+

OrderedList

Each table builds on the previous one, working up the hierarchy. Subsequent passes rely increasingly on the expected order of elements instead of qualifiers to wrap each level, including the highest-level element, Install. Notice that the fourth conversion table uses no qualifiers at all because the processing decisions at this point are based on expected order of content.

The Highest-Level Elements

Wrap this object or objects

In this element

With this qualifier

E:Title, E:Body+

Overview

E:Heading, E:Body+, E:UnorderedList+

WhatsNew

E:Heading,E:UnorderedList

SysReq

E:OrderedList+

HowTo

E:Overview,E:SysReq,E:WhatsNew,E:HowTo

Install

Using a conversion table to apply structure
Because much of the processing performed by the conversion table depends upon predictable structure, it is vitally important that authors maintain good structure in their writing. Following an agreed upon information model is the best assurance that consistent structure is maintained.

Once you’ve created a conversion table, there are three steps to transforming your unstructured documents into structured documents. First, you’ll need to import the structural elements contained in your EDD into your unstructured FrameMaker documents. Then, you’ll use the utilities included in FrameMaker+SGML to apply the processing rules contained in your conversion table. Finally, because automated conversion is never 100 percent effective, you’ll need to manually validate your document, making corrections where necessary.

Further Exploration

This article is necessarily a somewhat cursory overview of FrameMaker+SGML, XML, and Schema development. The next article in this series explores XML translation and transformation. Future articles will investigate native SGML and XML in greater depth. We’ll also compare and contrast XML with SGML. CIDMIconNewsletter

About the Author

BPFebruary0234