Home/Publications/Best Practices Newsletter/2012 – Best Practices Newsletter/Migrating to DITA: How automated content conversion works and why it matters to you

 

CIDM

June 2012


Migrating to DITA: How automated content conversion works and why it matters to you


CIDMIconNewsletterPatrick Baker, Stilo International

An unavoidable part of moving to DITA, or any other structured authoring system, is converting your existing content into the new format. Most organizations that make the move to structured writing have to make the change while still continuing to meet their regular delivery schedules. This need means you must convert content and get it up and running correctly in the new system between two product cycles and usually without much in the way of added staff or resources.

Content conversion is a key part of your migration strategy, and the quality and completeness of that conversion is essential to getting the migration done within the constraints of your schedule. However, content conversion is a bit of a black box for many people. It is hard for writers and managers to anticipate how difficult the conversion process is going to be, how long it is going to take, how much it is going to cost, and how much clean up of the output is going to be required after the conversion is done. The purpose of this article is to lift the lid off that black box. In particular, understanding how automated conversion works will help you form a more reasonable expectation about how your own conversion project is going to go and what you can do to make it go more smoothly.

No automated conversion is ever 100 percent clean, but the difference between an 80 percent clean conversion and one that is 95 percent clean is huge–it means a fourfold difference in clean up costs. What makes the difference between 80 percent clean and 95 percent clean? Between 95 percent clean and 98 percent clean? Such outcomes certainly depend upon how well managed the conversion process is. However, having the right approach to content conversion is of critical importance.

Knowledge is the Key to Intelligent Content Conversion

There are three essential mechanisms that content conversion technology may leverage. They are

  • patterns
  • context
  • guided conversion

When encoding content in a semantically rich format such as DITA, it is important to understand the meaning of the content in order to apply the correct tags. While people can understand the full meaning of the text they are reading, a computer does not, at least not very deeply. What a computer is exceedingly good at is recognizing patterns in the content. But patterns don’t provide the full solution. Patterns, when found in a given context, carry much more insight as to the meaning of the text in question. A sequence of 5 digits, for example, may represent a zip code in the context of a US postal address or an ICD-9 diagnostic code in the health care sector. Guided conversion is supported by the provision of high-level mapping rules that hint at the current context so that patterns are interpreted correctly by the automated conversion tool. Compiling these hints depends on having an intimate familiarity with the document set destined for conversion. It is the content owners, armed with this content knowledge, who are best positioned to specify the mapping rules.

Patterns

Patterns are everywhere in content. Patterns occur both in the content itself and in the file format that contains the content. The foundation of all content conversion tools is the ability to recognize patterns.

People also use patterns to recognize things in content. For instance, a reader will immediately recognize what these numbers mean based on their pattern:

9/30/12

+1 (613) 745-4242

$65.12

Software can recognize these patterns as well, so if your target format requires semantic markup for date, telephone numbers, or monetary amounts, a simple-pattern matching algorithm can find them and supply the markup. For example, a conversion program could recognize the sequence:

“+” numbers space “(“ number*3 ”)” space number*3 “-” number*4

It can then capture each number sequence in this pattern and write it out using whatever XML format you choose for phone numbers, for instance:

<phone-number country=”1” area=”613” exchange=”745” number=”4242”/>

Of course, recognizing phone numbers is a bit more complicated than this. For one thing, people do not always include the country code when they write a phone number. People often omit the parentheses and the dash from the number, especially when the country code is used. This case is one where local knowledge of your content comes in—if you have a corporate style for phone numbers, you can tell your conversion software exactly what to look for. Otherwise, the conversion program can use multiple patterns to detect phone numbers in different formats.

Also, this pattern only works for North American phone numbers. Many other countries write their phone numbers differently. In this case, we can use a context clue to improve our detection of phone numbers. For instance, we can use the country code to determine which pattern to expect. The following pattern detects a UK phone number:

“+44” numbers-and-spaces

UK phone numbers use a different format from North America (details available at http://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom), so our original pattern will not detect them correctly. A conversion program can detect phone numbers as a two-step process. First you detect the country code to determine which country the number belongs to, then you select a pattern appropriate to the chosen country to fully analyse the number.

You can expect support for matching common patterns, such as phone numbers, to be built in to conversion software. However, it should be easy to extend the system with new patterns specific to the vocabulary of a particular domain.

Context

Patterns, though they are an indispensable part of automated conversion, cannot on their own address the challenge of imparting to the content the depth of meaning or understanding required for the intelligent application of semantic markup. This is where context comes in.

For example, consider a list in a FrameMaker document. In FrameMaker, while a table is a distinct type of object, a list is not. In FrameMaker, you create a list simply by adding a bullet or number style to a set of paragraphs. The result is something that looks like a list in the output. However, the FrameMaker file format does not record the fact that this is a list. The human eye can see the list in the output, but it is a little more challenging for a conversion program to figure out where a list begins and ends and what belongs to each item in a list.

Why does the conversion program have to figure out where the list begins and ends? Because most XML formats treat lists as distinct objects. When an XML document is styled, the style is generally applied to the list as a whole rather than to the individual paragraphs in the list. This is usually the only way that an XML-based system provides for styling lists, so if the conversion software does not recognize the list in the source and create a proper XML list element in the output, chances are that the list will not be styled properly in the final output.

Example: A nested list

Quick-drop cookies

1. Prepare the dough.

a. Beat the egg in a large bowl.

b. Add flour.

c. Stir in milk.

2. Prepare the topping.

a. Mix brown sugar and cinnamon
in another bowl.

3. Form 1-inch round balls of dough.

It is helpful to use a spoon when
forming these balls.

4. Roll each ball in the topping.

5. Place each ball on an ungreased
cookie sheet.

Bake at 425 °F for 12 to 15 minutes.

This is the kind of construct that often occurs in complex procedures in technical documentation. The conversion program has to deal with multiple paragraphs within a single list item, as well as nested lists.

In this example, a paragraph that begins with a numeral indicates a first-level list item, while a paragraph beginning with a letter indicates a nested second-level list item. An automated conversion should leverage this pattern to determine the logical nesting level of each item. Alternatively, it should identify nesting level by the indentation or styles that were used. Regardless, the conversion needs to track the current nesting level to ensure that the lists are properly opened and closed and that each list item belongs to the correct list. In our example, we emit an opening <ol> each time we transition from an outer list item to a more deeply nested list item and emit a closing </ol> when transitioning in the other direction. The correct output is as follows:

<p>Quick-drop cookies</p>

<ol>

<li>Prepare the dough.

<ol>

<li>Beat the egg in a large bowl.</li>
<li>Add flour.</li>

<li>Stir in milk.</li>

</ol></li>

<li>Prepare the topping.

<ol>

<li>Mix brown sugar and cinnamon in another bowl.</li>

</ol></li>

<li><p>Form 1-inch round balls of dough.</p>

<p>It is helpful to use a spoon when forming these balls.</p>

</li>

<li>Roll each ball in the topping.

</li>

<li>Place each ball on an ungreased cookie sheet.</li>

</ol>

<p>Bake at 425 degrees Fahrenheit for 12 to 15 minutes.</p>

Note that the list markers (1., 2., a., and so on) have been removed by the conversion.

Guided Conversion

So, how can we establish the appropriate context of a given piece of content? The most reliable authority is the content owner who is familiar with the content. A mechanism is required which enables the content owner to easily express what the correct context is for any document content. The mechanism must have a high-level interface that does not require the user to be a programmer or technical expert.

Example: Task steps

Upon further reflection, the markup provided by the previous example is not ideal. An improved DITA markup of these instructions for preparing the quick-drop cookies would use steps within a task topic. But to target a semantically rich content model such as a DITA task, a conversion tool requires guidance. Such guidance may be provided by means of annotations attached to portions of the content, as illustrated in Table 1.

Text Annotation

Quick-drop cookies

1. Prepare the dough.

a. Beat the egg in a large bowl.

b. Add flour.

c. Stir in milk.

2. Prepare the topping.

a. Mix brown sugar and cinnamon in another
bowl.

3. Form 1-inch round balls of dough.

It is helpful to use a spoon when forming
these balls.

4. Roll each ball in the topping.

5. Place each ball on an ungreased cookie sheet.

Bake at 425 °F for 12 to 15 minutes.

task title

step level 1

step level 2

step level 2

step level 2

step level 1

step level 2

step level 1

tip

step level 1

step level 1

Table 1: Annotations Attached to Content

The task title annotation can be based on the formatting properties of bold and underline. The annotation of step level 1 or 2 can be based on the presence of the list markers or the indentation level of the text. The tip might be recognized by the paragraph styling. The conversion should be smart enough to try to fit the last sentence into a task in a way that makes sense, in a way that is permitted by the DITA task content model. The elements <result>, <example>, and <postreq> are good candidates. A preference can be set for the documentation set. In this case, <postreq> is the best choice.

Guided by these annotations, the conversion software should produce the following output:

<task>

<title>Quick-drop cookies</title>

<taskbody>

<steps>

<step>

<cmd>Prepare the dough.</cmd>

<substeps>

<substep><cmd>Beat the egg in a large bowl.</cmd>

</substep>

<substep><cmd>Add flour.</cmd></substep>

<substep><cmd>Stir in milk.</cmd></substep>

</substeps>

</step>

<step>

<cmd>Prepare the topping.</cmd>

<substeps>

<substep>

<cmd>Mix brown sugar and cinnamon in another bowl.

</cmd>

</substep>

</substeps>

</step>

<step>

<cmd>Form 1-inch round balls of dough.</cmd>

<info><note type=”tip”>It is helpful to use a spoon when forming these balls.</note></info>

</step>

<step><cmd>Roll each ball in the topping.</cmd></step>

<step><cmd>Place each ball on an ungreased cookie sheet.</cmd>

</step>

</steps>

<postreq>Bake at 425 degrees Fahrenheit for 12 to 15 minutes.

</postreq>

</taskbody>

</task>

Typical Problems to Look Out For

Here are some examples of the types of conversion issues that cause problems for conversion solutions that do not make full and integrated use of patterns, context, and guided conversion.

Multiple Sets of Steps Within a Task Topic

A DITA task topic must contain only one procedure. However, many existing user guides are not written that way and may have more than one procedure in a section. If you are converting sections into topics and a section has more than one procedure, the conversion software needs to do something to produce valid output that includes both procedures.

Some control of context is required even to recognize that this problem exists. A conversion that depended solely on pattern matching would not even notice that it was creating an illegal second procedure. For a conversion tool to avoid this error, it has to be aware of the context of the procedure, not only in the input it is reading, but in the output it is creating.

Though the content cannot be automatically re-authored, the conversion software can insert an empty task <title> based on context, effectively breaking the topic into two tasks. This action allows the conversion software to apply the semantically correct <step> and <cmd> markup to the content of the second procedure. The user still needs to provide the proper text for the title of the second procedure, post-conversion, but this work is much quicker and easier, and less error-prone, than re-authoring the topic, either in the source or in DITA.

Procedures Authored as a Table

A number of organizations use tables to lay out the steps of a procedure. For a generic conversion program, this structure is going to look like a table, not a procedure, and the result will be that the content will come out as a table rather than a task in the DITA XML, which is not what you want.

Guided conversion can identify such tables based on, for example, the content of the first column (Step 1, Step 2, and so on), or the header row, or possibly the table style. The identified tables can be stripped of their table markup and their contents automatically mapped into step commands, info, examples, and so on. Again, the paragraphs can be identified based on the fact that they were contained in such a table, so there is no need to rely on styles.

Tables that contain definition lists, advisories, or any other content can be similarly identified and stripped of their table markup.

Conditional Text

Some conversion tools have trouble working with files that contain conditional text. Sometimes the tool requires that all conditions be turned on before conversion, and then they lose the conditions in the output.

Guided conversion should be used to specify rules indicating how different conditions in the source content map to XML. The conversion rule can target the DITA otherprops attribute or a specialization of the props filtering attribute for the capture of the conditional information. A guided conversion rule could also cause conditions of a specified type to lead to the creation of entries in the relationship table of the DITA map.

Constructing Book and Map Files

While the aim of a conversion to DITA is to be able to reuse topics in many places, the first place you are probably going to want to use your converted topics is in the same book they came from. That intent means you will need a ditamap and/or bookmap that reproduces the structure of the converted book. Your conversion tool should produce the required ditamap and bookmap for you.

Discerning the hierarchy is not always as simple as matching heading levels. Not every heading marks a change in hierarchy, and authors do not always use headings in strict hierarchical sequence. Additionally, different topic divisions may be indicated by the use of different heading types. Managing all of these issues requires sophisticated management of context informed by a detailed knowledge of the content and the style conventions that were used to create it.

Another important issue is discovering the book information such as publication date, document number, and so on. For some organizations, discovering book information may involve the creation of a specialized bookmap if the standard DITA bookmap does not capture all of the publication information the organization uses.

Book information is not always easy to find in the source files. No generic conversion software can ever accurately detect, extract, and preserve this publication information, since its format and location is always specific to an individual organization. However, with guided conversion pinpointing the location, pattern, and context of this information, a conversion tool can build the correct map.

In some cases, important metadata is found in the headers and footers rather than the main text flow of the document. Once again, guided conversion can pinpoint the data of interest and relate it correctly to ditamap and bookmap files you are building.

Choosing Your Conversion Strategy

Knowledge has been defined in the following way1:

“Knowledge is the meaningful organization of information, expressing an evolving understanding of a subject and establishing a basis for judgment and the potential for action.”

The level of success that an automated conversion technology can hope to achieve is bounded by the depth of knowledge it can attain of the content to be converted. Context, supported by guided conversion, provides for the meaningful organization of the information revealed by patterns. The conversion software can act on this evolved understanding of your content to produce the richest XML possible. Knowledge is the key to intelligent content conversion.

Because intimate familiarity with the content is so important to specifying the patterns and the context that will produce a high quality conversion requiring little clean up, you probably don’t want to simply send your files away to be converted. Without your specialized knowledge to supply the patterns and context clues, the conversion you get back is going to be pretty generic, and that is going to mean you will have to do a lot of manual clean up before the content is really usable.

On the other hand, the people with this knowledge are writers and editors in your organization, and they generally don’t know how to express these kinds of context clues in a programming language. Trying to learn to do conversion programming so that you can write your own conversions that exploit your knowledge of the content is going to be even more time consuming than cleaning up all the problems left by a generic conversion.

To get the best of both worlds, you need to work with a conversion service provider who understands the importance of patterns, context, and knowledge of the content in the conversion process and who will work with you to define the conversion rules that will greatly improve the quality of your conversion output, thus saving you weeks or months of clean up effort. You need a conversion service provider that possesses the intelligent conversion tools that allow you to capture and express all the context recognition rules in a high-level human-readable way without the need for programming or technical expertise. CIDMIconNewsletter

1 Gollner, J. The Anatomy of Knowledge; Retrieved

March 8, 2012 From: http://jgollner.typepad.com/
files/the-anatomy-of-knowledge-jgollner-sept-2006.pdf

Baker_Patrick

Patrick Baker

Stilo International

pbaker@stilo.com

Patrick Baker is VP, Development and Professional Services at Stilo International, where he heads product development and is actively engaged in the successful deployment of content conversion solutions for publishing clients. Patrick has been associated with best practices for complex content conversion for over a decade and has successfully delivered custom solutions on behalf of organizations in the automotive, airline, defence, and commercial publishing sectors. With a B.Sc. degree in Mathematics and a M.Sc. in Computer Science from McGill University, he leads an expert team of highly talented content conversion specialists at Stilo.

Stilo International <http://www.stilo.com> is the leading provider of cloud content conversion services and content conversion tools and developer of enterprise conversion solutions that aggregate and convert rich content from source formats such as SGML/XML, Word and FrameMaker to target publishing formats including DITA, S1000D and custom XML.

 

We use cookies to monitor the traffic on this web site in order to provide the best experience possible. By continuing to use this site you are consenting to this practice. | Close