Legacy Content Pitfalls and Potentials


June 2008

Legacy Content Pitfalls and Potentials

CIDMIconNewsletterTroy Klukewich, Oracle Corporation

Imagine that you wake up one day to discover that none of your shoes fit. That is what attempting to convert legacy content to structured XML can feel like. What are some of the pitfalls of converting legacy content to a highly structured standard like DITA XML? What are the potentials? What content converts well and what does not? I have worked on numerous XML conversion projects for companies and can share from experience.

Like most complex projects, the devil is in the details. In this case, the details invariably involve not only the structure but also the intention of legacy content versus modern, structured documentation projects. I will describe worst and best-case scenarios, address retrofitting the book model to DITA, and provide a comparative, action-oriented model for measuring legacy content. Though I focus on topic-oriented DITA conversions for software projects, the principles I address are applicable to structured documentation systems in general.

Why Not Convert Everything?

It seems like such an innocent idea. We love our old content (at least some of us do). It has been with us for years, like an old, frayed coat. It may not be perfect, but at least we know where everything is. Why not convert everything to XML and gain the benefits of easier translations, downstream single sourcing, and so on? We get all the benefits of moving forward and all the benefits of our years of hard work. How hard can it be?

Eye of the Needle

When you convert legacy content, a critical question is: formal or informal, what (and where) are the implicit documentation types? How many types are there? How are they structured in legacy? Are they neatly organized into separate topics with headings or is everything mashed together in a stream-of-consciousness style? Is there a preponderance of one documentation type and a glaring absence of another?

Structured documentation is a specific approach to designing and connecting formal content types. DITA does not designate new fundamental content types. In fact, the types that DITA identifies have been in documentation all along. Most every documentation project contains some mix of conceptual, perhaps procedural (task or how-to), and reference information. These fundamental content types are capable of near-infinite variations.

In essence, what structured documentation does is separate content types into discrete topics. That is really it. Put the conceptual material in a conceptual file. Put the task material in a task file. Put the reference material in a reference file. Link them together in a strategic architecture using maps to form the shape of your documentation. Use alternate maps for alternate structures. It sounds simple. And it is, when you start from scratch. Starting from legacy is a different story.

Worst Case

Let us take a worst-case scenario. I will refer to a real software company that I once worked for that shall remain nameless to protect the guilty (especially me). Our products had grown increasingly complex over the years, evolving into a suite of applications with complex integrations. The documentation in turn evolved over the years, layer upon layer of morphing technologies. Translation costs had grown beyond reason. The documentation was increasingly difficult to maintain. Writers were spending an increasing portion of their time finessing desktop publishing formats rather than working on content.

It was time for structured documentation and XML to save the day. Our first idea was to restructure the legacy content into strict, structured documentation, which would entail extracting conceptual, task, and reference material. We wanted to start on the right foundation and build up from there, release over release.

A Small Experiment

I am a big believer in small, contained experiments. I believe it is much better to perform a little experiment and experience a little flop than to bravely mount a full production and bomb on opening night. I put a challenge to our team. Let us take one chapter, a self-contained domain, and attempt to convert it to structured documentation. We could then determine the work effort and note the time, which we could then extrapolate to the rest of the materials.

Our web services functionality had hundreds of pages of supporting material. We started with the main chapter. Between releases, I gave the chapter to our most seasoned writer, a superb minimalist, and she confidently set to work. After a couple of weeks, she gave up. She said it was impossible to rework with the material. Why?

A Big Problem

The web services material was a mix of conceptual, task, and reference content with no clear demarcations from one documentation type to another. Sometimes conceptual material morphed into reference material and back again. Sometimes topics were neither the one nor the other, but something in-between.

Tasks were the biggest problem and, I would argue, the most important content type. In most cases, tasks were not really tasks at all. They were pseudo-tasks. They might look like a task describing some application process, but if you attempted to perform the steps, you might find that three operations might require fifteen discrete, undocumented steps (strangely enough, many of our customers never did quite figure out what all those implicit steps were, as our support calls attested).

To extract real tasks meant understanding the goal of what the original author was trying to communicate in the first place, which was not clear given the state of the documentation. The writer would essentially have to reverse-engineer the content from original principles. As the tasks did not really describe reproducible steps, we had to determine what the steps were, which required a reinvestigation of the original functionality.

After a couple of weeks, the writer determined that it would take less time to write content from scratch than it would to decipher and rewrite the existing legacy content. The original content was simply too confusing. It slowed us down. If it slowed us down in a basic rewrite, imagine what it must have done for the comprehension of customers reading it.

The exercise was enlightening. We put some estimates together and given the many thousands of features, we determined that it would take the full complement of writers a number of years to structure the content properly. This estimate should not be a surprise. It took many years to document the product in the first place.

The legacy content did not map to the new structure. Restructuring existing content was not an option as cleaning up the mix was too time intensive or impossible in practice. Writing from scratch to cover an entire project was not an option either. We sought a compromise.

A Hybrid Solution

We converted to XML and loaded the entire book into a pseudo-reference node and left it more or less as is. We positioned it as an “advanced” system reference because the vast majority of topics were reference-oriented. We then built a thin structured layer on top of the TOC, with a focus on practical goals and tasks, initially positioned for new users. Over time the thin layer built out and became the main access into the documentation. It did not take long.

The advantage of the hybrid architecture is that we could at least go forward with an ideal structure and not worry about trying to fit new content into an old, incompatible style. We could also link into the old material if necessary, though we avoided doing so because we wanted to deprecate most of the material over time. We still realized some benefits from converting the legacy content to XML. Our translation costs dropped, though not as much as a fully single-sourced, structured content system would have. The disadvantage of the hybrid model is that the legacy content will tend to stay around forever like a vestigial organ.

I was involved in numerous conversion projects across multiple products, including new acquisitions. I soon recognized the same factors repeatedly at work. Few projects mapped well to true structured documentation architecture. Occasionally, a project would come along that would surprise me.

Best Case

Is legacy conversion a lost cause, a case of minimal rewards from the move to XML alone? When do legacy projects convert well to structured documentation? Structure in documentation is nothing new. Practically from the moment typesetting moved to computers, structure was required to describe elements of a virtual page. Over the years, I have seen projects move away from the book as the fundamental information type, but the importance of structure remains, regardless of packaging.

Long before DITA, I was using structured documentation approaches in large, complex ERP implementation projects in help and curriculum, all without the benefit of XML. Structured documentation was simply a practical approach that cut the fluff and got to the point. The huge advantage that XML in general and DITA in particular provide is single sourcing and multiple outputs in addition to a defined structured approach based on topics. But structure has been in documentation projects all along.

Legacy content converts best when the content itself is structured. If a large book is structured into chapters, and chapters structured into discrete topics, and those topics are in turn clearly organized into conceptual, task, and reference subjects, the legacy can convert very well indeed. Once the topics are extracted out of large files into discrete topic files per type, ditamaps can then restructure the topic hierarchy. Now with single sourcing, you have great flexibility. You can recreate the linear “book” structure or experiment with alternate structures like a three-tier concept, task, and reference model used in many modern projects.

Old Wine Into New Bottles

The majority of projects that I have seen that did not convert well to structured documentation were based on books. This occurred not because books are inherently unstructured. It depends on the conventions used to design the book.

New to Oracle, I have noticed that many of the software manuals, though based on a traditional presentation of a book, are very well structured with discrete conceptual, task, and reference material contained within chapters. In the past, the frequent problem I have seen with books based on open-ended desktop publishing systems is that you can do anything in them and, often, anything happens.

I’ve noticed one constant throughout my career: regardless of format or packaging conventions. The materials that are easiest to understand and use are the ones that are well-structured. I have gone into failing projects with technically accurate, but poorly structured content, saving the day by injecting structure into the material with dramatic, turn-around successes and measurable results. The only thing that changed was the structure.

Time for a Reappraisal

When moving to structured documentation, it may well be time to coldly appraise the hard-earned content you have labored over for years. How well does it do the job? What job does it do? Does a structure that possibly worked well ten years ago still work well today with larger, more complex, modular, and increasingly integrated products? It is time to ask.

A huge temptation when moving to XML is to regenerate the legacy structure verbatim, most often based on a book and chapter model, but use XML as the underlying markup. There are some advantages. You still gain some efficiency in translation, and the possibilities for single sourcing, though complicated, are there. As a result, some projects continue to use the book as the primary publication metaphor.

I have nothing against books. By some strange coincidence, the arch of my career has avoided books and focused on modular, hypertext systems as the primary deliverable with book presentations as secondary. This is to say that books are not the only way to present complex information to audiences and in some cases may not be the best way.

Books and Single-sourcing

You can build books in DITA with ditamaps. But you can also build alternate structures just as easily. If you do choose to build books, the structure of the XML files does not have to mirror chapters. Though the choice to use large chapters or small topics may seem arbitrary, storing many loosely related topics in a single large chapter constrains the DITA model. Unlike DocBook, DITA is optimized for topics, not chapters.

While the topic-based approach to structured documentation is relatively new, best practices are emerging for shared content that you need to consider prior to conversion. These practices may influence your decision whether to use legacy structure, especially large chapters, to contain converted topics.

Many information architects prefer to chunk at the topic level and share content via ditamaps. In contrast, large chapters have a greater dependency on complex content references versus simpler ditamap references. In general, smaller topics provide the most flexibility for shared content and multiple outputs.

Checking with the dita-users newsgroup, leading-edge companies are finding that unconstrained references to small chunks of information within larger files via content references leads to spaghetti documentation references and tracking complexities. I am finding that experienced information architects recommend ditamaps for single-sourcing versus content references when possible. Hard-coding topics into a long linear presentation within single files, though possible, constrains the utility of single-sourcing chunks of information.

In general, if a chunk of information smaller than a topic is truly capable of single sourcing and multiple contexts, put it in a shared content library and link to the library with a content reference. The library convention automatically signifies a shared context and supports better maintenance. In most cases, ditamaps will suffice for single-sourcing content when using small topics, but content references do have their place. Use ditamaps for single-sourcing topics and content references for chunks of information smaller than a topic.

Knowledge versus Know-how

As you consider moving a legacy project to structured documentation, it is a good idea to appraise the legacy first, to determine what it does and does not do. To do so effectively, it is best to understand what a structured documentation project should look like as an ideal model. You can only measure effectively against a target when you know what the target is. Comparing documentation models, you will most likely uncover fundamental differences in approach.

Understanding through Knowledge

The main problem I have noticed with legacy conversions is that documentation is a complex thing. Different documentation approaches accomplish remarkably different goals. The majority of legacy projects that I have seen are founded on the idea that information supports understanding and that understanding is what we’re trying to deliver to customers. So if we deliver a lot of information, we deliver a lot of understanding.

The problem with the “understanding through knowledge” approach is that there is no end to it. Projects based on this approach typically contain a voluminous amount of information. I have seen numerous topic domains with hundreds of pages of fascinating information. Yet customers and consultants routinely complained that they had no idea how to do anything or where to start.

The pure information model is likely an artifact of the early days of software when products were small, self-contained silos. Writers talked more about what a thing is rather than what it does, perhaps because it did not do much at first. Then came rapid growth, complexity, and integration.

What numerous documentation projects have experienced over time with the “understanding through knowledge” model is gigantic increases in word counts and translation costs with a corresponding decrease in usability. The approach does not scale. At best, the model works well for general overviews and specific reference topics. But these two implicit documentation types (concepts and reference) are not enough to transfer effective know-how. Today, know-how is the bottom line measure of success in complex software projects.

Understanding through Action

There is more than one way to understand a complex domain. We can learn through memorizing information. And we can learn through actions. Enter an alternate model of documentation: understanding through action.

It is possible to learn certain things through understanding facts and information, certainly. You can learn all kinds of facts about Paris, perhaps even more than the average Parisian. But you do not know Paris until you walk the streets and breathe the air, until you take action within the domain. We can learn by understanding information, but we also learn by action.

Structured documentation, especially topic-based content, is superbly suited to working with audience goals and tasks. When working on complex ERP implementations in my early career, I specifically identified user goals first, then determined what tasks were needed to accomplish those goals, then determined what conceptual and reference material was needed to support those tasks.

The advantage of an action-oriented approach is that it constrains conceptual topics. It puts them in their place. In the “understanding through knowledge” model there is really no end to when enough is enough. The boundaries of the information space are arbitrary. In the “understanding through action” model, the conceptual material is constrained by user goals. You only need as much conceptual and reference material to support the goal. Once the goal is accomplished, you move on. The end result is more concise, focused content (minimalism).

Adult learners do not need to know everything before they can do something useful. In fact, learning works naturally when building on small domains of accomplishment. Once learners have an understanding through action, they can quickly adapt their understanding to accomplish more goals, often with little or no further documentation.

The most important deliverable in contemporary documentation and training is to provide adult learners with an effective mental model of the product that they can build on themselves. Burying readers under an avalanche of facts, no matter how interesting, pervasive, or technically correct, no longer works.

What is the Goal?

The majority of legacy projects that I have seen do not clearly identify goals and context. Often they do not have tasks. And when they do have tasks, they are often geared to discrete elements of the user interface (UI), divorced from goals and context. I am finding that this kind of micro-documentation is of decreasing value, better handled with embedded UI documentation (field descriptions, tool tips, and pop-ups).

I have also seen semi-structured projects that contained discrete conceptual topics, but the concepts were geared to small UI elements, not user goals. So the concepts often go into great detail about small, self-obvious features and UI descriptions. As UIs are increasingly web-based, using common web metaphors, and are often self-documented, this kind of micro-conceptual documentation is of decreasing value and subject to offshoring.

The higher-end, analytical work for technical writers using structured documentation involves determining audience goals and pain points. With goals and pain points identified, the writer can design a powerful structured documentation solution to directly address customer needs.

Instead of talking about features of the product at great length, we talk about what the users will do with the product to accomplish their goals. We put the features within a context that users can immediately understand. The difference is fundamental. We have to understand what the users will typically do with the product, what their goals are. Structured documentation, especially the model that DITA supports, aligns particularly well with goals and actions. Understanding is action. Action is power.


The pitfall of legacy content is not one merely of incompatible structure but the intention of that structure. When moving to structured documentation, it is time to take an inventory of the effectiveness of legacy documentation to ensure it is ready for the next ten years of product development.

In many cases, it is easier to start over than attempt a restructuring during conversion. If the legacy content does provide useful content of a certain type, a hybrid structure may retain the advantages of the legacy while promoting new practices. In some cases, portions of the legacy content may be jettisoned while new documentation structures and types supersede legacy design with a superior, more concise, action-oriented approach based on goals. CIDMIconNewsletter

KlukewichTroyTroy Klukewich


Troy Klukewich is an Information Architect and a DITA evangelist at Oracle. Prior to Oracle, Troy worked at Borland Software where he led an early structured documentation initiative in XML, similar to DITA. Prior to Borland, Troy worked at PeopleSoft where he headed up the cross-product documentation effort for Enterprise Integration Points in XML. During his consulting days, he worked on numerous SAP implementation projects, both as a technical writer and a curriculum designer, including stints at Microsoft and other Fortune 500 companies. He has a proven track record turning mission critical documentation projects into top-rated, measurable success stories using structured documentation. He is currently working on a DITA conversion at Oracle. Troy is a world traveler, has lived in Africa, and reads avidly in German and French.