Metadata: Origin and Destiny of Documentation


June 2016

Metadata: Origin and Destiny of Documentation

CIDMIconNewsletter Annu Rantamäki, Varian Medical Systems, Inc. & Richard Forster

Meticulous metadata management is one of the keys to a successful DITA implementation. Richard Forster and Annu Rantamäki take a systematic look at the different metadata questions that an organization faces when working with DITA. They illustrate the discussion with specific lessons learned in their daily work in medical technology. While the regulatory environment enforces a particularly rigorous approach, many of these issues need to be addressed implicitly or explicitly in any industry. It transpires that many of the most complex challenges with DITA can be framed as questions of metadata management. Ultimately, metadata tell you where your documentation is coming from and help you shape where it is going.


Imagine a world without metadata. More specifically, imagine a world where at 3:52 p.m. on Thursday next all metadata disappears with an almost inaudible sound—plop!—and gone they are. First you wouldn’t worry much and would just keep on typing your memo. But then, just as you want to reach out to your phone to call an old business partner, you notice that all contact details have disappeared from your directory. Hoping to find his number, you try searching for his latest email in your inbox. But the search returns an empty result since all the search indexes have disappeared too. And so on for the rest of the afternoon.

At the end of a fairly unproductive day at the office you drive home. For lack of anything better to do, you decide to boot up your home computer and finally organize those holiday pictures from half a year ago which you had postponed time and again. But holy Homer—your treasured photo collection has gone up in thin air! Disaster!! No wait, there it is. Phew! But hey, what happened to it? Where have all the folders and tags gone which you used to organize people and places by? All lost! You already see yourself painfully reconstructing your life by going through the photographs one-by-one together with your diary to organize them again, starting from the oldest and then slowly moving on to the latest. But stop—not only tags and folders are gone, so are all the timestamps and geolocation coordinates! Even the filenames are no longer in any order, but randomized sequences of digits and letters. Your great and famous collection has just degenerated into an almost random mess of colored pixels of unknown origin.

You pray to wake up from this hellish nightmare—but you don’t. This is not just a hellish nightmare; this is the world without metadata.

The Three Types Metadata

So let us look at the meaning and purpose of metadata in technical publications and DITA. Metadata being “data about data” is a well-known and quite appropriate definition. The classification of the National Information Standards Organization (NISO) further helps us structure the discussion:

  • Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords.
  • Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters.
  • Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it.”


Descriptive Metadata

Identification and search
One of the first and most obvious occasions for dealing with metadata is when you introduce a content management system (CMS) for your DITA content. While some systems use folder-like structures to help you navigate, others are database-oriented and work exclusively with queries. In both cases, it is important that you are able to identify and describe resources (topics, maps, images) with relevant metadata so that they can be found and retrieved. Let us call this the backend search, as it is not exposed to the customer. Then, of course, we also have the search functionality offered in our deliverables to the customer—the frontend search.

As soon as the documentation reaches a modicum of complexity or size, both backend and frontend search become indispensable. Here are some of the most important metadata for identification and search, together with some of the challenges to solve:

  • Title—The title of the resource is the foremost identification item, both for the frontend and backend. Especially in the case of topics, the title should be a succinct and accurate description of the resource. A style guide may define how to form titles for each type of topic to improve consistency and facilitate navigation.
  • Type—The type of the resource is typically used at the backend to structure the content development process. Countless battles can be fought over which topic types to use and when, how to use them, and whether to create your own topic specializations. At the frontend, the topic type is usually not explicitly shown, but may be deduced from the way the topic title is written.
  • Abstract—Next to the title, the abstract or short description is another typical way of summarizing the content. Here it is important to think about different delivery methods. The short description has a very different function and value in an online system, with search results and progressive information disclosure, than in a standard book or PDF output. If you are fully committed to single sourcing, it is important to use these elements in a way that makes the content useful, stringent, and free of annoying redundancies, irrespective of the final delivery format.
  • Author—Knowing who worked on a particular topic is usually of vital interest to internal users, but irrelevant for customers. DITA provides the <author> element, but if you use a CMS with user identification, chances are that the CMS already stores and provides that kind of metadata. The same applies to metadata such as <publisher>, <critdates>, or (depending on interpretation) the status attribute.
  • Keywords—Specific terms to characterize a topic. They are typically given special weight in a search context and may also be used to classify topics. Before adding <keyword> to your content model, make sure to develop global conventions and strategies. Do you use keyword in the metadata only, or do you also use the keyword element to tag terms in the actual text? Do you use just <keyword> or some of the more specialized keyword metadata elements like <audience>, <platform>, <prodinfo>, <brand>, <category>, <source>, and so on?
  • Also, when using such keywords, everyone should know how they are going to be used further down the road and where they might appear in the output. For instance, if writers are not aware that their keywords will be included in the <meta> tags of the HTML output, they might inadvertently put sensitive information in the public domain or influence search results in an unexpected way.
  • Index terms—Indexes have a function similar to keywords, but there are considerable differences. Index terms usually have little significance at the backend, but in the output they serve as a third, independent access point to the content, right after the search function and the table of contents. Even more so than with titles or keywords, it is crucial to index the content consistently throughout.
  • Providing a good and globally consistent index with DITA content is hard. While in traditional book production indexing is a separate, concentrated activity performed once on the whole book, in topic-based authoring it is often an activity distributed over multiple writers that happens for each topic individually and at different times.
  • For an index in a DITA book to be consistent, all topic owners need not only a similar understanding of the depth of the indexing, but they also need to strictly follow the same approach regarding grammatical form, orthography, as well as nesting of the index entries. Otherwise the index will be either incomplete or cluttered by many very similar concepts that should have been unified.
  • Particular care is required with cross-references in the index (<index-see> and <index-see-also>). DITA does not support an automatic integrity check for such cross-references, so you should not be surprised if your cross-references lead into limbo.
  • One option to address many of these difficulties is to move the index terms from the topics to the <topicmeta> elements of the map. This allows the map owner to create a consistent index without having to change any of the topics. But there is a price to pay for this: First, if the topic is reused, then the index metadata needs to be created or maintained separately for each occurrence of the topic. Second, such index terms are always on the topic-level and cannot be included at specific places inside a topic.
  • CMS categories—a content management system may have some additional, non-DITA features such as labels, tags, or favorites to organize topics. This kind of metadata is typically independent from the DITA structure and only used internally. As with keywords, its success depends on the existence of clear guidelines, taxonomy structure, and use cases.
  • Typically, this kind of metadata is not versioned or formally tracked. As a result, it is easy to use, but also easy to abuse, and in a complex multi-user environment, the lack of an audit trail may mean that the access rights to updating such categorization metadata need to be managed purposefully.

Filtering and Conditions
A particularly mighty kind of metadata are DITA’s profiling attributes (also known as “conditions” or “filters”). They are descriptive in the sense that at the stage of authoring, they indicate the audience, platform, product, and so on, for which a specific part of the content is intended. Looking from an end user’s point of view, however, these attributes are much more than just descriptive—they decide if a certain part of the content is visible or not (and you might thus classify conditions also under the structural metadata).

Almost everything that has been said in the previous section about keywords also applies to profiling attributes. Questions of governance and taxonomy need to be answered. What is the “dimension” of each condition? Should content for different product releases be managed with conditions?

Since conditions are used to decide about inclusion and exclusion of text, it is vital to understand how they work. Especially with multiple cascading conditions, it can become very tricky to know which combinations of conditions need to be expected at any given point of the entire topic and map hierarchy.

Some of the most important practical questions to address:

  • Controlled values—Your content management system and your editing tool need to support controlled values. Because of the great importance of conditions, it is vital to avoid any typing errors. A taxonomy needs to be implemented and maintained.
  • Search—Once your documentation reaches a certain level of complexity, you will need a way of finding and identifying conditions in your content management system. For example, you will need to find all the conditions used in a specific map and locate all the occurrences of the conditions.
    box6 box7
  • Editing conditional content—When working on the actual text, it is desirable or even necessary that the author is fully aware of the different conditions in use (and active for different deliverables). If your XML editor has a DITA WYSIWYG view, it may be able to not only show the conditions prominently, but also to gray out or hide “inactive” content (that is, content that will be invisible under a given set of conditions).
  • However, this only addresses part of the issue, as the filtering context cascaded from outside the topic is still missing. A full solution needs to go beyond the individual topic. It needs to make the topic editing tool aware of the full filtering context from the given map structure, and it needs to support the association and interpretation of specific filter rules (DITAVAL specifications) with the map.
  • Review—You need to identify an appropriate review strategy for different scenarios. If a topic is reused in different deliverables with different conditions, you need to choose if you have the topic reviewed multiple times in the different contexts (possibly multiplying the complexity of the feedback if the reviewers are not the same) or if you review the topic only a single time, showing the reviewer the different conditions.
  • Status—In a controlled environment, your topics are typically in a status such as “work,” “reviewed” or “approved.” Because the topic is the base unit of most DITA content management systems, it is usually not possible to manage the status for smaller units (such as paragraphs or sentences).
  • In the context of filtering this means that a topic can only move to an approved status if every element, including all its conditional elements, has been approved. In other words, for a topic to move to the “approved” status, it is not enough that such a topic passes the approval process in only one context, because the elements filtered out have not been reviewed or approved. A topic can only obtain the approved status when every element has been exposed to review and approval at least once.
  • If a topic is developed for different deliverables with different timelines, this creates an obvious dilemma and forces you to synchronize certain timelines in your deliverables.
  • “Dead wood”—Over the lifetime of a reused topic, it may be added and removed from various maps, and its cascading conditions may change. In both cases, some of the topic’s conditional content may become obsolete because the respective filtering conditions have changed. Such “dead wood” is not only obstructing the author, it also means additional cost for translations that will never be used in any context.

Storing Metadata
For much of the metadata discussed so far, there is a choice where to store it:

  • in the DITA resource itself (for instance, in the topic)
  • in the parent context of the resource (as <topicmeta> in a <topicref>, for instance)
  • as separate information in the CMS

Much depends on the concrete circumstance, the capabilities of your CMS, the output transformations that you use, plus the type and amount of reuse in your content.

For metadata that is actively managed or created by the CMS, it usually makes little sense to try and store it in the DITA resources themselves. Leaving such metadata entirely to the CMS tends to be more efficient, more reliable, and it also keeps your source files tidier. Furthermore, change control of the source files becomes more complicated with each additional way in which the CMS actively modifies the XML.
For DITA-specific metadata, choosing the storage location is more difficult.

When storing the metadata inside the topic, it is closer to the content and the metadata is also automatically available in every context where the topic is being used. On the other hand, the topic owner needs to be more aware of the different present (and future) contexts where the topic might be used.

Storing the metadata in the topic’s context (such as the parent map) has the advantage of being more flexible and avoiding the need to update a topic when only its metadata change (which simplifies change control and status changes). On the downside, in the authoring interface, it is more difficult to present such metadata to the topic author, and it increases the overhead and redundancy if the same metadata needs to be replicated in various contexts.

Metadata for images are a particular challenge, for at least two reasons:

  1. Unlike with DITA XML files, in the majority of cases images do not contain any text or other attributes that could be used when searching, so descriptive metadata is almost the only way of finding an image in a database.
  2. Different image formats have different ways of storing native metadata. Therefore, it is usually not possible to consistently store and manage metadata in the image files themselves. Much therefore depends on the metadata that the CMS can handle and whether it has special extensions for managing metadata for images.

Since these challenges are both unrelated to DITA, they are not further investigated here.

Structural Metadata

Structural metadata describe the relations between the data. For DITA it is almost as important as descriptive metadata. In the strictest sense of the word, maps are also structural metadata in their own right, as their main purpose is to arrange topics and bring them into some sort of order or hierarchy.

Be it maps with references to maps and topics (<mapref>, <topicref>), be it cross-references (<xref>), be it image references, content, and key references (@href, @conref and @conkeyref attributes)—DITA is full of references which define a rich and intricate structure. One of the key tasks and benefits of a CMS is that it takes care of referential integrity, at least alerting you to broken references. Ideally, a mechanism is in place that prevents references from breaking in the first place.

However, a DITA authoring environment needs to address more than just referential integrity:

  • Visualizing dependencies—When writing a topic, the author needs a lot of information from the CMS, and the information needs to be visualized in a meaningful way:
  • Outgoing references: Direct references found in the topic such as xrefs, conrefs, image references.
  • Incoming references: All references from other topics and maps to the current topic. If such a reference addresses a particular element (by ID), this information must also be known to the author.
  • References to other copies of the current topic (cloned and branched versions, but also translated versions, published versions).
  • Cascading references need to be shown; in particular, a mechanism is needed to see all objects referenced directly or indirectly from a given map.
  • Managing out-of-scope links—Another type of referential integrity check involves cross-references to topics that are not in the scope of the root map. Publishing such a map will result in broken cross-references. Your CMS needs to maintain either a specialized index or else provide a function to dynamically identify out-of-scope links in the maps that are used for independent deliverables.
  • Dynamic linking—The whole concept of keys in DITA (also known as dynamic linking or indirect references) is very powerful. But it also requires a complex interaction between the CMS and the XML editing component if the writer is to be supported. A full-fledged key resolution mechanism needs to be built into your system to enable the writer to see dynamically if and how the keys are resolved in a given (map) context. Ideally, your system also provides the option to run referential integrity checks for indirect linking, along the same lines as for direct linking (out-of-scope checking).
  • Output types—One of the promises of DITA is single sourcing. In theory, DITA topics are written without knowledge of, or reference to, any particular output type (PDF, HTML, Word, to mention a few). In reality, however, one often ends up with certain elements and attributes that are only useful in a particular type of output. Some topics are easily handled with any type of output; other topics are only fully functional in a given type of output and may need further adaptations if used in additional output types.

By the very definition of DITA, the structural metadata is stored in the source files (XML files). However, this means that the information is spread out over all the files, and a holistic view is not possible when working only with the sources. It is therefore essential that a CMS extracts the structural metadata from the source files and aggregates it, so that the structural metadata can be presented to the user in a compact fashion.

Administrative Metadata

The third and last type of metadata, administrative metadata, is of a more technical nature.

Access Rights and Workflow Management
In a multi-user environment with dozens or hundreds of contributors, tasks and responsibilities need to be managed beyond mere convention or verbal agreement. Ideally your CMS has an integrated workflow management system that involves groups, roles, assignments, deadlines, and so on. The CMS also needs a full-fledged access right system which allows you to define who is allowed to do what at which point on a very granular level.

Source Archive
DITA content typically goes through an iterative cycle of work, review, possibly approval, and then publication, before work resumes again. For some of these steps (usually publication), a tagged copy of the entire deliverable needs to be retained.

Preserving such “snapshots” of a deliverable can be done in two ways: Either you create a separate, “read-only” copy of all the sources comprising the book. Or, if your system has a full-fledged version history, a snapshot may consist of a set of metadata which identifies all objects and their respective versions, so that the full set can be retrieved at any time.

However, proper archiving requires more than just the sources. In particular:

  • Complete definition of all the conditions used for each deliverable (enter <ditavalref> in DITA 1.3—of course, the DITAVAL files need the same sort of document control as the actual sources).
  • Complete copy of the transformation (stylesheet), including all the third party libraries being used.

Audit Trail and Status
Especially in a regulatory environment, but in any quality-conscious setup, it is essential that all the changes happen in a controlled manner. At all times it must be possible to track who changed what and when, and which versions were published.

Apart from guaranteeing that every source change is authorized and accounted for, our CMS must maintain an accurate record of every change containing at least identification of the modifying user and the exact date and time of every change.

To support a controlled document development process, a CMS will need to work with the concept of “status.” The status maintained by the CMS is most likely to be defined on the level of the individual file (topic, map, and so on). As we have seen in the previous section, these requirements can lead to problems when different parts of a topic are in a conflicting state. However, because DITA provides attributes for status information on every single element, it is worthwhile thinking about the benefit of scenarios with a more fine-grained status level and how the CMS can support it. Obviously, a more-fine grained status level is nothing that one would want to maintain manually.

Finally, a meta question on the audit trail: While it captures all the content changes, to what degree can and should it capture the changes in metadata? Most systems will log all metadata changes when the metadata occur in the topic itself, treating such metadata in the same way as the content data. Some will also log any metadata change on the CMS level (for instance, changing assignments or workflow statuses). Rarely or never will the audit trail include changes to DITA metadata that are not part of the topic itself, but defined in another resource (<topicmeta> in the parent map, for instance).

Version History
The version history is closely related to the audit trail. The version history is relevant when we need to know or reconstruct the evolution of a document.

One typical use case is the content review with two special aspects:

  • Delta—Typically, reviewers expect to be shown the textual changes since the last review or last publication. Two methods are possible:
  • Continuous change tracking, where the XML editor typically inserts XML processing instructions to mark-up deleted and added text. This method is more accurate, but also relies on suitable strategies and tool support for consolidation (accepting, rejecting changes). It is useful where reviewers also work in an XML-aware environment, whereas transforming the mark-up into a PDF or Word file for review is a difficult endeavor.
  • File comparison, where a comparison tool is used to compare two plain XML files. Various different approaches exist, from special XML-aware comparison editors to tools that merge two or more XML versions with sophisticated use of markup tags. The file comparison approach has the advantage that it is always possible to perform ex-post, between any versions, and on-the-fly. In some cases, it also allows transformation into a traditional format with highlighting. The downside is that file comparison, especially with a complex XML like DITA with cross-references, tables, indentation changes, and so on is not at all trivial.
  • Justification—While the delta answers the “what,” the justification answers the “why” of an update. Especially when there are a lot of changes, reviewers may want to skip the minor linguistic alterations and concentrate on the relevant content changes. Also, for a particular review only changes in connection with a specific product update or product feature may need to be highlighted.
  • While it would be overkill to attribute a justification to each individual text edit, it is helpful to choose one or more categories of metadata for each revision of a topic. As with some of the other descriptive metadata discussed before, having a taxonomy or a controlled list of values and patterns not only facilitates the implementation, but also makes it possible to use filters to select and detect topic updates of a particular kind.

Tracing Documentation Requirements
The primary drivers behind technical documentation are high-level:

  • Business requirements; for example, enabling your customer to use the product safely and efficiently, marketing your product and strengthening your brand, reducing support requests and repairs.
  • Regulatory requirements; for example, meeting the requirements for product registration in different markets, complying with quality and industry standards.

However, derived from those high-level requirements, we also see a number of very specific detailed requirements which mandate the inclusion of very specific text or even prescribe the exact wording of certain warnings or technical details in the documentation. Especially in a regulated environment, such requirements can arise in great number.

The technical publication team is then responsible to meet these requirements and to report and monitor that all the requirements are met in the documentation, present and future. Various situations arise where we need to document for internal or external stakeholders how and where certain requirements are met in the documentation.

Appropriate metadata—what else?—is the key also here. Depending on the nature of the requirements, it may be sufficient to attach the metadata to topics, but other cases require a more fine-grained approach where individual paragraphs or even single sentences must be highlighted. Inserting the metadata directly in the DITA source code has the advantage of allowing accurate and comprehensive reporting, but also of immediately alerting a writer to any existing requirements from earlier revisions, thereby reducing the risk of inadvertently changing or deleting some critical parts of the documentation.

The next part of this article describes in more detail how we solved the challenge of tracing detailed documentation requirements in a DITA environment.

Case Study: Tracking Documentation Requirements

Business Requirements
In a highly regulated environment producing content for medical devices, all decisions must be documented, and documented in a way that allows finding the reasons behind the decisions afterwards, in addition to helping us keep track of plans and work during a development project.

In regulatory terms, our content is considered design output that must correspond to design input alongside the rest of the product. For the product, design input is created in the form of requirements, and the design output is verified and validated against the design input, that is, it is made sure that the product corresponds to the requirements. The same regulatory requirements apply to the content, or labeling, as the content is known in regulatory language. As we have specific requirements for the content, we need to be able to trace the content to those requirements, that is, to document the implementations of those requirements.

The requirements are stored in a database, in which each requirement is defined using standardized metadata that trace them to other related requirements. Most importantly, each requirement has a unique ID.

Safety-Related Information
The most significant of the requirements that apply to the content are safety-related requirements. Safety requirements are designed to prevent hazardous situations during the use of the product. The safety requirements are part of a larger risk management context. The risk management procedures define the hazardous situations, and the requirements are the mechanism used to make sure that the product features will prevent the hazardous situations. Some of the safety-related content is also based on IEC requirements (such as IEC 62083:2009, Medical electrical equipment – Requirements for the safety of radiotherapy treatment planning systems). Risk management procedures are described in a Risk Management Plan and Risk Management Report for each development project; safety requirements are described in the form of Safety System Requirements.

Each safety requirement links to a hazard cause, that is, the sequence of events leading to a hazardous situation. Hazard causes are described in the metadata of the requirement. Hazard causes are categorized per severity of the potential hazard that may be present if certain circumstances are ignored. The hazards range from potential death to harm to the equipment. The implementations of the safety requirements, that is, the mitigations of the hazards, are certain features in the product, or in the case of labeling, they usually take the form of Warnings, Cautions, Notices, or non-labelled paragraphs, depending on the severity.

The Risk Management Report documents the relevant safety requirements in a development project, and it also documents the labeling safety requirements, together with their implementations and the locations of the implementations. Currently we have hundreds of requirements that we need to trace to implementations among thousands of topics.

Technical Requirements and Principles
Because of the variance in the scope of the safety requirements, we need to be able to use metadata mark-up on various content levels: bookmaps, maps, topics, sentences, phrases, and graphics. To be able to reliably search the metadata, create reports based on the metadata and to batch-modify the metadata, it needs to be strictly structured and standardized.

It is also important to keep redundancy minimal. The metadata should only contain what is strictly needed to identify the purpose of the metadata and to link to external sources, such as the requirements database. Therefore, instead of repeating the requirements metadata, we only store the requirement IDs in DITA, and the rest is found in the requirements databases.

We also needed to keep translation and localization in mind when developing the metadata approach. On a practical level, this means that the way metadata items are placed in the content needs to respect certain boundary conditions to avoid difficulties later in the translation and localization process. Metadata mark-up in the wrong place can break a sentence in two when the sentence enters the translation management system.

Past Situation
As mentioned above, we have thousands of topics in which hundreds of safety requirements are implemented. Managing all of them requires documentation, and we understood that metadata would enable us to do this documentation. We began to add the unique IDs of the requirements directly to the content, but in a form that would not be shown in the output. In the FrameMaker world, that meant using a condition that would be hidden in the final output. Following the transition from FrameMaker to DITA, and without a systematic metadata approach, we ended up with a plethora of creative approaches, opportunistically used whenever the need arose: we could find draft-comments formatted in any way each writer preferred, free-form check-in comments in the CMS, diversely used keywords, individual XML comments, various <data> elements, probably only mentioning a few means of recording the metadata.

The positive aspect was, of course, that the metadata was recorded, but with the generous variety, it was laborious to use. All reporting on the metadata had to be done manually, and the multiplicity had to be kept in mind when looking for the metadata. For producing the required documentation for purposes such as the Risk Management Report, we assembled a matrix in Excel. The matrix was put together manually, pulling in data from the requirements database and the content. We simply copy-pasted the requirements metadata and the implementations to the matrix. The matrix contained the requirement ID, the requirement text, part of the requirements metadata, the implementation, and the location of the implementation. Although possible to repeat in each development project, the process was heavy and error-prone, especially when some of the requirements changed a few times during a development project.

It started to become very clear that we needed to standardize the use of metadata, and we needed a way to link to the requirements metadata instead of manual copy-paste operations.

Our answer to this need to trace the content to various requirements by means of metadata in DITA was a specialization of the <data> element—the <vmstrace> element. It is used to mark up the content as necessary to keep track of the requirement implementations, and it can be extended to discrepancy reports.

The <vmstrace> element is a container element for the metadata related to specific requirements. Each type of requirement is represented by a specialized child element in the <vmstrace> element. For example, a system requirement is represented by a <syrs> element. Its main attribute is a reference to a unique requirement ID, which allows linking to, for instance, the requirements databases, which are separate from the CMS.

The most common, and the simplest, use case of the <vmstrace> element is to mark up individual elements, such as common paragraphs, but also Warnings or Notes. In those cases, the metadata of <vmstrace> always applies to its entire parent element.

However, the requirements are often such that they are not possible to implement by one single paragraph in the labeling. Therefore, we need a more powerful solution which also allows us to easily tag a sequence of elements (such as multiple list items) or only parts of an element (such as an individual sentence in a longer paragraph). For this purpose, the <vmstracestop> element was created. It is used to define the end-of-scope point of a specific <vmstrace>. The <vmstracestop> element uses a reference ID attribute to link itself to a specific preceding <vmstrace> element.

For the same reason of varying requirement scopes, in addition to this smallest possible context for the <vmstrace> element, we also need to be able to use it for larger contexts, such as entire topics or even maps. A <vmstrace> element placed in the <prolog> applies to the entire topic; one placed in a map’s metadata applies to an entire map.

Once the metadata is in, we wanted to be able to create PDF outputs indicating where <vmstrace> elements were used, and we wanted to be able to create metadata reports in tabular format. The PDFs help us and reviewers understand how we have implemented requirements by showing the implementations in context, and the tabular reports serve as the required documentation for risk management.

PDF Output Showing Metadata
The following example shows the most common use case of <vmstrace>. In the metadata output, the red color in the Warning text and the requirement number (SYRS123.45) in the margin next to the Warning indicate the use of <vmstrace> to trace the Warning to a particular requirement.

Looking at the XML of this example, the <vmstrace> is used as follows.

<title>Scaling Images</title>
<p>When importing images … scaling of the images is essential.</p>
<p>Image scaling is necessary after extracting slices from a single film.</p>
<note type=”warning”><vmstrace> <syrs syrsNumber=”SYRS123″/> </vmstrace><p>It is extremely important that you ensure that the image scaling is correct for each of your image modalities. If the images are incorrectly scaled, a mistreatment could occur.</p></note>

The example above shows <vmstrace> elements added to an entire paragraph (in this case a <note>). To mark up only part of a paragraph, such as an individual sentence, the <vmstracestop> element is used together with <vmstrace>. This partnership creates restrained metadata. The following example shows the PDF output of restrained metadata. Again, the red text at the end of the paragraph shows where <vmstrace> and <vmstracestop> are added; the corresponding requirement number is shown in the margin.


The XML coding of the above example looks like this:

<p>Verification plans are created using either of the following methods:</p>
<li>Verification using a phantom</li>
<li>Verification using portal dose prediction</li>
<p>The fields contained in a verification plan have the same geometry and accessories as their counterparts in the original plan. <vmstrace id=”trace1″><syrs syrsNumber=”SYRS234″/></vmstrace>The field angles in verification plans are always defined in the IEC scale. <vmstracestop trace_id=”trace1″/> The MU, modulation … verification plan.

Using colors for highlighting and placing metadata in the margin has the advantage of leaving the original page layout intact. However, the use of coloring to indicate metadata and requirement implementations also has some obvious limitations. If multiple <vmstrace> elements overlap, appropriate highlighting becomes difficult as colors cannot be easily combined. More accurate highlighting methods can easily be imagined, but typically they will need access to some advanced features (such as Adobe’s PDF comments) or they will have an impact on the pagination (inserting markers directly in the text).

For documenting the requirements implementation, such as needed in the Risk Management Report, the metadata found from the requirements database and the metadata found from the CMS are merged into one tabular report. The CMS can output an Excel report for a given bookmap, which lists the metadata IDs and their positions plus excerpts of text where the metadata is found within the topics of the bookmap. The requirements database produces a similar Excel report, listing the requirements and requirement metadata, sorted by the requirement IDs. It is then a matter of linking the metadata ID field and the requirement ID field together in order to generate one unified report which contains all the requirements together with the information about their implementations in the labeling.

Table 1 shows a simplified example of a tabular report.


Instructions for Using Metadata
To add consistency in using metadata, we created metadata instructions and ran hands-on workshops for the entire technical publications team. In the instructions and workshops, in addition to the actual markup techniques, certain principles were emphasized:

  • Trace metadata for a specific purpose or to meet a specific requirement.
  • Plan the use of metadata and have the plan reviewed by all stakeholders.
  • Be aware of implications to translation and localization. For instance, in a translation memory, addition of metadata is seen as a content change even if the metadata is the only change in a topic. Notice that markup placed incorrectly (in the middle of a sentence, for instance) can break a sentence in two when it enters the translation management system.


Not surprisingly, metadata is one of the keys to a successful DITA implementation. The complexity of a single-sourcing, multi-channel, multi-user component-based authoring environment with liberal reuse, a variety of filtering conditions, and multiple localization needs cannot be handled without several carefully crafted layers of additional information.

The discussion in this article touched on many of the different aspects of DITA. Key insights from our own experience include the following:

  • Structured authoring also requires structured metadata. Plan carefully, implement systematically, and periodically review your metadata strategies.
  • Find the appropriate layer to handle metadata. Avoid unnecessary redundancy or manual work wherever a system can do the job for you.
  • Have a clear purpose. Plan and implement metadata with specific use cases in mind.
  • After identifying a problem, look for an appropriate solution. If DITA or your CMS does not have a ready-made solution, extend the framework and use specializations.

Master your metadata, so you will not only know the origin of your data but also be able to steer its destiny. CIDMIconNewsletter

About the Authors

Richard Forster

Richard Forster

Richard Forster holds a PhD in computational linguistics and is an expert in text engineering. He was the lead information architect and documentation systems engineer at Varian Medical Systems from 2010, when the DITA CMS was introduced, to 2016.


Annu Rantamäki
Varian Medical Systems, Inc.

Annu Rantamäki holds an MA in translation studies. She has been responsible for the content creation and technical publications project management at Varian Medical Systems since 1998. She was actively involved in the transition to DITA and has been continuously collaborating in the development and establishment of authoring best practices for Varian’s DITA CMS environment.