Tools of the Trade: Part One of a Series. Braving the Wilds of Document Structure Specification
Function over Form
Have you ever walked between structures in a carefully planned campus of buildings and noticed the ruts in the lawn where pedestrians have left the sidewalk? Whether artful or pedantic in design, no matter how well thought out, the superimposed structure of the walkway succumbs to the relentless force of the natural traffic pattern of the pedestrians who traverse its course.
Some time ago, inspired by this phenomenon, the campus building and planning department of a major university undertook a bold experiment. Upon completing a new building surrounded on all sides by a large field of grass, the project manager elected to postpone the construction of concrete sidewalks connecting the new building to the rest of the campus. Not until students, faculty members, and others that use the building had worn paths in the surrounding turf did the construction of sidewalks along these same paths begin. The resulting pedestrian walkway design followed a structure in perfect harmony with the natural patterns of traffic specific to the building that the walkways serve. To accomplish this, it was necessary to allow for a period of chaos so that the inherent order might become apparent. From the unstructured pedestrian traffic sprung the correct structure for the new set of sidewalks. Cacophony begot harmony.
The process of transforming a documentation base, or information set of any kind, from disorder to order is much the same. Successful content management-that promotes content reuse and repurposing to multiple output formats-requires well-ordered structure. Tagged text languages such as Extensible Markup Language (XML) and Standard Generalized Markup Language (SGML) can provide order through the alignment of documents around a common structural design, often called a schema. While there are some generic industry standards, there is certainly no one perfect schema. Attempts to create “one size fits all” structural designs have produced schemas that are either constrained to the point of being unusable in the production environment or so lenient as to be valueless in enforcing orderly, unambiguous content.
The key to success in this endeavor is allowing the natural structure to emerge. The challenge is developing a method of recognizing and shaping this structure into a formalized efficacious schema. The project manager of the new campus building needed only to climb a nearby hill to find vantage. Where will you find yours?
This, the first article in a series, embarks on a journey of discovery. We’ll explore the use of software tools to amplify and illuminate the inherent structure in existing documentation, molding and redefining it as a truly workable schema for structured documents. Some of the tools we’ll explore in this series of articles, such as Adobe FrameMaker+SGML and Softquad XMetaL, might also be used by content authors to create documentation. We’ll also discuss the use of tools, such as XML Authority, that authors would not typically employ in the creation of content.
The metamorphosis from content chaos to content management has three distinct phases: content audit, tool independent technology, and tool specific solution architecture. This series of articles assumes that you are already well into the content audit process, possibly with the direct help of an expert such as JoAnn Hackos, Ann Rockley, or Elizabeth Anders to name a few. While it is true that you know your own content best, it may be important for you to have the trained eye of a content audit expert to assist you in deciphering patterns that you might otherwise not notice. It is also important to note that while this series of articles is framed around the use of tools, the process being employed is squarely in the realm of the tool independent technology of tagged text languages. We use tools to help manage this task, but our deliverable is tool independent. It is not until the solution architecture phase that you will determine the specific tools appropriate to your situation and requirements.
Discovering the Structure Within
Because each documentation base is unique, your discovery process will also be unique. Your requirements may vary greatly from those of your contemporaries. Available resources may also pose a variable. Many factors may affect the process you employ in defining a schema for your structured documents. You may even define multiple schemas designed to work interoperably with each other. There are advantages and disadvantages to each of the many approaches you might take. We’ll discuss some of them in the course of this series of articles.
In spite of the wide variety and potential complexity of documentation bases, to begin to understand how to extract and refine the natural, implicit structure in unstructured documents, we’ll start with sample content that is succinct and not overly complex. We’ll use a short essay composition. This content is not meant to be representative of your documentation base or necessarily an expression of the opinions of the author or publisher.
To begin to understand how to extract and refine the natural, implicit structure in unstructured documents, we’ll start with sample content that is succinct and not overly complex.
Converting Unstructured Documents
This article assumes that your documentation base is composed of unstructured documents. These documents might be in Microsoft Word, Adobe FrameMaker, or some other proprietary format. If your documents are not in Word or FrameMaker, you’ll need to be able to convert your files to Rich Text Format (RTF) to use FrameMaker+SGML with your content.
The sample content, “The rise and fall of the dot com economy” was created as an unstructured RTF document in Microsoft Word. But studying the content reveals implicit structure that we can use to convert it to a structured document (see the figure below).
Studying the content reveals implicit structure that we can use to convert it to a structured document.
Given that the sample content document is representative of a sample documentation base, we can use it to establish an initial structural design for all of the documents in the set. Undoubtedly, much iteration will be required, but we can expect to build a working draft from which to begin the process. We’ll start by deducing the appropriate structure for the sample content by examining the implicit structure suggested by the content. Review the labels to the left of the sample content in the previous section.
Note that the sample content contains six major sections: Title, Intro, Thesis (statement), Con (arguments against the thesis statement), Pro (arguments supporting the thesis statement), and Conclusion.
Two of these sections have sub-sections. Con includes the sub-sections Concession and Rebuttal. Pro includes the sub-sections Argument and Info. Note that Rebuttal always follows Concession, but Info does not always follow Argument. Using this information, garnered by analyzing the unstructured content, we’ll create a schema that represents the document structure.
We’ll use a kind of shorthand to describe these relationships.The contents of Essay can be described as:
Title, Intro, Thesis, Con, Pro, Conclusion
The description of Con is a little more complex:
The description for Pro is as follows:
This shorthand is an industry standard for expressing structural design (see the table below). Items separated by commas, as in the description for Essay, occur once and in the defined order. Parentheses indicate a grouping of items. A plus sign indicates that an item, or group of items, appears one or more times. The description of Con uses this type of notation to indicate that one or more instances of Concession followed by Rebuttal are contained within the Con item. Finally, the notation for Pro adds a question mark to indicate that a single instance of Info may optionally follow each of one or more instances of Argument within the Pro item. Using an asterisk instead of a plus sign indicates that an item or group is optional but may appear more than once. So, for instance, if the notation for Pro read (Argument, Info*)+, it would indicate that more than one Info item may optionally follow each of one or more instances of Argument within Pro, instead of constraining the use of Info to a maximum of one instance per instance of Argument.
So far we’ve been exploring the structure of the sample content document without the use of specialized tools. The simplicity of the content allowed us to manually notate structure. But schemas can become very complex, with many concatenated notations that are difficult to read in plain text form. FrameMaker+SGML provides a set of tools to help visualize and specify structural design. FrameMaker+SGML structured documents have the same look and feel of standard FrameMaker documents but additionally contain structural information.
FrameMaker+SGML provides a set of tools to help visualize and specify structural design.
The Element Definition Document (EDD) specifies FrameMaker+SGML document structure. This is a FrameMaker specific file that is comparable to the SGML Document Type Definition (DTD). A DTD is a type of schema file used by SGML and some XML documents.
Confused? Don’t be.
Remember that a schema is simply a description of document structure. In SGML, this schema file is called a DTD. SGML files are tagged text files. That is to say they are plain text files with special embedded instructions that help software programs process the information contained in the files. If you’ve ever created a Web page using a text editor, then you’ve already worked with one type of tagged text file. FrameMaker+SGML files are non-text or binary versions of SGML files. These are the files referred to in the preceding paragraph as FrameMaker+SGML structured documents. FrameMaker+SGML also maintains a binary version of the DTD file. This is the EDD file. It contains the same schema information that is in a DTD file plus additional FrameMaker-specific information (see below).
The EDD file contains the same schema information that is in a DTD file plus additional FrameMaker-specific information.
FrameMaker+SGML EDD files represent structural design with a graphical depiction but also use the same notation we just learned in the preceding paragraphs.
A schema must be lenient enough to support the author’s writing process but strict enough to enforce standards of structure. This is always a balancing act for the schema designer. At best, a schema reaches a fair compromise between conflicting philosophies. Your solution will be very specific to your individual needs and requirements. For example, your content authoring group may require two separate schemas: an editing schema and a production schema.
The schema that we’ve created for the sample content is fairly lenient. While the main sections are specified absolutely and in a specific order, the structure of the sub-sections (items contained within Con and Pro) is rather loosely drawn. How would you notate the requirement for at least two Concession sub-sections within Con and at least three Argument sub-sections within Pro without changing the existing notations for the follow-on sub-sections Rebuttal and Info? Review the preceding paragraphs on schema notation and venture a guess before reading ahead.
The notation for Con would be:
(Concession, Rebuttal), (Concession, Rebuttal)+
The notation for Pro would be:
(Argument, Info?), (Argument, Info?), (Argument, Info?)+
Can you think of other notations? Try writing notations of greater and lesser leniency. How would the description for Pro read if we wanted the Info sub-section to be interchangeable with the Argument sub-section?
The notation for Pro would be:
(Argument | Info), (Argument | Info), (Argument | Info)+
The vertical separator notation indicates that any one of the elements in a group may appear. FrameMaker+SGML allows us to experiment with our structural design decisions, easily adding and deleting constraints and latitudes as we prototype our schemas. The following screenshot demonstrates how FrameMaker+SGML displays the sample content EDD for editing.
The following is the text equivalent (DTD version) of that same EDD.
This sample content schema is a very brief and uncomplicated structural design. It is easy to see why the use of a tool like FrameMaker+SGML or another specialized tool is preferable. Without much additional structure this text version would quickly become difficult to edit and manage. One popular publishing industry DTD is over 2000 lines in length.
Ideally, content authors write with the appropriate schema at the core of their process. Authoring applications like FrameMaker+SGML guide the author during the authoring process, enforcing or suggesting structural rules according to what the schema designer has defined. The concept of applying structure after the fact should not be standard to the content author’s experience. Documentation managers, however, will at one time or another have to wrestle with the task of applying structure to unstructured documents. And schema designers will certainly need to test prototype EDDs by applying structure to existing content.
During schema prototyping, structure is best applied by manually wrapping pieces of content in structural elements. FrameMaker+SGML makes this a fairly straightforward task.
During schema prototyping, structure is best applied by manually wrapping pieces of content in structural elements.
After the EDD has taken on a more stable form, custom FrameMaker+SGML conversion tables can be created to handle some of the more mundane aspects of document conversion. It is important to note, however, that the very best conversion table will do about 80 percent of the conversion-on a good day. Because the automated conversion process relies on the implied structure in unstructured documents, a certain amount of manual correction will always be necessary. Fortunately, mass document conversion is a one-time proposition. Once your document base is in SGML or XML, it can be automatically manipulated to any required format. Documentation managers may want to employ writers in the document conversion process. The fine-tuning process benefits greatly by the judgment and skill brought to the task by writers.
This article has only scratched the surface of FrameMaker+SGML and schema development. Subsequent articles in this series will investigate the native SGML DTD in depth. We’ll also compare and contrast XML with SGML and continue to explore the elusive butterfly of automatic document conversion.
About the Author