What’s the Big Deal?—Just Cut and Paste?
When converting documents from publishing systems like Microsoft Word, Quark, or Adobe FrameMaker, many people cut and paste from the original documents. Yet cutting and pasting can prove inaccurate, time consuming, and costly.
Over the years, people have said to me: “What’s the big deal with data conversion? Why not simply cut and paste? It’s easy to do and doesn’t cost anything.” They think that once they’ve pulled out the text all they need to do is throw a couple of tags around it and the job is done. While that’s partially true, the devil is in the details.
The truth is that the old adage “five percent of the details take up ninety-five percent of the effort” applies in the world of document conversion.
You are likely to come across a number of difficulties when using cut and paste to convert documents to XML. In a collection of simple documents, the difficulties may not be a big deal—but the cost of cleaning up the “little details” in a collection of complex document can be substantial.
Special Characters and Emphasis
While standard text is typically extracted accurately using cut and paste, characters such as mathematical and foreign symbols can be dropped in the course of moving into an XML environment.
So while a registered trademark symbol might transfer without any problem, a Greek Alpha will often be converted to a plain capital “A.” If that character occurs 300 times in a document, you will need to find and fix all 300 of them manually.
Likewise, emphases such as bold, italic, small caps, underline, and super- and subscripting do not convert well with cut and paste. In most cases, you will not end up with the tagging needed to represent the desired emphasis in the resulting XML document.
It is also important to note that superscripting is very heavily used within documentation, as hyperlinks and to designate footnotes—especially within tables and journal bibliographies. If these hyperlinks are lost in cut and paste conversion, reconnecting the links can be difficult.
Technical documents tend to contain many tables. If the table is relatively simple (and has little in the way of column or row spanning), the text in the table cells, as well as the tab characters between cells, will be retained.
However, the majority of tables in technical documents are made up of much more than cell contents. Complex elements like spanning, alignment, header row designation, and cell borders often do not convert accurately using cut and paste. If the conversion does prove to be inaccurate, you must insert these important properties by hand.
If you do decide to use cut and paste to convert your documents, someone will likely need to manually insert the necessary tagging into the resulting document. This work will require people with some XML training and a good understanding of all the rules and tagging requirements for your particular markup environment. The requirements make it difficult to put together a scalable process.
On top of the work, the reality is that humans are rarely 100 percent correct or consistent in performing tagging. In the majority of cases, it makes sense to automate the bulk of a conversion—a move that significantly reduces tagging inconsistencies.
Technical documentation is typically filled with potential XML hyperlinks, such as: “See Figure 2.2.7” or “Refer to Step 12”. These references need to be set up. Cut and paste conversion will always require the manual setup of hyperlinks, even if the hyperlinks have already been created in the original document. This process is labor-intensive and a likely source of rework.
Automated conversions, on the other hand, will retain the hyperlinks in the original document and will create links even where only text existed before. Simply typing “See Figure 2.2.7” into the original document is all that is required to produce the desired hyperlink.
When it comes to hyperlinking, automated conversion is especially useful for large and complex procedures and for converting maintenance manuals.
Besides creating hyperlinks, you will need to create IDs to hyperlink to. When tagging IDs, you must follow a consistent pattern. This consistency is straightforward when using automated conversion, but a nightmare when using cut and paste.
Other Special Mark-Up Requirements
During cut-and-paste conversion, you run the risk that important information could be lost. Most at risk is information “buried” in source documents, such as tables of contents and indexes. These would need to be maintained in some way by the target mark-up environment.
In addition, there may be other embedded pieces of information required for final output, such as GUIDs and LOINC codes for pharmaceutical SPL documents. These codes are stored in document header fields or other documentation sets, such as Excel spreadsheets. With the cut-and-paste approach, this information would need to be inserted manually.
The cut-and-paste approach is feasible and may even make sense as a viable conversion method for small and simple documentation sets. For large or complex documentation sets, the “little details” loom larger, and automated conversion should be seriously considered for faster, more accurate, and more complete tagging.
Data Conversion Laboratory, Inc.
Michael Gross is Chief Technology Officer and Director of Research and Development for Data Conversion Laboratory. He is responsible for all software-related issues, including product evaluations, feasibility studies, technical client support, and management of in-house software development. He has been solving digital publishing conversion problems at DCL for twenty years and has overseen thousands of legacy conversion projects.