Automating XML Checks: A Metric-Driven Approach to Quality Improvement
Documentation practices at MathWorks changed rapidly during a two-year beta period of instituting a new documentation design. The mountain of new writing guidelines was matched only by the mountain of legacy documentation that required cleanup. We needed to understand, prioritize, communicate, and track the problems that needed fixing by the end of the beta, as well as guide writers through the huge cleanup effort. This article describes how automatic markup checks, implemented using Schematron and XQuery, provided relevant metrics that contributed to a successful and more efficient transition. Even if a writing group’s markup is not in transition, automated markup checks can help writers adhere to that group’s existing guidelines.
Although our group of over 50 technical writers and editors had been using an XML-based publishing system for more than eight years, transitioning our documentation to new standards posed these challenges:
- All content, the source for over 40,000 HTML files, needed to migrate to the new design by Fall 2012.
- Writers and editors needed to assimilate many concepts, guidelines, and XML markup practices that were new to them.
- The guidelines changed during the transition, as we explored and learned. Early-adopter writers needed to revisit their work and didn’t always know how the guidelines had changed under them.
- A specific category of problems involved links that went to targets that were present but undesirable as link targets. The links were not broken, so a standard HTML link checker could not help us find them.
Use of Schematron and XQuery
The automated checks we implemented had three facets: Schematron rules, XQuery reports, and a numerical summary report.
From the Arbortext Editor authoring environment, writers launched a process that checked their in-progress documents against a set of markup rules. The resulting report enabled writers to navigate to each problematic site in their document.
Behind this feature were 33 markup rules that we designed, along with error or warning messages to guide writers to solutions. The implementation used the Schematron validation language. Schematron is a concise markup language for expressing patterns you want to report on. For those familiar with XPath, Schematron is easy to learn.
Schematron complemented but did not replace our DTD. We kept our DTD relatively flexible, so we could:
- Experiment with design ideas
- Refactor content more easily
- Phase in new markup guidelines without instantly making all legacy documents invalid
Complementing the flexible DTD with Schematron rules enabled us to
- Describe detailed content models that a DTD cannot express
- Stratify our markup rules, so that Arbortext Editor strictly enforced some rules throughout the authoring process, while Schematron gave more nuanced feedback (for instance, some instances of warnings became allowed exceptions) on demand.
See Figure 1 for the process flow.
Figure 1. Mapping of Source Files to End Products
Writers viewed a Microsoft Excel spreadsheet that listed instances of errors and warnings throughout the XML source repository. The spreadsheet had columns that indicated the location of each error or warning (for instance, the path to the XML file and relevant elements and attributes) and a descriptive message.
XQuery is a language for querying XML data. We had previously developed tooling that built an XQuery library daily based on all XML source in the repository. The ability to query our source globally proved immensely valuable and versatile. For this project, we used Qizx software by XMLmind to execute a query against the library daily and produce a working report in XML format. The query codified 12 markup rules that were distinct from the 33 Schematron rules. We converted the XML report into an Excel spreadsheet for convenient viewing and filtering.
For some of these 12 markup rules, the reason for implementing them using XQuery rather than Schematron was that determining an error condition required the tools to look beyond the document at hand. Links to invalid targets seemed impractical to catch using Schematron rules. Writers initiating checks against Schematron rules have a particular document open in Arbortext Editor and link targets might be elsewhere.
To give managers a high-level view of the cleanup, we created a daily Excel spreadsheet that combined and summarized results from both the XQuery reports and Schematron rules. This spreadsheet had a column for each type of error or warning and a row for each MathWorks product. This spreadsheet was mostly numbers—no error messages or location details. We created this spreadsheet by
- Generating the XQuery report, in XML format, as described earlier.
- Generating a global report on the Schematron rules, in XML format. To do this, we created an XSLT transform that converted our Schematron source file into an XQuery query that identifies the same instances of errors and warnings. If we changed the Schematron rules, we regenerated the query. The XQuery library and this generated query enabled us to report on the 33 Schematron rules throughout the XML source repository. In effect, the Schematron source file produced a convenient UI for writers working on specific XML files, as well as a global view.
- Combining data from the two global reports to produce a consolidated summary of the 45 rules, in Excel format. For this step, we used MATLAB®, the flagship MathWorks product. Alternatively, we could have used other tools capable of reading XML files, consolidating data, and writing Excel files.
Additional Processes That Supported Our Transition
Creating reports does not guarantee that writers have the time, motivation, or expertise to fix the errors and warnings. These additional processes and materials supported the cleanup effort:
- A design process that helped the group prioritize by dividing the 45 markup rules into two categories: errors and warnings. The distinction enabled us to phase the cleanup work across multiple releases. Writers needed to fix errors quickly, but could defer warnings.
- An initial week-long “quiet period” during which writers canceled nonessential meetings and fixed as many errors as possible. The large reduction in errors boosted morale.
- Detailed instructions on how to fix certain errors, why they were considered errors, and whom to consult with questions.
- Regular email updates to the group, containing charts derived from the numerical summary, kudos on progress, and alerts about regressions.
- Regular check-in discussions between managers and their staff members. Managers facilitated load balancing and helped resolve questions about guidelines.
Benefits and Results
The automated checks represented 45 fewer things to check manually. The remaining manual checks focused more on content and judgment calls. Furthermore, automated checks were quick and easy to perform after changing a topic or near a deadline.
The numerical summary helped documentation managers quantify how much time writers had to spend on the cleanup. This data helped leadership and development teams understand that it was a big effort, that it would justifiably take time away from other tasks, and that it was ultimately doable.
In five months, writers fixed over 8,900 errors and 1,800 warnings. The vast reduction in errors gave us confidence about the quality of what we finally shipped. Remaining errors are few, and the daily global reports will alert us to any regressions. Writers also left nearly 1,600 warnings to address in a future release. Distinguishing between errors and warnings had helped us prioritize.
What We’ve Learned and Where We’re Heading
Undesirable markup patterns can end up in the documentation set if the DTD or schema allows them. Nevertheless, making the enforcement too restrictive throughout the authoring process can make writers frustrated and inefficient. This project taught us that some restrictions deserve continual authoring-time enforcement, while others are more appropriate as periodic or on-demand checks.
From a technical perspective, we learned that:
- It suited our needs to use Schematron and XQuery together because their strengths differ. In addition, they are useful independently of each other.
- XQuery requires more overhead than Schematron, but is more versatile. XQuery is useful for many purposes unrelated to XML markup checks.
- Some software for authoring and managing XML content includes built-in support for Schematron rules or XQuery queries. While such support could provide a jumpstart, the software and versions we were using did not offer that benefit.
Moving forward, we are broadening our automated checks to cover more areas of our markup guidelines. We also plan to deepen the set of checks so that if a discussion about “interesting” markup leads to a decision that the pattern should be discouraged, we can add an automated check for it. Between the authoring environment and the global report, such a check can help writers learn the new guideline just in time, eliminate legacy instances, and catch new instances early. Achieving these outcomes through automation helps the group learn, stay consistent, and direct its creative energy in areas that require and benefit from human
The MathWorks, Inc.
Amanda Galtman develops quality-checking tools, XML markup designs, and XSLT stylesheets for MathWorks software documentation. She integrated XQuery and Schematron into the Documentation Department’s tool chain. She was previously a technical writer at MathWorks.