Amanda Galtman, MathWorks
You have an authoring environment with the tools and practices that support your writers. Your writers understand customers’ needs and create the content they require. As writers continue to create and update content, the question arises—how do you find and delete content you no longer need? It is an easy task to ignore because it tends not to have a deadline, and the benefits are hard to quantify. However, having a large quantity of unused content can waste time and storage space and cause confusion. In this article, I describe the approach that the MathWorks Documentation group took to identify unused content and delete it efficiently.
Over the years, MathWorks has accumulated thousands of documentation source XML and graphics files that are no longer needed. As writers update their content, they unwittingly create situations that cause files to become unnecessary, such as:
- Evolving content without removing files from earlier approaches
- Changing graphics to a different file format, such as GIF to PNG, without removing the old files
- Automating creation of graphics without removing the manually submitted graphics files
In addition to having a low priority, deleting unused files traditionally has not been a well-supported task in the MathWorks authoring environment. Roadblocks to deleting unused files include the following:
- The environment has no component content management system, and the revision control system in use until recently made deletion somewhat tricky.
- MathWorks translates some documentation into several languages, and use of source files varies by language.
- File use can be indirect, as when writers store an illustration in both vector and bitmap formats. Though not referenced from XML documentation, the vector file is a source for the bitmap file.
- Until recently, we had no specific tools or processes to help us determine the use of each source file. Writers often erred on the side of caution when they did not know if a file was in use.
We structured the cleanup as a group-wide initiative. We used an in-house file identification tool that traces through the XML source to find referenced XML and graphics files. This tool actually follows references in makefiles and XInclude elements, but it just as easily could have been designed to read Ant scripts and DITA maps.
Prior to the first round of cleanup, we piloted the entire process with several writers. We also documented how to restore files, if necessary.
The initiative includes these steps:
1. A project steward uses the tool to prepare lists of seemingly unused files.
2. File owners review the lists and indicate which files are safe to delete.
3. A production editor prepares and submits a batch job.
These steps occur in non-overlapping time windows of approximately 2 weeks, 3 weeks, and 1 week, respectively.
To ensure that we do not delete necessary files, we use these safety measures:
- Writers review the lists of files suggested for deletion. Reasons for preserving specific files are often known only by file owners. Files marked for preservation include:
- New files for use in the next release
- Content about features whose status in the product is undecided
- Content undergoing active refactoring
- The file identification tool interprets change-tracking markup conservatively, assuming that deletions are rejected and additions are accepted.
- When the tool changes, the developer and project steward review its logic and output, looking for bugs.
- In the build system, the production editor does a dry run of file deletion, building all documentation deliverables. The project steward searches the build logs for error messages that indicate the absence of necessary files.
Benefits and Results
We’ve completed two rounds of cleanup, with a savings of about 2 GB of disk space. Writers find the review process to be quick. Several writers have gone out of their way to express that they are happy to be rid of the old files with so little effort.
We were able to streamline quality reports by removing clutter arising from unused files. Also, we simplified maintenance of the code that generates the reports by removing the special treatment required for certain files.
Having all writers complete their review on the same schedule, and then preparing, qualifying, and submitting one deletion job has proved to be very efficient. This approach is preferable to expecting writers to delete unused files as they work. The approach also offers flexibility. When specific writers cannot meet the review deadline, such as when they are on extended leave, we can defer cleanup of their areas until the next round.
We’ve learned that dividing the file list into separate Excel spreadsheets works well. In the first round, we used a single spreadsheet that covered all our documentation, but some writers inadvertently changed hidden data, after filtering to show only their areas of responsibility. Having separate spreadsheets that align with writers’ areas removes the need to filter the data. Having honed this cleanup process, we now repeat it easily and keep our file clutter under control. We expect future cleanups to save less disk space because the backlog is gone. Even so, a plot of the number of unused files over time suggests that unused files are just a part of the documentation life cycle.
Return to main newsletter