Larry Koller
August 1, 2020

Computers are good at brute-force tasks. For example, they can compare thousands of paragraphs with each other, looking for matches or near-matches without getting tired or bored.

A content developer can use the results to:

  • find and correct inconsistencies
  • create collection files for reusable text.

I built a simple reuse analyzer from existing open-source tools and code, using a script to loop through any number of topics (after stripping markup).

Behind the scenes

Reuse analysis uses a technique called fuzzy matching. In a traditional comparison, the result is always a Boolean — true or false. Fuzzy matching gives a floating-point result between zero and one, where 1 is a perfect match, 0 is no match at all, and 0.95 might be “close enough.”

For example, the following two strings are not identical, but should be in a technical document:

Click OK to close the dialog.
Click OK to close the window.

Comparing these strings returns a score of 0.93 — in other words, 93% identical.

Fuzzy matching, at least in this implementation, uses an algorithm called the Levenshtein distance. This is the number of single-character changes (or edits) — additions, changes, or deletions — required to change one string to another. The algorithm looks complex but can be expressed in less than 30 lines of code. A WikiBooks page provides implementations in many different programming languages.

Calculating the score is equally simple: if l1 and l2 are the lengths of the two strings, and d is their Levenshtein distance, the score is: (l1+l2d)/(l1+l2).

There are other fuzzy matching techniques, but I used this one as a starting point.

Preparing the content for analysis

Ideally, the content needs to be stripped of all markup. The text of one block element should all be on one line. My original thought was to write (or find) a DITA-OT plugin that would publish a bookmap to CSV, where each record would contain the file name and one block (or paragraph, if you prefer) of text.

This took more effort than the analysis script, believe it or not. After a brief experiment with a “plain text” plugin, I decided to try exporting to Markdown, a transform built-in to DITA-OT 3.1 and newer. From there, a utility called pandoc stripped the remaining markup and eliminated line-wrapping. The commands can be placed in a shell script:

dita --format=markdown_github --input=book.ditamap --args.rellinks=none
cd out
for i in *.md; do
    f=`basename $i .md`
    pandoc --wrap=none -t plain -o $f.txt $i
done
delete index.txt

A medium-sized book contains perhaps 2000 to 3000 block elements. Creating a book-of-books would be useful to look for reuse possibilities over multiple books.

The analysis script

The script is written in awk for rapid development, ease of maintenance, and maximum portability. (One can even install awk on a smartphone, but I would not recommend trying to do reuse analysis on it.) Although the original awk release was in the 1970s, the language has found a modern niche in “big data” processing applications. The entire script, including comments and blank lines, is just over 130 lines.

The script collects all the block elements into an array, eliminating blank lines and stray markup that pandoc did not remove. Even with thousands of blocks, this phase is over in an eyeblink. The script then loops through each block, comparing it to all the other blocks and outputting any matches above a specified threshold.

Technically, the script must perform n2 comparisons, where n is the number of blocks in the book. The real-world test contained 2361 blocks, resulting in 5.5 million fuzzy comparisons! My home computer, a late 2013 iMac, averages about 600 comparisons per second. Fortunately, there are a few optimizations that can help. The goal is to not do any fuzzy matches you don’t have to.

  • Eliminate the obvious first: don’t compare an entry to itself and do a (cheap) Boolean match to skip other identical entries. These were small improvements, but easy to do. Time to process: 2 hr 48 mins.
  • Delete each entry after comparing it to the others. This eliminates duplication (after comparing block A to block B, comparing block B to block A is a waste of time), cutting comparisons and processing time in half. Time to process: 1 hr 24 mins.
  • When comparing two entries, their minimum Levenshtein distance is the difference in string lengths. If the resulting ratio is less than the minimum, the script can skip the fuzzy match. This made a dramatic difference in the test document, reducing processing time by a factor of more than 30. Time to process: 5 mins 30 secs.

The script has an option to ignore block elements shorter than a user-specified threshold. Even if processing times improve only minimally, it can cut down on clutter in the output.

Another potential optimization would be to first calculate the maximum Levenshtein distance allowed for two entries, then stop processing if that distance is exceeded. Some of the example WikiBooks implementations include that improvement.

Using an awk to C translator, like awka, further improves processing time, depending on the amount of data the script must process. The document set that took 5-1/2 minutes to process with the interpreted script took about 2 minutes with the compiled version.

Potential improvements

The current way of preparing a bookmap for analysis includes all current reused (conref’ed) content, meaning all current reuse gets flagged. This makes the script best suited for newly converted content.

A DITA-OT plugin that generates a text file, one block per line, eliminating reused content, would be ideal for analyzing established content for reuse improvements.

The current version of the script outputs redundant groups when it finds more than two matches. For example, if there is a group of three or more matches, like this:

Matches for file1.txt block 3:
This is a test.
    file2.txt block 7 (ratio 1):
    This is a test.
    flle3.txt block 12 (ratio 1):
    This is a test.

Then there will be a redundant group for all but the first match:

Matches for file2.txt block 7:
This is a test.
    flle3.txt block 12 (ratio 1):
    This is a test.

A second script, that takes the piped input from the report, would be a good way to weed out those redundant groups.

Conclusion

In some cases, it may be worthwhile to use commercial offerings instead. Their interfaces are certainly more pleasant to look at than a command line. Some vendors claim to be able to build a collection file automatically, and you might recover the costs in time savings. However, this script provides a cost-effective way to get started.

If you are about to move from word processor documents to DITA, you could export your manuals to plain text and use this script to find reusable strings before conversion.

My homebrew reuse analyzer script is available on GitHub.