CIDM Matters is an electronic newsletter published on the 1st and 15th of every month. Browse these articles published in CIDM Matters or subscribe to the newsletter by selecting the “Join Us” option on the navigation bar.
August 1, 2020
Computers are good at brute-force tasks. For example, they can compare thousands of paragraphs with each other, looking for matches or near-matches without getting tired or bored.
A content developer can use the results to:
- find and correct inconsistencies
- create collection files for reusable text.
I built a simple reuse analyzer from existing open-source tools and code, using a script to loop through any number of topics (after stripping markup).
Behind the scenes
Reuse analysis uses a technique called fuzzy matching. In a traditional comparison, the result is always a Boolean — true or false. Fuzzy matching gives a floating-point result between zero and one, where 1 is a perfect match, 0 is no match at all, and 0.95 might be “close enough.”
For example, the following two strings are not identical, but should be in a technical document:
Click OK to close the dialog. Click OK to close the window.
Comparing these strings returns a score of 0.93 — in other words, 93% identical.
Fuzzy matching, at least in this implementation, uses an algorithm called the Levenshtein distance. This is the number of single-character changes (or edits) — additions, changes, or deletions — required to change one string to another. The algorithm looks complex but can be expressed in less than 30 lines of code. A WikiBooks page provides implementations in many different programming languages.
Calculating the score is equally simple: if l1 and l2 are the lengths of the two strings, and d is their Levenshtein distance, the score is: (l1+l2–d)/(l1+l2).
There are other fuzzy matching techniques, but I used this one as a starting point.
Preparing the content for analysis
Ideally, the content needs to be stripped of all markup. The text of one block element should all be on one line. My original thought was to write (or find) a DITA-OT plugin that would publish a bookmap to CSV, where each record would contain the file name and one block (or paragraph, if you prefer) of text.
This took more effort than the analysis script, believe it or not. After a brief experiment with a “plain text” plugin, I decided to try exporting to Markdown, a transform built-in to DITA-OT 3.1 and newer. From there, a utility called pandoc stripped the remaining markup and eliminated line-wrapping. The commands can be placed in a shell script:
dita --format=markdown_github --input=book.ditamap --args.rellinks=none cd out for i in *.md; do f=`basename $i .md` pandoc --wrap=none -t plain -o $f.txt $i done delete index.txt
A medium-sized book contains perhaps 2000 to 3000 block elements. Creating a book-of-books would be useful to look for reuse possibilities over multiple books.
The analysis script
The script is written in awk for rapid development, ease of maintenance, and maximum portability. (One can even install awk on a smartphone, but I would not recommend trying to do reuse analysis on it.) Although the original awk release was in the 1970s, the language has found a modern niche in “big data” processing applications. The entire script, including […]