Click here for the annotated topic list mentioned in this article.
Bill Hackos, Comtech Services, Inc.
You are doing a DITA pilot project, reusing topics to create three documents. How do you measure your success? I present here three metrics: Two are widely used but are invalid or give invalid results for topic sets that vary widely in size. One that I strongly recommend gives valid results by considering content rather than topics.
To illustrate the three methods, I have created a typical but entirely fictional annotated topic list. (Example annotated topic list) My example calculations using the three metrics are based on this annotated topic list. For more information on creating an annotated topic list, see JoAnn Hackos’s latest book, Information Development: Managing Your Documentation Projects, Portfolio, and People (Wiley 2006). The annotated topic list accounts for all the topics that will be used in a particular library, including an account of each topic that is used in multiple deliverables.
1. Percent topic reuse
The percent reuse metric measures the fraction of topics that are being reused somewhere. It does not take into account how many times topics are reused in a library. To do the calculation, count how many topics are used at least twice. Divide that number by the total number of topics. In our example, there are a total of 32 topics in the repository, 15 topics are reused or used more than once, yielding a ratio of reused to total = 15/32 or 47%.
A complication arises due to the way some authors handle notes and warnings. In the above example, the notes and warnings have been put into separate topics, one for each notes or warning. If the authors had put the notes and warnings into a container topic and linked to them by means of conrefs, only 9 topics would have been reused out of a total of 25 topics for a percent reuse of 36%. For the same amount of reuse, but with a slightly different architecture, the metric gives very different results.
Another good way to test this algorithm is to make a small change in the annotated topic list and see if the new metric value measures the change in the appropriate way. Suppose we were to add a small safety manual to the library using six topics that have already been reused. Since no topics are added that were not already used more than once, the ratio stays the same. Still 15/32 or 47% — no change, although clearly the reuse has increased. This algorithm does not react to the change in the expected way. This metric fails a validity test.
Another problem with using topics as a unit to measure reuse is related to the enormous difference in size of topics in a typical DITA project. In the example I have provided, the largest topic I have provided contains 6,457 words while the smallest contains 32 words, a ratio of 202! That is something like comparing watermelons to grapes. It’s possible to reuse the smallest topics many times without having much impact on the total content if the larger topics are not reused.
2. Average topic reuse
A better way of measuring reuse is to use the average use calculation, which measures the average number of times a topic has been used in a library. With this metric, we add up all of the instances that topics are reused in a library and divide by the total number of topics in the repository to get the average reuse. From the example spreadsheet, we find that there are 32 topics, with 21 instances of reuse, resulting in an average reuse of 21/32 or 66%. Average reuse could possibly exceed 100%. If the authors have used a container topic with conref links, each link is considered an instance of reuse. Therefore, the results are the same, 21/32 or 66%.
Let’s check this algorithm with the same example used above. If we add a small safety manual to the library with six topics that have already been reused, the calculation changes to 27/32 or 84%. This is a small increase due to a small change in the library in the expected direction. This algorithm passes the validity test.
Because the unit of measure is the topic, this calculation still has the watermelon/ grape problem discussed above. This problem is solved by the third method that I present below.
3. Percent repository words reused in context
A more meaningful metric is the ratio of the repository content to the produced content. This metric is directly proportional to the actual costs saved by reuse through content maintenance as well as translation. It is the only metric that can anticipate cost savings. The ratio of the repository content to the produced content metric works at the content level rather than at the topic level. This metric is proportional to actual costs because translation is charged by the word, and maintenance costs are proportional to the volume of content rather than to the number of topics.
From the example annotated topic list:
Document1 – 25,413 words
Repository Content to Produced Content Ratio = 40,060/74,848 = 0.54
The repository in this case contains just over half the number of words of the produced documents.
I prefer to illustrate this metric in terms of the Percent Repository Words Reused in Context (PRWRC). The algorithm works as follows:
PRWRC = (Words in All Produced Content –Words in the Repository)/(Words in the Repository)
I add the words “in context” because the random reuse of words is meaningless. Only reuse of words in the context of topics has value.
If there is no reuse, the value of this quantity is 0% because the difference between the repository content and the produced content is zero. The percentage can rise over 100% if the reuse is high enough. In our example:
PRWRC = (74,848 – 40,060)/40,060 = 87%
This metric can be directly related to cost savings. In the above example, 87% of the repository words (34,842 words) do not have to be re-translated as they would be if the documents were created independently. At 30 cents per word, that amounts to a savings of $10,452 in translation costs for one language. This repository represents a library of about 150 pages. For multiple languages, reviews, and maintenance, the savings would be greater. Only the PRWRC metric allows this direct comparison with costs.
Consider the validity test by using the example above. If we add a small safety manual to the library using six topics that have already been reused with a total of 7,082 words, the new PRWRC is:
PRWRC = (7,082+74,848-40,060)/40,060 = 105%
A small increase in reuse results in an expected increase in the value of the Percent Repository Words Used in Context. This metric passes the validity test.
Because the unit of measure is content, this calculation does not have the watermelon/grape problem discussed above.
People tend to look for easy solutions to many of the problems they encounter. That seems particularly true for employing metrics to measure the success of their reuse projects. The test of a good metric should not be how easy it is to use but how accurate it is in providing the measure you are testing. The data for such a metric may be more difficult to obtain.
Percent topic reuse
The percent topic reuse metric reuse is completely invalid despite the fact that it is the most popular metric used to measure reuse. It is very easy to employ, but it has no real meaning.
Average topic reuse
This metric is valid but does not translate directly into cost savings. It also suffers from the watermelon/grape problem.
Percent repository words reused in context
This metric is valuable because it is directly related to cost savings. Although it requires a word count for each topic (which will change with every edit) and can be complicated by graphics, I recommend this metric as a valid reuse metric. Arbortext Editor, as well as other editors, has a facility for giving word counts for a topic. It is important not to count DITA comments as words, since they are not part of the text of the topic. It is tempting to use bytes rather than words as a measure of content. However, bytes include all of the tags and would not be a valid measure of content.
We hope that developers of Content Management Systems will consider providing this key metric automatically from their systems. Publications managers’ goal should be to minimize the size of their repository. As with inventory control, we should aim to produce the most output (manuals and other content sources) with as little inventory of words in context as possible.