Managing Scalability and Reuse
“Reuse is Good”
Amongst the many best practices identified for new and experienced writers, the ideal of ‘Reuse’ is often a top recommendation. The ability to take an existing section of content and use it again in a different, but appropriate, context promises measurable benefits.
The reason is simple: if some content is already well written, or applies unchanged across several versions of a product, why waste time trying to create the content again from scratch? If you rebuild your content on a regular basis, reuse should make it quick and easy to include the latest version of the content. The benefit is that you are always delivering the most current material. Where you have a corporate style for content, reusing material makes it easier to maintain continuity with that style and helps new writers learn about and apply style consistently to their own work. So reuse is seen as a good thing.
The application of reuse can be measured and counted in many ways. For example, creating a single, clearly defined product name and reusing that definition throughout the rest of your content gives you various benefits such as a simple multiplier (“Name X is reused 1,000 times”), and tangible time savings (“Renaming Name X to Name Y took just 5 seconds, rather than 1,000 lots of 5 seconds or about 80 minutes”).
For a discipline where qualitative improvement measures are rarely so absolute, the ability to demonstrate specific savings as a result of reuse is an ‘easy sell’ to project managers and other people who want hard numbers to prove value for money.
The increasing adoption of markup technologies such as DITA has further enabled reuse. Markup makes it much easier to identify discrete units of content, suitable for reuse. The unit might be a small one- or two-word phrase, such as a product name. It might be a larger collection of words, such as a sentence or a commonly used step in a task sequence. It might be an entire topic that is frequently required but rarely changed, such as a list of trademark acknowledgments. Finally, it could be a large collection of content, such as a ‘must-gather’ list of tasks to perform when requesting product support, where each task is described in its own topic.
Determining the boundaries of a reuse unit is not always easy, especially where translation is concerned. A useful test is whether the unit can be translated meaningfully and consistently. Wise counsel can be found in the W3C guidelines on Best Practices for XML Internationalization <http://www.w3.org/TR/xml-i18n-bp/#AuthInsText>, which recommends that you:
“Make sure that any piece of inserted text is grammatically independent of its surrounding context.”
We can summarize the benefits of reuse under three headings:
- Consistency—by having single definitions of terminology, brands, and names.
- Simplification—by solving the problem of creating good content just once.
- Cost savings—by reducing task time and inheriting material from other teams.
Unfortunately, discussions of reuse tend to stop at that point. The concept has been explained, the techniques described, a few cautions administered, and the benefits outlined. It seems that the case has been made.
The problem is that—much like the notorious ‘Bubble Sort’ often taught to novice programmers—a concept that is easy to explain and apply on the small scale turns out to have problems on a larger scale.
The Scaling Problem
The scaling problem arises the moment you take advantage of a reuse opportunity. More accurately, the scaling problem is ‘enabled’ the moment you begin to reuse. To explain why, let us revisit the simple example of defining a single reusable unit of content for the product name. Within your set of documentation, each instance of the product name is populated by referring back to that single definition. When you have only a single development set or ‘stream’ of documentation, moving from distinct version to distinct version, there is no difficulty from this kind of reuse.
The difficulty is when you have to introduce another, parallel line of documentation. This is a common requirement in enterprise-level documentation. Rather than having a single stream of documents, you have two or more logically distinct but parallel streams. Much of the content might be largely identical, but there will be sufficient differences between the intended deliverables to make separate logical streams a reasonable strategy.
We can see how the streams work by considering an example. Version 1 of SuperProduct might be developed and delivered on one hardware platform X. A corresponding set of documentation is created. This is the first stream. Later, the commercial success of SuperProduct means that a decision is taken to continue development to Version 2 and at the same time to introduce support for a new platform Y. The product documentation team now faces the challenge of structuring the documentation set for Version 2 and its multiple platforms.
The first option is to maintain a single documentation set, with platform-specific details clearly called out within the overall content. This means the team can keep working with a single stream. Initially, this might appear to be attractive and simple option, but as the number of differences increases—more platforms, more versions, and so on—the content becomes more difficult to read and apply. Customers would spend more time working out what does not apply to them than reading useful content.
The second option is to have several different collections of all files—for example, one collection for each specific platform and version, perhaps stored in separate folders. This makes each individual document set easier to read, because it contains only that material relevant to the stated platform and version. But it rapidly becomes an impossible task to maintain and develop the content consistently. The likelihood becomes too great that you might miss one file during an update or inadvertently add new material where it does not apply.
The third option is to produce platform- and version-specific builds of the documentation using clearly defined streams that exist within a single collection of files. Each specific stream is built and delivered by using different filters to include or exclude content, according to the required output. This option poses an increasingly complicated logic problem as more permutations of content must be managed, but is nevertheless the preferred approach used by many documentation teams. The reason this option is preferred is that there should be only one file to go to for specific content, therefore reducing the chance of inconsistency. The option is especially tempting if strategic vision statements suggest that later versions of the product will support additional platforms or if support must be maintained for all versions concurrently.
Reuse best practice suggests that using streams should be straightforward and work well. Much of the content such as Getting Started, Tutorials and Scenarios, and Basic Tasks will be common across the different platforms and indeed similar from version to version. Such content can be shared or reused. Other content, such as Installation or Diagnostic Help, would be platform- or version-specific and so would be unique to the corresponding stream. We can represent the general case of having several output documents stream built from a single collection of source files using the diagram in Figure 1.
Content that started as a single collection of files produced a single document stream for a single platform. As development progresses, the content branches off into several different streams built from the files. Each new stream is for a newer version or additional platform.
As time goes by, these documentation streams tend to lose synchronization with each other. In addition to the obvious content differences that arise from planned differences, other factors start to have an impact. For example, a software development team might change the usage instructions for certain features, either for marketing reasons or simple platform constraints. The changes must be reflected in the corresponding documentation streams. Since each stream represents a distinct set of content, unique within a collection of multi-version, multi-platform, and multi-capability material, each subsequent platform or version difference increases the divergence between streams, making it more difficult to identify and maintain common material. We can understand this divergence by defining the ‘distance’ between two streams as being the number of words to add, change, or remove to convert from one stream to another. The problem caused by this divergence is that it represents a form of ‘resistance’ to re-integration of content.
This point is extremely significant. By definition, streams are concurrent but exclusive. The product differentiating changes in one stream are not necessarily inherited or propagated across to other streams. Inevitably, there is a divergence and a loss of synchronization. Other than for the simplest situations, once branching starts to occur, streams do not later converge; at most they come to an end.
We can now see that what started as a benefit might become a problem. As reuse continues and increases, so the scaling problem increases correspondingly but not necessarily proportionately. Reuse facilitates branching into streams, but offers nothing to help manage the increasing number of streams.
Some of these problems might be eased by using a good Content Management System (CMS). However, many best practices supported by a CMS focus on tidy expansion and promotion of reuse, not reintegration of content, and therefore do not really help the scaling problem.
Do I Have a Problem?
There are several ways in which you might test for a scaling problem in your project. Two markup-oriented methods are outlined in this paper. Their suitability for your projects will vary, but should help you think about, and test, other methods optimized for your circumstances.
In method one, you identify the ratio between pre- and post-build plain text. Begin by removing all the markup elements from your source files, then counting the number of ‘ordinary’ words left. Call this value X. Next, build your files into (X)HTML. As before, remove the markup elements from the build results, then count the number of ‘ordinary’ words left. Call this value Y. You now have a ratio X:Y. A smaller ratio (where X < Y) suggests a larger amount of reuse. The reason is that reuse tends to lead to a larger number of words in the post-build, because reuse markup in the source is replaced by actual words in the generated
If you have no re-use, then the ratio of pre-build to post-build plain text is 1:1. If your post-build content expansion ratio is more than 25 percent, you might have a scaling problem. If it’s more than (say) 50 percent, you do have a scaling problem. The actual percentage numbers will certainly be different for your circumstances, so you should interpret the values by thinking about your projects where reuse is, and is not, a problem.
An advantage of this method is that it is a simple calculation. The ratio can be determined automatically every time a document build is run and the results monitored. You could set up alerts to email you when a build produces a ratio that exceeds a certain threshold. A side benefit of this method is that a sudden change in ratio might indicate a significant and perhaps unexpected alteration in the delivered content. Finally, recording and monitoring the ratios can be an interesting part of the project review, to see how reuse has changed over time. Steady and gentle changes in ratio would indicate correspondingly steady and therefore better-managed reuse. Conversely, erratic changes in ratios suggest uncontrolled reuse.
Method two takes a different approach, by looking at the number of possible permutations for any given collection of content. As the number of permutations increase, so the resistance pressure between different streams of output documentation also increases. That pressure makes it more difficult to reintegrate streams or to make common changes without adverse effect on one or more specific streams.
DITA is able to build different permutations of content by selectively including or excluding material according to the values of specific attributes supplied at build time. These attributes are defined in ditaval files and enable us to mark content as being applicable to a given platform and, therefore, to be included or excluded during a build. Some DITA build implementations can also filter content by selecting different search paths during the build. For any given set of documentation source, the number of distinct attributes and the range of possible values for each of those attributes gives us the number of possible permutations for building the content. If your build environment allows dynamic search paths, this can be treated as an additional attribute, using the number of possible build paths as the range of values.
As a simple worked example, assume that
file.dita has three distinct attributes:
attr3. It doesn’t matter if an attribute appears more than once in
file.dita; the point is that if an attribute is used, it must be included in the calculation.
The variety of ditavals used for building means that
attr1 might have four possible values,
attr2 might have eight possible values, and
attr3 might have two possible values. This means that
file.dita could potentially be built in 4 x 8 x 2 = 64 different ways. Some files will have only one permutation. Other files might have many more permutations.
You can use this permutation value in several ways. You could analyze the distribution of permutations amongst all your files to see whether you have generally low reuse (mostly low permutation values) or high reuse (high permutation values). More generally, you could consider how easy it is to describe each of the valid permutations or whether they can be listed using no more than one side of paper. If you need more space than that, you might have evidence of complex permutations that were initially driven by reuse but now promote an increased risk of error or inconsistency between what should be valid streams. In effect, it becomes more difficult to detect and prevent an invalid content build because of the larger number of possible permutations.
Planning For, and Preventing, Scaling Problems
There are certain traps that make it easy for scaling problems to occur.
Some traps are procedural. For example, your content development process might not have an explicit technique for managing reuse instances. This trap is easy to detect. Any time you want to create a reuse instance, is an evaluation performed to help you justify that reuse? All too often, no evaluation is performed because it seems obvious that reuse is required. As the instances increase and measures such as the ratio and permutation values described previously start to grow, you become more aware of the impact of uncontrolled reuse. A related procedural trap is failing to establish and apply a process for identifying and safely removing instances of reuse.
Other traps are technical. For example, you might define reuse attributes or values that are not sufficiently orthogonal from each other, such as
Platform_Y_32-bit. These values are apparently different because of a specific platform, but are similar in referring to a 32-bit environment. The effect is that you might be forced to include a more complex set of filter conditions in your source to ensure you have the correct content for each permutation, and that same complexity makes errors or inconsistency more likely.
Another technical trap is where the reuse content is not sufficiently decoupled from its original context. The material might not make sense when read out of context. This is less of a problem for smaller units of content, such as product names. For larger units such as topics or collections, it is often too easy to carry over context specifics such as product details or assumptions, making reuse inappropriate.
Being aware of traps such as these will help you recognize the process and organizational characteristics that encourage the kind of reuse that leads to scaling problems.
Ultimately, however, you should think about and implement a reuse management process to help prevent and mitigate the problems. The process should include mechanisms to constrain the proliferation of instances of reuse. In other words, copy only what you need for reuse, don’t copy the entire stream. Have a way to minimize or ‘retire’ instances of reuse, where appropriate. Constrain reuse to only those parts of your content that need to be different. Apply different reuse strategies to different sections of the documentation. For example, in some sections you might reuse small content units, whereas in other sections you might prefer a simpler duplication of an entire topic, accepting that consistency must be watched carefully.
Finally, don’t be afraid to adapt your process according to the project specifics and in light of what you see happening as your reuse monitoring continues.
Sensible application of reuse is a good thing because it helps to increase consistency and reduce error by omission. The problem is that, like many other ‘good things,’ reuse taken to excess can become a bad thing. Specifically, excessive or uncontrolled reuse can enable scaling problems. These are problems that get worse with time, especially if you neglect to refresh your reuse strategy.
A number of tools are available to help you recognize the symptoms of uncontrolled reuse. These tools range from simple assessments to more advanced content management systems.
As you think about different reuse strategies, recognize that some approaches work better than others, and they vary according to project. Some strategies you might otherwise reject could prove to be simpler and therefore more efficient.
Many thanks to my colleague Ian Larner of IBM United Kingdom Limited for his review of this paper and his constructive comments.
About the Author:
Adrian R. Warman, Ph.D.
IBM United Kingdom Limited
Adrian is an Information Architect for IBM in Hursley, England. He supports information developers in the IBM SDK for Java team, and is actively involved in Mobile, eBook and accessibility technologies and systems. Before joining IBM in 2000, Adrian worked in the telecommunications and banking sectors, having started his career as a University Lecturer, researching Information Systems and Computer Security. Adrian presented at the DITA Europe 2012 conference in Frankfurt, Germany.