Becoming a Taxonomist: A review of the book The Accidental Taxonomist and its value to the DITA community
Despite the talk about taxonomies in the DITA community, not many organizations are using them to their full potential. The Content Wrangler article “DITA, Metadata Maturity and the Case for Taxonomy” by Paul Wlodarczyk & Stephanie Lemieux summarizes the results of research on the subject:
Many organizations have turned to component-oriented content creation to create more sophisticated knowledge products, in more languages, and at lower cost. Our research shows that organizations that use XML authoring are more mature than their peers with respect to the adoption of best practices for search and metadata. However, the use of native DITA metadata capabilities is rare, and many are also missing out on opportunities to use taxonomy for content reuse and improved content findability.
While organizations may use metadata for filtering or workflow tracking, they are not taking full advantage of descriptive metadata, such as keywords and index terms. Perhaps their products lack sufficient complexity or variety to warrant extensive classification and terminology control for precise topic search. A complex product taxonomy more naturally leads to using a complex taxonomy for product documentation. Consider the case of Analog Devices, Inc., they hired a consulting firm to help them develop taxonomies to improve their website’s search capabilities and content maintenance. As a highly engineered site, <www.analog.com> offers users many options for searching an extensive array of products and the documentation for those products.
If your organization has considered developing a taxonomy to improve CMS searching, Heather Hedden’s The Accidental Taxonomist provides great training on the subject. The book belongs to a series of Accidental … books published by Information Today aimed at working and prospective librarians faced with unexpected roles. The book also addresses the goals of information developers who want a systematic way to develop keywords and index terms for DITA topics. They know that if authors apply their own keywords and index terms ad lib, metadata search results in a CMS will not yield the best results, and back-of-the book indexes will be chaotic. As they start to investigate the idea of a controlled vocabulary of terms, they are immersed in the world of taxonomies, synonym rings, authority files, thesauri, and ontologies. They may not be librarians, but they realize that they must become accidental taxonomists if they wish to achieve their goals.
Much has been written on taxonomies to date, but The Accidental Taxonomist fills an important gap in the literature. Earlier publications teach theory but have few guidelines to help a would-be taxonomist get started. This book engages the reader as a practical field guide that is both accessible to information developers and a fresh resource for students in library and information science. The book uses numerous real-world examples to illustrate principles and practices and builds on users’ experiences with retail web sites and search engines.
Types and Uses of Taxonomies
The first chapter defines what taxonomies are and distinguishes a simple hierarchical term list (a classification system) from a thesaurus. A thesaurus creates relationships between terms. It allows a term to exist in more than one hierarchy and to have synonyms and related terms. A thesaurus can more easily contain a greater number of terms and support more specific and extensive indexing.
The chapter identifies taxonomies as primarily serving one of the three following functions:
- Indexing support
- Retrieval support
- Organization and navigation support
Each of these functions can be directly related to what is needed for a successful DITA and CMS implementation. For indexers and metadata experts, a thesaurus type of controlled vocabulary helps them choose terms consistently and use preferred terms rather than synonyms. This type of thesaurus also supports retrieving topics effectively by metadata search.
Navigation support calls for a hierarchical taxonomy, such as the CMS folder structure that users can browse to find a topic. This taxonomy is often organized according to the product taxonomy and document types.
The Accidental Taxonomist teaches you how to think about different kinds of taxonomies and how to design an appropriate structure for a specific purpose. You can learn to set up your controlled vocabularies from a strategic perspective. Seeing numerous examples and exploring existing taxonomies can give you ideas and models for what would work in your situation. You don’t need to build an entire taxonomy on your own. The book references sources for taxonomies for indexing that can be purchased or licensed. You may find well-defined taxonomies for your industry that already standardize terminology. You could start with those terms and then add your own. Understanding more about taxonomies can also help you to take advantage of the DITA 1.2 subject scheme capabilities as well as the glossary specialization, which includes a number of features in common with a thesaurus.
Creating Terms and Their Relationships
Hedden’s third chapter guides you in performing a content audit and identifying concepts and terms for a thesaurus. In the process, you determine preferred terms and their synonyms and the best format for each term, based on guidelines. You may also include scope notes to define the scope of usage for a preferred term. The results of this type of term development can easily map into the DITA 1.2 glossary specialization.
The next chapter describes how to specify the relationships between the terms that you have identified. As with an index, you want to guide users to the preferred term. A synonym or nonpreferred term within a specific context, such as a phone book or medical reference, might have a listing: Doctor. See Physician. In a thesaurus, the equivalent relationships between preferred and nonpreferred terms have different notations. You see the terms Doctor USE Physician and Physician USE FOR Doctor, which is shortened to Physician UF Doctor. Hierarchical relationships contain broader terms (BT) that have under them narrower terms (NT) that represent subordinate concepts. A subordinate concept could be a member, part, example, or instance of a broader concept, class, or category. Often it seems intuitive to understand broader and narrower terms, but in subtle cases you need to test the relationships to see if they are truly hierarchical and not merely associative. Without the guidance found in The Accidental Taxonomist, it would be easy to go astray and create associative relationships that appear to be hierarchical. A truly associative relationship connects two concepts through a related term (RT) notation. This relationship is equivalent to the See also notation used in indexing.
The book surprises us with guidance on the different types of taxonomy software available for creating and managing taxonomies. While simple spreadsheets might be sufficient for starting out or for small taxonomies, using dedicated taxonomy software can take the headache out of maintaining a growing thesaurus. It can flag duplicates, unresolved references, and illegal relationships, update all relationships if a term changes or is deleted, generate reports in different display formats, and export the taxonomy in different platform-independent formats, such as XML for a CMS and HTML for a web server. Although the DITA 1.2 glossary specialization might replace a type of thesaurus, not having software to manage the relationships between terms would make the glossary specialization more difficult to maintain.
Taxonomies for Human Indexing and Automated Indexing
Hedden’s sixth chapter describes how indexers can use a taxonomy and how best to design and display a taxonomy to support indexers. A group needs comprehensive indexing policy guidelines and training to ensure consistent, accurate indexing. Taxonomists and indexers need to communicate about adding new terms, eliminating terms, or changing terms to improve the quality of the taxonomy. Managers need to decide on how much to control the addition of terms from indexers or other sources. Even terms from social tagging or folksonomies may be introduced as candidate terms if they reveal emerging concepts that should eventually be merged into the controlled taxonomy. For instance, internal communities within your company may employ product-related tagging with terms that may end up in product documentation.
The book covers automated indexing because the process involves developing taxonomies or pre-indexed documents to train the indexing software. Many companies are using automated indexing for enterprise content management. The typical context is a large organization with a vast volume of content, content that needs to be indexed quickly, or rapidly changing content. Automated indexing supports a search engine’s ability to retrieve documents at a rough granularity but not the task of assigning precise metadata in DITA files for topic retrieval or book indexing.
Taxonomy Structures and Displays
Chapter 8, on taxonomy structures, addresses the design of hierarchies and facets and the use of multiple vocabularies versus one vocabulary organized into categories. The way to structure a taxonomy depends on the needs of the people who will be using it. This chapter describes the characteristics of different structures and the factors to consider in implementing them.
The following chapter on taxonomy displays describes how the same taxonomy can be implemented with different user interfaces or displays to meet the needs of different kinds of users. A thesaurus of terms with no dominating hierarchies can be output in a number of different formats using one of the thesaurus software applications.
Examples of hierarchical taxonomies on websites show various strategies in real-world situations. One particularly interesting example shows the results of recursive retrieval in a hierarchical taxonomy. Searching for a book on database design on Amazon.com takes the user through the categories Books —> Computers & Internet —> Databases —> Database Design. At the Databases level, over 23,000 results are listed. Such a high number indicates a recursive retrieval at that level, because it includes the sum total of all books in narrower categories. Such examples provide practical awareness of taxonomies in action and the effect of their design on the user experience.
A fielded search display provides another type of access to a taxonomy. The user can select or enter a term for a variety of fields, which may represent facets, vocabulary files, or categories. A CMS could have a fielded search display for different types of DITA and CMS metadata. The values of the different fields could be linked by Boolean AND operators so that all criteria must be met to return a result. The taxonomy developed for controlling the terms used in the metadata enables the CMS to return results with greater precision.
Developing and Maintaining a Taxonomy
The tenth chapter of the book provides guidelines for planning, designing, and creating a taxonomy. The governance needed for the design and build phases can also apply to the maintenance phase. The initial taxonomy design-plan document can become the basis for one or more governance documents that define policies and procedures. Within the DITA environment, taxonomy maintenance could be considered a continuing information-development activity, as each new document written in or converted to DITA may require adding new keywords and index terms to the taxonomy.
The penultimate chapter discusses issues of taxonomy implementation and evolution. In some situations, you might want to export a taxonomy for use in other systems, such as for a CMS search engine. A transfer of terms from a thesaurus to DITA 1.2 glossary specifications might fall under this category also. Updating a taxonomy may include revising it as well as adding to it. You may need to combine separate taxonomies on different subjects into a master taxonomy, or you may have two redundant taxonomies in the same subject area that you want to merge into one taxonomy. If your DITA implementation requires localization, you may need to develop multilingual taxonomies, which are supported by the more sophisticated taxonomy software applications.
Learning More About Taxonomy
The last chapter describes the nature of taxonomy work and the opportunities for working as a professional taxonomist. Information developers may appreciate the listings of classes, training programs, workshops, and online tutorials that can help an accidental taxonomist develop skills for the role. Other valuable references for learning include links to conferences and meetings, online discussion groups, social networking sites, and web resources.
Heather Hedden has posted a web page about her book at <www.accidental-taxonomist.com>. From there, you can also find links to her courses on taxonomies and controlled vocabularies. Her online course for groups or corporations provides feedback on taxonomy lessons and makes it easy to try out some of the taxonomy software available.
The Accidental Taxonomist provides a rich and comprehensive view of the field of taxonomy and leaves the reader feeling like a budding expert at the end. It offers guidance on how to proceed with a taxonomy application in a professional way and shows real-world examples for inspiration.
Susan started out documenting software development tools and then migrated into the Telecom industry. She worked for Qualcomm, then Ericsson, and again for Qualcomm. Her recent phase has been focused on introducing single sourcing and DITA methods for chip design documents and smart phone software documents. Her group is starting a CMS implementation and is facing the challenges of creating taxonomies for a complex subject domain.
The Accidental Taxonomist
2010, Medford, NJ
Information Today, Inc.
Paul Wlodarczyk and Stephanie Lemieux
DITA, Metadata Maturity and the Case for Taxonomy
January 7, 2010