BookBuilder: Custom Publishing with Topic Maps
For a high-level discussion of topic maps, we recommend that you read Tina Hedlund’s book review, “Topic Maps 101,” in this issue.
We present here a solution developed at Aspen Publishers that uses topic maps (XTM and ISOTM) technology for multidimensional indexing and classification of content across various products and publications. A topic map composed of numerous merged indexes and classifications is used for navigation and, most important, content repurposing and building new and customized publications.
Aspen Publishers, an information provider for attorneys, business professionals, and law students, has an unusual perspective-unlike the audience for most publishers, its audience is equally receptive to both print and electronic products. Because Aspen publishes products across many subject lines, customers may want information contained in multiple products (for example, tax issues are covered in such disparate areas as pensions, corporate law, and insurance). As more of its titles move to XML, Aspen aims to repurpose material in ways that make it more useful, more easily located, or more readily navigated.
Because of the cross-disciplinary nature of many subjects, Aspen’s first look at “chapter-chunking” custom publishing was unsatisfactory. Chapters and sections listed in a table of contents (TOC) provided a base of topics that was too wide. Instead, customers should be able to combine smaller components of multiple publications to fit their needs more precisely. Working with Cogitech, a topic maps consultant, Aspen devised an approach that let the XML markup used for indexes become the basis for an index-based book builder.
With the browser-based utility described here, as with any custom-publishing application, a library of books can be accessed and those books’ chapters, sections, and subsections dragged and dropped into a column representing the book to be constructed. But, uniquely, all of the library’s book indexes can also be called up and individual entries can be replaced or added. And, of course, the custom book that is created has a single, unified index.
The topic map approach combines all the indexes, TOCs, glossaries, and referenced materials from all books in the series into a single, multidimensional index. Using topic map associations, the entries in one book’s index can be related to other entries, as well as to entries in the other books’ indexes, glossaries, TOCs, and so on, as well as to all external referenced materials such as IRS publications or Web sites. This approach allows each index entry to point to the direct content in the book, which is displayed in the BookBuilder application, so that the user can explore all the associations any topic brings up. The explanation here concludes with a discussion of the underlying topic-map technology and the use of XSLT to provide real-time creation of the HTML display.
How Did We Get Here?
Aspen Publishers is a New York-based legal publisher and part of the WoltersKluwer Legal-Tax-Business cluster in North America. As it happens, legal texts are prime candidates for XML publishing for several reasons. In the first place, lawyers have fully embraced electronic publications-along with everyone else, lawyers go to the Internet first for research, but they have also eagerly embraced CD-ROM publications, unlike customers in other areas of publishing. Many legal texts are consequently good candidates for release in three different formats.
Additionally, legal texts are relatively stable. Many issues of law are pretty much settled, so when new laws are passed, new court cases decided, new regulations promulgated, you’re not usually throwing away the previous text, but instead adding to it or editing it somewhat. This type of authoring also lends itself to inhouse or staff authoring, which facilitates using XML, but for historical reasons, Aspen, unlike its sister companies, works almost entirely with outside authors and is unable to dictate their working environment.
The first series of books that Aspen put into structured markup shared not only the same design but the same structure as well-a question-and-answer (QA) or frequently asked question (FAQ) format. (See Figure 1.) As the 41 titles in this series of AnswerBooks, which span the full breadth of the law, were migrated into XML, the question arose as to how else to take advantage of this fact. Aspen includes a legal-education division, which publishes books for law school classes, and the educators’ natural desire for material customized to specific classes introduced the notion of creating a custom-publishing application. Mixing and matching chapters, sections, subsections, or even individual questions into a new book seemed a natural way to take advantage of the different books’ identical formatting and structure. And of course this application was envisioned to generate a single, unified index for any custom publication.1
Figure 1. Opening Page of an AnswerBook
The AnswerBook chapters begin with an introduction and include a “mini TOC” for the chapter, with references to specific page numbers for the section, subsection, and subsubsection headings. The questions are numbered sequentially, with the chapter number as the first component. Supplements to a full edition include only edited and new questions (which receive a “dot” or decimal number using the number of the question after which they are inserted). The QA groups, as you would expect, cross page boundaries.
At the time this project was first discussed, each book’s TOC and list of questions (LOQ)2 were being generated programmatically from XML. Each book’s index, however, consisted of an XML file that was manually edited as entries and references were added or deleted. (See Figures 2 and 3.) Similarly, the “end-tables” (such as the table of cases) existed as XML files instead of being generated programmatically. In the body of the book, each QA pair is contained within a <qagroup> element. The index and end-tables point to the questions by number and not to the page where the question begins, while the TOC and LOQ point to the page containing the question and beginning of the answer.
Figure 2. An Index Page
Figure 3. Markup for Indexes
The indexes for AnswerBooks have a three-deep hierarchy. References to specific questions point to the question number and not the page where the question appears. In a supplement, references to questions in the full edition retain their initial numbering, but edited and inserted questions are preceded by the letter “S,” indicating the latest information is in the version that appears in the supplement. Ranges, consequently, have to be interrupted when they include a supplement question, such as that under the item “Age 50 catch-up limit on contributions.”
In the XML file Aspen started with, the question numbers were stored as attributes, while all punctuation for the print edition is stored as content.
In a traditional custom-publishing scenario, Aspen could list the books’ contents divided into chapters or segments smaller than chapters, such as sections, subsections, and subsubsections, and provide a mechanism for checking off which segments a professor or professional user wanted in the custom publication. This list would then be fed to the pagination application, which would assemble the one-off title and an index and format it. Of course, this scenario would depend on Aspen’s converting the index’s XML file into inline references.3 Some publishers actually construct such titles by combining already existing PDFs, automatically generating a unified index. They believe that the TOC hierarchy provides sufficient indication of a segment’s content, at least so far as a professor is concerned.
Aspen Custom Publishing
This crude level of chapter-chunking does not serve the professional user well. Just as a story about a sports star signing a new contract appears in a newspaper’s sports section but is also a business topic, many AnswerBook questions have relevance to more than a single topic. In fact, the mere concept of an index acknowledges that the primary hierarchy is not suitable for all types of access into a book’s contents. As Aspen studied the effort to transform the index’s XML file into inline indexing,3 these advantages for custom publishing were identified:
- All titles use the same DTD/schema.
- Inline indexing would facilitate question moving and renumbering, as well as automatic generation of a unifed index.
- Material is already atomized into QA groups.
- Index entries point to questions, not pages.
- QA grouping provides start- and stop-points in the text.
- The author for each question is already identified.
- A unique ID, such as a digital object identifier (DOI), was considered for each question, potentially allowing for access from outside the system.
At this juncture, Nikita Ogievetsky of Cogitech provided the necessary guidance to steer Aspen toward using topic maps for its custom-publishing application. Topic maps, as he pointed out, derive from indexes, and indexed material translates easily into topic maps. As to their suitability for this project, you can be the judge here.
We used topic maps’ ability to store multidimensional indexes. Cogitech’s Adaptive Classificator framework, developed earlier, formed the basis for the proposed approach. The first step was to separate the QA content from the back-of-the-book indexes and the so-called end-tables.4 QA pairs from all books were collected into a uniform corpus of QA pairs and kept in an XML repository. All indexes, including the TOC, back-of-the-book index, glossary, references to IRS publications, and so forth, would be extracted in the form of topic map associations that point to information resources (for example, QA pairs) stored in the content repository. Some additional associations can be inferred based on the topics’ similarity and semantic proximity. Topic maps extracted from all books were merged into a single topic map and maintained in a topic map repository. The XML repository can be used for this purpose with proper normalization.
Individual books can be reconstructed following a TOC. This is possible because the TOC, as well as the book indexes, is represented as a set of topic map associations. TOC associations apply in the context of the original book, while other indexes (back-of-the-book index, index of IRS publications, and so forth) are ubiquitous across all books.
A new book is just a set of new TOC associations. Once a custom book’s TOC is picked up, the content and associations of the indexes follow. Book reconstruction starts with selecting a TOC that points to individual QA groups. Consequently, selected QA groups take along referencing portions of back-of-the-book and other indexes. This allows content repurposing and facilitates maintenance of back-of-the-book and other indexes.
The merits of this approach become even clearer when one considers its application for custom publishing. Indexed content of the original books can be used as the basis for the new publication. A book in our framework is merely a collection of TOC associations. Construction of a customized book becomes almost as easy a process as reconstruction of any of the original ones. The steps are as follows:
1. Create a topic for the new book; provide it with a title and some other metadata.
2. Create new or clone existing TOC items: chapters, sections, subsections, and so forth. Build the TOC by associating these items, as seen in Figure 4. Add a TOC theme to the scope of these associations to determine the content of the book. Add references to existing or newly created QA pairs to these TOC associations.
3. Enable authors to select existing content by navigating topic map relationships (TOC, back-of-the-book and other indexes, and inferred associations) extracted from the previously “mapped” books.
4. Add new QA pairs, which requires authoring the text and mapping it onto a topic map of existing indexes.
Figure 4. Association of TOC Items
Once the conceptual framework was settled, a few requirements were delineated. Namely, any application built would have to accommodate the following editorial needs:
- Questions (or more properly, QA groups) in the AnswerBooks move about and have to be renumbered when a full edition is issued.
- New questions have to be inserted when supplements are prepared.
- New and deleted material require that the index entries reflect the text changes.
It was recognized that relying on inline indexing would allow QA numbering to move forward in the book production timeline, closer to print time. This was not a requirement, but any easing in the schedule makes life simpler for the editors.
As we built the prototype of Aspen’s BookBuilder custom-publishing application, it was clear that several different user groups would have to be considered:
- law school professors constructing texts
- specialist practitioners collecting information on small as well as large topics
- editors creating specialty one-off titles out of the general texts
- Web site visitors, using our interface to navigate the AnswerBooks’ content online
This last item was not part of the custom-publishing notion, but along the way the content navigation model was admired by some of the people responsible for Web site design.
Specifically, the custom-publishing application included these requirements:
- It would have to construct a new contents list that could be handed off to another application that would compose the custom publication in PDF or Microsoft Reader format. Both local generation, using the open-source Apache PDF engine FOP, and online generation, using the facilities of Aspen’s 200-page-per-minute PDF generator, Datalogics Pager, were considered.
- It would have to use a drag-and-drop interface to construct the new contents list.
- Two parallel development paths would have to be considered-online and standalone. A standalone application would be distributed on CD-ROM but still might require online generation of a PDF.
- The design of the custom-publishing application would need to facilitate both pre-planned research and spur-of-the-moment wandering, as typified by Web browsing. Thus, the application would need to display a variety of levels of content of any question indicated.
- Last, and most critical for custom publishing, the related material in otherwise different books would need to be identified.
This last requirement is truly what brought Aspen to topic maps.
Why Use Topic Maps?
In the first place, topic maps require close and accurate subject identification for all the occurrences that the topics are going to point to. This is no small requirement, especially when the material you are dealing with moves into the tens of thousands of pages. But, odd as it may seem to anyone who has worked in the low-margin world of publishing, book publishers-unlike most businesses-have already invested vast amounts of money to perform this identification.5 That is, they have hired subject-matter experts (SMEs) to index their titles, not for building topic maps but for navigating a book by topic. Because this expense has already been incurred, it provides not a barrier to but an incentive for using topic maps to capitalize on an existing investment. In the case of Aspen’s AnswerBook indexes, serendipitously the start- and stop-points of any index entry were already identified, because the index pointed not at a page number but at a QA group. That meant the inclusiveness of any material could be identified with 100 percent certainty.
Indexes also contain several useful relationships easily modeled in topic maps-the see and see also relations most prominently. Other relationships are discussed below.
Let us take a look at the prototype for Aspen’s custom book constructor, the Aspen BookBuilder. It uses the CTW framework, an XSLT library for topic maps as described in “Cogitative Topic Maps Websites (CTW) Framework-Information and Tutorial” (http://www.cogx.com/ctw), XML Topic Maps: Creating and Using Topic Maps for the Web (Addison-Wesley 2002), and XSLT CookBook (O’Reilly 2002). The prototype was designed for proof of concept and speed of development, and consequently does not represent the final application in many respects. For this prototype, Aspen selected just 5 of its 41 AnswerBook titles and then supplied approximately 20 percent of the questions from these titles. In some cases, questions or index entries were selected because it was clear they had relations that would span across the titles-in other words, the sample material was probably richer than the material as a whole. The five titles-Pension Answer Book, 401(k) AnswerBook, 403(b) AnswerBook, ERISA Fiduciary AnswerBook, and Nonqualified Deferred Compensation AnswerBook-were assumed to have related material on benefits and pensions and are in fact marketed together as the Panel Pension Library, Panel being an Aspen imprint.
The prototype that was demonstrated in Philadelphia is shown in Figures 5, 6, and 7. These are Web-based screenshots, reflecting not only the simpler path of developing a Web application but also a reliance on standard technologies such as HTML, CSS, and XSLT, as well as XTM. Not shown here is how the same underlying topic maps can be inserted into a Microsoft SQLServer database and queried using SQL implementation of TMQL (topic map query language) queries.
Figure 5. Aspen BookBuilder Opening Screen
Figure 6. BookBuilder Displaying a Top-Level Index Entry
Figure 7. Question and Answer Display
The top row of BookBuilder contains the names of the five included books, selectable by the user. In this case, the second title, 401(k) AnswerBook, was chosen. Either the TOC or the index for each title can be displayed along the screen’s left side. Before any index or TOC entry is selected, the middle column simply displays the book title and other metadata. The right column will hold the items selected for the book being constructed. Any topic in the left column can be dragged into the list on the right side.
Although our figures do not display the TOC, any chapter or subpart of a chapter can be dragged into the book construction list, which can be re-ordered once it is complete. The mechanisms for dealing with questions that are pulled in more than once also have to be put into place but were not required for the proof of concept.
When Index is selected, entries or topics from the book’s index are displayed. Here, the top-level entry “Compensation” has been selected. This entry has two subentries, which are displayed, as are the questions referred to in the second-level index entry. Note that below the Compensation entries are listed similar topics from other titles. In this case, the similarity relationship is based on the use of the same term in the different indexes. Wherever an index entry refers to specific QA groups, questions themselves are displayed. From a navigational standpoint, any question or index entry shown in the middle column is clickable-using the display to that entry or question and answer.
If a question is selected rather than an index entry, the question and its answer are displayed. Roughly 12 lines of text are displayed before scrollbars in the shaded answer part are required. In the prototype, users can view the complete answer, but any restriction can be placed here. Related topics within the same book refer to the other index entries that point to this question. While the user has navigated to this spot by selecting 401(k) AnswerBook > Contributions > limits, he or she can also reach it using one of the other three entries in the index that point to this question. Any question that shares an index entry with another question has a “related topics” relationship.
Although the link here to Question 8:66 is not highlighted, links within the text content were built both within the text and to external sources, such as the reference to the Economic Growth and Tax Relief Reconciliation Act (EGTRRA) § 616(a)(2)(A). Because Aspen includes an online primary-law research site, Loislaw, this availability of the primary source with explanatory material heightens the navigational strengths of this interface.
Topic Map Graph Visualizations
Some software vendors provide means for graphical visualizations of topic maps. Graphical visualizations of topic maps provide alternative and sometimes more powerful and comprehensible means for topic map navigation than those using plain HTML.
If topic maps are to be used primarily for navigation, it may well be that alternate means are more powerful ways to represent the navigation path than HTML’s linear display and serial linking approach. Exploring this idea further, we took the same topic map used in the BookBuilder and dropped it into the Empolis K42 topic map engine, which includes Inxight Software’s star-tree technology for displaying topics. In Figure 8, a portion of the topic map is shown in a star tree.
Figure 8. Basic Topic Map Displayed as a Star Tree
We put the basic topic map into the Empolis K42 topic map software and opted for the star-tree view. Any topic can be made the center of a star-tree representation, with its relations spreading out from the central topic. Any node can be expanded and its relations displayed. Here three successive nodes have been expanded to show how a star tree is navigated. A node might link to an HTML page or display information when the mouse pointer passes over it.
The BookBuilder prototype easily related index entries from five different titles, using XML topic maps to identify the index entries and the basic relationships used to connect associated concepts. Merging the different topic maps was based on names, but in the actual production arena it was envisioned that published subject indicators would be used for merging. After the application was constructed and the notion extended to other series, the natural fit of this material to this technology became more apparent. First, the atomized nature of the FAQ approach makes it simple to identify the start- and stop-points of any topic for inclusion in the custom publication. For many books, a second pass through the index by an SME would need to take place, to map the exact start- and stop-points. Second, the detailed index of the material that is necessary for this approach to work poses no obstacle to technical publishers, who have made huge investments in indexing already. As was noted before, choosing open standards facilitated the development of this project-although to the authors’ knowledge, no index-based, custom-publishing application had ever been built before. The Aspen BookBuilder took less than two weeks to construct from requirements list to working prototype. Most of the subsequent work was simply in cleaning up the user interface.
Content repurposing is not just a buzzword in publishing but is in fact one of the essential means for justifying the expenditures for better, richer content. XML, of course, presupposes repurposing as the natural course of events. Whether the topic map methodology described here is used for custom publishing, small-run narrow-interest republishing, or Web navigation, it clearly makes separate books into a web of interconnectedness that puts exploration of the material onto a higher level.
About the Authors
Nikita Ogievetsky is a consultant in knowledge technologies and information management. He leads the community in finding enterprise solutions for real-life problems using topic maps, XSLT, and other XML technologies. He has developed the Cogitative Topic Maps Websites (CTW) framework and is actively involved in enabling the interchange between the RDF and topic map standards. Nikita has authored more than 20 papers on knowledge technologies and applied math and physics.
Roger Sperberg is a consultant in information architecture working with WoltersKluwer’s LTBNA group. He was formerly manager of electronic publishing systems at Aspen Publishers, a legal publisher that has begun publishing its books direct from XML. Prior to that he was director of content services for The Ballantine Publishing Group at Random House. The author of many Web articles and co-author of a multimedia text, he is the author of the for.eWords column at eBookWeb.
1. Tommie Usdin of Mulberry Technologies provided Aspen with guidance at this juncture, critically pointing out the frustrations engendered by a custom publication whose chapters are referenced only by a collection of indexes from the books supplying chapters. Tommie successfully pointed the effort in the direction of a unified index.
2. There are as many as 1500 questions in some of the series titles, so this is an important navigational feature.
3. Each entry in the index would, therefore, require an element to be inserted into the text. If a question was referenced three times in the index, three references would be attached to the QA group in order to regenerate the index at publication time. Some QA pairs are referenced as many as 50 times in the index.
4. Although not essentially discussed here, these separate back-of-book sections are structurally no different from the main index, but they have specialized references, such as sections of the Internal Revenue Code or Treasury Regulations.
5. At Aspen, for instance, the amount budgeted for indexing typically equalled that allocated for composition.