Publishing Structured Documents in an Open Source Wiki

Home/Publications/Best Practices Newsletter/2009 – Best Practices Newsletter/Publishing Structured Documents in an Open Source Wiki

CIDM

February 2009


Publishing Structured Documents in an Open Source Wiki


CIDMIconNewsletter Peter Dykstra, MetaphorX LLC

The rise of Wikis is a game changer for technical communications groups. Companies are using Wikis to deliver large, structured documentation sets.

Built around lightweight browser-based components, Wikis don’t have all the bells and whistles of specialized publishing tools, but they are also easier to learn and administer. They offer advantages over traditional desktop publishing tools, especially for multi-user collaboration and XML-based processing. At the same time they tend to be less complex and have lower cost of ownership than proprietary content management and publishing systems.Wikis provide new capabilities, especially for web-oriented publishing, and can significantly simplify the traditional publishing process. In this article, I show how to use Daisy, a Wiki-based Content Management System (CMS), to publish structured technical documentation.

Built from the ground up as web-based, multi-user tools, Wikis can be used to set up efficient content review and publishing cycles, with defined roles for writers, editors, reviewers, managers, and readers. An administrator can configure role-based access to specific documents (or document parts) as well as to specific functions such as the ability to create, edit, or publish documents based on document type or metadata tags. XML processing can support single-source, multi-channel publishing of traditional books as well as online content. Flexible search and browse features, including techniques such as faceted browsing, make information easier to find and use for end users.

The net result: more timely publishing of information that’s easier to use, with fewer errors, and reduced cost.

What’s a Wiki?

Wikis were first introduced around 1995, defined by Ward Cunningham (who first used the term “Wiki”) as “the simplest online database that could possibly work.” Since then they’ve grown and been widely adopted, and the technology has matured. Wiki technology now provides many features previously available only in proprietary content management and publishing systems. Some Wikis now support very large websites; the best known of these may be Wikipedia, with over 2.5 million articles in its English edition.

Though different Wikis have different feature sets, at a minimum they allow users to format text and add links within and between pages, as well as links to other websites. Syntax and formatting capabilities vary. Some use simple Wiki-style formatting commands. Others support HTML with built-in WYSIWYG editors. Many Wikis offer more sophisticated features, some of which are described below.

Why Should Technical Publications Groups Care?

Wikis are a new class of software designed for managing and publishing text—the core activity of technical publishing groups.

Though they have generally been associated with informal publishing with few rules or controls, Wiki features such as defined roles and access rights, metadata support, versioning, and XML-based processing can be leveraged to provide additional advantages in structured-publishing environments.

Wiki technology should be of interest to technical publishing groups for at least three reasons:

  • A Wiki can provide an accessible way to do Content Publishing.
  • It can simplify collaboration within and among groups.
  • It is built from the ground up around the web and web standards.

Content publishing

First of all, Wikis provide an accessible, low-risk point of entry to the world of content management and single-source publishing.

Demands on publishing groups have grown in recent years, as the amount of information, the rate of change, and the demand for multiple versions and formats have all increased. A single-source publishing strategy built around a content management system can dramatically increase efficiency.

Open source Wikis provide content management and single-source publishing capabilities at a lower cost than proprietary systems. In fact, the licenses are free—the main costs are staff time and effort, and hiring or developing the necessary skills. Though these costs are not trivial, a Wiki can still provide a low-risk way to learn before you buy or build your own solution. The technology is good and getting better. The features meet the requirements of many typical publishing situations.

Simple collaboration

Many traditional desktop-based tools and CMSs have collaboration features to allow file sharing; these are powerful but can also be complex.

Collaboration with a Wiki is simple and intuitive. Subject-matter experts can have direct access to text and can contribute and review text online without specialized tools—all they need is a web browser. The ability to set access rights means each person can be authorized to see, change, or publish the appropriate documents or document types. It’s also easy to set up workflow views—separate views listing topics that are ready to be reviewed or approved topics ready for publishing, for example.

Standard web-based technology

Web-based publishing is already a requirement for many technical publishing groups, and this requirement is likely to grow. Web-based content is also created by many other groups throughout an organization; tech pubs groups have new opportunities to tap into and help organizations manage this decentralized knowledge, but they need to learn about the tools first. Knowledge of XML and XML publishing strategies can be invaluable to a publishing group.

One virtue of open source tools is that they tend by necessity to be standards-based. Migrating an organization’s content into a structured format that can be addressed by XML processing is generally a good idea, since it reduces the risk of orphaned content in proprietary formats and provides a base for continued development into new areas.

An example of Wiki-based structured publishing

The rest of this article shows how to create structured documentation using the Daisy CMS, a Wiki-based content management system.

“Structured” in this case means that documentation is written in modular topics, with topic types such as Concept, Procedure, and Reference. Authors identify each topic by topic type and with other identifying tags (such as version and status) in metadata fields attached to each topic. The modular topics are then assembled and published as HTML or as PDF books. The examples show a small number of topics; Daisy can manage sets containing thousands of topics with more complex information architectures.

Unlike DITA-based systems, Daisy doesn’t use a validating XML editor and thus doesn’t enforce different tagging rules for different document types. On the downside, this lack of validation limits the type and degree of structure enforced by the system within each topic. On the upside, this lack of validation makes the system less complex and much easier to use and makes Daisy a good tool for casual users.

In the examples, a Daisy document type of topic requires that writers specify a topic type for each topic document. Rules for content and formatting of the topic types are enforced externally through the use of authoring guidelines.

Daisy’s features

First, a bit about Daisy. The Daisy system is produced by Outerthought, a software company based in Belgium.

Out of the box, the system supports editing and publishing of web-based sites consisting of Daisy documents. These can be Daisy HTML documents, or they can be attachment documents containing attachments such as PDFs and Word, and Excel documents (or virtually any file, including multimedia).

Some of the more interesting Daisy features:

  • browser-based WYSIWYG editing of Daisy HTML documents
  • definable metadata fields attached to each document type for use in accessing and managing documents
  • versioning of all edits, with DIFF views comparing any two versions
  • inclusions (the ability to include documents in other documents)
  • support for document variants. A variant consists of a branch (for a specific content version) and a locale (for a specific language).
  • assembly and publishing of PDF books
  • a Publish feature which requires explicit publishing of Staged documents to make them Live
  • ole-based rights to view, edit, or publish documents

Features for readers:

  • view public pages without logging on. Users who log on can be granted access to restricted sites or documents based on their role(s).
  • use a document basket feature to collect arbitrary sets of topics and view or print them as a set
  • convert any Daisy page (or the contents of a document basket) to PDF on-the-fly
  • add public or private document comments
  • search the full text of documents (including HTML documents and common attachment types of PDF, Word, Excel, and text)
  • use faceted browsing views to drill down into a repository by filtering out unwanted documents
  • subscribe to documents to receive notifications when a document changes

Daisy architecture

Daisy consists of two main components—the Daisy repository, in which documents are stored, and the Daisy Wiki, the front end through which users create, manage, and view documents in the repository. The Wiki communicates with the repository using a documented application programming interface (API). You can also write other applications to interact with the repository using the API. (If you wanted, you could build a separate application that could use the repository without using the Wiki at all.)

Documents in Daisy each contain one or more parts. Part types include (among others) Daisy HTML, free-form HTML, images such as .png, .gif, or .jpg, and various Attachment types which allow storage of just about any kind of content, including binary files such as PDFs or multimedia. An administrator can set up multiple document types, each with one or more parts plus optional fields with optional picklists defined to attach metadata to each document.

The default document type contains one part of type Daisy HTML. This part is used most often and is supported by the built-in WYSIWYG editor. Daisy HTML uses a restricted subset of standard HTML tags that describe the structure of documents (such as headings, paragraphs, lists, tables, notes, and monospaced text), plus Daisy include statements that can query the repository and include links to (or the full text of) other Daisy documents. The Daisy HTML part of a Daisy document can be created and edited in Daisy’s WYSIWYG browser-based editor and is stored as well-formed XML in the Daisy repository.

On the back end, the repository stores documents in an open source MySQL database. The repository API includes a custom SQL-like Daisy query language which is used internally (and can be used directly in the Wiki) to query the repository for documents. The Wiki uses the query language under the hood to retrieve and display documents.

The Wiki is built on the Apache Cocoon XML processing framework, which supports real-time XML processing. Daisy’s Search feature uses the Apache Lucene search engine. Wiki users see documents organized into one or more Daisy sites. The repository is not aware of sites—within the repository documents are stored in one ‘big bag.’ Each document is identified by a unique, sequentially assigned Daisy ID number. This scheme makes it easy to reorganize documents on a site and also to reuse documents in multiple hierarchies.

Daisy can be installed on just about any PC (or Mac) on a network and be accessed from anywhere on the network. A Daisy production instance is generally installed on a network server with backup capability or, for public Internet access, on a public web application server.

The Deck Doc Example

The following example shows how you can use Daisy to manage a documentation set using modular topics with defined topic types.

Sample Daisy site

A sample Daisy site, shown in Figure 1, contains topics about adding a deck to your home. A navigation pane on the left lists the topics in a site-specific hierarchy. Daisy can support multiple sites, each of which may be visible to (and editable by) different sets of users.

Dykstra_Figure 1

This site uses a customized skin with a logo and customized font settings. Documents are written as modular topics, organized here by topic type with types of Overview, Procedure, Concept, and Reference.

The WYSIWYG editor

Daisy HTML documents are created in a WYSIWYG editor shown in Figure 2. Paragraph styles are selected from a drop-down selection list. Style tags are available for a subset of common HTML elements used to indicate document structure. The editor also includes controls for creating and formatting HTML tables. With minor modification, Daisy can support use of CMS class attributes to control formatting—to support a range of predefined table formats, for example.

Dykstra_Figure 2

The HTML view in Figure 3 shows the tags in the HTML view of a Daisy HTML document. Most editing can be done in the WYSIWYG view. Familiarity with HTML is helpful to fix occasional problems. (The WYSIWYG editor isn’t perfect—it can get confused by things like multi-level lists; if they do occur, problems can almost always be fixed by switching to the HTML view and making adjustments there.)

Dykstra_Figure 3

Metadata fields

Each document type can be defined to include an optional Fields tab containing fields set up for that document type. The Topic document type has the fields shown in Figure 4 (Product, Status, Topic Type and Audience).

Dykstra_Figure 4

The Navigation document

Each site has a Navigation document as shown in Figure 5. This document defines the sequence and indentation levels for the documents that appear in the Navigation pane. (A Navigation document is similar in concept and function to a DITA map.) The Navigation document can refer to individual documents by ID, or it can use queries based on built-in document properties (such as name, owner, or document ID) or customized metadata (such as ProductName, DocumentType, or TopicType) to assemble a set of topics for the site.

Dykstra_Figure 5

Embedded queries

Queries and Include statements can also be embedded within a Daisy HTML document to include lists of documents (or the full text of documents) in other documents. For example, a one-line query embedded in Figure 6 results in a list of topics about decks.

Dykstra_Figure 6

Another use for the document Include feature might be to define a so-called Snippet document type for short documents designed to be included in other documents. Snippet documents could then be reused for boiler-plate text or to customize topics for specific audiences.

Faceted browsing

In addition to full-text search (not shown), Daisy’s Faceted Browser view (shown in Figure 7) lets users apply filters in any order to drill down to topics that are of interest—for example, Procedures for version 2 of a Product X specific to Dealers in Region 3. (Faceted browsing can be a more reliable way to find topics when a user does not know the right search term.)

Dykstra_Figure 7

PDF Processing

Daisy provides several types of PDF conversion for topics, sets of topics, or full books. First, users can view and print individual topics in PDF on the fly as shown in Figure 8.

Dykstra_Figure 8

Single document PDF

Second, users can also assemble any subset of topics on the fly using the Document Basket and then view or print the set as HTML or PDF.

PDF Books

Third, an authorized user (usually an administrator or editor) can define and publish PDF books (shown in Figure 9) using documents in the repository.

Dykstra_Figure 9

A book’s table of contents can show topics at multiple levels as specified in a book definition.

PDF books include standard features such as formatted pages and chapter and section numbering (Figure 10).

Dykstra_Figure 10

Publishers can produce multiple versions of a book for separate audiences or users of different product versions.

In the Deck example, the Audience field on each topic could be used to create separate versions for pre-sale, customers, or dealers, and to maintain an entire set of books for each product version—for example, to continue maintaining the 2007 version alongside the 2008 version.

Customization

Daisy is built using the Java programming language, but because of its modular architecture and standard open source components, it can be configured and customized by anyone familiar with XML technology, often without the need for Java programming.

Here are some of the ways Daisy can be configured and customized, from least to most complex:

  • To use Daisy, you must set up some out-of-the-box configuration options. You can set up document types, field definitions, access rights, and roles using Daisy’s GUI-based Administration page. Setting up Daisy sites can be done by editing a simple set of XML configuration files. These files also include some basic customization such as adding a custom logo as part of the site definition.
  • By digging a little deeper, you can set up site-specific CSS stylesheets. They allow you to modify aspects such as site fonts and colors. Although there is basic documentation about Daisy’s CSS design, changes to a CSS require an understanding of a CSS and some experimentation to understand how it is implemented in Daisy.
  • By modifying Daisy XSL transforms, you can control the sequence and structure of elements on a page—to combine Daisy documents or document parts on a single page or hide menu options based on user role, for example. These require an understanding of XSL and begin to look a little more like programming.
  • You can change PDF page formats. By modifying a set of XSL-FO stylesheets, you can control the formatting of Daisy PDF pages. You can modify a PDF parameter setting to control aspects such as font and page size. By digging deeper, you can add your own parameters—for example, you might add a parameter to control the font size in tables independently of normal body text.
  • At the high end, you can download the source code, set up a Daisy Java development environment, and modify any aspect of Daisy you wish. Java coding would be necessary to make changes to hard-coded functions, such as customizing search pages.
What You Will Need

The Daisy site (cocoondev.org/daisy) provides a complete description of what you need to download and install Daisy. In addition to a computer and an Internet connection, you’ll need three main things:

    • a Java environment, which can be downloaded from Sun
    • the MySQL Community edition, which can be downloaded from the MySQL site
    • Daisy—which can be downloaded from SourceForge (you can link to this from the Daisy site)

You can install Daisy on a Windows PC, a Mac, or a Unix/Linux computer.

Is Daisy For You?

Daisy may or may not meet your requirements for structured publishing. Here are some reasons technical publishing groups have chosen NOT to use Daisy or other Wiki-based solutions for production use:

High end or specialized requirements

If you need to publish documents with high-end production values, such as detailed control of typography and layout, you probably would not want to rely on a Wiki editor. (On the other hand, a Wiki can be ideal for distributing images or PDF documents created in separate high-end tools.) If your organization needs to use DITA, Daisy is not currently a solution (though there have been discussions about adding DITA support to Daisy and other open source projects).

Need to work offline

If writers and editors need to work offline, a Wiki isn’t a good fit since Wikis require a connection to the server. You can copy and paste documents back and forth from Word or another editor but that is not a satisfying solution on an ongoing basis.

Lack of technical orientation

An open source Wiki requires up-front decisions about setup and configuration, as well as ongoing technical support. In addition to making decisions about which tools to use and how to deploy them, you need to set up detailed guidelines for use (including an information architecture describing document types, metadata, and content guidelines), and there needs to be a ‘go to’ person to deal with operational questions, both on the back end (how to manage the server) and on the front end (which paragraph tags to use or how to fix a document format). Help is available for open source tools, but you may have to find it on your own.

Can’t afford the learning curve
If you have a mission-critical function with a tight deadline, high volume, or the need for high-end features like domain-specific translation memory or decentralized publishing workflow, you may want a vendor’s expertise.

On the other hand, many technical publishing applications, especially in areas such as technical product documentation, don’t have these requirements and may be well served by the benefits of a Wiki publishing environment.

About the Author

Peter Dykstra_bw

Peter Dykstra
MetaphorX LLC
peter.dykstra@metaphorx.com

Peter Dykstra is founder and principal consultant of MetaphorX LLC. He has 30 years experience in technical communication including 10+ years as the Product Information Director at a 500M IT firm. MetaphorX provides consulting and training services for Daisy in the US.

We use cookies to monitor the traffic on this web site in order to provide the best experience possible. By continuing to use this site you are consenting to this practice. | Close