Using Semantic Mark-up Languages with XML to Aid Search
The Information Problem
We’ve all been frustrated when looking on the web or on an intranet—timely access to the right information. Intranets, and the web allow us to access information quickly, but often overwhelm us with other information. The right information has to be
- detailed enough without too much extra information
- free of excess links to extra information or I might get confused as to what is really the relevant extra content
- based on the context of the situation
The context can be defined by what I need to do, what information would answer my question, what my role is, what the product or item is, and where I am.
For an example of this problem, we can look at a conference registration web page. We see things like the name of the conference, the city and country it’s held in, the dates, the location, speakers and the subjects they will discuss, how to register, and the price. Often, however, very little of this specific information is actually tagged in relation to each other so that the topics, places, or presenters could be searched on their various facets. For example, how many people are speaking about the DITA open toolkit versus a content management system? While we might have tags for name, location, speakers, and registration, the details are often organized so we cannot take further advantage of searching through them (see Figure 1).
Figure 1: Sample html Web Tagging Code (Copyright David De Roure)
We need to add meaning to the content so that machines can better interpret the information for us, that is to say, answer more sophisticated queries that give us the direct answers we want instead of 1 million Google hits. This means that tags cannot be arbitrary; they must be interpretable by machines and must be able to go beyond a limited controlled vocabulary. Tagging can’t be informal. It needs a vocabulary of standard language elements to add meaning to the content. The language elements can be combined to build more precise statements, such as the class of objects, their allowable relationships with other objects, and data used to describe those objects.
Ultimately, we need a vehicle for representing knowledge. Here’s an example for a search paradigm that could use a knowledge model:
I’m a network engineer. I want to configure the Ethernet Routing Switch 1234 to use IPV6 protocol with the Command Line Interface for product versions 5.1.
The Information Tagging Challenge
The first thing to realize is that technical writers tag for user context but they also tag based on their expert knowledge. There are several problems with this approach. First, we may make incorrect assumptions about the customer. Second, the customer may think about information differently, leaving out key tags, or they do not understand the relationships they are describing. Thus, they cannot pick the right words. Third, customers often use synonyms, slang, or incorrect syntax for searching. Google attempts to correct for incorrect syntax, but may not provide comprehensive results for the synonyms “truck” and “lorry”. Fourth, how can you tag metadata for all of this at the point of creation? You can’t tag everything and even a minimally controlled vocabulary is incomplete. Typically, authors focus on keywords in a title and the introductory sentence. Often some content is implied such as a graphical user interface or operating system.
Constraining the Metadata Problem
Metadata is often explained as data about data. The good news is that there are already some solutions for metadata. For example, most technical metadata about an electronic file is automatically generated—its time and date of creation, file size, image resolution, and creator.
Publication metadata is easier to add and is quite distinct with our many years of classifying books and journals. Things like ISBN, copyright year, and copyright holder are clear to add, even if it needs to be done manually.
Then there is descriptive metadata. This is where things get harder to describe because different users describe the same things in different ways. Metadata about the product or its components can be more ambiguous. In the DITA world, <keywords> and <category> are deliberately vague to allow wide-scale adoption of the metadata <category>.
Following is an example of DITA standard metadata:
- Audience = Network engineer
- Author = Joe Smith, tech writer
- Category = Job function (Configuration)
- Keyword/index term = Protocols > IPv6
- Prodinfo =
- Prodname = Ethernet Routing Switch 1234
- Vrm = Release 5.1.0
- Platform = Linux
- Component = Base software
DITA is vague about metadata for <category> and <keyword>. As a result, an information architect must work with a product subject matter expert to develop hierarchies or taxonomies of how information relates. DITA does not attempt to provide more guidance on the subject because each domain or industry and product has its own context.
One significant metadata management issue that results in DITA is that the <keyword> – <index term> end up replicating other metadata or omits the higher framework altogether. A number of potential taxonomies, controlled vocabularies, thesauri, data dictionaries, and metadata registries exist to describe various parts of the metadata. But how do all of these multi-dimensional documents relate to each other?
Metadata Tagging is Broken
Metadata helps categorize content, but there are a number of issues with managing metadata in DITA:
- DITA has limited categories.
- DTDs take time to change to accommodate metadata specialization.
- Relationships between metadata categories are not explicit.
- There is often difficulty maintaining metadata when content is used in new contexts.
- You can’t predict where content will get used.
As a best practice for metadata management, Joe Gelb recommends that you maintain categorizations, taxonomies, and relationships outside of the content.1
With the release of DITA 1.22, we now have the SubjectScheme3 that allows authoring teams to go beyond a hierarchy of information to help define taxonomies for each metadata category as well as define the relationships between categories. The SubjectScheme allows us to standardize the subject matter, index, and glossary; create associations across hierarchies; and allows for better planning, content reuse, robust filtering, and production delivery. The SubjectScheme also allows us to classify content using categories not in the metadata, for instance, by defining the <category> element.
The DITA SubjectScheme can organize hierarchical and associative relationships using simple object property relationships. The SubjectScheme has five key relationships:
- hasInstance4 (“ProductName” hasInstance “Ethernet Routing Switch”)
- hasKind5 (“Switch” hasKind “Layer 1 Switch” vs. “Layer 2 Switch”)
- hasNarrower6 (“Telco Equipment” hasNarrower “Switch”)
- hasPart7 (“Switch” hasHardware “Ethernet Port”)
- hasRelated8 (“Switch” hasRelated (usesProtocol) “IPv6”)
Few tools currently support the implementation of the DITA SubjectScheme. Current tools use two dimensional relationship tables to define associations between subjects and their related topics. Two-dimensional relationships are often hard to visualize as the relationships become more complex.
The limitation of the SubjectScheme is that it is not expressive enough to cover these scenarios:
- Equivalent/in-equivalent classes or properties
- Union/intersections of classes
- Restrictions on values or relationships
- Missing a related query language
- Reasoning or inferencing
- Complex “folksonomy” of terms
- Easy visual representation of more complex metadata models in the SubjectScheme
In essence, the SubjectScheme is not very responsive to end-user language as well as having limited logic to allow for situations where querying for a “truck” needs to bring back results for a “lorry” as well.
When you have more complex information problems, you can turn to the recently developed semantic standards that use mark-up languages like
- RDF/RDFs (Resource Description Framework/RDF Schema)
- OWL (Web Ontology Language)
XML is the common denominator mark-up language and RDF, RDFS, and OWL build on it to offer increasing levels of logic and reasoning expressiveness (see Figure 2). This hierarchy of increasing expressivity enables the application of logic, reasoning, and inference when the basic XML cannot. XML alone has no logic other than a hierarchical taxonomy.
The record of writing has evolved from a scroll, to a codex (book), to recent innovations such as topic-based modular authoring (like STOP Storyboarding), to chunks or blocks (like those advocated by Information Mapping™), to the representation of facts or triples as the smallest unit of content to build all knowledge.
Resource Description Framework (RDF)
RDF9 is the general method for modelling information. It is a W3C standard (2004) with the support and advocacy of Tim Berners-Lee, the inventor of the Internet. RDF forms the foundation for RDFa10, RDFs11, and OWL12. RDF allows you to create general classes of information in hierarchies and then add instances of those classes. For example, vehicle class has car and truck class. Car class has instance of Ford Focus. RDFs also allow you to model data properties so that you can describe your instances. For example, your Ford Focus has data properties for wheels = 4, seats = 5, and motor size = 2.5 liters.
RDF is based on triples with a subject, predicate, and objects format. The Domain is your subject, the Range is your predicate, and the Object is your fact. Domain and Range are RDF’s syntax. The combination of these types of RDF allows you to say
RDFs are approximately equal to the DITA SubjectScheme. RDF triples are normally stored in a separate Triple Store database13. You can still keep your XML content in your current database and use the Triple Store database to maintain the additional semantic information. This way you do not have to maintain two complete sets of data. Rather, you have an additional layer on top of your original XML data.
RDF provides a general modeling environment suitable for defining the DITA category and Keyword/Index. And it is a good low-level data annotation. You do, however, have to spend time building knowledge models suitable to manage DITA information and integrate these knowledge model files into current information development and search toolsets. For example, how will the XML editor application call up the various attribute file definitions and control the associations between the various attributes?
Web Ontology Language (OWL)
OWL is a family of well-defined logic languages (OWL Full, -DL, -Lite) and is often called OWL 2 since the release of the 2009 version14. OWL offers formal semantics of first order description logic15. It also builds on the RDF framework. It has use cases16, tools17, and implementations18. It has become popular with some segments of industry, particularly with pharmacology and bio-informatics.
OWL-DL is based on formal properties that have complexity and decidability, known reasoning algorithms, and implemented systems. In particular, OWL offers the ability to form
- unions of information: Microsoft® Office 2010® (class = Product suite ) is a union of Word® (Product) and PowerPoint® (Product)
- disjunction: Two classes are not equivalent like car and truck (but may belong to parent class motor vehicle)
- subsumption: A parent must have children of which all are people and can be either male or female
- satisfiability reasoning: A SLR camera belongs to class “Entry Level SLR” when it is a Camera and has skill level = Entry
The value OWL brings is the improved search results in terms of precision and recall through subsumption reasoning, satisfiability reasoning, and instance classification. OWL also provides a knowledge model for more than search—data integration, text mining, tagging, and auto summarization. OWL does come at a cost. You need to do more knowledge modeling, and you need more infrastructure to support each additional task the model provides.
As an alternative to the formal rigor of RDF and OWL, microformats came out in 200519 and can also add more metadata to your content. The format of a microformat is much less logic based and more practical. Often a single microformat can describe your product or service. Microformats also help describe how your DITA XML can be exported in xhtml,
ATOM, or RSS class equivalent. The most useful microformats for DITA include:
- hProduct20: Product definition format
- hNews21: News article format
- hReview22: Content rating format
- rel=”tag”23: Author designated hyperlink tag format
- Extensible Open XHTML Outlines (XOXO)24: Open outline format
- hCalendar25: Meeting download format
Table 1 shows an example of the hProd microformat.
Table 1: hProd Microformat for a Canon Camera
The value of microformats is that they make key data available for search engines and indexing. They work well within available microformat scenarios. In fact, some OWL ontologies use microformats as part of their
structure. The disadvantage of microformats is that there are not many available. Another disadvantage is the inability to reason over them and the mixing of data properties with class instances.
Benefits of Using RDF, OWL, and Microformats
The use of semantics is more than just metadata tagging for search. Semantics can provide for
- semantic integration of various databases
- knowledge exploration
- semantic annotation
- automatic document tagging
- auto-summarization of a document
- business intelligence mash-ups
- decision making support
- common metadata vocabulary for self-annotation of documents
Infrastructure for a Semantic Solution
Building a semantic solution requires people, hardware, software, and ultimately investment. Required human resources include
- Solutions Architect
- Domain Expert
- Logics Reasoner
- Application Developer
- Database Architect
- Information Architect
- Business Analyst
- Project Manager
- Text Mining Engineer
Hardware is normally limited to one or more servers to house the various applications. In particular, one will be needed for the Triple Store. There are a wide variety of bench-marked Triple Stores available26.
New software tools you will have to learn include:
- ontology editor
- text miner
- rules engine/reasoner
- end user interface
- query languages27
While semantic solutions can be more expensive than using traditional methods, the expense is largely due to the newness of the solution (a steep learning curve), a still limited group of people with the appropriate job skills, and the need to build new types of end user applications to take advantage of the power of semantics. The good news is that semantics can bring a holistic view to your information rather than supporting multiple piecemeal approaches, that when added up, may require more effort to build, maintain, or modify than a semantic solution. Typically, there is greater value in the semantic solution per dollar spent.
The main activities in a semantic project include these high level tasks:
- Defining the problem
- Creating an ontology
- Composing the standard queries
- Creating a text mining pipeline
- Creating an end user application
Developing an ontology involves the tasks outlined in Figure 3.
Figure 3: Workflow for Developing an Ontology (Source: A. Garcia Castro)
Using semantic languages offers you a much better way to manage your metadata to enhance search. Metadata proves useful for both information developers when authoring content as well as end users looking for the final published product. Semantics can also provide much more than just an enhanced search. Semantic knowledge models allow you to increase the value of your information because of the rigor spent to ensure high-quality information. Semantics can also unlock value in the new search paradigms this form of knowledge representation opens up such as faceted search, knowledge discovery (inferred facts), and browsing associated information not expressed as such in the content, but in the knowledge model. The level of semantics you need will depend on the complexity of the product and users and the information management problems you are trying to solve.
Bradley Shoebottom has worked as a technical writer, instructional designer, and information architect for many telecommunications products. He is also the resident ontologist designing knowledge models for products. He works for the Research and Product development group at Innovatia trialing new knowledge management practices and technologies. Before Innovatia, he worked as an instructional designer for LearnStream and as a Canadian Armed Forces instructor. He is also currently a lecturer in Technology and Warfare and Canadian Military History. He has self-published several Canadian history articles and co-authored several articles on semantic web projects implemented at Innovatia. He has a BA in International Relations, and an MA in Warfare Studies, an MA in Atlantic Canadian History, and is currently completing a PhD in Interdisciplinary Studies studying the adoption of semantic web technologies by organizations.
1 Joe Gelb, CMS-DITA NA 2011.
2 DITA Version 1.2 Specification <http://docs.oasis-open.org/dita/v1.2/spec/DITA1.2-spec.html>
3 SubjectScheme Maps <http://docs.oasis-open.org/dita/v1.2/os/spec/archSpec/subjectSchema.html#subjectSchema>
4 hasInstance <http://docs.oasis-open.org/dita/v1.2/os/spec/langref/hasInstance.html#hasInstance>
5 hasKind <http://docs.oasis-open.org/dita/v1.2/os/spec/langref/hasKind.html#hasKind>
6 hasNarrower <http://docs.oasis-open.org/dita/v1.2/os/spec/langref/hasNarrower.html#hasNarrower>
7 hasPart <http://docs.oasis-open.org/dita/v1.2/os/spec/langref/hasPart.html#hasPart>
8 hasRelated <http://docs.oasis-open.org/dita/v1.2/os/spec/langref/hasRelated.html#hasRelated>
9 Resource Description Framework (RDF) <http://www.w3.org/RDF/>
10 RDFa Primer <http://www.w3.org/TR/xhtml-rdfa-primer/>
11 RDF Vocabulary Description Language 1.0: RDF Schema <http://www.w3.org/TR/rdf-schema/>
12 OWL Web Ontology Language Reference <http://www.w3.org/TR/owl-ref/>
13 Triple Store <http://en.wikipedia.org/wiki/Triplestore>
14 OWL 2 Web Ontology Language Document Overview <http://www.w3.org/TR/owl-overview/>
15 Description Logic <http://en.wikipedia.org/wiki/Description_logic>
16 OWL Web Ontology Language Use Cases and Requirements <http://www.w3.org/TR/webont-req/>
17 Semantic Web Development Tools <http://www.w3.org/2001/sw/wiki/Tools>
18 OWL Implementations as of December 2003 (Historical) <http://www.w3.org/2001/sw/WebOnt/impls>
19 Microformats Wiki <http://microformats.org/wiki/Main_Page>
20 hProduct <http://microformats.org/wiki/hproduct>
21 hNews 0.1 <http://microformats.org/wiki/hnews>
22 hReview 0.4 <http://microformats.org/wiki/hreview>
23 rel=“tag” <http://microformats.org/wiki/rel-tag>
24 Extensible Open XHTML Outlines <http://microformats.org/wiki/xoxo>
25 hCalendar <http://microformats.org/wiki/hcalendar>
26 RDF Store Benchmarking, http://www.w3.org/wiki/RdfStoreBenchmarking
27 Bradley Shoebottom, Getting Started With Semantics, 2010, http://rmc-ca.academia.edu/BradleyShoebottom/Papers/358330/Getting_Started_with_Semantics_in_the_Enterprise