Access to Shared Linguistic Assets through Open Information Retrieval Standards
In this discussion, I review the possibilities of applying open information retrieval standards to enhance the access to terminology assets. Terminology management and control is a critical area in the production of technical documentation. A systematic approach to the use of linguistic assets is a key factor in any technical documentation project, as terminology control is needed to
- help authors and reviewers use a common, defined set of terms
- standardize the writing of technical documents among the different writers involved in content production
- save money in internationalization activities, by sharing with translation and localization agencies the linguistic assets that should be reused
- ensure a common understanding of the concepts used in documents with customers, partners, or any other third parties involved in the project
- avoid misinterpretations of potentially ambiguous terms
The management of organizational linguistic assets includes the control and management of specialized technical terminology. This management includes a set of well-defined activities, among others:
- Identify and extract terms by analyzing existing documents or through discussions with domain experts.
- Record linguistic information for the selected terms—Part-of-speech, variants, and so on.
- Identify similar and related terms.
- Identify and document the semantic relationships between terms.
- Record terms’ definitions and scope.
- Record the different context in which the terms are used in the documents.
- Record equivalent terms in other languages.
- Communicate and manage the resulting databases or glossaries to ensure they are known and properly used by the company staff when writing documents or creating translations.
The result of these activities is usually recorded in databases made up of a set of terminography records where the following data are recorded: definitions and scope notes with recommendations for the use of the terms; textual fragments or illustrations showing how the terms are used in different contexts; synonyms and related terms that build a conceptual structure that can be used later to explore and browse knowledge; and data and the equivalences between the different languages that refer to the same concepts.
These databases cannot be applied only to the elaboration of technical documents. In fact, terminology as a discipline and its products are closely related to other disciplines like Documentation and Information Science and Knowledge Engineering. Documentation and Information Science focus on terminology to identify terms that are valid as descriptors to help users of information systems in the retrieval of information. Knowledge Engineering focuses on the use of terms to create shared understanding or ontologies for special knowledge domains to support the automation of knowledge-intensive processes.
For the three disciplines (Technical Communication, Documentation and Information Science, and Knowledge Engineering), the management of linguistic assets has an added-value that goes beyond the standardization of writing and editorial processes.
Linguistic Assets: Who can benefit from them?
Terminology produces different artifacts with different levels of semantic elaboration. The list of terminology artifacts include
- Glossaries: list of terms and definitions and related terms alphabetically ordered.
- Thesauri: list of authorized and non-authorized terms with basic semantic relationships among them (broader term BT, narrower term NT) plus scope notes ordered alphabetically and systematically (by subject area). Thesauri are mainly used for indexing documents and information retrieval.
- Translation memories: databases containing sentences that are equivalent in different languages. They are used to streamline and ensure consistency in the translation processes.
- Ontologies: set of terms and concepts systematically arranged, recording complex semantic relationships among them. Ontologies are used to support complex reasoning and information processing activities by software agents.
The potential users of these terminology artifacts include actors involved in different software development processes: a) technical writers and translators benefit from these artifacts when writing documents because they are useful to standardize language and vocabulary;
b) engineers benefit from terminology artifacts because they are useful to ensure the consistency of the specification of the system’s behavior and avoid misunderstandings; and c) finally, staff involved in customer support processes can benefit from these assets because they ensure quick, direct access to those information items (customer-support cases, for instance) containing data related to a specific issue or technical problem.
Encoding and Sharing Linguistic Assets
One of the risks associated with linguistic assets is their misuse or underutilization. The real potential of linguistic assets depends on the ability to share them effectively. It is necessary to ensure that linguistic assets are known to all their potential users and that terminology data are made available in formats that are easily reusable.
To achieve this goal, different XML-based schemas have been designed to encode different types of linguistic assets or work products derived from terminology activities. The list of machine-readable and XML-based standards and specifications include, among others, MARTIF ISO 12200 (Machine-Readable Terminology Interchange Format), TBX (TermBase eXchange), OLIF (Open Lexicon Interchange Format), UTX (Universal Terminology eXchange), and GlossML (Glossary Markup Language). It is also possible to include within this group specifications like SKOS (Simple Knowledge Organization System), used for encoding thesauri and indexing languages, RDF-S (Resource Description Framework-Schema), or OWL (Ontology Web Language) for sharing ontologies.
Some of these specifications are clearly linked to specific terminology artifacts. For example, UTX—developed by the AAMT (Asia-Pacific Association for Machine Translation) —was designed to absorb differences between existing formats for machine translation. It is considered a useful format for user-created dictionaries. GlossML, that tries to be simpler than alternatives, consists of an XML-based schema for encoding mono- and multi-lingual glossaries with just six elements and four attributes. On the other hand, TBX (published by LISA in 2002), offers a more complex format and incorporates several data categories (subjectfield, definition, term, PoS, source, context, xGraphic, grammaticalGender, termType, and so on) and structural levels (concept, language, all concept entries, and so on.)
Terminology Encoding in DITA
Although DITA was not initially conceived as a schema for encoding and sharing terminology, this specification incorporates a method to encode terminological data that should be considered as a valid choice in terminology management programs. DITA’s terminology management is based on the use of the specialized topic type <glossentry>.
This topic type is used to add entries to a glossary. In its simplest form, a separate <glossentry> topic is kept for each concept containing, at least, a mandatory <glossterm> child element, and <glossdef> children elements. Optional <related-links> child entries can be added to the entry. <glossentry> elements can be grouped together within a <glossgroup> element.
More complex use of the glossentry topic type is possible, as described by Warburton1 and Self2. These two guides provide detailed information about the creation of individual glossary topics, how to add glossary topics to ditamaps, and how to link terms appearing in content topics to their definitions in the glossary or to their expanded forms.
The common approach for using glossaries in DITA documents is based on the addition to the map of <glossref> elements pointing to the glossary entries. These <glossref> elements have a @keys attribute that uniquely identifies them. Inline terms that need to be linked to glossary entries are tagged in the form <term keyref=“term1”>, where @keyref points to the corresponding <glossref> element. Instead of the <term> tag, it is also possible to use the <abbreviated-form> tag, to indicate that the tagged fragment needs to be replaced by the corresponding value of the <glossSurfaceForm> or <glossAcronym>.
With DITA encoded glossaries, technical writers can benefit from the automated expansion or display of pop-ups for terms, linking terms to their definitions, and the generation of both printed and online, sortable, and searchable glossaries. In addition, DITA 1.2 incorporated additional elements for encoding glossaries like <glossSynonym>, <glossShortTerm>, <glossSymbol>, <glossPartOfSpeech>, <glossAlt>, <glossUsage>, <glossStatus>, and so on, that make it possible to consider the DITA schema as a valid alternative for the storage and sharing of terminology data.
Standardizing Access to Linguistic Assets
As indicated in a previous section, terminology and linguistic assets not only provide benefits to technical writers working with DITA. Other users (engineers and support teams) require access to terminology assets in different work scenarios and contexts. Some potential use cases where access to terminology is needed are listed below:
- Engineers can use terminologies when writing specifications based on Linguistic-Patterns and templates.
- Engineers can access terminologies when writing other technical documents.
- Technical writers can access terminologies when editing content to ensure the use of preferred or approved terms in the right context.
- Technical writers can add glossaries and acronyms at the beginning of technical documents to help readers understand their content.
- Engineering and support teams can access terminologies for searching and indexing when writing or answering problem descriptions or Engineering Change Requests.
These requirements imply that access to terminology data and glossaries be given to different users working with different content and data management tools (word processing, XML editors, content editors, tools for managing requirements, test cases, tickets, or support cases).
To accommodate these heterogeneous needs, implementing some kind of “universal access to shared linguistic resources” would add value.
The technical solution presented in this paper answers the need to give access to terminology artifacts from different tools in distributed work environments. Actors involved in different software development processes can access centralized terminology data from their preferred work tools. The possibility of accessing terminology data from different tools also leverages the organization’s existing glossaries and terminology resources and ensures that they are consistently used across departments and processes.
The proposed solution makes use of DITA and Semantic Web standards (RDF-S, OWL, and SKOS) as well as open information retrieval specifications developed to search data in remote web repositories through Web Services. Search/Retrieve URL (SRU) is the chosen information retrieval protocol to implement searching capabilities. SRU is a Lightweight “SOAP” URL-based web service initially developed by the US Library of Congress to search bibliographic catalogs. SRU proposes a general abstract model for distributed information retrieval that can be used for any kind of “information retrieval process” (not only for bibliographic data). In fact, any kind of XML-based record may be retrieved using SRU. This protocol is quite interesting for accessing terminology data because it makes it possible to simultaneously search different databases and get a single, integrated list of results.
A typical SRU interaction involves a client application (in our case, the content editing tool used by technical writers or engineers)
and a server application where the terminology database is stored. The client application sends requests to the server based on a URL, for example: <http://localhost/sruSrvr/termDB_processRequest.php?version=1.2&operation=searchRetrieve&query=QUERY_CQL&
The server processes the request and returns the matching records as XML formatted data. These XML data are in turn processed by the client application, usually to display the terms, definitions, relationships, and so on, to the user. The user can then browse the terms, select and add terms to the documents, compare definitions or contextual information, or select the expanded form of the chosen term.
Figure 1 describes the workflow between the involved tools.
The SRU protocol establishes a set of well-defined messages for exchanging queries and responses. The protocol needs to be adapted to each information retrieval need by defining a profile. Our implementation is based on a profile designed for accessing terminology assets encoded in SKOS, GlossML, and DITA glossaries. An SRU profile establishes a set of “abstract access points,” that is to say, fields or a set of fields that can be used to run queries against the terminology database or glossary. The set of fields in our profile include: TERM, ALL FIELDS, SYNONYMS, ACRONYMS, RELATED TERMS, and RELATIONSHIPS.
These access points are said to be abstract because they need to be mapped—in the implementation of the SRU server—to the actual fields used in the target database or glossary. This mapping of abstract access points to actual fields in the target database or glossary makes it possible to maintain the independence of the terminology database hosted in the SRU servers and the client applications that access them. Client applications can be any content edition tool that can send valid URL-based queries. The implementation is now available for the Altova Authentic XML editor, Microsoft Word®, and IBM DOORS®. The server application is built on top of an Oracle® XMLDB and a set of PHP pages in charge of processing SRU messages and interacting with the database (as shown in Figure 2).
Of course, the implementation of this solution requires some technical work:
a) client tools need to be adapted to implement the capability of sending URL requests and processing the XML responses sent back by the servers and b) servers need to be adapted to implement the capability of accepting requests and sending responses in the agreed XML format. But it is important to note that once the adaptation is done for a specific tool (server or client), it can be reused with no change to work with any other client or server application implementing the SRU profile.
The possibility of accessing corporate linguistic assets from different tools (XML editors, word processors, requirements management tools, and so on) helps ensure the consistent use of business and technical concepts in the information items created by different actors involved in software construction and maintenance. It is also an interesting choice to leverage existing glossaries and similar artifacts owned by the organization and improve their visibility and use. It also demonstrates the possibility of using DITA as a valid format for sharing terminology data. The implementation I have described here may help achieve these objectives. Its benefits are the following:
- Users (technical writers, engineers, and so on) can search and retrieve terminological data stored in shared, common repositories.
- These repositories do not need to be based on a specific encoding language, although DITA can been used as an alternative to encode glossaries.
- The retrieved terms can be retrieved from the shared repository and selected and incorporated into the information item being edited (DITA documents, DOORS requirements, and so on).
- Centralized management of terminological data gives additional options like automatic updates of referenced terms in document collections.
About the Author:
Universidad Carlos III de Madrid, Spain
Dr. Ricardo Eito-Brun is an associate professor at Universidad Carlos III de Madrid, Spain, where he teaches different subjects related to publishing, knowledge organization and representation, classification, and information management. Ricardo holds a master degree in Informatics from Universidad Carlos III de Madrid and in Documentation and Information Science from University of Granada (Spain) and a doctoral degree from University of Zaragoza (Spain) on the application of distributed collaboration environments and Semantic Web techniques for the description and classification of archival materials. He has been responsible for several large scale content management and web-based publishing projects for companies and public institutions in European countries. He is the author of four books on mark-up languages and XML and numerous articles and conference papers in the field of information management.
1. Kara Warburton
DITA 1.2 Glossary and Terminology Specialization Feature Description.
Available at: <https://www.oasis-open.org/committees/download.php/34831/GlossarySpecializationBestPractice_Final.pdf >
2. Tony Self
DITA 1.2 Feature Description: Improved glossary and terminology handling
Available at: <https://www.oasis-open.org/committees/download.php/35766/glossary_and_term_management.pdf>