Find vs. Search: Maximizing Information Value with Content Applications

Home/Publications/Best Practices Newsletter/2008 – Best Practices Newsletter/Find vs. Search: Maximizing Information Value with Content Applications


April 2008

Find vs. Search: Maximizing Information Value with Content Applications

CIDMIconNewsletterAndy Feit, MarkLogic Corporation

Internet search has transformed the way we obtain information, bringing search engine technology to the forefront of the content publishing agenda. Search engines help us quickly find answers to questions using information in far-flung sources we could not access otherwise. But when it comes to enterprise search—to accessing and making use of a company’s huge store of unstructured information for improved decision making—the technology fails to deliver. In the enterprise, a long list of links that take time to peruse does not produce usable information.

To deploy a solution that locates the exact information users need and then “delivers” that information in a variety of ways to meet the personalized and customized needs of different users, organizations must go far beyond enterprise search and develop content applications. What are content applications? Think of a database. That’s where structured data is stored. Applications built atop a relational database are designed for specific users working on a specific task, an order entry system for example. In the same way, content applications are built atop a content server that stores unstructured data. Content applications are aware of the role and task of a user, just like data-centric applications, except they are based on unstructured content. This article explains what an XML content server is, the types of applications that can be built on one, and why an XML content server is very likely in your future.

Basic Internet Search Interfaces Fall Short in Enterprise Applications

Dimes are wonderful coins, but they make lousy screwdrivers. And internet search engines, which point you to page after page of links that you must open and review one at a time, can be great for tracking down information in the clutter of the internet. However, when it comes to building applications that access and use information that you have stored in the enterprise, using a search engine is like using a dime as a screwdriver. Sure, it may work, but there are better ways.

Even the smallest example reminds us of the utility of internet search. It used to be that if you couldn’t remember the name of an actor in a movie, you might call a friend or go to a library or video rental store to find it. Today, it takes only seconds to find the answer. The fact that this “service” is free (beyond your hardware and internet access costs) makes it even more appealing.

While these search links work fine for finding general information like an actor’s name, when it comes to trying to find specific information to perform a task, you don’t want to click through dozens or hundreds of links to find relevant information. Internet content is typically organized for easy retrieval and consumption by the page. In the enterprise, this page-orientation is not always the case, where longer documents are the norm. Even though the required information is in there, a search engine won’t get you to the right “chunk” quickly. And in enterprise search—unlike on the internet—most content is not hyperlinked, so one search result doesn’t always lead to others. Finally, an enterprise search engine must try to search across content stored in siloed systems, but complex security schemes make accessing the information you really need next to impossible.

Take the Best of Search and add Database Capabilities to Create a Content Server

For the enterprise, you need a different model, one in which all the information is consolidated and you are able to immediately find the exact information you need. For example, imagine you are a doctor and need to get information quickly so you can treat patients more efficiently. Digging through links is not acceptable. What you need is to have an application that presents the information to you in a way you can use quickly. The application knows you are a doctor, so it presents the information clinically, and it knows you are trying to diagnose a problem, so it presents the information in a manner that speeds up the diagnostic process.

In some respects, this process is similar to a search of a relational database management system (RDBMS). When all the information exists in one place, we can construct exact queries using SQL, and applications built atop the RDBMS can take into account the role and requirements of the user. But databases don’t work for unstructured content because the data doesn’t exist in rows and columns.

What if we could combine the strengths of a database management system with the strengths of search so that it delivers targeted content in a manner that is context aware? This combination is what a content server does. It has a single repository of unstructured information that can be indexed for extremely fast access. It lets us make queries using a standards-based query language called XQuery (the high-level W3C-standard query language) that offers the same level of precision as database queries. It allows us to enrich the content by adding metadata or changing the content as needed in the repository or during publishing. Most important, an XML content server returns the actual content that is needed—not simply links to the entire file—and applications built on top of it allow the content to be used in an unlimited number of ways. This capability helps organizations better leverage their content.

Store Your Unstructured Information in XML

The technologies that make this effective information retrieval possible are XML and XQuery. Storing content in XML gives organizations the flexibility to reuse and repurpose their content in many ways. Content stored in an XML format includes the metadata for that content, so systems such as an XML content server can quickly repurpose it to create new documents or deliver it within applications.

XML allows us to separate the meaning of the content from the presentation. For example, take this standard line of HTML code:

<font face=“arial” size=“14”>David </font>

This code tells the web server to display David in the Arial font in 14-point size. Now look at these lines of XML code:

<person nickname=“Goose”> David </person>

<place lat=”48 51 30.09 N” lon=”2 17 40.53 E”> Paris </place>

These “tags” don’t control the display. They contain searchable information. “David” is a person with the nickname Goose—not to be confused with other people named David. And “Paris” is a place at a specific latitude and longitude, not to be confused with Paris, Texas, or the man who ran off with Helen of Troy.

By adding meaningful tags and separating meaning from presentation, XML enables precise finding of desired content. Then, using XQuery, content can be custom assembled for a particular purpose and delivered in different formats for print, online, or mobile use.

Even if your content is not yet in XML, many systems can convert existing documents from their current formats. There are also an ever-increasing number of XML-based enrichment tools that have been developed to automatically extract entities, facts, sentiment, parts of speech, summaries, and so on from existing unstructured information and apply more and better tags. These tools makes it possible to automatically transform existing content into XML-tagged content. In addition, XML will become ubiquitous, in part thanks to Office 2007, which uses XML as its standard file format.

As described above, an XML content server is a special purpose database management system (DBMS) that stores XML documents and can be queried using XQuery. Using an XML content server, we can now develop knowledge applications that access any part of any content in the “contentbase,” integrating and repurposing it for fine-grained contextual search, custom publications, and content analytics.

Don’t Confuse Content Servers with Content Management Systems

The great irony of a content management system (CMS) is that it is more about the “M” than the “C.” Its focus is management, and it has its place in document control, versioning, workflows and approval systems, and library services (checking content in and out). But a CMS operates against content in “finished format” and wraps it with a layer of metadata. The components of those documents are often aggregated as a whole and much more difficult to leverage.

A content server and a CMS are actually complementary and can work together to ensure the success of content management initiatives. A CMS is one type of content application that can be built on an XML content server. Since XML content servers are relatively new, most of the CMS systems today are built on relational databases. However, a relational database is the wrong architecture for unstructured data, and the next generation CMS systems will likely be built on XML content servers.

Best Practice: Build Content Applications that Respond to User Needs

Content applications are task and role aware, meaning they know who the users are and what the users are trying to do. The application must be highly targeted and results in significant improvement in user productivity. Below are some examples.

The Oxford African American Studies Center

The Oxford African American Studies Center

<>, billed as “the first online resource center to fully document the field of African American Studies and Africana Studies,” is a subscription-based web site. The site includes the familiar keyword search box, but the application differs substantially from a typical search engine. For example, after entering “Maya Angelou” in the keyword search box, the application pulls together rich, authoritative content from thousands of books, articles, and original source documents—all authored for other purposes—in various formats.

However, instead of a long list of links, the application can build, on the fly, pages such as:

  • An at-a-glance summary of Maya Angelou, including picture, place and date of birth, and professions
  • Summaries of the top five articles on Maya Angelou
  • A timeline of milestones in Maya Angelou’s life correlated with other timelines, such as the American civil liberties movement
  • Recommendations of related people and resources
  • Views of the content by type of information (biographies, images, charts and tables, and so on), category (history, politics, and so forth), or other criteria

None of these pages exists in the contentbase. The application finds pieces of information, retrieves them, and composes the page dynamically. The site puts useful information into a context appropriate to the researcher—in short, it supplies information, not links.


SafariU <>, a joint venture between O’Reilly Media and the Pearson Technology Group, lets professors “rip, mix, and burn” custom textbooks for their courses. Searching an extensive collection of books and articles, they can select just the sections and chapters that are most relevant to their course. They can then assemble these items add their own material along with content from the web, preview a low-resolution PDF version of the new book—complete with table of contents and index—and order printed copies of the book to be delivered to the university store. This content application really starts where search leaves off, finding the exact information the professor needs and producing a relevant, up-to-the minute textbook that can change every semester if necessary.


PathConsult <> from Elsevier offers tools to help pathologists perform diagnoses more quickly and accurately. The application consolidates content from images (such as x-rays and microscope slides) and text (including diseases and diagnostic procedures) drawn from standard medical texts. The system then places the material into the context of the diagnostic process specific to each patient, giving pathologists the information they require as they need it in the course of performing a truly mission-critical activity. For example, a pathologist might select two or more diagnoses to see a side-by-side comparison with “diagnostic pearls,” review cytology images and descriptions, or show side-by-side stain comparisons. At any step, additional information about any aspect of a diagnosis is just a click away.

Content-driven applications like these truly go where no search engine has gone before. They create interactive, personalized information products that incorporate the role, activity, and process of individuals. Successful content applications simplify and accelerate the process of finding exact or contextualized information and using that information in an unlimited number of ways. Search engines aren’t bad. They’re just limited. Content servers are something entirely different. They will do for unstructured information what database software did for structured data—and the impact on the enterprise will be even more profound. CIDMIconNewsletter

Andy Feit

MarkLogic Corporation

Andy Feit brings to Mark Logic more than 20 years of high technology marketing and sales experience. Prior to joining Mark Logic, Andy served as chief marketing officer of KNOVA, where he expanded KNOVA’s presence in the customer service and support markets while extending the company’s reach across new markets. Before joining KNOVA, Andy was senior vice president of marketing at Adomo, and prior to Adomo, Andy served as senior vice president of marketing for Verity, Inc., where he was responsible for worldwide marketing and product strategy for the company’s full range of search, categorization, and business process management solutions, generating more than $140 million in annual revenues. Previously, Andy held executive and marketing positions at Quiver, Inktomi, and Infoseek. Andy holds a B.S. degree in Chemical Engineering from Tufts University.