Metadata: Classical Genius Reconstituted


December 2002

Metadata: Classical Genius Reconstituted

CIDMIconNewsletter David Walske, David Walske, Inc.


Bob burst into my office with a look of astonishment on his face. “I can’t believe it,” he shouted as he waved a page he had just printed from the Web. I gave him my full attention. My friend Bob is not prone to hyperbole.

“I just opened my Web browser to and look, they’re devoting the entire home page to books about New Haven, Connecticut!” Being a descendent of a New Haven founding family, Bob has for some time been collecting books, old deeds, and pretty much anything to do with New Haven. As a hobbyist collector, this subject has held his attention for many years. But the apparent fact that his obscure hobby had suddenly gone mainstream was incomprehensible to him.

The reality, of course, is that the Amazon home page displayed in Bob’s browser had been assembled dynamically, based upon pages he had previously viewed and past purchases he had made from Amazon. While New Haven is undoubtedly a charming town, its history and heritage had not quite yet become the object of a national obsession. I broke the news to Bob as gently as possible.

To each his own
The New Haven of today is a very different place from the town described by many of the books in Bob’s collection. Their depictions reflect an agrarian society, in which the primary focus of human endeavor centered on the production of crops, livestock, and other products of farming. Blacksmiths and other craftsmen made farm tools built to the exact specifications of each individual farmer. The level of customization and quality in these products was quite high. So too was the cost, in terms of the skill, effort, and many hours of labor required to produce such customized products.

One size fits all
Inevitably, the custom manufacturing of the agrarian age was replaced by the revolutionary techniques of automated production. By the early 1900s, a worldwide industrial revolution was at full steam. In 1913, using standardized parts, assembled by workers stationed along a moving conveyor belt, the Ford Motor Company used the first ever large-scale assembly line to produce the Model T automobile. With the advent of standardization and automation, the expense of high-quality manufactured goods was significantly reduced. This economy, however, would come at the loss of customization. The Model T could be ordered in any color, as long as it was black.

Just what you had in mind
The agrarian model states, “You can have high quality products built to order, but they’ll be costly.” The industrial model says, “You can have inexpensive, high-quality products, but they will not be built to your exact specifications.” Today, on the cusp of the twenty-first century, we are experiencing a new revolution of digital technology that breaks these rules. This revolution of personalization, sometimes called mass customization, says, “You can have it all: an economically produced, high-quality product that conforms to your exact needs.” The Amazon home page that individually reflects the specific interests of each user who visits the Amazon site is an example of such personalization.

Information Overload

The volume of information available to mankind has long ago outstripped native human capacity for information management. A genius, in the classical sense, is defined as one who knows all that is known by mankind. The last of the classical geniuses, Sir Francis Bacon, passed on April 9, 1626, marking the end of an era, just as global information volume began to rise exponentially-at a time well before the advent of modern information technologies. (See Figure 1.)


Figure 1. Reprinted by permission of Joseph Busch (Taxonomy Strategies) and the American Society for Information Science and Technology.

In spite of the awesome data storage capacity of our modern digital technology, there is often a sense of something missing. Raw data storage lacks a feel for the information and the knowledge that it represents. Without a workable method of identifying and tracking the meaning and purpose of individual chunks of content stored in a database, effective content management is not possible. Data storage is simple; information retrieval is something else entirely. It is a core task for content management, but one that is not always easily performed.

The nihilistic database blues
There is an antidote to the feeling of melancholy that you, as a content manager, may be experiencing amid the mountains of data that are your wards. It’s not a tincture, or tablet, or even a transdermal patch. It’s metadata-rich, abundant metadata. Metadata is by its simplest definition, “information about information.” But it is actually so much more. Metadata could be our salvation from an avalanche of soulless data that threatens to consume us all. Okay, so I’m being a bit melodramatic, but you get the idea.

“What does not kill me makes me stronger.”
-Friedrich Nietzsche 1844-1900

In tagged text languages, such as XML and SGML, metadata is commonly expressed through the use of attributes. Individual chunks of content are contained within XML or SGML elements. Think of these elements as envelopes. Attributes are attached to the envelopes to provide information about the data contained within. Compare attributes to the markings on the outside of a physical envelope in which you mail a letter to your favorite Aunt Suzie. These markings express something about the letter, or the data, if you will, inside the envelope. You don’t have to open the envelope to know that it contains a letter of some sort to Aunt Suzie. By reading the first line of the addressee block, we know this information about the content inside the envelope without ever actually seeing it.

What’s more, the information on the envelope of Aunt Suzie’s letter is organized in a predictable way. If for some reason the letter needs to be returned to its sender, that information can be found in the upper left corner of the envelope. The address block and the return address block contain attributes that define, in a standardized way, information pertaining to the “who and where” of the content inside the envelope. We can deduce that it is a personal letter of some sort by the lack of a business name in the address block.

Now, let’s imagine that the sender also included some photographs with the letter. The envelope might be marked, probably in the lower left corner, with words such as “Photos Enclosed. Please do not bend.” Now, we have additional information about the content inside the envelope. We know a bit more about what it is, and we know how to handle it. And of course there’s probably a postage stamp affixed to the envelope in the upper right corner. And let’s suppose that the stamp is one that commemorates Mother’s Day. We now have the basic information that any journalist looks for, “the what, where, how, who, and why.” And not once did we have to resort to opening the envelope to get a peek at its contents.

In the preceding analogy, we could see that there are different types of attributes. Some attributes express something about what the content is-a Mother’s Day greeting with photos. While others express what should be done with the content-the name and address of the person to whom the envelope should be delivered. As we continue to examine the individual attributes of Aunt Suzie’s letter, we recognize that they are bipartite in nature. Each attribute contains both a question and an answer.

The first line of the address block asks, “Who is to receive this envelope?” It also answers the question, “Aunt Suzie.” Further study reveals that the questions asked by these attributes can be categorized into distinct types. The first line of the address block asks a “fill in the blank” question. The last line of the address block asks a series of “multiple choice” questions. If Aunt Suzie lives in the United States, one of those questions will require a choice from a standardized list of fifty abbreviated state names.

Could an attribute ever ask a “true or false” question? Probably not in the case of Aunt Suzie’s letter, but in an XML database, an attribute could certainly ask a “true or false” question. Asking a “true or false” question might not, however, be the most efficient way to use attributes and should perhaps be avoided for reasons we’ll discuss a bit later in this article.

We’ve learned that attributes are sets of questions and answers regarding the elements to which they are attached. We know that they can ask and answer these questions in a variety of formats. And finally, we can see that attributes are divided into two groups: those that describe what the content is and those that describe what should be done with the content. This last distinction is an important one. If large volumes of data are to be stored and then successfully retrieved, the metadata that describes what the content is must be rich and robust. You might refer to the metadata provided by this type of attribute as metonymical. Like a metaphor, this type of content-descriptive metadata contains a succinct expression that denotes a larger concept. We often use metaphors in everyday conversation. One might speak of “the Crown” in reference to not merely the headgear worn by Great Britain’s Queen Elizabeth but rather to the British monarchy in its entirety.

Metadata in Action

So far, we have discussed the metadata inscribed on the envelope of an imagined personal letter to a fictional Aunt Suzie. Now instead of a Happy Mother’s Day letter to Aunt Suzie, let’s imagine that we’re to create one of those annual holiday family newsletters that we all love receiving about as much as the inevitable bricks of fruitcake that show up on our doorsteps around the same time each year. But instead of a static form letter, we’ll use metadata to help us personalize the content of the letter to each recipient. Metadata might just save you from the impersonal personal holiday family form letter. You’re on your own with the fruitcake.

A schema for our scheme
The first thing we’ll do is set up a schema for our holiday letter. A schema is simply a list of the elements of content that we’ll use in our document and a description of how they’ll be arranged. Instead of specifying XML or SGML and the underlying code, we’ll express our schema more generally, in graphic form.

Figure 2 contains a graphical representation of one possible schema that could be used to represent the structure of our holiday letter. The organization is a familiar one that you might typically find in any personal correspondence.


Figure 2. A possible schema that could be used to represent the structure of our holiday letter.

For now, we are most interested in the Para element. This element is where the content lives, the individual paragraphs of our letter. Again think of elements as envelopes that contain an individual chunk of content, but this time the chunk is much more granular. Instead of enveloping the entire letter, the Para element is a wrapper for a single paragraph. Next, we’ll assign attributes that describe the content that it contains.

We’ll assign four attributes to the element Para: FamilyMember, BusinessAssociate, Location, and SubjectMatter. Let’s examine the sample Para element in Figure 3. Notice that the FamilyMember attribute has been set to read, “Suzie,” which tells us that the content contained in this specific Para element is of particular interest to Aunt Suzie. The BusinessAssociate attribute has no value assigned to it because this content would be of no interest to any of the business contacts or household merchants that might receive the holiday letter. The Location attribute indicates the inclusion of geographical references specific to the east coast of the US, and the SubjectMatter attribute specifies that the content has something to do with cooking. Again, we know a lot about this specific chunk of content without ever seeing the actual text.


Figure 3. The four attributes of the Para element.

Any of the attributes we’ve assigned so far could be specified in the schema as either “fill in the blank” or “multiple choice.” Which of these attributes could be set up as “true or false?” Either or both of the FamilyMember or BusinessAssociate attributes could be set up as “true or false.” But setting up these attributes as “true or false” would diminish the value of the metadata contained in these attributes. We could only indicate groups of people-family members or business associates-instead of individuals, reducing the richness of the metadata. Now, let’s take a look at the highest-level element, HolidayLetter, in Figure 4.


Figure 4. The highest-level element, HolidayLetter.

We’ll assign an attribute called email to the element HolidayLetter to indicate whether or not this letter should be sent as email instead of as a printed letter by postal mail. This attribute differs in two ways from the other attributes that we’ve specified so far. First, the attribute is specified as “true or false.” The holiday letter is either sent by email or not: Email=Yes or No. Second, unlike the attributes we’ve defined for the element Para, this attribute specifies what is to be done with the content instead of what it is about.

It would seem that we have a good use for a “true or false” attribute, but even here we can enrich the metadata further by changing to “multiple choice.” The “true or false” format limits us to two alternatives. What happens if we discover that some of the recipients would prefer to receive the letter as a Fax? Perhaps it would be better to name this attribute, “Output” specifying, “Print, Email, and Fax” as allowable choices. By making this change, we’ve not only made the metadata richer, but we’ve improved the scalability of our schema. If a new output method becomes available next year, we need only add it to the existing “multiple choice” list.

Keys to Successful Content Management

In this article, we’ve shared a brief glimpse of the power of metadata and the practical use of attributes in tagged text languages, such as XML and SGML. There are many considerations in planning and implementing a content-management system. Providing for metadata that is rich, abundant, and scalable is one of the key tasks of successful content management. CIDMIconNewsletter

About the Author