News Stay informed about the latest enterprise technology news and product updates.

Metadata and Sub-Metadata

Are we forgetting about the messy end of the data spectrum?

This article originally appeared on the BeyeNETWORK.

Attention span is short, but data goes on forever. Especially on the Web, where data in the form of content grows like weeds, virtually unformed and at the atomic level effectively unformattable. A lot of effort is expended in turning data that lives only on the Web into databases, plugging holes in the knowledgesphere and regularizing more and more information, but there's still a long way to go.

The Web today looks more and more like the real world in terms of data, information and knowledge, and we're running into many of the same problems in the virtual world as we've encountered in the real world. And all the technology in the world isn't going to solve the problem.

The Setup

Let's think about how this Web/Internet thing has grown over the past twenty years or so. At first, there wasn't much. You could access a fair bit of data over the Internet, but only if you had login privileges for mainframe databases that had Internet access points. There were no portals, no web spiders to feed data into indexing sites. In practice, most data was invisible unless you knew where to look.

With the opening of the Internet to commercial use in the early ‘90s, and the birth of the World Wide Web, we started getting more data out in the open – and that meant it became easier to search for information without knowing in advance where it resided. Yahoo! was the most successful of the pure portal sites, and we started to see more and more portals where information could be summarized and aggregated. Amazon started out by listing virtually all books in print, along with information about them, and now is opening up the content of those books. Google and other web indexing sites made web searches more and more comprehensive. Industry and topic-specific portals are meta-sites, where you can go to read aggregated and summarized content being made available on related sites.

And so on: these meta/portals, like Slashdot and Metafilter, make it possible to leverage the amount of time spent surfing the Web, giving access to summarized/aggregated data from all over. You don't need to surf thousands of web sites to get the latest and greatest news from those thousands of web sites.

RSS feeds likewise help leverage surfing, letting you browse headlines culled from many meta/portals; likewise the meta-meta/portals like Popurls, where you can browse headlines pulled from the most popular and influential of the meta/portals like Flickr and Digg and Boing Boing.

Over time, things have changed so that now not only is most of that data out in the open, but there's such a flood of new data (in the form of blog articles and comments) that we're choking on information.

The Back Story

I originally intended this article to explore how some interesting and important content continues to be hidden as the volume of content overall continues to grow. I got the idea when I noticed something funny about an interesting article on Slashdot,  “Advocating Linux/OSS to Management.” The post hit the Web on August 5, and comments were locked out about three days later, after a total of 466 comments were posted.

The body of the original post poses the question:

"I'm the Senior Developer at a fairly large agency, we're currently a 100% LAMP shop, but I've heard a reliable report through the grapevine that the management a few levels above our office wants to standardize our region on Microsoft .NET. As I'm sure most of you can appreciate, to do such a thing would be... counterproductive, and I could really do with a hand conveying this to a manager whose only real knowledge of Linux is, ‘If it's so good, why would you give it away for free?’"

Right away, it's clear that the headline, “Advocating Linux/OSS to Management,” is misleading: management had already, at some time in the past, either actively opted to choose a Linux/open source solution, or at least passively permitted that solution to be deployed, and now wants to switch to a proprietary/Microsoft-based solution. But that's the headline that showed up on Slashdot, as well as on the meta-meta sites like Popurls.

Even more perplexing (or challenging) is that many of what I thought were the best or most useful comments in that discussion were buried, having garnered low reader scores. Unless you are willing and able to read through all those hundreds of comments, chances are you'll miss out on the full value of that article.

So even if you read beyond the headline to realize the actual content, you may still miss out on some of the very real and practical suggestions and comments that, for whatever reason, didn't get a high rank. In any case, you would have to wade through an awful lot of noise before you got to the good stuff – with no easy way to differentiate what might turn out to be psychotic ramblings from rational opinion.

Comment ranks are helpful but no panacea: the people grading them often have agendas significantly different from yours, rendering their judgments worthless (to you). I can imagine no amount of adding metadata tags to comments, the vast majority of which are the information equivalent of chaff, that will allow us to distinguish the useful from the pointless.

Sure, we've got a lot of good tools for aggregating and filtering the flood of information, but we still have a way to go before we can automate the summarization and interpretation of all that data in a way that makes it useful.

And that's when I arrived at the analogy.

The Analogy

If you think about it, this interweb thing is about publishing data/information/knowledge – much like that last really big change in publishing, the printing press. Go back 500 years or so, and that new technology allowed dissemination of data/information/knowledge to much bigger populations, much more quickly. It started slowly, because there wasn't that much data to distribute: mostly religious stuff and copies of books written hundreds of years before. Then you started seeing new books published, and more data published. Books with information about books appeared, as well, in the form of bibliographical material in standard books as well as books that provided surveys of entire fields. Libraries, full of book-oriented metadata like the Dewey Decimal System, became the equivalent of meta/portals and extended the trend as more and more books became available and needed to be cataloged for easier access to all that information.

Consider too that as more and more printing presses became available at lower costs, and the ease with which words could be printed and distributed subsequently increased, we started seeing publication of periodicals: newspapers and magazines became the meta/portals of their times. Unlike in earlier times, when most people were concerned mostly with the stuff that happened in their town or village, and knowledge of great discoveries traveled slowly if at all, now people in Boston or Philadelphia could concern themselves with things that happened in London or Paris.

The growth of the periodical business resulted in the ubiquity of newspapers and magazines, and eventually newsreels and TV news shows, bringing more of the news of the world into the lives of ordinary people. The other result of that growth was the creation of a new type of worker, the journalist, who filtered, aggregated, interpreted and summarized all the events of the world.

The Problem

And that's where the problem comes in. Over time, the business of filtering, summarizing and interpreting the news – whether online or in print – has been delegated. This is partly a good thing. If you were a contemporary of Isaac Newton, you'd need to be a scientist to understand – or even have access to a copy of – his scientific writings. Most of his contemporaries would neither know nor care about what Newton was working on; now, there are legions of writers who eagerly interpret and simplify the latest theories of physics for the masses.

But what happens when those who are filtering, aggregating, summarizing and interpreting the news don't really understand their topic? During most of my career in technology I've been both a consumer of and a producer of technology writing, and, as a result, I have a pretty good sense both for the disdain most technologists have for journalists and the degree to which journalists often get it wrong.

There are so many ways to get the story wrong; starting from the top:

  • Inaccurate headlines are often written by editors who don't really understand what the story is all about in the first place.
  • Oversimplification of complicated topics is unavoidable, especially when you've got to get the point across in 300 words – or even 3,000 words.
  • It's easy for journalists to miss the point, get confused, or misunderstand the implications.

This whole family of issues re-emerges on the Web because so much of the content is being filtered, aggregated, summarized and interpreted by intermediaries like the people who participate on the meta/portal sites and the meta-meta/portals. Because we may have access to an endless flood of articles about the minutest and most subtle points of technology, business or science, and still not have any good way to grasp the implications, the subtleties or the implications of those reports.

The Solution?

Remember how I said that the problem with all this data and information and knowledge is recapitulating the problem of knowledge in the real world? Popurls may be the digital equivalent of Readers Digest or People Magazine, Slashdot may be the equivalent of Dr. Dobb's Journal, and The Business Intelligence Network may be the equivalent of CIO. All are good sources of information and knowledge when applied to the purposes for which they are intended.

The issues I raise here are mostly those of accuracy and value: as you aggregate and summarize more and more information, you tend to lose accuracy, and you also risk missing out on valuable knowledge. For most purposes, then, the solution is twofold: first, don't trust everything you read on the Web; second, dig deeper on your own when you need to find out more.

But those are the same maxims people followed 50 years ago: don't believe everything you read in the papers, and if you need more information you've got to track it down yourself. If time is of the essence, and you need to stay on top of your business, a third option is and always has been available: hire someone smart enough to do the digging for you.

Dig Deeper on Data governance strategy

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.