BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
"Schemas? We don't need no stinking schemas!" My apologies to the mysterious B. Traven for that one. Traven is the author of The Treasure of the Sierra Madre, the book that led to the film that led to one of the most abused lines in the history of filmdom -- in this case, changing the subject from badges to database schemas and the speaker from a bandit to a representative of the big data community.
But is that a fair statement for Hadoop and NoSQL proponents to shout at IT managers asking to see their data management credentials? As someone who has been in the industry for 35 years, I have problems with how the term schema is used by big data vendors -- and a recent presentation at the Boulder BI Brain Trust (BBBT) brought the issue to the fore. A disconnect exists between the way IT has approached schemas for decades and the way big data folks are misrepresenting them -- to their own detriment and, ultimately, of users as well.
Let's start with a definition of schema, from Dictionary.com: 1. a diagram, plan or scheme; 2. an underlying organizational pattern or structure; a conceptual framework. In the relational database world, a schema includes descriptions of fields, rows, tables, indices and other elements that create an image of how data is laid out. It often has been thought of as something very rigid, but a schema is not limited to that single definition.
Take SGML and XML as examples. The two markup languages were created decades ago to allow for better communication between applications of less structured information. There are no fixed fields, but there's certainly a schema -- it's called syntax. After all, what else is syntax but a schema for a language? There's a clear layout of special characters and words that helps a receiving application understand the data coming in. If the sending and receiving applications don't share the same syntax, they can't communicate.
Therefore, it seems odd that folks love to talk about schema-less data dumps or schema-on-read as if that somehow avoids data having a schema. All Hadoop does is let you pull data out of operational systems and put it in a place where you can begin to analyze it without affecting operations. That the data is dumped into a simplistic file system means nothing. It's not until some structure is applied that the data can become real information.
That doesn't mean Hadoop is without value. It's wonderful that there's a technology available for easily extracting and storing information from a vast variety of systems. However, this doesn't make Hadoop anything other than a modern operational data store (ODS).
Flexible schemas, yes -- no database schemas, no
Therefore, we need to stop thinking of Hadoop data sources as having no schema. We should also stop confusing people with schema-on-read silliness. During the BBBT session, another consultant mentioned flexible schema. I like that concept better.
The limitation of a relational model is that we have to decide the database schema that will be most used, then provide extra indexing, stored procedures and other workarounds to support additional ways of looking at the information. That's great for the 80% in the 80/20 rule, but not so good for the 20% -- and even worse for the small percentage of folks truly doing data discovery.
What Hadoop, NoSQL databases and other modern big data tools allow is for each application or user to come to the raw data with a different schema. Take call center logs as an example. Someone performing a columnar analysis on time and call length has a different interpretation of the schema than someone doing a row search for a specific call. But they aren't imposing a schema-on-read; rather, they're flexibly addressing different components of the schema to maximize their individual query performance.
The same is true for information coming in from sensors monitoring a pipeline. Just as with XML, device manufacturers set up delimiters to let applications understand the originating schema and extract necessary data. An application analyzing dynamic flow rates through the pipeline and another focused on maintenance are going to use different views of that schema in order to extract the appropriate information. But it's wrong to say that there's no schema.
Information pipeline starts with Hadoop
Too many people enamored with the latest technology are shouting about the end of the data warehouse, and even the end of the relational database. Get over it -- they're not going anywhere. We have mission-critical applications where time matters, where we know what the data structure is and where that structure speeds analysis and reporting. We have high-latency data that doesn't change and is used in the same way almost every time. Structure matters to making business decisions.
Think of Hadoop as the beginning of the information pipeline. It's the latest, greatest ODS. It lets you pull data from various systems without needing to have the perfect structure ready for that data. Yet much of the data will migrate down the pipeline into structured repositories offering better performance for specific analytics needs.
That's why the NoSQL movement doesn't have long-term legs for anything other than niche applications. SQL is the norm in IT and it isn't limited to relational queries. The companies making Hadoop data available for querying and analysis through SQL are doing the right thing in that only by accelerating the integration of Hadoop into the existing IT infrastructure will the Hadoop providers accelerate their own acceptance into mainstream, mission-critical applications.
So, forget schema-less, schema-on-read and other nonsense that is of use only to theorists and niche players. Focus instead on providing ways for flexible database schemas to be integrated into the full business information pipeline.
About the author:
David A. Teich is principal consultant at Teich Communications, a technology consulting and marketing services company. Teich has more than three decades of experience in the business of technology. Email him at firstname.lastname@example.org.
NoSQL won't break the bond between users and SQL
Learn more about NoSQL databases
Schema-less platforms mean DBA skills must evolve