Sergey Nivens - Fotolia


Ingesting data into a data lake may give you data indigestion

Big data vendors routinely push the notion of ingesting all of your data into a data lake. But in many cases, doing so is an unnecessary step that could cause data ingestion problems.

Clickstreams. IoT. Social media. The sources of data have expanded rapidly over the past 10 years. I laugh at the term big data, because we've always had that -- naming one period the big data era makes as much sense as one period of art, decades ago, being labeled modern art. Nevertheless, we certainly can capture more data today than ever before.

That will be the case years from now, too, as businesses continually look to gather as much data as possible to help improve decision-making. But ingesting data is only the first step. People often forget that there's a difference between data and information: Data is just a collection of individual elements. The challenge remains how to create useful information from incoming data, without letting the data ingestion process leave you suffering from data indigestion.

Understanding how to manage what's important to who can help minimize data ingestion problems. In planning your strategy for ingesting data, don't think you have to put everything in one place to begin with, and then fork it out to different places. Creating a data lake isn't the only way.

Data warehouses were a novel idea, and they had some intuitive business advantages when they began to appear in the late 1980s. A single view of all corporate information? "Great," said the CxO suite. However, the complexity of trying to extract the available data from the systems of the day was a bigger task than expected.

Data po-tay-to vs. po-tah-to

To this day, many of the early data warehouse proponents still make a comfortable living pushing it and other centralized ways of storing data -- such as the data lake, which is really just an operational data store (ODS) by another name. Too often, the message from big data proponents is, "We need to dump all your stuff here!" -- mouthed while they claim to be moving beyond the data warehousing paradigm.

Just changing the name doesn't do that. It also minimizes the best part of the data warehouse movement: the focus on metadata.

By now, everyone likely has heard that metadata is data about data. When I type the character 9 into a pair of data fields, it might be an actual integer in one, while in the other, it might represent a letter. Metadata gives us the context to know which it is in each case. It also can help organizations with big data environments to control the indigestion and to get back to effective -- and efficient -- ways of ingesting data.

Take the example of a car transmitting data via the internet of things (IoT). When I was a teenager, I had a weird thing called a timing light in my toolbox. Along with other toys (OK, tools), it enabled me to keep my car's engine in good shape manually. Today, there are dozens of computers in a car, and no real way to fully tune the engine without using a diagnostic system. More to the point, more and more cars are transmitting operational data back to the automakers in order to provide information that can help them improve performance and safety.

Under the data lake paradigm, Hadoop vendors will sell you on dumping everything into a single data store, and then trying to figure out what's there. Here's a simple question: Why bother?

Every data element in its right place

At a car company, some of the data being captured from vehicles needs to go to one department to analyze engine performance. Other data might go to the folks who deal with creature comforts in the cabin, while still more goes elsewhere for safety analysis. Why does all the data need to go to a single, starting repository? Edge servers with an understanding of metadata can route different data to the appropriate departments.

Some sensors might also track data that's expected to be of use in the future, but that isn't currently needed. The devices can either have switches to turn off its transmission or the edge servers can drop the unneeded data rather than wasting further transmission and storage costs on it.

This isn't absolute. The large volume of conversational data coming from social media is one example of what needs to be dumped into a larger data store, Hadoop or otherwise. Again, though, think of that data store as you do the old ODS used to drive manufacturing and other operational systems.

In most cases ... there's no need to put all of the data you're collecting in one place.

It's just a dumping ground; a place that different applications can access in order to mine the syntax and semantics of the social media data to find relevant information. There's zero need to integrate all the data with data from other sources. Only the information identified as relevant needs to then be extracted and migrated to other places in a big data architecture.

In most cases, though, the growth of computing and networking power, the reduced cost of storage, and advances in analytical algorithms and artificial intelligence tools mean that there's no need to put all of the data you're collecting in one place. Let each piece of data go to as many places as possible where it might be used to gain information. It's up to higher-level systems to then extract only the information that is needed at each point and for each purpose within an organization.

The massive amount of data increasingly available to companies isn't a solution in and of itself. Data doesn't have its own purpose -- it's there only to be turned into useful information. As you're ingesting data, don't let it abuse you.

Next Steps

How NoSQL databases can help with IoT data ingestion

Find out what Walmart did to speed up data ingestion

Get tips, insight on data integration in big data environments

Dig Deeper on Enterprise data architecture best practices