- Fotolia

Don't throw out design principles when jumping in Hadoop data lake

In a Q&A, data warehousing expert Joe Caserta explains why a new generation of developers building Hadoop clusters and other big data systems may need an introduction to some fundamental rules of ETL.

While there are often good reasons for technologies to change, useful skills are sometimes forgotten in the process. Today's Hadoop data lakes may be a case in point, according to Joe Caserta, founder and president of New York-based consulting practice Caserta Concepts. He says advances in Hadoop-style data handling are harder to achieve if data management teams forget basic means of data preparation, ones that apply whether the technology is an emerging Hadoop data lake or an established relational data warehouse. SearchDataManagement spoke with Caserta as he prepared to help lead a pair of courses in New York on Agile data warehouse design and ETL architecture and design, the latter focusing on extract, transform and load data integration processes.

Where are we at with big data? It seems like we are now out of the early innings.

Joe Caserta: What's been happening is we as an industry have been immersed in big data and emerging technologies, and now that this discovery stage is behind us and we are implementing systems, we find some of the same fundamental issues of ETL and data warehousing still remain -- that not a whole lot has changed.

Joe CasertaJoe Caserta

The reality is that the fundamental principles of data warehousing and ETL are still as applicable as they have ever been. We shouldn't lose sight of that. In effect, we have a whole new set of people who don't know the basics of how to do data management for analytics.

A song in the old Humphrey Bogart movie, Casablanca, said "the fundamental things apply as time goes by." How does that play out in data management and analytics?

Caserta: The projects are being called data management and data analytics, but the core of these things is really still about ETL and data warehousing. You could call the technologies over the last few years nascent. The fact is the individuals who are trying to solve these problems sometimes are also nascent.

There are too many people who don't have the key concepts of ETL and data warehousing down.
Joe Casertafounder and president, Caserta Concepts

It seems like you can have a Hadoop data lake and store data, but once you want to do something with the data, you face the same kind of issues you always had.

Caserta: Right. You have to ensure the quality of the data. All of this stuff is really still applicable. There are too many people who don't have the key concepts of ETL and data warehousing down. With Hadoop, people talk about ELT instead of ETL, but that is just semantics. All that really is about is where you do your transformations. Sometimes it's just the vendors saying, "Do it on my technology instead of somebody else's." But at the end of the day, you have data, you need to prepare it and then you need to use it. That's extract and transform, whether you load it into the target database and transform it or transform it on its way. Where you transform the data -- well it's six of one, half a dozen of the other. What you have to know is how to transform it.

One pixel BizApps Today: Diving into the data lake

What you have to do is land your data as part of your process. You have to interrogate your data for cleanliness, thoroughness and completeness. You need to be able to establish confidence levels in the integrity of your data -- a quality score or integrity score. It's more important now than ever. With a data warehouse, you tried to achieve a quality level that was 100. With a data lake, you know it's not 100 -- but what is it? Should people be 50% confident that it's right? Should they be 85% confident? Getting 10 people to wrap their heads around that can be very challenging.

I guess you're saying that at some point, when people get beyond just exploring the new thing, they will begin to apply more traditional management processes.

Caserta: Sure. Enterprises are starting to embrace big data, but they need to go back to standards to govern the data. There are processes that need to be in place, and they look and feel a lot like the legacy methods. Of course, there's an evolution happening. The legacy methods are not 100% applicable. But a lot of it is the same.

There's a lot going on. People need to learn how to create requirements so they're addressing a business problem, and that in turn has to be converted into a technical solution. There have been processes in place for years to address that, and, true enough, it is a constant struggle.

But for some people in big data today, it's useful to revisit fundamentals of basic data warehouse design. They work even if you're using emerging technology. They should be technology agnostic. They should be very agile methods. For example, one of the methods you can apply is model storming. That's a process where you talk about your business, and you identify all of the business's dimensions, and then start to bring modeling into the conversation.

Jack Vaughan is SearchDataManagement's news and site editor. Email him at [email protected], and follow us on Twitter: @sDataManagement.

Next Steps

Read a Q&A with Joe Caserta on Hadoop maturity

Learn the latest on cognitive computing from consultant Judith Hurwitz

Get Forrester analyst Mike Gualtieri's take on data lakes

Dig Deeper on Enterprise data architecture best practices