Data lakes can give organizations more freedom on storing and analyzing data than they get from traditional data warehouses. But building a data lake architecture also presents IT teams with a raft of challenges.
TDWI data management analyst Philip Russom detailed the potential benefits and pitfalls of data lakes in a webinar this month; he also offered advice on priorities and best practices for a data lake implementation, highlighting things such as the need to tie it to real business issues and ensure that solid data governance processes are in place.
For example, Russom said data lakes -- which typically involve Hadoop and other big data platforms -- let data scientists and business users analyze data that wasn't accessible to them before: call center notes, social media posts, internet clickstream records and more. But, he added, the broader exploration enabled by data lakes has to be done "with an eye to getting real business value from this data."
To aid in that process, organizations need to be careful not to scrub away useful information in the raw data collected in a data lake, Russom said. That requires a different approach from the way that structured transaction data stored in a data warehouse is cleansed and consolidated before being made available for analysis.
"If you cleanse the data, if you get rid of anomalies, if you standardize the data so it all looks the same, you may lose some of the stuff that you're looking for," Russom said. As examples, he cited identifying customer segments for targeted marketing and detecting possible fraud in financial transactions, which might be overlooked if outliers in data sets are eliminated.
The webinar was based on a TDWI report released in March that included survey data on how companies are using data lakes and what benefits or drawbacks they're seeing. In the survey, conducted late last year, 23% of 252 respondents said their organizations were already using a data lake, while another 24% expected to have one in production in the next 12 months.
Russom outlined a list of 12 priorities for businesses implementing a data lake architecture. His tips can be condensed into these three main points:
Plan your data lake carefully, according to the specific needs of your organization. Using Hadoop is the most common way of building a data lake -- 40 of the 75 TDWI survey respondents with data lake experience said their platforms are built entirely on Hadoop. But it isn't the only way, and some methods work better for different situations, Russom said.
For example, 17 of those 75 respondents said they're using a hybrid architecture that combines Hadoop with a relational database. That's a logical combination for companies that have invested in data warehouses based on relational software, according to Russom. "I rarely find a data lake existing in a vacuum," he said.
Ultimately, when building a data lake, it should be designed to fit the planned uses and structured logically so that users can successfully navigate the environment, Russom said. Flexibility is also called for, he advised: If the initial data lake design doesn't end up being fully successful, it may need to be updated to make it more effective.
Don't expect to find lots of workers with the required skills and relevant experience. One of the biggest obstacles faced by companies looking to deploy data lakes is a lack of skills on Hadoop and other big data technologies.
To get around that issue and facilitate the process of building a data lake, Russom suggested retraining data management staffers to fit the necessary roles, instead of trying to hire new workers with data lake skills. Temporarily bringing in technical consultants can be an easier, more cost-effective way of integrating prior data lake experience, he added; consultants can also help train internal employees during the implementation process.
Philip Russomanalyst, TDWI
Avoid data "dumping" so your data lake doesn't get clogged up with useless data. Russom said the temptation with a data lake is to simply throw everything you can into it without any plan for organizing or structuring the data, which can make it messy and difficult for users to navigate. His prescription for avoiding that lies in effective data lake governance.
"A data lake will become a swamp if you allow anybody to dump any data into it any time, so there needs to be some controls," he cautioned. While fully cleansing and conforming data isn't always wise, the controls should include a governed process for introducing new data, to make sure it gets "vetted at least slightly" before being let into a data lake, Russom said.
More challenges in building an enterprise data lake, and how to get around them
Find out why metadata management is essential in a data lake environment
A practical way to implement a data lake: Start out small and work your way up