Sergey Nivens - Fotolia
Data lakes evoke images of vast pools of raw data, available for unfettered exploration and analysis. But the reality isn't so free and easy: To avoid information chaos, all that data needs to be cataloged and governed -- and doing so is still a developing and very often do-it-yourself process.
As a result, data lake management and governance frameworks are in the formative stage in many organizations, with IT and data management teams scrambling to piece together combinations of governance tools and mechanisms to help keep their big data environments in order.
That's the case at medical insurer Health Care Service Corp. (HCSC), which deployed a Hadoop data lake in April 2016 to give its data scientists and other analysts self-service capabilities for analyzing data from source systems across the Chicago-based company's operations. But self-service doesn't mean a free-for-all in the Hadoop cluster, explained Susan Swanson, senior manager of data modeling and architecture at HCSC. "We need something that's governed and controlled," she said, so users don't end up working with inconsistent data -- for example, different numbers on current membership in health plans.
The standardization effort includes a level of data integration, cleansing and preparation work, plus data quality rules, a catalog of available data and a metadata repository for tracking data lineage and populating a common data dictionary. "Data management hasn't gone away -- it's definitely here for big data and the data lake," Swanson said. The big difference, she added, is that the enabling governance technology "is so new and so emerging in the data lake environment."
That means the data architecture and management team at HCSC has to spend more time working on technical initiatives related to data lake governance than it does with the company's existing data warehouse, where it can focus more on resolving data quality issues and other governance tasks. "We do a lot of proof-of-concept projects," Swanson said. "It's kind of a pilot approach; we're going to figure out how to do something and then bring in tools to automate it." As an example, she noted that HCSC initially "cobbled together" a metadata repository that combined HCatalog, an open source metadata management tool, with the HBase database and Hive query engine.
Now, the insurer is installing Apache Atlas, a broader data governance and metadata framework for Hadoop that was first released in 2015 and is still designated as an "incubating" technology by The Apache Software Foundation. "It's not all 100% there -- you still have to do a lot of workarounds," Swanson said. "But I like a lot of the concepts."
Diving into the data lake
Susan Swansonsenior manager of data modeling and architecture, Health Care Service Corp.
Data lake adoption is reaching sizable levels, according to recent surveys. In one conducted late last year by IT research and education outfit TDWI, 23% of 252 respondents said their organizations were running production applications on data lake platforms, while 24% said they expected to start doing so in the next 12 months. A Forrester Research survey, also conducted in 2016, found an even higher deployment rate -- 48% of 543 respondents said they had either implemented or were implementing a Hadoop-based data lake, while 31% said they planned to build one within 12 months.
In the TDWI survey, however, a lack of data governance was cited as the biggest roadblock to data lake deployments, with 41% of respondents listing it as a possible obstacle (see "Barriers to Entry"). "Part of that is it's just unknown territory if you haven't done this before," TDWI analyst Philip Russom said in a webinar on the survey results as well as a report he wrote about them.
Robin Gordon, CoreLogic Inc.'s chief data officer, said she's looking to adopt a "factory assembly line model" for automating data lake management and governance processes at the Irvine, Calif., company, which provides information on real estate, mortgages and consumer credit to clients that include lenders, insurers and government agencies. Among other things, the plan would automate tracking of data lineage and usage rights so CoreLogic can "put guardrails in place and make sure we don't get too crazy on where our data goes," Gordon said. But for now, managing and governing data in the company's Hadoop-based big data environment "is more manual than we'd like it to be," she added.
More data, bigger challenges
The situation is similar at BT Group PLC, a London-based communications and TV services provider. BT deployed a Hadoop cluster in 2013 and is now expanding it into an enterprise data lake designed to handle feeds from up to 2,500 applications and support self-service analytics by thousands of data analysts and business users. Data governance will be a bigger challenge going forward, both in helping users find relevant data in the data lake and monitoring the data that's going into the system, said Jason Perkins, BT's head of business insight and analytics architecture.
BT has taken several steps to start getting its arms around those challenges. The company augmented its existing data governance program by setting up an analytics review board that will examine requests to create data sandboxes or individual data views in the data lake; Perkins is a member of the board, along with data management, governance and IT representatives. In addition, he and his team created a "cookbook" document that details the process for adding data to the lake. An internal Hadoop user group also was formed to discuss plans for the data lake and share ideas on analytics and data management best practices.
Furthermore, the data lake team is building a homegrown metadata repository called Midas, which incorporates commercial software such as Oracle Data Integrator and Cloudera Navigator, a rival Hadoop data governance tool to Atlas. Perkins said BT is looking externally at emerging data lake management and governance platforms that might provide additional functionality on top of Midas.
"I don't see BT being a metadata software company," Perkins said. "We're just looking to fill a gap we see in the industry today." And the homegrown system will at least let BT "do some data governance, making sure that nobody can pollute the data lake with data we don't want in there," he noted. "The referee has to blow his whistle every now and then."
IT managers sharpen their focus on Hadoop management and governance
Consultant Mike Ferguson discusses the challenges of managing data lakes
Self-service BI, big data initiatives complicate data governance processes