Business Information

Technology insights for the data-driven enterprise

Sergej Khackimullin - Fotolia

Hadoop data governance takes hold in companies as data gets 'bigger'

LinkedIn, Cleveland Clinic and fitness company Beachbody are examples of organizations that have increased their data governance efforts in connection with big data applications.

When LinkedIn Corp. was a smaller company, it didn't matter so much internally how data captured from its social networking website for analysis was formatted and structured.

"You could really log anything and access it later," said Yael Garten, LinkedIn's director of data science. That let data scientists work quickly on analytics applications, she added, without having to worry about any data inconsistencies that might result.

But things changed, as the company and the amount of data it generated grew rapidly. Now, people see the wisdom of better governing the data in LinkedIn's Hadoop environment so it's standardized throughout the analytics cycle, Garten explained. Otherwise, "it becomes a nightmare when you have hundreds of teams emitting data and hundreds of teams consuming data," she noted. That's particularly true, she said, if data is stored schema-free -- a lesson that LinkedIn learned early on.

Tools of the data governance trade

Yael Garten, data science director, LinkedIn Yael Garten

LinkedIn's Hadoop data governance process includes an internally developed system called the Unified Metrics Platform, which facilitates development of consistent metrics data for reporting uses. Garten also pointed to a data model review committee that evaluates whether models will successfully produce the specified data. And she cited another homegrown technology called Dali that provides a common API into Hadoop data sets for both data producers and users at the Mountain View, Calif., company, now owned by Microsoft.

[I]t becomes a nightmare when you have hundreds of teams emitting data and hundreds of teams consuming data.
Yael Gartendirector of data science, LinkedIn

Cleveland Clinic has made data governance a bigger priority in connection with a big data deployment as well. Eric Hixson, its senior program administrator for business intelligence, said the Cleveland-based health system created a formal data governance program last year after expanding from a conventional data warehouse architecture to one that also includes Hadoop, advanced analytics software, self-service BI tools and other technologies.

The new architecture, modeled after a logical data warehouse concept outlined by Gartner, was accompanied by a change in Cleveland Clinic's internal culture to make the health system more data-driven and position it to use analytics as a competitive differentiator, Hixson said during a presentation at the 2017 TDWI Leadership Summit in Las Vegas. To support that premise, the data governance initiative is aimed at upgrading risk management capabilities and improving data quality and usability, he added.

All pumped up for data governance

The deployment of a cloud-based date lake last December also prompted new Hadoop data governance processes at Beachbody LLC, a maker of fitness and nutrition products based in Santa Monica, Calif.

The big data system runs in the Amazon Web Services cloud and includes Hive and the Spark processing engine in addition to Hadoop, said Eric Anderson, Beachbody's executive director of data. It gives the company's data scientists and analysts self-service access to more types of data than they could get from its existing Oracle data warehouse, including website activity data, workout-video streaming logs and call-center records. They can also access more sensitive data than before, and at a more granular level. "Those are all governance challenges for us," Anderson said.

Eric Anderson, executive director of data, Beachbody LLCEric Anderson

Data governance and usage policies have been documented for users of the data lake platform, he noted. Anderson's team has also created a data inventory that lists what's available in the system, along with a data dictionary and another document with data lineage information. That's all posted on a web portal to make the system "more transparent" to the users, Anderson explained. He added that the documentation "is more of an intermediate step than we maybe would have done before" in the data warehouse environment since there's less data to deal with there.

More and more organizations may well find themselves taking similar intermediate steps on big data and Hadoop data governance in the years ahead. William McKnight, president of McKnight Consulting Group, compared data to an abundant natural resource in a keynote speech at the Enterprise Data World 2017 conference in Atlanta. "We're not going to run out of data, but we might get overwhelmed by it," McKnight said, pointing to the growing importance of effective data management processes.

Article 7 of 12

Next Steps

Expanding big data architectures add to data governance challenges

Consultant Anne Marie Smith on why big data needs to be governed

Balance of planning and flexibility needed in data lake deployments

Dig Deeper on Data governance strategy

Get More Business Information

Access to all of our back issues View All