Sergej Khackimullin - Fotolia
When LinkedIn Corp. was a smaller company, it didn't matter so much internally how data captured from its social networking website for analysis was formatted and structured.
"You could really log anything and access it later," said Yael Garten, LinkedIn's director of data science. That let data scientists work quickly on analytics applications, she added, without having to worry about any data inconsistencies that might result.
But things changed, as the company and the amount of data it generated grew rapidly. Now, people see the wisdom of better governing the data in LinkedIn's Hadoop environment so it's standardized throughout the analytics cycle, Garten explained. Otherwise, "it becomes a nightmare when you have hundreds of teams emitting data and hundreds of teams consuming data," she noted. That's particularly true, she said, if data is stored schema-free -- a lesson that LinkedIn learned early on.
Tools of the data governance trade
LinkedIn's Hadoop data governance process includes an internally developed system called the Unified Metrics Platform, which facilitates development of consistent metrics data for reporting uses. Garten also pointed to a data model review committee that evaluates whether models will successfully produce the specified data. And she cited another homegrown technology called Dali that provides a common API into Hadoop data sets for both data producers and users at the Mountain View, Calif., company, now owned by Microsoft.
Yael Gartendirector of data science, LinkedIn
Cleveland Clinic has made data governance a bigger priority in connection with a big data deployment as well. Eric Hixson, its senior program administrator for business intelligence, said the Cleveland-based health system created a formal data governance program last year after expanding from a conventional data warehouse architecture to one that also includes Hadoop, advanced analytics software, self-service BI tools and other technologies.
The new architecture, modeled after a logical data warehouse concept outlined by Gartner, was accompanied by a change in Cleveland Clinic's internal culture to make the health system more data-driven and position it to use analytics as a competitive differentiator, Hixson said during a presentation at the 2017 TDWI Leadership Summit in Las Vegas. To support that premise, the data governance initiative is aimed at upgrading risk management capabilities and improving data quality and usability, he added.
All pumped up for data governance
The deployment of a cloud-based date lake last December also prompted new Hadoop data governance processes at Beachbody LLC, a maker of fitness and nutrition products based in Santa Monica, Calif.
The big data system runs in the Amazon Web Services cloud and includes Hive and the Spark processing engine in addition to Hadoop, said Eric Anderson, Beachbody's executive director of data. It gives the company's data scientists and analysts self-service access to more types of data than they could get from its existing Oracle data warehouse, including website activity data, workout-video streaming logs and call-center records. They can also access more sensitive data than before, and at a more granular level. "Those are all governance challenges for us," Anderson said.
Data governance and usage policies have been documented for users of the data lake platform, he noted. Anderson's team has also created a data inventory that lists what's available in the system, along with a data dictionary and another document with data lineage information. That's all posted on a web portal to make the system "more transparent" to the users, Anderson explained. He added that the documentation "is more of an intermediate step than we maybe would have done before" in the data warehouse environment since there's less data to deal with there.
More and more organizations may well find themselves taking similar intermediate steps on big data and Hadoop data governance in the years ahead. William McKnight, president of McKnight Consulting Group, compared data to an abundant natural resource in a keynote speech at the Enterprise Data World 2017 conference in Atlanta. "We're not going to run out of data, but we might get overwhelmed by it," McKnight said, pointing to the growing importance of effective data management processes.
Expanding big data architectures add to data governance challenges
Consultant Anne Marie Smith on why big data needs to be governed
Balance of planning and flexibility needed in data lake deployments