This content is part of the Essential Guide: Using big data platforms for data management, access and analytics

Data governance process gets fine-tuned, as big data makes its mark

Big data analytics and digital transformation challenge the conventional data governance process, with many questions to answer in organizations. But a governed data lake shows how it can be done.

As data-driven business models, digital transformations, big data analytics and the like continue to rise, they challenge the conventional data governance process.

They also provide opportunity to place data governance at the center of important business changes, according to participants in last week's Enterprise Data Governance Online 2017 webinar.

Among the most challenging new developments is the data lake, which, in its most basic form, eschews upfront curation and categorization of data. Curation, which includes cleansing data and assuring its consistency, is among the hallmarks of the data governance process.

Effective data governance can be applied to a Hadoop data lake, according to Shannon Fuller, director of data governance at Carolinas HealthCare System, based in Charlotte, N.C. The data-lake path was chosen for an innovative big data project, he said, because it could encourage more rapid application development and create a common repository, while protecting patients' information and the health system's intellectual property.

"We decided this would not be another data warehouse," Fuller said. "It would be stand-alone assets available to the whole organization." He added that data can be made available for analysis more quickly by structuring and managing it in smaller chunks. For example, physician payment records are set up as a separate element in the data lake.

One road to reports, another to sandbox

Fuller said his organization is using a twofold path that prepares sets of curated data carefully for both business users and data scientists.

Driving the project is Carolinas HealthCare's push to look at a patient's overall treatment plan, taking disparate data into account and making decisions on compensation models. Fuller described his operation as an IBM InfoSphere shop, but said the pilot data lake was accomplished using Microsoft's HDInsight and Azure Data Lake Store cloud services.

Tresata software is used to catalog some of the source data, according to Fuller. Once treated and cleansed, data is then pushed back into the Azure Data Lake Store to be further analyzed, or to feed reports and executive dashboards. Separately, the curated data is made available in an analytics sandbox to data scientists, who can use HDInsight, Microsoft's R Server software and the Spark processing engine to build advanced analytical models based on the data.

"You have to ask, 'Do you want to govern the data going into the environment, or do you want to do it when it is in there?' We decided to bring in the raw data, land it in an operational layer and then curate it," Fuller said. The operational layer comprises raw data that is available to database administrators, sys admins and data integration developers creating extract, transform and load jobs.

Fuller said data scientists and business analysts alike benefit from the availability of curated and more governed versions of the data. "Understanding the context of data is very important," he said, noting that the health system wants to make sure different users work off of the same base of information so they get consistent results and correctly see connections between different data sets.

Data stewards should pursue context

Context can be a challenge for data stewards, too. They are the individuals charged to implement a data governance process. If they can understand the context of new innovations, data stewards can place data governance at the center of innovation in organizations, according to Fawad Butt, who also shared his views with EDGO 2017 conference attendees.

"Data stewards are the unit of execution for data governance. A lot of data governance programs spend a lot of time building their strategy and putting rules in place, but these are the folks that facilitate the enforcement of standards and policies," said Butt, chief data governance officer at Kaiser Permanente, a healthcare provider and insurer based in Oakland, Calif.

To help grease the data governance and stewardship skids, Butt advised his audience to "look for places where there is opportunity for transformation, or where there are already transformation alternatives underway," such as big data initiatives, business system updates and digitization strategies. "You're already in the process of disrupting how things are done, anyway," he said.

He also advised practitioners to walk around and learn the business they are in. As companies look to transform the way they work, that means keeping an eye on the business side of data.

"In order to discover the context you need to understand the environment you're in," Butt said. "When [data stewards] are in the IT world, sometimes there is a focus on aligning stewardship with data sets that IT manages on a day-in, day-out basis. But perhaps there is an opportunity to also align with the major business units. There's a lot of traction that can be gained in doing that."

Next Steps

Be there as monitoring and governance come to Hadoop

Learn how best to select data governance tools

Ask the expert: How do you optimize master data management strategy?

Dig Deeper on Big data management