Roger du Mars, Contributor
Published: 08 Aug 2012
"Big data" alluringly holds out the promise of competitive advantages to companies that can use it to unlock secrets about customers, website usage and other key elements of their business operations. But some caution should prevail: Without a proper data governance process, the zest to spearhead big data projects can unleash a mess of trouble, including misleading data and unexpected costs.
Data governance's role in keeping big data houses in order is just starting to emerge from the shadows, though. Big data, which typically involves large amounts of unstructured information, is a very recent phenomenon that has found its way into many organizations under the IT department's radar. As a result, governance of big data environments is at an incipient stage, and there are few widespread prescriptions for how to do it effectively, according to data management analysts.
More on managing the data governance process
Get advice on building a data governance framework from consultant Gwen Thomas
Read about what not to do as part of enterprise data governance programs
Listen to a podcast interview on how to get ahead on a data governance initiative
"Big data is such a new area that nobody has developed governance procedures and policies," said Boris Evelson, an analyst at Forrester Research Inc. in Cambridge, Mass. "There are more questions than answers."
One fundamental problem is that pools of big data are oriented more to data exploration and discovery than they are to conventional business intelligence reporting and analysis, Evelson added. That, he said, creates a vicious cycle: "The data can't be governed until it is modeled, but it can't be modeled until it is explored [by data analysts]."
Data governance programs provide a framework for setting data-usage policies and implementing controls designed to ensure that information remains accurate, consistent and accessible. Clearly, a significant challenge in the process of governing big data is categorizing, modeling and mapping the data as it's captured and stored, particularly because of the unstructured nature of much of the information.
"To get meaningful business information from big data, all sorts of things need to be done, like semantic analysis of the data, which is then rendered into conceptual models or ontologies," said Malcolm Chisholm, president of data management consultancy AskGet Inc. in Holmdel, N.J. "And all that involves a heap of governance stuff."
Looking for clues on big data
The difficulty is that everything about the data governance process for big data is so new. "There is a great deal of immaturity when talking about big data, and the majority of data managers really don't have a clue going into this," Chisholm said.
Big data, which can also include large quantities of structured transaction data, has idiosyncratic features. It's commonly defined in accordance with the three V's: volume, variety and velocity. Forrester adds variability to its definition, while rival consulting company Gartner Inc. tacks on complexity.
In addition, the data often comes from external sources, and its accuracy can't always be easily validated; also, the meaning and context of text data isn't necessarily self-evident. And in many cases, it's stored in Hadoop file systems or NoSQL databases instead of conventional data warehouses. For many organizations, big data involves a collective learning curve for all concerned: IT managers, programmers, data architects, data modelers and data governance professionals.
Doing too much a danger
One of the biggest pitfalls in coping with, and trying to govern, the flood of big data is to lose sight of business priorities, said Rick Sherman, founder of Athena IT Solutions, a consultancy in Stow, Mass.
For example, much of the unstructured data being captured by organizations comes from social media, and typically only a small portion of that information is of significant value, according to Sherman. "Trying to manage or control everything in unstructured data would be a big mistake," he said, warning that companies could end up wasting time and resources on unimportant data.
Danette McGilvray, president of Granite Falls Consulting Inc. in Newark, Calif., also said that big data can be a big time-sink for data management and governance teams if it isn't handled intelligently and sensibly. "The only way we can figure out if the data is worth managing is if we know what the business need is," McGilvray said. "When it comes to big data, we still need to be reminded of that."
Gwen Thomas, founder and president of The Data Governance Institute LLC, a consulting and training company in Orlando, Fla., recommends that judgments about the quality of incoming data should be one of the top priorities for data governance managers looking to get their arms around big data. Proactive data quality checks can save a lot of time and grief down the road, she said.
Proper alignment avoids disjointed data
Frequently underrated, Thomas added, is the importance of mapping the new data to the reference data that organizations use to categorize information. Aligning big data with existing reference data is "a huge detail," she said. "In fact, if this is not done right, the information that results from the processing of big data may be misleading, inaccurate or incomplete."
To help ensure that the data is mapped properly, the task should be assigned to a senior data architect instead of being left to a less experienced data modeler or someone outside of IT, Thomas advised.
Chisholm said data governance managers should also make it a priority to have productive conversations about the applicable data models with the programmers and business users who often initiate big data installations. Such discussions, though, should begin with a firm appreciation of Hadoop and NoSQL technologies and how they differ from relational databases -- and an understanding of the need for a unified approach to managing and governing big data.
What companies should avoid, Chisholm said, is letting programmers and users go their own way and bring silo-driven perspectives to the process of setting up big data systems and doing the required data modeling and mapping work. That could saddle them with big remediation bills, inadequate installations that don't yield the expected business benefits, and wasted investments in unnecessary systems.
Roger du Mars is a freelance writer based in Redmond, Wash. He has written for publications such as Time, USA Today and The Boston Globe, and he previously was the Seoul, South Korea, bureau chief of Asiaweek and the South China Morning Post.