BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
For Eric Williams, the recipe for capitalizing on “big data” – and avoiding the data management problems it can lead to – reads like this: Start small, quickly demonstrate business value and keep in close contact with the end users who are running analytical queries against all that information.
Williams is executive vice president and chief information officer at Catalina Marketing Corp., a St. Petersburg, Fla.-based company that uses information gleaned from retailer loyalty cards to both track and predict the shopping habits of individual consumers in countries around the world. Thanks to a combination of data warehouse appliances and predictive analytics software, Catalina has been managing, and making sense of, enormous data sets since long before the phrase big data entered the IT vernacular.
On a typical day, the company receives up to 525 million pieces of data from U.S.-based retailers alone. In its systems, Catalina stores about 800 billion rows of customer data comprising the purchasing history of 200 million Americans over the past three years.
Williams’ advice to organizations that are launching big-data management and analytics strategies is straightforward: Avoid the temptation to gather every available piece of information and simply throw it into a data warehouse for business users or analytics professionals to contend with. Instead, as you load your data warehouse or other databases with large volumes of information, start the analytics process by analyzing a subset of key business data for meaningful patterns and trends to prove the value of the big-data approach and gain experience in overcoming big-data challenges.
“Take a sampling of your information from a limited time period or a limited set of products and get a person on board – you probably have one already – who can help with some of the analytics,” Williams said. “It doesn’t necessarily require a Ph.D. person to be able to do this. Much of it is just providing insights to somebody that is already making business decisions.”
Big-data management has become one of the most talked-about trends in the IT industry, as organizations look to cope with the challenges of storing large data sets and mining them for nuggets of information that potentially can provide significant competitive advantages. Complicating matters is the fact that a big-data installation might include both structured transactional data from in-house systems and unstructured information from a variety of sources, including system logs, call detail records and social media sites such as Facebook and Twitter.
Distributed computing’s role in big-data management
For example, clickstream data lets companies track what people are doing on the Web, from PCs as well as mobile devices. That produces huge amounts of data, said Tony Iams, a vice president and senior analyst at Ideas International, a Rye Brook, N.Y.-based IT research firm. The benefit, Iams said, is that organizations can use that data to “create potentially a much more accurate picture of user behavior than ever before.” But the data needs to be properly structured and managed to make that possible.
Jill Dyche, a partner at Baseline Consulting Group in Sherman Oaks, Calif., said classifying data is a key first step in the big-data management process. “When we talk about it with clients, we very quickly move on to data classification,” Dyche said at the Pacific Northwest BI Summit 2011 in Grants Pass, Ore. “So they’re not just forklifting data onto a data warehouse platform or data marts but really looking at what the data is and how it’s used.”
Often, one of the defining characteristics of big data is that it’s too large for a standalone database server to process efficiently. In addition, nontransactional data types such as Web logs and social media interactions – “the other big data,” in the words of Gartner Inc. analyst Merv Adrian – aren’t always a good fit for traditional relational databases. As a result, many user organizations engaged in big-data management employ a distributed computing, or scale-out, model, often built around open source technologies such as Hadoop, MapReduce and NoSQL data stores.
The distributed approach has worked well for Catalina Marketing, according to Williams. “This whole idea of grid computing or connecting standardized PC-type appliances and making them work in concert made all the sense in the world,” he said. “That’s really what has allowed us to scale to the size that we are and to do that very cost-effectively and efficiently.”
Another strategy that Williams put in place is holding a monthly user group meeting designed partly to help Catalina keep its data warehouse appliances performing at an optimum level. Williams said the meetings are critical because they allow the IT staff to see how the needs of business users – and the queries they’re looking to run – change over time.
“We work with them to understand how they operate, what they’re running and what their analytics are showing,” he said, adding that the process enabled his team to recognize that the existing data structure and query parameters “weren’t optimized to accommodate what [the users] needed.” The data structure has now been modified to accommodate new types of queries, Williams said.
Big-data challenges call for management oversight
For some organizations, one of the biggest challenges associated with managing and analyzing super-large data sets is finding valuable information that can yield business benefits – and deciding what data can be jettisoned.
For example, UPMC, a Pittsburgh-based health care network with 20-plus hospitals and more than 50,000 employees, has seen its data stores grow by leaps and bounds in recent years, largely because workers are afraid to delete any information, according to William Costantini, associate director of the company’s integrated operations center.
“The biggest issue right now is [figuring out] what do you purge and what can’t you purge, because everybody is afraid of liability and being sued,” Costantini said. “Everybody is afraid to throw anything out or get rid of it. At the same time, everybody wants to be budget-conscious and keep the size down.”
Adding to the big-data challenges facing organizations is the increasing popularity of “data sandboxes” that enable data analysts to explore and experiment on subsets of information, typically outside of a data warehouse. Companies need to keep a close watch on sandboxes to make sure that they don’t end up with inconsistent stovepipes of data, analysts said.
In addition, the databases and Hadoop installations used to store nontransactional forms of big data are often set up by application developers working independently of the IT department. “This is being done by people outside the usual IT focus, with different tools,” Adrian said at the Pacific Northwest BI Summit. “Managed is probably too generous a term.”
Gartner’s take, he added, is that organizations able to integrate those data types into a coherent information management infrastructure will outperform businesses that can’t.
Executive Editor Craig Stedman contributed to this article.