News Stay informed about the latest enterprise technology news and product updates.

Data warehousing best practices to support 'big data': No small task

Successfully managing "big data" as part of a data warehousing strategy requires careful planning and a clear understanding of the various technology choices, according to industry analysts.

Wayne Eckerson, research director for TechTarget Inc.’s business applications and architecture media group, puts it in simple terms: If you’re going to be successful in working with “big data,” you need the right culture, the right people, the right data and the right tools. If only it were that simple to put all those elements in place as part of a data warehousing best practices plan.

Getting to that point requires careful planning and a clear understanding of the potential opportunities and challenges presented by big-data management technologies and processes, according to Eckerson and other analysts.

For starters, Eckerson said, “you need people at the top of the organization willing to invest” in the required technologies -- and committed to instilling an analytics-oriented culture to ensure that the company will use the information “and not just revert to relying on spreadsheets” for data analysis. He added that as organizations look at how to respond to the challenges of storing and managing big data, they need to be open to the possibility of moving to more purpose-built data warehouse platforms. Such offerings can provide processing performance that’s “an order of magnitude better” than what general-purpose databases support, Eckerson said.

However, Richard Winter, president of Cambridge, Mass.-based consulting firm Winter Corp, warned that emerging technologies such as Hadoop and MapReduce aren’t the solution to every big-data management problem. Organizations need to take care not to “throw out the baby with the bath water,” Winter said. “Some people think they can now do everything in Hadoop and they can stop investing in [traditional] data warehouse technology -- but that would be a terrible mistake for most enterprises.”

Winter recommended looking at applications individually and assessing which platform is best for a particular set of big data. Two key factors are how long the data will be retained and how it will be used, he said. Core transaction data belongs in a data warehouse, where it can be systematically managed for long-term usefulness and value, Winter said. On the other hand, clickstream data, social network posts showing customer sentiment and other types of less structured data might be a good fit for a Hadoop cluster, especially if the information won’t be kept for as long as transaction data typically is. How broadly data needs to be accessed within an organization should also influence the choice of a technology platform, he said.

Volume isn’t the only characteristic of big data, according to separate but similar definitions of the term by Forrester Research Inc. and Gartner Inc.; both firms also take into account attributes such as variety and variability (or complexity, in Gartner’s model). But Forrester analyst James Kobielus said that in practice, preparing a data warehouse to handle big data is still fundamentally about scalability -- and he offered three sets of tips on data warehousing best practices aimed at helping organizations to deliver more powerful and scalable systems.

Big-data decision point: Scale up or scale out?
First, consider upgrading and potentially building parallelism into your data warehouse architecture. Possible steps could include scaling up data warehouse server nodes based on shared-memory symmetric multiprocessors or scaling out by using server clusters or shared-nothing massively parallel processing systems, Kobielus said. Partitioning MPP installations into hub, staging and query tiers is another option. But Kobielus warned that attempting to make such changes without proper attention to the underlying technology infrastructure is likely to lead to disappointing results. For instance, he noted that single-core CPUs probably won’t measure up to MPP requirements and that storage I/O bandwidth typically must be increased to support the increased processing capabilities.

Second, Kobielus advised organizations to look at adopting data warehouse appliances in cases in which the hardware and software bundles can address specific performance issues or pain points. And third, he recommended that companies work to optimize the data management and storage layers of their data warehouses to help boost performance. That could include compressing data for maximum efficiency, improving database schemas, joins and partitions, and using nontraditional database technologies such as columnar and in-memory software “as needed to achieve specific goals,” he said.

Lyndsay Wise, president and founder of Toronto-based consulting firm WiseAnalytics Inc., said the end goals of big-data projects are often much the same as those of traditional data warehousing initiatives -- for example, providing information that can help business users identify customer buying patterns or aid in fraud protection efforts. The challenges are similar, too: “There may be different nuances in terms of what you’re trying to get from the data, but the results may still depend on integration and data quality issues or on the challenges of data management and data governance,” she said.

Wise added, though, that the degree of difficulty on those challenges can be heightened by the amount of data that needs to be managed as well as its complexity, especially if a big-data project involves pulling together information from multiple data sources. As a result, companies incorporating big data into a data warehousing process need to honestly assess their capabilities, she advised. “Organizations want to say they have great IT people, but unless their DBAs and developers are really well-versed in data warehousing and specific big-data technologies, it almost pays to invest in outside [help] to really develop a strong platform,” Wise said.

With big data, it’s also critical to be able to frame what you want to achieve from an analytics standpoint and to determine up front what information is needed and what kind of hurdles might be involved in pulling it all together, according to Wise. “It’s really important to understand how everything interacts,” she said.

Alan R. Earls
is a Boston-area freelance writer focused on business and technology.

Dig Deeper on Data warehouse project management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.