News Stay informed about the latest enterprise technology news and product updates.

Trouble spots: 'Big data' pitfalls in the data warehousing process

The growing volumes of "big data" that organizations are looking to store pose a variety of potential problems and issues for data warehousing teams.

“Big data” has already arrived in many organizations; in others, it’s coming. And like any new technology opportunity, big data comes with a raft of potential problems and issues that IT and data warehousing teams should approach with caution.

For example, Forrester Research Inc. analyst Brian Hopkins said that before organizations jump into big-data management, they need to figure out whether traditional data warehouse strategies and techniques will work for them in the context of information that often is unstructured and potentially not a good fit for mainstream relational databases. And if a traditional data warehousing process isn’t the answer for managing big data, companies might need to “get comfortable with open source technology,” such as Hadoop, MapReduce and NoSQL databases, Hopkins said.

David Menninger, an analyst at Ventana Research Inc., noted that the people within an organization can make or break a big-data initiative, particularly if new technologies are involved. Ventana recently surveyed 163 IT and business professionals from various countries on Hadoop adoption and other issues related to managing and using big data. “The participants told us the biggest obstacle for them is most often staffing and training, because these technologies are different enough and require further training and different training than what people have studied in school,” Menninger said.

If you understand how processing is distributed across a Hadoop cluster, for example, you can avoid moving excessive amounts of data around and potentially get better and faster results on analytical queries, according to Menninger. But first, he said, “you need people who know how to do these things.”

Menninger said the Ventana survey showed that relational databases are still dominant, even for big-data management. About 90% of the respondents said their companies use relational software overall, and 75% percent were using it as their primary technology for supporting big data. On the other hand, more than half said they were evaluating Hadoop, while 22% already were in production with the open source technology and 12% planned to begin using it within the next year. Based on the survey, Menninger said, Hadoop most often is being used to store “loosely structured data -- log and event data and to a lesser extent text and social media data.”

Surprisingly to Menninger, the survey found that flat files were the second most popular big-data management technology, employed by about 70% of the respondents. “I think that is in part what leads to Hadoop,” he said. “If you’re working with flat files, it isn’t hard to consider using Hadoop. Hadoop is really all about flat files; it’s more sophisticated than that, but if you squint it’s sort of the same thing.”

Menninger added that companies should also watch out for two other potential big-data pitfalls as part of the data warehousing process: software licensing costs that potentially can soar along with data volumes, and inadequate integration between big-data technologies and business intelligence (BI) tools.

Big data: Too much information?
Avanade Inc., a Seattle-based IT consulting and professional services firm, also recently released a study on big-data trends and challenges, based on survey responses from 543 C-level executives and IT decision makers in 17 countries. Markus Sprenger, global director of Avanade’s BI and collaboration practices, said the survey showed that one of the primary hurdles of managing big data is simply figuring out what is worth keeping and what isn’t. “We found it is a question of how the business can identify relevant data and then apply it to a decision-making process,” he said.

Echoing one of Menninger’s points, Sprenger added that there aren’t enough IT and data warehousing workers available with experience in managing nontransactional forms of big data, both within the surveyed businesses and in the job market as a whole. Organizations typically have mature processes for handling structured transaction data, he said, but most are just starting to learn how to manage large quantities of unstructured and semi-structured data in an organized way.

That’s reflected on the technical side, where many of Sprenger’s clients are still struggling to understand what to do with big-data installations based on Hadoop and MapReduce -- assuming that the IT and data warehousing teams within organizations even know about the deployments and have responsibility for managing them, which often isn’t the case.

“We’re not at the point yet where IT can provide service-level agreements around these things -- that will probably take another year or two,” Sprenger said. “It is still more of an experiment for most organizations.”

Alan R. Earls
is a Boston-area freelance writer focused on business and technology.

Dig Deeper on Data warehouse project management

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.