Experimentation with what can be collectively dubbed as big data tools -- including Hadoop clusters, the MapReduce programming model and NoSQL databases -- has led to some emerging application scenarios and use cases that demonstrate clear business value. But these early successes raise a potentially complicated question: What is the best way to integrate big data systems into an enterprise data warehousing, business intelligence (BI) and analytics architecture?
Big data technologies don't have to be disruptive to existing data warehouse environments. Yes, the lowered barriers to entry provided by the broad set of no-cost or low-cost tools that make up the Hadoop ecosystem and their support for storing and managing massive data sets on commodity hardware do pose the potential to dislodge the traditional enterprise data warehouse from its perch at the center of the BI and analytics universe.
But organizations that have invested a significant amount of money, resources and time in deploying data warehouses to support querying, reporting and analysis aren't likely to want to turn their backs on those investments. And even if your company does opt to transition to a new BI and big data analytics architecture layered solely on top of Hadoop and NoSQL technologies, the switch is unlikely to happen overnight. More commonly, it will be made via a series of incremental changes to reduce the risk of decreased service levels or full-scale interruptions in analytics processes.
As a result, most organizations will benefit from an approach that values integration and interoperability to ensure a level of symbiosis between old and new technologies. An example might be a Hadoop-based analytical application for customer profiling coupled with an existing customer data warehouse. Data can be streamed from the warehouse into the Hadoop application, while enhancements to customer profiles and classifications generated as part of the analysis process can be merged back into the data warehouse.
Making a big data connection
The first consideration for integration is establishing connections between data warehouses and big data platforms. At present, one of the most frequent uses of big data systems is for data warehouse augmentation, in which they provide expanded data storage at a lower cost than a traditional data warehouse or data mart can. Many early adopters are also using Hadoop clusters and NoSQL databases as staging areas for data before loading some or all of the information into a data warehouse for analytical uses. Such applications can be as simple as using the Hadoop Distributed File System to store data, or they can involve more complex links to data sets in Hive, HBase, Cassandra and other NoSQL technologies.
More on managing big data tools and systems
Read about real-world big data management strategies and applications
Learn more about the logical combination of data warehouse and big data platforms
Find out why integrating big data technologies requires a firm grasp on corporate info
Incorporating those tools into a data warehouse and BI framework may require both connectivity and interpretation. Application programming interfaces can be used to provide access to Hadoop and NoSQL systems from data warehouses; in addition, numerous vendors offer packaged connectors between SQL databases and big data systems, including ones based on integration standards such as ODBC and JDBC. For those systems that don't conform to a typical relational model, there may be a need for an interpretation layer that can transform semi-structured objects (documents, for example) from their representative form, such as YAML or JSON, into a format that can be understood by BI applications.
There are other approaches for even tighter integration between the two types of systems. For example, data warehouse systems are increasingly open to incorporating call-outs to MapReduce functions as enhancements to their native SQL vocabulary, enabling the results of an analytical process on a Hadoop cluster to be pulled directly into a BI query's result set. Another example is incorporating Hadoop-generated analytical results into data warehouses for reporting and further analysis.
Big data gaps need bridging
Integrating the different approaches will rapidly become an imperative for many IT and data warehousing teams as the business value of big data -- and how to unlock it -- becomes better understood. Coupling a degree of agility with good program planning for the integration process is critical. That means bridging some obvious gaps that will persist as adoption increases, including the following:
Disjointed architectures. The typical approach to pilot projects or proofs of concept, as well as many early production applications, involves deploying Hadoop or NoSQL systems in their own siloed environments. A well-structured integration plan must include engaging IT and data architects to properly envision, design and deploy the various stacked layers of a hybrid data warehouse, BI and analytics architecture.
Administration shortcomings. The open source nature of many big data tools often leads to the emphasis of functionality over management and administration. This gap will be narrowed over time as the commercial versions of big data software products mature, but for now you may need to compensate for the relative immaturity of their management capabilities.
Skills shortages. The steep learning curve on working with Hadoop and NoSQL technologies may be the biggest hurdle to scale on big data integration efforts. Knowledge of parallel and distributed computing techniques in general remains somewhat elusive in the IT staffing marketplace, and there are even fewer people with deep hands-on experience in developing and updating big data applications. Training internal staffers may be the fastest, and lowest-cost, way to put the required skills in place.
In a growing number of companies, Hadoop and NoSQL integration with data warehouse environments is a question not of "if" but of "how soon?" Starting to prepare now will help identify the potential roadblocks up front and enable the development of an effective project plan. That, in turn, should help you build repeatable processes for meeting your integration needs -- and that should be the ultimate goal of any initiative.
About the author:
David Loshin is president of Knowledge Integrity Inc., a consulting, training and development services company that works with clients on business intelligence, big data, data quality, data governance and master data management initiatives. He also is the author of numerous books, including Big Data Analytics and The Practitioner's Guide to Data Quality Improvement. Email him at firstname.lastname@example.org.