Edmunds.com Inc., which publishes automobile pricing data, vehicle reviews and other car-shopping information online, is driving deep into big data applications territory to power its data warehousing and business intelligence (BI) operations. In February, the Santa Monica, Calif., company replaced its existing relational data warehouse with a Hadoop-based system in an effort to speed up data processing and enable its business users to run more complex and data-heavy analytics applications than the old platform could support.
But the Hadoop Distributed File System (HDFS) isn't the only engine under the hood of the new environment. After initially being processed in HDFS, dealer inventory information, vehicle configuration data sets and other forms of structured data are passed along to HBase, its companion NoSQL database, for storage. From there, aggregated information correlated with Internet clickstream data is transmitted to IBM Netezza and Amazon Redshift systems for ad hoc querying and to BI tools from MicroStrategy and Platfora for reporting uses, according to a June blog post by Philip Potloff, chief information officer at Edmunds.
Paddy HannonVice president of architecture, Edmunds.com Inc.
Doing the required data integration work to tie everything together wasn't a simple matter. Edmunds had to replace the traditional extract, transform and load (ETL) processes that fed the relational data warehouse with new manually coded integration programs, using Java, MapReduce and Hadoop's Oozie job scheduler. Paddy Hannon, the company's vice president of architecture, said in an interview at the Hadoop Summit 2013 in June that the work took four developers about 18 months to complete.
Copying data sets from the file structure of HDFS into a database table format for storage in HBase wasn't that big of a challenge, said Hannon, who took part in a panel discussion at the conference in San Jose, Calif. "The more difficult part," he said, "was unpacking the 10 to 15 years of ETL we'd done to find out what rules were important and which weren't." Then the developers had to incorporate the business rules deemed worth keeping into the new implementation.
Such challenges are common on big data projects -- and in many cases, the data integration process is likely to become more complicated to manage as all-encompassing data warehouses and rigid ETL routines give way to more dynamic environments involving a variety of different systems and flexible, on-the-fly integration to support specific data analysis needs. That can require a big shift in data management principles and procedures, covering data integration as well as related data cleansing and governance initiatives.
A federated format for big data applications
In the past, data integration in the form of ETL typically was "a self-contained process" that focused simply on moving cleansed and consolidated data from source systems to a target data warehouse, said Michele Goetz, an analyst at Forrester Research Inc. in Cambridge, Mass. "Now you've got this federated environment where data can be anywhere," she said. "And a lot of times you want to leave it where it is and just call it when it's needed [for use on another system]."
At least, that's where things are heading, according to Goetz and other analysts. The most prevalent big data deployment approach that Forrester is seeing among its clients is a Hadoop system tied to an enterprise data warehouse (EDW), with the two technologies augmenting one another. For example, a Hadoop cluster could serve as a staging area for data on its way to the EDW or become the primary repository for specific types of information.
Consulting and market research company Enterprise Management Associates Inc. (EMA) has mapped out what it calls a "hybrid data ecosystem," an architectural framework for big data applications that incorporates eight different categories of systems, including EDWs, data marts, Hadoop clusters, NoSQL data stores and specialized analytical databases. In a survey on big data initiatives conducted jointly by EMA and 9sight Consulting in the summer of 2012, 72% of the 255 business and IT professionals who responded said their organizations were using more than one of the eight technology platforms. Forty-six percent said they had three or more in place.
But as organizations move away from treating big data analytics as a siloed application and look to use the analytical results to drive their mainstream business processes, data quality and seamless upstream integration become more important. And the increased flexibility of big data architectures also brings a higher level of development and management complexity, which might require an infusion of new processes and skills -- and even a cultural overhaul -- in IT departments.
Slow start, fast finish
At Edmunds, Potloff wrote in his blog post, the first few months of the data warehouse replacement effort "were pretty slow going" as members of the development team learned the basics of using Hadoop technologies. Greg Rokita, the company's senior director of software architecture and leader of the Hadoop team, said in a Q&A section of the post that the developers had no prior experience with HDFS, HBase, MapReduce and other Hadoop tools. But, he added, the team eventually found its footing and adopted strategies such as abstracting complex data sets to simplify interactions with other information, and "continuous refactoring" of the code base to incrementally improve scalability and reliability in a controlled way.
As of June, according to Potloff, the newly combined data sets and improved processing capabilities of the Hadoop-based environment had enabled business analysts using the HBase-fed query and reporting systems to save more than $1.7 million in paid-search marketing expenses through better optimization of keyword bidding processes.
"We gave capabilities to the business that they had never had before," Hannon said at the Hadoop Summit. "It was well worth it in the long run."
Jack Vaughan, SearchDataManagement's news and site editor, contributed to this story.