Adobe Systems is well known for its portable document format (PDF) and multimedia editing products such as Photoshop and Premiere. The San Jose, California-based company tracks everything its users do -- for example, whether they've downloaded the latest software upgrade or how often they've used the product in the last week -- and puts that information into Event Tracing System, a Hadoop data store. The system captures more than 3 million events per day, according to Kevin Davis, a senior data warehouse engineer at Adobe.
What can be done with all that Hadoop data? Adobe sought to dump it all into its SAP HANA system for deep analysis, but it needed to find the right extract, transform and load (ETL) tool. In the end, it chose SAP Data Services over a third-party proprietary tool, or the open-source Sqoop. Davis explains why.
Why did you decide to use SAP Data Services over another ETL tool for loading Hadoop data into SAP HANA?
Kevin Davis: When we compared Data Services to other ETL tools, Adobe IT has [become] standardized on Data Services. We have a lot of SAP data sources and we have HANA as our main analytics engine, so just the native connectivity between Data Services and those other SAP sources and HANA made Data Services our ETL choice. We didn't even really compare Data Services to Informatica or Pentaho or any of the others. We have so much SAP in our landscape that it just made sense to use a SAP product. When you have a problem trying to integrate data, you can log one ticket and all of the people who need to look at the system to figure out where the problem is all work for the same company.
When we looked at Sqoop, a lot of it had to do with the maturity of the product and the fact that we had a lot of developers familiar with Data Services. So they could use the same development paradigm that they've used to get data from other sources. They didn't have to learn any new technologies.
Then there were the data-transfer rates. When we did our tests, Data Services was more than two times faster than Sqoop [for] loading data from Hadoop into HANA.
What is your main challenge in using SAP Data Services to load Hadoop data into HANA?
Davis: One of the biggest challenges is version compatibility. The open source community evolves so quickly. In the last year, Cloudera released seven major versions of its Hadoop infrastructure. You only get two service pack releases of Data Services per year, so it's obviously hard for Data Services to keep up.
What are your plans going forward for Hadoop and SAP HANA?
Davis: We're looking to update all of our infrastructure -- HANA, Data Services and Hadoop -- within the next couple months just to get us on the latest versions of all of those. And now that we've solved the problem of getting Hadoop data into HANA, we know there are other use cases where we have data in Hadoop that we can use to gain some insights and use them in analytics models and dashboards. So we'll continue to implement use cases with that same model of filtering and aggregating that Hadoop data and loading it into HANA, and using it for dashboards.
Mark Fontecchio may be reached at firstname.lastname@example.org or on Twitter @markfontecchio.
See how Pfizer dumped ETL tools for data virtualization http://searchdatamanagement.techtarget.com/feature/Pfizer-swaps-out-ETL-for-data-virtualization-tools
Sears cut data latency using Hadoop
When to use Hadoop (and when not to)
Dig Deeper on Hadoop framework