Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Open source Hadoop and supporting data management tools created by Web giants like Google, Yahoo and Facebook give more traditional companies new ways to tap into mounds of data from the Internet and other sources. But in most cases, deploying an enterprise Hadoop system takes more than just massive data-crunching capabilities.
At money-transfer and payment services company Western Union, for example, incorporating Hadoop into its enterprise operations also meant successfully integrating large amounts of unstructured Web information into the corporate workflow.
"We can do things we couldn't do in the past because the data was so huge and getting answers was very difficult. Hadoop is helping," said Pravin Darbare, senior manager of data integration and data discovery at Western Union. But considerable work was required to make new forms of data readily usable by the Englewood, Colo., company's data scientists, business analysts and marketing workers, he said.
That included building out self-service business intelligence and analytics applications for the end users. And to feed those applications with the needed data, Darbare said, Western Union had to integrate a Hadoop cluster with a variety of other systems and software pieces.
Enterprise Hadoop integration snapshot
Customers can use Western Union's website to send money to other people, pay bills and buy or reload prepaid debit cards, or find the locations of agents for in-person services. Log-based clickstream data about user activity on the site is captured in unstructured non-relational formats, and then it gets mixed with relational data and moved along a data analysis pipeline, said Darbare, who called these data feeds funnels.
The data load going through such funnels is heavy: In 2013, more than 70 million users tapped into Western Union services online, in person or by phone, and the company said it averaged 29 transactions per second across 200 countries and territories.
And the load is likely to grow quickly as use of the Web and mobile devices continues to increase. The enterprise Hadoop system is helping to handle the data influx, and Western Union is relying on data integration software from Informatica Corp. to tie the cluster into its broader analytics architecture.
Pravin DarbareWestern Union
First, the streams of unstructured and relational data are parsed and prepared for analysis using Informatica's Big Data Edition and Data Replication tools. The integration platform in turn connects with an IBM Netezza engine for structured data analytics and with the cluster, which is based on Cloudera Inc.'s Hadoop distribution, for storage and processing of both structured and unstructured data. This is all connected to Tibco Software Inc.'s ActiveSpaces in-memory data grid, and to analytics tools from SAS Institute Inc. and Tableau Software.
Forging a fast path to Hadoop
Using a commercial Hadoop distribution rather than basic Apache Hadoop is important to Western Union's effort. "We need support for the software," Darbare said, recalling the advent of the open source Linux operating system in the late 1990s. "This is just like many years back. We went through this with Linux -- it's the same scenario."
In addition, he said, Informatica's data integration tools helped Western Union get useful Hadoop applications up and running quickly. For example, Western Union's marketing teams are able to use the newly culled data to study website activity and to refashion the site in an effort to create a better user experience.
Darbare was more guarded in discussing uses of the enterprise Hadoop system by the data scientists at Western Union. However, there are clear indications that risk analysis and compliance with financial regulations are major drivers for advanced technology use at the company, which needs to guard against money laundering and other financial crimes.
While applications could be built quickly, Darbare said it took a year of effort to get to the point where his team could say the Hadoop data was being widely used. Now, though, the information in the cluster is crucial to the analytics process.
"Today, I don't think we can live without Hadoop," he said. "What we hear now is people that say, 'If Hadoop is down, I can't do my work.' This kind of comment is good for us. People are depending on the data."
Read top enterprise Hadoop stories
Learn about including Hadoop in existing processes
Watch a video on Hadoop use cases today