Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Because it runs on commodity server clusters, the Hadoop framework offers cloud-like scalable computing that threatens the IT status quo. Hadoop potentially could slow the growth of the enterprise data warehouse by providing lower-cost data processing. It may also make incursions in a more surprising place -- mainframe modernization, a sometimes sleepy world that may be due for a shakeup of its own.
Many organizations would like to curb the amount of data being processed on their mainframes to help reduce IT expenses. Some would also like to correlate mainframe operations data with unstructured and semi-structured forms of data for analytical uses -- for example, associating hotel room bookings with social media comments, or matching customer account data with transcripts of call-center support calls. Hadoop can play a part in both the curbing and correlating scenarios.
Mutual customers primarily drove mainframe data integration specialist Syncsort and Hadoop upstart Cloudera to forge a recently announced technology partnership to bring mainframe data to Hadoop clusters for use in big-data analytics applications. At the Strata + Hadoop World 2013 conference in New York last week, these and other technology and services providers discussed the notion of bridging the gap between big data and big iron.
It can be hard sometimes to pull IT and data managers in New York away from their fevered trading floors, but the idea of updating the mainframe with Hadoop extensions may have drawn more than a few Wall Street wonks to the Strata event in midtown Manhattan. These people typically aren't looking to pull the plugs on either their mainframes or their data warehouses. But they may be looking to cap the data growth on those platforms, if Hadoop proves capable of picking up the processing slack.
Syncsort and Cloudera aren't alone in efforts to insert Hadoop into mainframe environments. MetaScale, a spin-off of Sears Holdings, has built a consulting services practice that includes a mainframe-to-Hadoop application migration methodology; the approach uses Hadoop's associated Pig query platform to run high-volume query applications. Sears officials report that they have even been able to turn off a mainframe or two based on their progress in offloading processing to Hadoop.
From here to legacy
Such offloading is also of interest to other users, especially in mainframe-bastion industries such as finance and insurance. In an interview at the Strata event, Syncsort President Josh Rogers said it's an increasingly common use case for Hadoop in the enterprise.
With the data warehouse, Rogers said, a drive to reduce extract, transform and load (ETL) processing is especially in play. Very often a large portion of the overall processing work may involve ETL functions -- more than 30%, he said. Those workloads are a ready target for clustered Hadoop servers that can put the data loading stage ahead of the transformation one to support extract, load and transform (ELT) schemes. Putting off the transformation step reduces up-front processing. And when the time comes for it, Hadoop has proved to be quite adept at high-speed data transformations.
Rogers and others claim the cost of storage can be vastly lower on Hadoop clusters versus mainframes or data warehouses, sometimes sliding to $1,000 per terabyte compared with as much as $100,000 per terabyte. In describing the partnership with Cloudera, Syncsort CEO Lonne Jaffe said, "We've created a button you can push to suck in the expensive workloads."
Jaffe points out that some mainframe modernization efforts have stalled because they're risky and expensive. Even if they succeed, he noted, users often end up with nothing more than an original application that has simply been moved as-is to another platform. That could mean an opening for Hadoop.
Hadoop as additive for big iron
Ironically, moving some mainframe processing work to Hadoop platforms could bring new vitality to the legacy systems by fostering an analytics partnership between the two technologies, as IT analyst John Webster suggests.
For more on
Find out how Google set the stage for Hadoop in its big data infrastructure
Track Hadoop as it moves up from the sandbox
Paint with a richer data management palette
"People want to get data from the traditional [mainframe] data sources that they have been using forever, especially customer data and transaction data, and to marry that with the other kinds of data just now becoming available," said Webster, a senior partner at Evaluator Group. "That is where interest in Hadoop comes in."
As a result, customers are driving Hadoop distribution providers to support reloads of mainframe data in combination with other types of data, according to Webster.
In the big data era, data processing architectures are more fluid every day. It may have been expected that the mainframe would be touched by change as a result. Service-oriented architecture, or SOA, which put mainframe applications in an all-purpose wrapper of Web services and XML, was a major form of mainframe modernization in recent years, but it seems to have reached a plateau. Hadoop innovations could enliven the venerable platform -- and accelerate its evolution -- once again.
Methods to use in moving legacy files to Hadoop data lakes