Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
As one of the most vivid features of big data, the Hadoop file system, could hardly have a more promising future in IT shops intent on probing data for business leverage. Hadoop's distributed approach to processing looks like a great fit for handling big volumes of unstructured data. But Hadoop and its associated MapReduce programming model are not automatic cure-alls -- MapReduce and Hadoop problems confront the big data newbie at every turn.
Problems that Hadoop implementers confront include complexity, performance and systems management. Still, interest is rising. MarketAnalysis.com estimated that the Hadoop-MapReduce market will grow at a 58% compound annual rate, reaching $2.2 billion by the end of 2018. The software that works along with Hadoop, which was originally created at Yahoo, is plentiful and growing. But, just finding the right place for team members to start can be a challenge.
"When we speak of Hadoop, it is basically not much more than a distributed file system. All the distributions come with tools," said Joe Caserta, president of Caserta Concepts LLC, a New York-based data warehouse consulting and training company. "And the tools make sense of the data on the distributed data system. These include Hive [querying tools], which can emulate structured data; Pig, which is essentially an ETL language to manipulate data in Hadoop, and others."
For many, complexity with Hadoop comes in the form of development in Java.
While Java is a widely popular language, its use to date has been more common in application-oriented development than in data-oriented development. This is just one of many "context shifts" the Hadoop novice faces.
Even the large army of Java developers that are versed in SQL face challenges with Hadoop. That is because it does not use SQL, except via associated tools such as Hive.
"Hadoop has a tool that helps with the context shift, and that is Hive. That works fairly well, but it is not always a clean transition," said Paul Dix, CEO and founder of Errplane, a New York City consultancy and maker of application monitoring software, and member of the New York Hadoop User Group.
Mixing Java skills and SQL add-ons does not assure Hadoop success either, according to Dix. A stellar Java developer may fail in deploying the Google-originated MapReduce model, often associated with Hadoop. MapReduce distills Hadoop-processed data into more easily useable subsets.
MapReduce: Low-hanging fruit of the Hadoop kind
"Most Java developers face issues in how they think about processing data into the MapReduce paradigm. They have to learn how to write MapReduce code to work with Hadoop. They have to learn to structure the problem correctly," Dix said.
Learning how to write MapReduce code to work with Hadoop to count words in text files is a first assignment for many Google developers, said Dix, himself a former Google software engineering intern.
Finding a good starting point for the would-be Hadoop development team is key. One potential development project to start: putting log files into a Hadoop cluster and then applying MapReduce to that data to find out, for example, the number of unique visitors to a webpage, response time or the number of errors thrown for a Web application.
"I would say that is the low-hanging fruit," said Dix, an Addison-Wesley author who recently presented online classes on working with big data.
Slam dunk and Hadoop
The MapReduce avenue is not the only one. "There are a lot of ways to do Hadoop without writing MapReduce programs from scratch," said Paul Mackles, senior manager of software architecture at Adobe Systems Inc., based in San Jose, Calif., who pointed to Hive and other approaches. But Hive, too, brings complications.
Hive converts SQL-like statements into MapReduce programs, but as with other query systems, tuning is often required to gain the best available performance. Data joins are "not its strong suit," according to Mackles, who spoke at TDWI's BI Executive Summit 2013 this month in Las Vegas. Performance can falter for other reasons, as Hive is batch-only and working with MapReduce incurs startup costs on processing jobs, and subsequent processing overhead once jobs are running, Mackles said.
Among Hadoop ''gotchas" are some that are operational and intrinsic. The Hadoop file system is not Posix-compliant, so the practices for mounting the file system are less widely known. Hadoop splits up files into manageable units for parallel processing, but the development or operations team members sometimes have to "think about how [they] split up and compress files," he said.
For getting started in learning Hadoop, Mackles said it is valuable to pick easier, manageable projects first. Mackles included parking files, Hive aggregation and experimentation as "slam dunk use cases" for Hadoop, while noting that operational applications that strive for quick, real-time performance are not good first stops.
Mackles enumerated a long list of interesting tools being created to help Hadoop users over the hump. These include YARN, Impala, HCatalog and Hadoop 2.0, which is in alpha release, and, in Mackles' words, "represents a major shift toward real time." They may all help, but navigating the vast new base of tools for Hadoop is one of the first problems that implementers face.
Follow SearchDataManagement.com on Twitter: @sDataManagement
Learn about the 2013 outlook for the Hadoop framework and big data
You are there: Hadoop World 2012
Read about the logical next step for the data warehouse