BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
The Hadoop framework is just beginning to edge into regular enterprise operations in many companies, and some Hadoop best practices for production are only now arising. A view into such practices emerged in a panel on the subject last month at the Strata Conference in Santa Clara, Calif.
How you stage rollouts and how Hadoop and its associated MapReduce programs fit in the context of existing operations becomes important when moving to production, a data architect at a leading networking company told Strata attendees.
Hadoop today has similarities with ERP in the 1990s. It is likely to be the center for analytics for years to come.
chief architect for data architecture and innovation, Cisco
"When you move to production, the focus has to be, 'Can I deliver this efficiently?'" said Piyush Bhargava, chief architect for data architecture and innovation at Cisco, based in San Jose, Calif. He added that enterprise Hadoop efforts have to start with architecture.
According to Bhargava, developing Hadoop as part of an overall information plan allows Cisco to more effectively support use cases for Hadoop that go beyond the first few original test cases and to manage Hadoop so that the conversation with business leaders is about getting value out of the data, not about code patches and node failures.
At Cisco, he and his colleagues have started on a journey to create an enterprise Hadoop platform. The first use case was data warehouse offloading -- where data processing came at "one-tenth the cost" of existing systems. Subsequent Hadoop use cases have been in marketing, especially for bringing together offline and online customer experiences, he said.
"We started small, but over the last two years [Hadoop use] has growing exponentially," Bhargava said. That has required bringing outlying Hadoop efforts into central IT.
He said Hadoop today has similarities with ERP in the 1990s. It is likely to be the center for analytics for years to come, so it needs to be integrated with all parts of the organization.
As a result, effective managing of workloads becomes a necessary Hadoop best practice. Hadoop management has to be looked at for entire clusters, not just for single batch jobs, according to Bhargava. For its part, Cisco has set up a data management schedule that incorporates Hadoop work, as well as loads from traditional data warehouses and other systems.
Other session panelists pointed to other best practices for Hadoop in production. Architecture for cloud and staffing were among the key points addressed.
Setting up the right team is crucial for Hadoop as it is for other endeavors. Team composition may look different from more recent data undertakings, but it may mirror much earlier efforts.
Because Hadoop is very much about pairing computation with data, it could mean returning to some mainframe-era roots, according to Scott Russom, who is director of software engineering for managed security services provider Solutionary in Omaha, Neb.
"Now, on my database team, I look for people with a processing mindset," he said, adding, "COBOL programmers who have been retooled in [MapReduce] are powerful; they understand data processing."
Meanwhile, cloud can be a path to successful Hadoop production. An architecture incorporating on-premises cloud and public cloud computing was employed as part of Hadoop efforts at San Francisco-based The Climate Corp., according to Strata panelist Andrew Mutz, who is the farming risk analysis company's director of engineering.
Other voices, other Stratas
Learn about MapR performance enhancements
The mainframe and Hadoop: Is there a fit?
Mutz said on-premises Hadoop clusters allowed Climate Corp. teams to quickly experiment and in turn discover useful climate simulations, while learning how to scale reliably. Then, Mutz's team can move Hadoop processing to the cloud for easier maintenance.
"The combination of in-house and cloud-based makes us feel pretty good," he said. "It has required working on where the data comes from. You have to understand where the latency really lives."
For Cisco's Bhargava, the best practices for Hadoop production management come back to the blueprint. "Very often, you go to a conference and see cool things, but at the end of the day, you need to make sure it is scalable. You must have a blueprint for how it is going to grow," said Bhargavan, who described himself as a technology "late adopter."
The companies represented on the Strata panel all opted for the MapR Hadoop distribution from MapR, based in San Jose, Calif., which has especially focused on the management of Hadoop in the enterprise. The company was early to offer an alternative to the basic Apache Hadoop File System, in part to improve enterprise manageability.
Cool tools: Hive, Accumulo, Giraph, more
Strata panel moderator and Forrester analyst Mike Gualtieri likened the leap to Hadoop production to steps other technologies have had to take. He marked security, scalability and high availability among production qualities that must be achieved.
He noted that Hadoop is still in its early stages. Gualtieri said a recent Forrester survey showed only 16% of respondents had Hadoop in production. But many are moving in that direction, he said.
On the main, Hadoop is still in something of the "cool tools" phase dominated by the enthusiast more than the mainstream implementer, at least as described by Strata speaker Geoffrey Moore.
The author of the recently updated and re-released 1991 best-sellerCrossing the Chasm, which became a roadmap for many entrepreneurs trying to navigate technology adoption lifecycles, pointed to Hadoop software ecosystem tools such as Hive, Accumulo, Giraph, Cassandra, Spark and others, saying, "If you can't have fun with these words, you are not a data enthusiast."
Moore said Hadoop today is still very much dominated by "project sponsorship from visionary leadership." But, he said, the clock is ticking as observers look for evidence of marquee enterprise use cases.