Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
The Hadoop file system and framework have become "top of mind" technologies for organizations dealing with growing amounts of unstructured and semi-structured data. But for this new big-data kid on the block to succeed on a broad scale, able Hadoop data management software tools are needed.
Hadoop has found plenty of uses in Web applications, but much of the work done to date falls into the category of proof of concept. This is partly due to a lack of robust management capabilities. But that is changing: Deeper integration into operational workflows awaits Hadoop in the enterprise, as do better security and improved querying capabilities.
Clearly, a marketing push on Hadoop data management tools is under way. Recent announcements from IBM and Teradata, for example, look to show ways in which open source Hadoop can be tamed to meet the general operations needs of mainstream businesses.
The fact is that more than a few Hadoop projects to date have been sandboxed affairs, somewhat experimental in nature. We hear tell of one application where a Java-based Hadoop cluster was off the network and isolated in a locked room to which only a handful of people had access. Why? Because they were unable to vouch that there was no sensitive data in the system. If private information was to leak out, the corporation could be in for trouble.
The sandbox pattern has been seen before. Many a new technology goes through a stage where it's isolated for its own good. Parents have some degree of comfort when their children are playing in the sandbox -- at least, the number of things that can go wrong is more limited there. The same is true for IT and data managers tasked with deploying emerging technologies.
Hadoop as a staging area
Key parts of this week's Teradata Enterprise Access for Hadoop software rollout address needs for improved security, workload management and SQL access for the open source technology, Teradata marketing VP Steve Woolidge told me. Those capabilities point the way to better management of Hadoop framework implementations.
Woolidge doesn't embrace the notion of Hadoop as experimental technology for users, describing it as something akin to a data preprocessing area. "You can say 'experimentation' is happening," he said. "We see it instead as a staging area where people store large volumes of varied data types."
One of the advantages of Hadoop file data is that companies "can store it without transforming it," Woolidge said. But, he added, the next step is more complex: "It's easy with Hadoop to get information in, but it's hard to get out."
Teradata's new tools include a Smart Loader for Hadoop to allow business analysts (rather than Java jockeys) to provision Hadoop clusters and load jobs, plus software called "SQL-H" that they can use to query Hadoop data. "SQL-H is about making big data more manageable to the end user," Woolidge said. He sees security benefits as well: Putting a SQL-H layer on top of Hadoop means you can achieve row-level security for tables that are viewed in a Hadoop system, he said.
Not just playing around anymore
Many companies are moving past the experimental phase with Hadoop, according to Bernie Spang, director of marketing for IBM's software group. That means Hadoop management capabilities are coming more to the forefront, he said in an interview after a recent IBM online briefing on its big data management strategy and technologies. "Now that they are applying Hadoop to business problems, people have to deal with the details," Spang said.
For more on Hadoop
Study some tips for getting on top of Hadoop complexities
Look at recent announcements of new Hadoop distributions
Learn about Hadoop team issues
During the briefing, IBM unveiled a version of its PureData System that is optimized for Hadoop applications. The appliance is designed to streamline administrative workflow, provisioning and security for Hadoop-related efforts.
Hadoop will still have a place as the quick data-dicer it has become for some organizations, Spang said. Increasing enterprise adoption of the technology "doesn't mean the Hadoop systems that are more about sandboxing will go away -- that continues," he said.
But while the open-source -- and thus, portable -- nature of Hadoop is a major selling point, the industry's general experience has been that enterprise software management tools reside mostly outside the realm of open source. Surrounding tools will likely be needed for many mainstream operations, and many of these will be commercial tools. Improvements in Hadoop data management is a trend to watch closely this year.
Jack Vaughan is SearchDataManagement.com's news and site editor. Email him at email@example.com.
Follow SearchDataManagement.com on Twitter: @sDataManagement.