Big data makes data preparation steps more complicated to navigate

photobank.kiev.ua - Fotolia

Look for advances in serving the Hadoop data scientist

To date, the Hadoop data scientist has had to be a superhero, with a mix of data engineering, administration and analysis skills. Self-service tools will change that, but not overnight.

Hadoop and its proponents have been stymied in bringing the distributed processing framework to wider use, as special end-to-end skills are required to get the style of big data applications it supports up and running.

However, changes are afoot. The original incarnation of Hadoop based on MapReduce and the Hadoop Distributed File System (HDFS) has given way to interchangeable configurations that may use neither MapReduce, nor HDFS. Cloud-based Hadoop is on the upswing. And software vendors increasingly are bringing self-service capabilities to the Hadoop data scientist, for preparing data, tuning analytics algorithms and other tasks.

To some extent, the Spark data processing engine was a reaction to Hadoop's initial complexity. Spark set about to improve upon Hadoop's MapReduce data processing model, and also upped the level of abstraction for programmers. Building systems still required Java programmers to do much of the work, but they didn't have to deal with the down-and-dirty camshafts and flywheels of fairly raw Java. Similar motivation was also behind the tools created to open up Hadoop to a wider programming audience via SQL.

Yet, the problem remained: Hadoop and Spark in production called for gifted users with skills covering a range of jobs -- including system administrator, system programmer, Java developer and data engineer. Why not throw in domain expert, statistician and Hadoop data scientist, too?

Data superheroes

What became apparent through the first years of Hadoop was that, while data scientist was becoming a very hot job title, Hadoop wasn't a particularly good fit for the mere data scientist. That limited the new software's use because the superheroes whose data skills spanned the continuum from the engineering side to the science side were few and far between.

The first issue the data scientist faces is getting access to adequate data processing infrastructure. As more data scientists, power users and the like want to tap into Hadoop and Spark, they find themselves in a contest for allocation of system resources. As a result, there are many configuration issues that someone has to solve.

That isn't the end of the problems, however. Once users get the required access, they have to learn to tune jobs to run efficiently on special infrastructures -- something that probably wasn't a part of their original job description. And data preparation, from finding relevant info to readying it for analysis, often consumes a majority of an analytics team's time.

Various vendors are working to address these issues. The moves they make could be viewed as steps on the road to self-service.

Earlier this month, Spark originator Databricks launched Databricks Serverless Pools into beta to reduce resource management overhead and provide data scientists with easier infrastructure access.

That followed the May release of Hadoop distribution provider Cloudera's Data Science Workbench, which uses container technology to isolate data scientist jobs for mounting on Hadoop infrastructure. The offering is based on technology Cloudera gained last year through its purchase of Sense and that boutique firm's data collaboration tools.

The goal of big data processing now is self-service for as many users as possible.

Hadoop rival Hortonworks is pursuing a similar tack via partnership rather than acquisition. Last week, as part of an IBM-Hortonworks deal most noteworthy for IBM's pledge to adopt Hortonworks' Hadoop distribution, Hortonworks announced that it would resell IBM's Data Science Experience, a bundle also intended to help data scientists more easily avail themselves of big data infrastructures.

These companies aren't alone, as a variety of players are addressing the issues. Datameer, Domino Data Lab, Paxata, Pentaho, Platfora, Trifacta and other vendors all offer tools that also provide more self-service capabilities to data scientists.

Pardon my morphing

In the background of all this, major big data conferences have begun peeling the name Hadoop from their titles, with the Cloudera-affiliated Strata + Hadoop World morphing into the Strata Data Conference and the Hortonworks-led Hadoop Summit becoming the DataWorks Summit.

By the time open source Hadoop distributed data processing becomes truly self-service and widely used, the big data industry will probably come up with more than a couple new names for it.

What is clear is that the goal of big data processing now is self-service for as many users as possible. The new workbenches and serverless pools head in this direction, joining data preparation tools that had already started going that way. But there is much more work to do before the Hadoop data scientist can push the all-purpose automated self-service button. Stay tuned.

Next Steps

Learn how Hadoop data lakes compare to enterprise data warehouses

Discover what you need to know about Hadoop project management

Find out how Hadoop will change in the cloud

Dig Deeper on Hadoop framework