Photo-K - Fotolia
One of the signal traits of big data these days is an abundance of fresh data engines and data stores. In 2015, a wide variety of new data processing components were front and center, in the realms of NoSQL databases and Hadoop clusters. Keeping up with changes, tracking critical and not-so-critical updates, and successfully navigating the labyrinth of new systems has become one of the most pressing challenges for data management professionals. A look at the pages of SearchDataManagement in 2015 finds ample news and trend stories on the volume, variety and velocity of distributed open source architecture advances.
NoSQL databases show new sides
NoSQL software has opened up database options considerably in a field that previously had for all intents and purposes been boiled down to a handful of relational database vendors. But NoSQL variations are continually evolving, a process that continued apace this year and that bears watching again in 2016.
During 2015, for example, MongoDB introduced a new core data engine for its namesake NoSQL database and said it would support a variety of such plug-ins, bringing a flexible style that should be familiar to developers experienced in the MySQL database world. MongoDB's new default engine, called WiredTiger, is intended, among other things, to address locking issues that sometimes stalked the original MongoDB platform. Another key NoSQL player, DataStax, which offers a commercial version of the Cassandra database, is working with the Cassandra open source community to release an engine update in 2016.
Of course, the single most defining trait of NoSQL databases is their lack of SQL -- or, in some cases, their only-partial support for the standard relational programming language. Yet, some NoSQL customers are finding that SQL-like traits are useful as part of an overall data workflow. And vendors are starting to go with that flow. This year, Couchbase brought its N1QL query language tools -- pronounced Nickel -- to market with the purpose of creating a SQL-like environment for working with the company's NoSQL database management system.
SQL adaptations for Hadoop flower
While the "SQL-ization" of NoSQL technologies may just be beginning, the crush of SQL adaptations for Hadoop has moved well along. SQL-on-Hadoop query engines such as Hive, Impala and Presto have been lining up for years, and several saw new releases in 2015. Such tools could help ensure that the information in Hadoop data lakes can be easily accessed and utilized -- making them more like data refineries and less like data swamps. But SQL on Hadoop continues to be the province of early adopters, working to uncover which tools work well with their different interactive jobs. And there's a lot riding on user successes, as Hadoop could be relegated to the fringes of many enterprises if it can't tap the skills of the legions of workers versed in SQL.
Vendors and open source contributors also continue to add new technologies to the core of Hadoop. In October, for example, Hadoop distribution provider Cloudera added the Kudu columnar data store to the mix. It's meant to work with Impala, the company's MPP SQL-on-Hadoop engine , in real-time analytics applications involving fast data inserts and updates. Depending on your view, Kudu is a complement or an alternative to the Hadoop Distributed File System that has long served as the Hadoop data storage workhorse.
Spark engine rolls onto big data block
Perhaps the biggest of the new kids on the big data block is Apache Spark. It was apparent as early as 2013 that the data processing engine, created at a UC Berkeley computer science lab, had the potential to replace Hadoop's original MapReduce engine in existing batch jobs while also supporting new near-real-time analytical uses. While MapReduce continues to find new users, especially among those looking for substitutes for traditional data warehousing transformation and load applications, it could be said that MapReduce has been reduced in the eyes of many Hadoop users that are eyeing Spark for faster processing.
2015 was notable for startup Databricks' formal entry into the big data competition with its cloud-based Spark offering, which became generally available in June. The company is led by some of Spark's originators from Berkeley. Rather than push on-premises Spark engines, Databricks decided -- at least for now -- to bet on the cloud as the delivery means for Spark via its Databricks Cloud platform. IBM was among many other vendors that pushed Spark efforts into overdrive: It was busy training Spark developers, embedding the Spark technology as a part of numerous products and, like Databricks, offering a cloud-based version of the engine.
In addition, an emerging crop of data integration startups utilized Spark and its machine learning libraries to ''teach'' systems repeatable integration steps. Among these self-service data preparation startups were some that had turned first to MapReduce, but then opted for Spark instead.
Vendors and end users will need to be fleet of foot going forward, ready to embrace newer data engines, while still keeping an eye on alternatives in the works. All of these technologies promote a new model of processing data that brings potential opportunities, but challenges as well, to the data professionals who must make the architectural decisions that bring organizations into the future.
Take a look at SearchBusinessAnalytics' 2015 top stories in review
Look back at SearchDataManagement's 2014 rundown on data trends