Users have always wished for less dependence on vendors. However, that comes at a price. As they navigate the twists and turns of today's big data ecosystem, they take on responsibilities that were once the vendors', at least to some degree.
The new style of data engineering calls for a heaping helping of DevOps, that being the extension of Agile methods that requires developers to take more responsibility for how innovative applications perform in production. At the same time, engineers are required to learn new software at a breathtaking pace.
Of late, users have had to connect the dots represented by a steady parade of open source tools. This might be called the curse of interesting times.
Teams have also had to switch components in and out at a fairly swift clip. The most vivid example is that many early adopters have had to create MapReduce-based Hadoop applications, only to have to redo them using the Spark processing engine.
Streaming in flux
Other examples of component hopping abound, with a variety of young open source offerings to sort through for Hadoop SQL querying tools, machine learning and other capabilities. Telling examples have emerged from the open source data streaming space, which is evolving along with a new class of real-time systems that go beyond batch processing.
In streaming, the tools are in more than a little flux. Early contender Apache Storm took a back seat to Apache Spark, which, in turn, found Apache Flink nibbling at its heels -- and all of this occurred in just a few quick years.
This is the very nature of modern data engineering, according to no less a personage than Hadoop co-creator Doug Cutting, chief architect at Cloudera. Today, people have to be ready to experiment with software components, he said.
In fact, it is not hard to find shops that have worked with several streaming architectures, involving a lot of on-the-job learning. As Spark moves to add record at a time style streaming via a recently announced Drizzle add-on, more learning will be in the offing.
This anticipated Spark update comes as teams still work to find out how streaming itself works in the context of a larger big data ecosystem. It was clear in technical sessions at last month's Spark Summit East conference that a whole lot of tuning was happening.
What that means is data engineers are figuring out ways to monitor streaming apps to stay ahead of processing failures. They are finding out how the components work in different combinations. Such application hardening is a big part of moving from proof of concept to production. End users are now a part of this quest, just as much as vendors.
Be careful what you wish for
Think back: The vendors were in the driver's seat in the days when they acted as the sole source of innovation, and the user went along for the ride. General releases came when the vendor was ready, or so it seemed. Vendors may still drive much in the way of big data application implementation, but, like users, they are often riding a tiger.
The fact is that at least some of the lag time in product releases back in the day had to do with vendors making software ready for prime time -- that is, reliable enough to place in production. It is not coincidental that open source big data applications today find the move from proof of concept to production a difficult bridge to cross.
Be careful what you wish for is so common an adage that it has become suspect. But it bears repeating, as data shops go for ground-shifting new open source applications, for which great doses of innovation are required.
For big data engineering to continue to move forward, teams will need to very seriously pursue the tenets of DevOps, or what some now call DataOps -- especially the principles that call for data engineers and IT architects to take responsibility for moving innovative ideas into production. As always, a zest to make it new will help, too.
Listen to a podcast on what's next for the big data ecosystem
Learn how eBay approaches data engineering
See how containers are changing data streaming