Machine learning technology, techniques add new analytics smarts

Sergey Nivens - Fotolia

Embedded analytics to feel widest impact of machine learning projects

Ovum analyst Tony Baer discusses machine learning tools, IoT-driven streaming analytics and Hadoop in the cloud, all of which gained attention in 2016, with more likely in store for 2017.

Real use cases are what drive upticks in technologies, especially the ones that vie to be technology of the year. Ovum analyst Tony Baer knows this, and as he set about to describe the big data trends to watch in 2017 for an Ovum report, he focused on the available evidence.

Baer concluded that machine learning projects, the internet of things (IoT) and real-time streaming analytics, all of which gained attention in 2016, will grab more of the spotlight in the year ahead. He expanded on those topics, and the growth of cloud-based Hadoop systems, in an interview. Particularly in streaming analytics, use cases and technology may be combining for a "perfect storm," he said.

You write that machine learning will be the biggest disruptor for big data analytics in 2017. Still, one wonders if machine learning projects will be limited to the top echelon of companies, or if use will be much broader than that.

Tony Baer: It's broad in that, in many cases, businesses and consumers are already using services that have machine learning embedded in them -- they just don't realize it. But in terms of how many companies have data scientists on board, ones that are writing or using machine learning algorithms and doing their own internal development, that will still be limited. That's even though there are libraries available for machine learning, so you don't have to just write them from scratch anymore.

There are also emerging collaboration tools that are designed to connect the data scientist to the data engineer or the business. You're seeing an upswing of tooling, but largely, the appeal of that is going to be limited to those organizations that have very deep resources -- the same types of organizations, really, who were the pioneers with Hadoop.

Tooling is one thing. But it sometimes seems people don't realize machine learning projects require a learning phase that can be time-consuming and full of trial and error.

Baer: That's right. There's also an interesting thing going on. A few years ago, data science was the hot thing. Everyone wanted to be called a data scientist and wanted that on their business card. Now, the shiny new thing is machine learning, and so all these would-be data scientists want to jump on.

Tony Baer, analyst, OvumTony Baer

What they may be forgetting is step one: You really have to learn the data science. It's not synonymous with machine learning. It's synonymous with science, in that you are constantly testing hypotheses. It's the blocking and tackling of the scientific method. It requires a lot of patience and perseverance.

The spectrum in machine learning goes all the way from the anomaly detection cluster on one end to deep learning and cognitive [computing] at the very deep end of the pool. But you need to get data science mastered before you can go on to using machine learning, which includes advanced pattern recognition and many different approaches along that spectrum.

For machine learning, in the short term, the widest impact is going to be through capabilities that are packaged into analytics or applications, such as supply chain optimization, or the smart electric grid, or threat and fraud detection. It will be embedded in these applications. The headlines will talk about individual companies with courageous data scientists that are writing brilliant models. But, when it comes to broad impact, it is going to be via capabilities that are packaged under the hood.

You mentioned machine learning adopters being similar to Hadoop adopters. That technology has taken a while to germinate. Now, it seems bound for the cloud. At what pace do you think Hadoop can move to the cloud?

Baer: What I would call Hadoop is a multicomponent operating system. It's very much about mix and match, which made it hard to explain, and probably confused the market quite a bit. Now, in the cloud, it's even harder to explain because, when you go into the Amazon cloud, you may not be using [the Hadoop Distributed File System] -- you're probably using S3 (i.e., Amazon Simple Storage Service).

Hadoop wasn't born to be on the cloud, but that is going to be the key adoption trend. From my conversations with vendors, it seems that a year ago, 15% to 20% of new workloads were going to the cloud. Now, it's one-third. And I'm basically expecting that we're going to hit the 50% mark for new workloads in 12 to 18 months.

It's fair to say data streaming bears a resemblance to complex event processing (CEP), in which the emphasis was somewhat on the "complex." We're dealing with different events these days, mostly things like cell phone activity and clickstreams. But are things really different this time?

Baer: Complex event processing was a solution looking for a problem -- well, except in some specialized cases, like financial services, where bleeding edge is part of what they do, part of how they compete. But now, we have the perfect storm.

That's because infrastructure has become more accessible and inexpensive, especially with the cloud. And with CEP, when you worked on a small number of events, that wasn't too intriguing. But when you can scale out with infrastructure like we now have, it might become a viable idea. IoT alone is really moving this.

There are use cases that use IoT and have real value. IoT is increasing the urgency for real-time streaming analytics. Examples include anything that involves some physical movement of things, whether it be supply chain, network optimizations or smart cities and the like. Or, for example, anything in the field that is working. That is asset management and fleet management. Events, such as clickstreams, are drivers, too. There are all these use cases that are tangible and actually have clear business value.

We have more smart devices out there that are generating real information. That's ultimately what's driving streaming analytics. It's a mix of open source and proprietary technologies. On the other hand, with CEP, the processing was expensive. The few tools that were out there were proprietary and required very specialized skills. With open source, the barriers to learning and experimenting come down. It's kind of a perfect storm in that all those things are happening. 

Next Steps

Catch up on recent product news on a key data streaming technology

Learn how machine learning projects apply to big data

See what fits with IoT and data analytics

Dig Deeper on Hadoop framework