Using big data platforms for data management, access and analytics
A comprehensive collection of articles, videos and more, hand-picked by our editors
As data management and business intelligence options multiply, setting a course isn't getting easier for IT teams. Measuring the immediate and long-term impact of those options is John Myers' job. As managing research director for BI and data warehousing at Enterprise Management Associates Inc., Myers keeps close tabs on cloud technologies, Spark and the different types of databases now available. In an interview with SearchDataManagement, he said a key trend these days sees users moving to an architecture that allows different platforms to work to solve the data processing problems for which they're best suited.
Isn't the variety in workloads and workload mechanisms popping up today astounding?
John Myers: What we are really seeing is the emergence of a hybrid data ecosystem. We don't subscribe to the idea that a single data management platform can meet all the processing and data management needs you may have. People are looking at Hadoop and NoSQL entries such as Mongo and Cassandra.
We might throw analytics engines like Apache Spark or different types of databases in there, too.
Myers: Well, I would say Spark is much more of a processing engine than a data management platform.
Basically, when I think of a data management system, it has to meet the ACID [criteria], and part of that is durability. Spark is a nice processing engine. But it still needs to have that durability component to go along with that. Spark has to live somewhere. It has to leave its material somewhere. It's growing and getting better at what it does, and I don't know [if] you could ever ramp up MapReduce and Yarn to get to where Spark is going to be. It's a great platform to start going toward, but it's only two or three years old. In that sense, it has a lot of work to do to learn a lot of things other engines have done for quite some time.
It has a lot of opportunities, but it is also very young in its maturity. For certain use cases, Spark works really well. But for some others, when you get it up and going, Spark will actually run slower than some other processing engines. It is especially dependent on the types of questions you are asking it. That's true for any platform -- it all depends on what you ask it.
Going back to relational databases and things of that nature, if you want to ask [a relational database management system] to add, subtract, multiple or divide, it'll do that all day long. That's what it's been trained to do for 40 years.
On the other hand, if you ask a relational database to do a graph analysis, something like what a graph database like Neo4j or an Objectivity [InfiniteGraph] can do, it's difficult. You have to ask the relational database to do a very recursive join, which is something that it doesn't like to do because, frankly, it wasn't designed to do that.
Whereas, with the graph database, if you ask it to do a graph analysis, if you say, 'Tell me who is the friend of a friend of a friend,' it'll say, 'Here you go, here's a list, have a nice day.' But if you ask a graph database to add, subtract multiply and divide, it gets a little upset.
You find people wondering which of these platforms they should pick. But what I would emphasize is that there is more than enough room for multiple platforms.
How do you see the business side reacting to the new state of big data analytics?
Myers: Business stakeholders are intrigued with what can happen with big data analytics. Our research over the course of the last five years shows that big data projects are almost always aligned with raising revenues, limiting costs or improving margins.
We find up-sell opportunities [are] a significant chunk of projects. Another one is risk mitigation, either in the form of risk analysis or fraud detection management. The business stakeholders are getting the value and are driving those projects.
The fact is, IT people can load up Hadoop with data, but then they have to ask what to do with it next. At the same time, business people don't necessarily say, 'Give me the customer data that sits in Hadoop versus the customer data that sits in our enterprise data warehouse or in our operational system.' Instead, they say, 'Give me the customer data.'
So, it's the job of the IT teams to take event-level or behavior data, such as clickstream data from an online or mobile application that is probably stored in a Hadoop platform, and take curated data from a data warehouse and correlate those two so you can really get value.
Is it fair to say that where big data and these different types of databases are moving us is to a place where we can put the clickstream data together with curated data so we can get such things as better margins, good cross-selling, better risk mitigation and so on?
Myers: Yup, exactly. But business people don't say, 'Let's use big data analytics.' Instead, they go, 'Let's expand the scope of the information we can look at for our customers.'
Learn about the effect of in-memory technology on database mainstream
Dig deeper and discover open source Apache Spark's implications for big data
Take a look at the history and future of Hadoop