Data integration must take a new, more curatorial tack in the age of big data, according to Michael Stonebraker,...
a database development pioneer, MIT professor and serial entrepreneur whose latest venture is integration software startup Tamr Inc. But other participants at a data management conference held at MIT last week contended that new integration methods are already in play.
Stonebraker, speaking at the 2015 MIT Chief Data Officer & Information Quality Symposium, called for the use of emerging data curation tools backed by machine learning algorithms -- for example, Tamr's technology -- to help meet the rise in data volume and variety that many businesses are now experiencing, as they deploy big data systems.
Methods relying on global data schemas and traditional extract, transform and load (ETL) tools can't adequately cope with the scale of big data applications or the variety of the data types they involve, the long-time data management industry player asserted.
"In the past, everyone considered ETL the gold standard. You would extract, transform and load to a common global schema," said Stonebraker, an adjunct professor in MIT's computer science department, as well as Tamr's CTO and co-founder, among other industry positions. But now, he said, "a global data model is a fantasy" for organizations.
The prime-quality global data schema was part of data warehousing's early rocky ride, in Stonebraker's estimation. "People thought they would build a global enterprise data model and everybody will use it," he said. "One hundred percent of them failed." Related ETL approaches proved to be labor-intensive, unmanageable and non-scalable, added Stonebraker, who in March received the Association for Computing Machinery's prestigious A.M. Turing Award for 2014.
As described by Stonebraker and others, data curation focuses on a streamlined process of discovering data sources of interest, cleaning and transforming the data, and semantically integrating it with other data sets before delivering a deduplicated composite result to data consumers within the organization. Tamr, an MIT research spin-off, based in Cambridge, Mass., offers what it calls a "data unification platform" for cataloging and curating large amounts of information. The company is joined in the new data integration scramble by rival vendors such as Paxata, Trifacta Inc. and WorkFusion.
Different views on the podium
Other presenters and participants at the MIT event took a somewhat different view, saying that what Stonebraker called "the bondage of the schema" is already well on the wane.
Sharing the podium with Stonebraker was John Talburt, chief scientist at consultancy Black Oak Analytics in Little Rock, Ark. Talburt agreed that a global data model is "unachievable," and that many early data warehouses were failures. But he emphasized that few practitioners now pursuing data models aspire to be so completely encompassing.
"It's not a magic bullet -- you don't have to integrate data universally," Talburt said. He added that data management professionals have learned there are gradients of data quality that can be employed, based on different factors. For example, a company's data of record would be treated differently than social media data used to assess customer sentiment.
For Joe Caserta, president at Caserta Concepts LLC, a data warehouse consulting and training company in New York, talk of a central, all-encompassing data model recalls arguments heard in the early days of data warehousing and data marts.
Mr. Inmon, meet Mr. Kimball
Back then, the battle was engaged between a top-down data modeling theory championed by consultant Bill Inmon and others, and bottom-up methods espoused by consultant Ralph Kimball and his kindred spirits. To simplify the debate somewhat, the former called for designers to build a monolithic data warehouse that would set the stage for data marts, while the latter called for them to build dispersed data marts that set the stage for a central data warehouse.
Caserta co-authored The Data Warehouse ETL Toolkit with Kimball, so his preference for the bottom-up approach isn't surprising.
"Today, we use 'model storming' to very quickly figure out the business processes, the connections and the dimensions of data," he said. "We think of it broadly, with a central hub in mind." But the central data hub appears in iterations, and data modelers and enterprise architects understand it may never be fully achieved, he added.
"You need to think about the entire enterprise, yes -- but you need to plan large, but build small," Caserta said. And when you do that, you have to make sure you don't just create data silos, he noted, describing isolated data marts that may be replications of existing systems. "There is a way to come up with something of a central approach, but in an iterative way, without waiting for this big, monolithic project to be finished."
Data: The Day One problem
Technologies have evolved over the years, and ETL is one of them, said Murthy Mathiprakasam, principal product marketing manager for big data tools at data management vendor Informatica Corp. Reflecting on Stonebraker's remarks on data curation in an interview, he said that processes "have become a lot more agile, and there is a lot more collaboration between IT and the business" in many user organizations.
"It's inevitable that data is going to be all over the place these days," Mathiprakasam said. "It's unlikely that you can put everything in neat rows and columns. But you can still achieve centralized understanding of the data. You can have mechanisms to identify what data represents throughout the enterprise."
The bottom line, though, is that "the world of a heavily regulated monolithic schema no longer exists. That is not the world of 2015," he said.
A top data manager at The Bank of New York Mellon Corp. espoused a similar position at the MIT event.
"You don't build warehouses that require you to model the world from Day One," said Rajendra Patil, head of data strategy at New York-based BNY Mellon, speaking during a conference session that considered data responsibilities in financial services companies.
Patil echoed the critique of the data warehouse, and its long-running effects on the IT budgets of companies. "They've spent millions of dollars on data warehouses, and it has only worked to a certain extent," he said.
Revisit Mike Stonebraker's 2012 predictions on relational databases
Find out how big data is changing data governance