Sergej Khackimullin - Fotolia
- Jack Vaughan, Senior News Writer
The beneficial connection between preparation and opportunity has been noted by more than a few sages, from the Roman philosopher Seneca to American self-help entrepreneur Tony Robbins. But data preparation's role in the opportunity known as big data analytics is often underappreciated, if not overlooked completely.
The data preparation process can be a stumbling block that stands between advanced analytics technologies and the business benefits organizations hope to get from them -- increased revenues, more efficient operations, better decision-making and more. And as big data environments proliferate, the work involved in integrating and preparing data for analytics uses is changing in some notable ways.
On the front end, there are more -- and more diverse -- sources of data with which to work. All that variety spices up the big data analytics stew, but it upends traditional data pipelines. The days of a one-directional data flow into an enterprise data warehouse are ebbing; in the big data world, data often needs to move back and forth between data warehouses, Hadoop clusters, Spark systems and other platforms to support different analytics applications.
There's also more variety on the back end, more specifically, in the needs of the people using the data. For example, data scientists are likely to want access to raw data so they can filter it as needed to support particular predictive analytics or machine learning applications. That creates more steps to navigate on data preparation than a typical business analyst would require.
To meet those increasingly complex needs, one data pipeline may support a lot of automation as part of the data preparation process, while another may have to be structured to enable data scientists to play around with data in the analytics equivalent of a sandbox, walled off from the main data store or set up on a separate system. For IT teams, that means incorporating a mix of capabilities into data workflows to ensure that different analytics users can access the right information for what they're looking to do. And that's not always easy to accomplish.
Data challenges not just a size issue
"Today, you see the rise of big data -- but it's not just about the size of the data, it's also about multiple data sources," said Jason Brannon, supervisor of data architecture at insurance company USAble Life in Little Rock, Ark.
USAble offers term life policies and a variety of medical insurance options meant to supplement primary coverage in case of accidents, serious illnesses and other health issues. The insurer's operations require accurate synchronization of insurance enrollment information between the company, its Blue Cross and Blue Shield business partners, and its customers. From a data processing and analysis standpoint, "the requirements are increasing every day," Brannon said.
The variety of data requiring preparation for analysis is also an issue for Brock Christoval, founder and CEO of Flyspan Systems Inc. in Irvine, Calif. His company is building a data analytics platform for the commercial drone industry. The system, called FlyView, is being set up to pull in a wide range of sensor data from drones via the internet of things (IoT). Analytics teams at companies with fleets of drones will then be able to trawl the data both to analyze drone activity in the field and support predictive maintenance on the devices.
Data prep software options grow
Established data management vendors like IBM, Informatica, SAS, Syncsort and Pentaho offer tools to help users handle the increasing data deluge. But the acute need to take in diverse data and ready it for different uses has given rise in recent years to a group of new vendors that focus on aspects of those issues through self-service data preparation software and other technologies. The contenders include Alation, Alteryx, Attivio, Datameer, Looker, Paxata, RedPoint Global, Tamr and Trifacta.
The data management team at USAble Life uses Pentaho's namesake software, which Brannon said has been particularly helpful in reducing work related to scripting extract, transform and load (ETL) integration jobs. For example, a so-called metadata injection capability that Pentaho added in April 2016 automates ETL and data preparation steps, speeding repetitive workflows.
The software frees USAble's developers from "having to support one form of scripting for file manipulation and another for ETL -- it puts all the dependencies in one place," Brannon explained. "As a result, it simplifies our development process, and allows us to be much more reactive to demands for data." Cutting time out of the data integration and preparation cycle is especially important to him because he has to meet the analytical data needs of both internal and external users.
At Flyspan, Christoval sees benefits in the ability of Trifacta's software to provide an overall view of the data preparation process and automate the required steps. "It allows us to wrangle the disparate types of data we get from telemetry, distill [the raw data] down to what the decision-makers need and put that in their face," he said.
'No hands' on deck for preparing data
Such automation is becoming an imperative as more and more data needs to be processed a lot faster. "The new tools fit well because we're moving toward more 'hands-off' workflows," said Dave Wells, an independent consultant and industry analyst.
David Stodder, a director at IT research and education services provider TDWI, also credited self-service data preparation tools for his company's automated approach and its focus on big data and advanced analytics applications. "They're all trying to reduce the number of steps required, to make things more repeatable and easier," he said -- something that's more important for predictive analytics and machine learning involving large and diverse data sets than it is for mainstream business intelligence and reporting.
A TDWI report published in July 2016 that Stodder authored bears out the growing need to blend different types of data for analysis -- and the challenges IT and analytics teams face in doing so.
In the report, Stodder wrote that users increasingly want to see an integrated view of data to help them identify relationships, correlations and trends. Not surprisingly, relational databases and data warehouses lead the array of data sources enabled for self-service data preparation, according to a survey conducted for the report. But JSON, clickstream, social media and real-time streaming data are among newer types of information that survey respondents are adding to the analytics mix.
Nonetheless, in many organizations, "users across the spectrum deal with data chaos every day," Stodder wrote. Only 43% of 411 survey respondents said their users were satisfied with how easily they could find and understand relevant data; on the other hand, 37% said users were either somewhat dissatisfied or not satisfied at all. The increasing data demands and associated problems are prodding companies to rethink the traditional data preparation process, Stodder said, noting that improvements can "help both [the] business and IT become more productive and effective."
Learn more about machine learning in big data environments
See why predictive analytics ran into unpredictability in 2016
Find out why data analytics in sports isn't always a big winner
How a startup uses its Apache Arrow expertise at self-service data delivery
- Security Big Data: Preparing for a Big Data Collection Implementation –SearchSecurity.com
- The ultimate guide to data preparation –ComputerWeekly.com
- Infographic: The state of data preparation in 2019 –ComputerWeekly.com
- From out of nowhere: the unstoppable rise of the data catalog –Alation