StreamSets Inc. expanded its streaming data product line to include tools that bring more data governance to big...
data applications where data is in motion, not just at rest.
The StreamSets Data Protector works with the company's existing platform elements to automatically check that the vast quantities of new incoming data adhere to corporate governance policies.
Such tools could play an important role as companies work to adapt big data to old compliance standards like HIPAA and new ones like GDPR. These regulations are in place to protect old-fashioned personally identifiable information afloat in those new age data streams.
A step up from custom programming
Software like StreamSets Data Protector is an important step up from custom programming for handling compliance and data discovery, according to Mark Ramsey, senior vice president and chief data officer at the London-based pharmaceutical company GlaxoSmithKline (GSK) PLC. The software helps filter newly arriving data sources -- that is, data in motion -- and cuts down on programmer workload, he said.
"That enables us to understand which things could be considered personally identifiable data," Ramsey said. He added that his teams, which had previously hand-scripted software to meet such needs, worked with StreamSets to help define the new tooling.
He said Data Protector works with Data Collector software, which acts as an agent to identify patterns in intra-organization data flows.
It is all part of an overall effort to bring a less "vertical" and more "horizontal" view of data across GSK research and development work. Ramsey said the StreamSets tools are used to feed a data lake that makes diverse data available to different groups within R&D.
He said the tools "tie in with the needs of GDPR," ensuring companies like GSK can be "clear on data as we move it around."
What to look at
In effect, Data Protector pushes data detection and policy enforcement upstream. Machine learning is employed to check data sources against many known, relevant patterns. Working with the StreamSets Data Collector software, it looks for nine-digit entries for social security numbers and other telltale traits of personal data.
The software can help in GDPR, the EU-originated guide that could affect data policy globally when it becomes an enforceable standard in May. That is because, when it comes to sorting data and applying policies, streaming data in motion is just as important as stored data at rest.
But, Girish Pancha, CEO at StreamSets, advised that workload performance can be impacted by data policy enforcement. Sampling rates are at issue, he said.
"To inspect all of your data, that's a lot of overhead, and it can bring your infrastructure to a creaking halt. The greater the sampling rate, the more processing overhead. You've got to balance the business risk with the other elements, one of which is obviously [processing] cost."
Data Protector supports "a user-defined process on how much data to look at," Pancha said. This can be a periodic process of sampling data or a brute force inspection of all data that passes through Data Protector, Pancha said. Data Protector is currently in the hands of several early adopters with general availability planned for midyear.
Keep calm and generally protect data
As with the Year 2000 bug that caused an industry-wide rewrite of code, the enforcement of GPDR will create some types of hysteria, Ventana Research analyst David Menninger said.
At the same time, it will cause organizations to reevaluate data preparation and governance processes, which will come to affect big data in motion, he said.
David Menningeranalyst, Ventana
"Companies are building data lakes because data now is being created constantly. In effect, we have solved one problem, but we have left the rest of the exercise to the reader," said Menninger, referencing the growing need to effectively manage all of the data the goes into a data lake.
That process is benefiting from greater automation.
"We have raw tools to do data streaming, but now we are starting to see higher-level apps that apply governance to that streaming data," he said.
Senior executive editor Craig Stedman also contributed to this article.
Flink cottons up to containers via stateful streaming
Dig Deeper on Hadoop framework
IR35 private sector reforms: GSK gives contractors 'quit or go PAYE' ultimatum
Docker management at scale: What enterprises need to know
GlaxoSmithKline R&D CDO on combining Hadoop and Docker to boost drug discovery efforts
GlaxoSmithKline R&D creates data platform using Hadoop for the internal sharing of scientific data