Sergey Nivens - Fotolia

Manage Learn to apply best practices and optimize your operations.

How to improve data governance for self-service analytics

Citizen data scientists and self-service analytics are on the rise as the data scientist shortage continues. Here are some data management best practices to support those roles.

The movement for improved self-service analytics to enable citizen data scientists is at a tipping point. According to analysts with Gartner, 2020 will be the year citizen data scientists surpass analytics professionals and data scientists in the amount of advanced analysis they produce.

However, in order to get the most out of this analytical groundswell, organizations are going to have to prioritize data governance for self-service analytics. Organizations seeking to scale out self-service capabilities without implementing the same data management best practices will find business users struggling to make decisions due to inconsistent data, stale data and hidden data, among other problems. Additionally, organizations open themselves up to a world of risk when it comes to data privacy and security issues when they open up data sources without appropriate data governance.

According to analytics and data management experts, organizations need to focus on the following data management best practices in order to better support citizen data scientists in the coming years while addressing these issues.

Balancing data quality and timeliness

Data quality has increasingly moved to the epicenter of a number of analytics trends -- including self-service analytics, according to Emily Washington, executive vice president of product management at Infogix Inc., who said that it will be a key goal for organizations in 2020.

"As organizations continue to push the limits with data storage and processing, we see data quality as the underlying theme to ensure they're leveraging data they can trust," Washington said.

The problem enterprises face is not just providing consistent, clean data, but also doing it in real time. A recent survey by Actian Corp. showed that while 94% of IT decision-makers today said it's important to receive current data to power a data-driven enterprise culture, over half of them admitted they're forced to use stale data at least some of the time.

As a result, organizations are seeking innovative ways to serve up data. As Washington said, traditional batch processing where data is sent on a schedule from system to system is not meeting the needs of today's real-time environment.

As organizations continue to push the limits with data storage and processing, we see data quality as the underlying theme to ensure they're leveraging data they can trust.
Emily WashingtonExecutive vice president of product management, Infogix

To meet these demands, many companies are moving to event-driven architectures to handle large volumes of streaming data, leaning on distributed streaming platforms like Apache Kafka, ActiveMQ, Apache Pulsar and Amazon Kinesis. They're seeking to not only help citizen data scientists make decisions more quickly but also to open up more analytics use cases.

"There are exciting new analytics use cases, like customer-360 and hyper-personalized real-time offers, that simply don't work with stale data," said Jack Mardack, a vice president at Actian. "This blurs the lines between traditionally separate transactional databases and data warehouses and places new demands on the data management infrastructure, where real-time availability is now a requirement."

The point is that real-time data becomes a liability rather than an asset if it's not validated at speed.

Putting an emphasis on data governance

Establishing strong data governance for self-service analytics is at the heart of addressing data quality issues that hamper effective citizen data scientists. It's also crucial for ensuring that citizen data scientists' activities don't devolve into security and compliance nightmares.

"When enabling and managing citizen data scientists, governance should be a high priority," said Jen Underwood, an independent consultant and former senior director at machine learning vendor DataRobot, where she was in charge of product marketing for citizen data science uses. "For organizations in highly regulated industries -- financial services, pharmaceutical or biotechnology and energy -- effective data management solutions for supporting legal and regulatory compliance, mitigating risk and improving efficiency are simply not negotiable."

The good news is that data access policies for citizen data scientists don't have to be revolutionary. These policies can evolve from and mirror similar policies that organizations have been rolling out for self-service business intelligence enterprise functions.

The trick is adapting policies to new use cases that are cropping up, such as accounting for how machine learning data access practices need to change in light of data privacy laws like GDPR and the California Consumer Privacy Act.

California State Legislature
Data governance is especially important with GDPR and the California Consumer Privacy Act, which is going into effect Jan. 1, 2020.

Amp up data discovery and data prep with augmented analytics

Organizations are increasingly turning to the machine learning capabilities of augmented analytics to automate how organizations discover and prepare the data that citizen data scientists' need to glean insights from organizational information.

Data discovery is a crucial piece of this data management puzzle when it comes to getting the most out of self-service analytics.

"Acknowledged as important glue to enterprise software, delivery of a common catalog for finding, provisioning, securing and understanding data and other objects is important to customers," said Todd Wright, senior product marketing manager of data management and data privacy solutions at analytics software vendor SAS Institute. "Further, this discovered insight through application of advanced analytics delivers the ability to automate mundane data management tasks and find value in data that previously had been too difficult to discern."

In the meantime, augmented and smart analytics can help drastically reduce the amount of effort organizations must spend to clean up data sets. According to Krzysztof Surowiecki, managing partner at Hexe Data, extract, transform and load (ETL) takes up 80% of data analysts' time in preparing data for use. Augmented analytics and AI stand to slash time-consuming data prep activities like ETL. Wright agreed, stating that this approach to data governance for self-service analytics will unlock the kind of data democratization that organizations need to empower citizen data scientists.

"To expand data manipulation activities to a wider audience, development of advanced data transformation using AI to automate cleansing and blending will empower nontechnical users," Wright said.

Dig Deeper on Data quality techniques and best practices

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

What data management best practices do you think are most important to support self-service analytics?