Gabi Moisa - Fotolia

Machine learning algorithms meet data governance

Machine learning algorithms may change computing, but they're a bit of a black box. Still, there are ways to tame them with flexible data governance, according to tech startup exec Andrew Burt.

As a lawyer on the staff of the FBI Cyber Division, Andrew Burt spent a good deal of time looking at the intersection of national security and technology. That meant looking at policy in an organization charged to look at massive amounts of sensitive data. Now, as chief privacy officer and legal engineer at startup Immuta Inc., he is one among a new cadre working to bring more data governance to machine learning, the artificial intelligence-style technology that is moving from laboratories into mainstream computing.

Machine learning algorithms are something of a black box for governance, as the technology does not necessarily disclose how it reached its decisions. To cast some light on this black box and what it means to data governance, we recently connected with Burt to discuss sensitive data processing at scale.

How will data governance change when decisions made based on machine learning algorithms are more widely employed?

Andrew Burt: It's the multibillion dollar question. What is challenging is that machine learning for the first time at scale is starting to occupy a significant place in the decision-making process. Organizations are using technology to make decisions in ways that at least have the potential to remove the human entirely from that decision-making process. That has excited some people and scared some, too.

Andrew BurtAndrew Burt

It is different because the old types of governance were about process -- about who saw what data when. That's been the bread and butter of data governance. That model assumes there still is an audit trail and you can ask someone what happened.

As machine learning comes to occupy part of this decision-making sphere, we're losing that ability. Governance, now, is actually beginning to impact what types of decisions can be made, and what types of rights the subjects of the decisions have.

We are starting to hear people wonder if machine learning algorithms are too much of a black box. How do we begin to govern artificial intelligence?

Burt: Actually, there is a host of ways we can govern and actively control and monitor the process of creating machine learning models. It's not a binary choice between letting machine learning models run amok or strengthening governance so much that there is no machine learning.

There is no better way of having visibility into black box models than having a very good understanding of what type of data is actually going into them.
Andrew Burtchief privacy officer, Immuta

There are really three buckets here. You have the data, the model and the decisions. There are ways to govern using each of those buckets. Each has a role to give visibility into the way that machine learning models are actually being deployed.

The most important bucket is understanding the data that is used to train the model. If you don't understand the data to start with, there can be huge risks embedded within the models. There is no better way of having visibility into black box models than having a very good understanding of what type of data is actually going into them. That includes everything from the time the data is collected, -- gauging for the possibility of biases in the data itself, observing the activity when it is [extracted, transformed and loaded] -- to the time it's used in a model.

And what about with the machine learning model itself?

Burt: There you find a spectrum where there is a tradeoff between the traceability and the actual accuracy of the model. There are some circumstances where governance concerns are going to have to hold some weight on the scale. There may be circumstances where there is a level of interpretability we just can't sacrifice.

Historically, in fields like finance, interpretability has really been prioritized. In fact, data scientists in that field have leaned very heavily on models like linear regression where you have the ability to play it back. So, the second bucket is about the model choice itself.

But there are going to be circumstances when the models we use literally are black boxes. So, finally, the third bucket is the actual output for the decision. There are some technical ways, in fact, where you can reduce the level of opacity in these models. One is LIME, which stands for Local Interpretable Model-agnostic Explanations. What that is able to do is, basically, after each decision, to model the reason why that decision was made. What it does is isolate the exact features that are driving the decision that is being made, even in the face of black box algorithms there is a level 'post-hoc' review or backward-looking review for some of these models.

It seems what Immuta is pursuing could be a platform for differential privacy within an organization. Does that reflect the fact that 'one size does not fit all' for data these days?

Burt: Differential privacy up until now has lived within academic research industry and within the tech giants. What we have done is tried to make it easy to implement and easy to use. What that means is data can be shared while also having mathematical protection for the personally identifiable information within the data. That concept is what we call personalizing data.

Organizations are finding that what they need in order to speed their data science programs is the ability to have each user seeing only the data they are allowed to see in the right form for each corresponding purpose. So, within any organization, permissions and rights, and the ability to use data for different purposes, that is going to vary across the spectrum of users.

Data access patterns are going to change depending on a variety of contexts. That relates to both the underlying storage technology and governance concerns. Different data will have different restrictions attached to it, and that will change.

Next Steps

Hear Andrew Burt and others on algorithms that walk the line

Look behind the scenes at the second machine age

Find out more about machine learning and embedded analytics

Learn how to best use predictive analytics algorithms

Dig Deeper on Data governance strategy