Sergey Nivens - Fotolia
A long journey on the big data science road took a new direction this year when Kirk Borne joined management and technology consulting company Booz Allen Hamilton as principal data scientist. Borne's early work was in the field of astrophysics, where grappling with terabyte upon terabyte of star data set him on a path to today's intersection of data science and big data analytics. In between, he helped create one of the early academic data science programs, at George Mason University, where he also served as a professor of astrophysics and computational science. Now, besides helping Booz Allen Hamilton advance its data science techniques and advocating for greater data literacy, Borne is among the most prominent big data Twitterati. SearchDataManagement recently caught up with him for a Q&A on big data, data science and other topics.
How did you come to big data? From your background, it seems you may have come to it by the stars.
Kirk Borne: Yes, I guess you could say I started in the stars. My education was in physics and astronomy. Then I was doing astrophysics, and it always involved data analysis and data collection. But my 'day job' actually was supporting NASA contracts, including the NASA Hubble Space Telescope. I also did work at the National Space Science Data Center. It all included data systems and large data sets -- I was always surrounded by data. In the late '90s, some of the data sets were going off the charts in terms of size. I realized something was dramatically changing, and I started looking into data mining and data products, to help scientists explore the data. It became a mix of what we call data management and data analytics.
What is your mission at Booz Allen?
Borne: My title is principal data scientist. We have several chief data scientists and several hundred data scientists. The chief data scientists may focus on specific verticals, like national intelligence, sports analytics or health analytics, whereas my focus is about building across [industries]. Like a honeybee, I go wherever the flowers are blooming. I get involved in the conversations whether it is sports, cybersecurity, fraud or what have you.
It seems that a hallmark of the next era may be deep learning networks. It's a new thing for many organizations, but neural networks really go back a ways.
Borne: With the new work in deep learning, you see a neural network concept, but it's a very deep multilayer network, rather than the ones you dealt with in the early days, which were basically just one hidden layer plus input and output layers. The deep learning networks have many layers combined as what they call convolutional networks. All the different parameters are combined in different ways at each layer to produce much more intelligent outputs than we ever imagined before. It's more akin to the human brain. Your brain recognizes general things and then works toward specifics. Humans build up interpretations in layers. A deep network is actually doing that kind of thing.
Can we add Apache Spark to the conversation? Much of the interest there seems to be around Spark's machine learning libraries. But there is some confusion around its connection to Hadoop.
Kirk Borneprincipal data scientist, Booz Allen Hamilton
Borne: Well, people sometimes confuse Hadoop and Spark, and think one thing will replace the other. They are two different things. Hadoop is really the distributed data infrastructure. When you have very large data, it is very hard to access all that data. That was a problem we had [in astrophysics] -- that the data was sequential as opposed to parallel. What a distributed data system does is it gives you parallel access to large sets of data. Really, to use a library science metaphor, Hadoop serves as a kind of card catalog that allows a 'blob' of data to exist across a commodity cluster of machines that you can add to.
Spark is an analytical engine that does the processing on all that data. Before, it was about MapReduce in Hadoop. But if you think about doing a deep learning network with lots of calculations, MapReduce makes a lot of [writes and calls] to disk. That is extremely time-consuming and expensive processing for something like deep learning. What Spark does is to read all of that stuff into memory and do all the processing [there] -- and then, when it's done, to write the results back to the distributed data infrastructure. It's in-memory fast processing, running on top of what Hadoop provides. The two can live harmoniously together, I would say.
Borne: For me, data literacy is on the same par as other literacies we promote in our education system, like reading, writing and arithmetic. Literacy also includes an understanding of history and cultures; literacy makes you a well-rounded person because you can carry on a conversation, you can understand the world you live in, and you can be a productive worker in that world.
Since everything is now digital and everything is producing data -- our social media, our cars, our refrigerators -- the businesses of the world that will make the biggest splashes and the biggest revenues are those that will make the best use of those data streams and digital signals. And they're looking for people that know how to do it. We need to not just train peoples in skills -- as far as machine learning skills go, very few will actually learn those. But every person needs to learn what data is. And it's not just all the positive issues. There are other issues around data privacy and ethics -- having an understanding of both the positives and the negatives, including what could go wrong. We have to work to learn the limitations of data.
Find out how one firm pursued data literacy
Every data tells a story
Look into the big data science skills gap
Data science as a service offers instant access to analysts