In his role as principal data scientist at consulting firm Booz Allen Hamilton Inc., Kirk Borne sees the world in terms of data connections. "Life is about who is connected to whom and what is connected to what," Borne said, and he pointed to graph databases and graph analytics applications as new ways to capitalize on such connections.
That's because graph databases, a form of NoSQL software, document the connections between data points quite different compared to mainstream relational databases. Graph systems represent data not as elements in tables, but as nodes linked to one another by edges with a set of properties that delineate the relationship between nodes.
Therefore, one of the advantages of graph databases is they allow data analysts to navigate through data sets without the need to create and run complex queries to join combinations of tables together, as in the relational model.
"Graphs make more sense from a data discovery perspective," Borne said. When graph algorithms and analytics tools are applied to data sets, basic functions, such as clustering, partitioning, search and estimating the shortest path between nodes, disclose patterns in the data, according to Borne and others.
Graph use cases on the rise
Borne noted that the graph approach underlies some bellwether online applications. That includes the page ranking system in Google's search engine and its Knowledge Graph, which pulls together factual data from various online sources. Facebook and LinkedIn also use graphs to map their networks of friends and connections. In addition, graph analytics is used in online recommendation engines.
These applications aren't implemented in graph databases in every case; graphs can be built, stored and managed in various platforms, including relational ones. But interest in graph database technology is growing among vendors and IT teams alike, with fraud detection, cybersecurity, text analysis, data catalogs, master data management and scientific research among the uses now in production.
Kirk Borneprincipal data scientist, Booz Allen Hamilton
The rising tide of big data is one of the factors prompting more users to consider the advantages of graph databases and graph data modeling approaches. "Today, graph tools have more rich data to do discovery against," Borne said.
For example, Neo4j Inc.'s namesake graph database provides a platform to collect and share a variety of genetic data and other information related to diabetes for Alexander Jarasch, head of data and knowledge management at the German Center for Diabetes Research in Munich.
Known by the acronym DZD in German, the center is using Neo4j as part of its efforts to create new therapies for diabetes patients and find ways to prevent the disease, Jarasch said. He started scripting and prototyping work with the graph database in April 2017 and was joined in that effort by his two other team members last year.
Goodbye to data joins
After more than 10 years of working in bioinformatics, Jarasch developed an aversion to relational databases -- or at least the data joins that so often are central to relational queries. "I hate joins," he said. "When you have data scattered over tables and you look for insights, it gets complicated."
Jarasch and his colleagues are looking to use the Neo4j database to enable easier sharing of diverse data within the DZD, which comprises a number of independent research organizations. The data comes from a mix of hospitals, labs and other sources, according to Jarasch; some of the data is on humans, and some is on test animals. "Everybody has their data in silos," he said. "They exchange data, but there's no overarching way to connect the data." That's what he seeks to achieve via the graph technology.
Alexander Jaraschhead of data and knowledge management, DZD
The first steps focus on creating metadata to accompany the raw data so researchers can see what's available to analyze. Graph analytics applications will follow. Jarasch said he expects data to become available for one or two research projects this year, including a project that will link anonymized data on humans with data from mice and pigs.
Despite the deployment of Neo4j, Jarasch doesn't foresee an end to the use of relational databases at the DZD. Rather, the graph database software gives research scientists "an additional layer for looking at their data," he explained.
Plenty of graph options
Other users looking to benefit from the advantages of graph databases have various technology options. In addition to Neo4j, vendors of native graph databases include Cambridge Semantics, Cray, Franz, Ontotext and TigerGraph. Cloud platform market leader AWS also offers a graph database called Amazon Neptune, launched in late 2017.
Graph technology is available from other cloud providers, too. Microsoft's Azure Cosmos DB multimodel database can be used to store and manage graph data. IBM supports the JanusGraph open source database in its cloud via a managed service called Compose for JanusGraph.
In addition, relational databases such as Oracle Database and Microsoft SQL Server can do graph processing and analytics. Graph functions are similarly supported in other types of NoSQL database management systems from DataStax, MarkLogic, Redis and others. Meanwhile, the Apache Spark analytical engine supports parallel graph computation.
Historically, graph databases have been divided into two categories. In addition to those that support property graphs with nodes and edges, there are RDF databases that are based on the Resource Description Framework and focus on the semantic aspects of data, storing information in triples that comprise nouns, verbs and objects. But that distinction is blurring as vendors move to support both types.
Powering the grid with graphs
Guangyi Liu has begun to work with TigerGraph's massively parallel processing graph database as part of an effort to build a system that matches electricity supply-and-demand on the fly.
Bringing real-time analytics performance to electrical power distribution has been a holy grail in the utility industry, said Liu, CTO at the Global Energy Interconnection Research Institute North America. GEIRINA is an R&D center in San Jose, Calif., that's affiliated with the State Grid of China, a government-owned utility based in Beijing.
Liu's team is looking to do large-scale linear equation processing on a topology representing signals from the millions of sensors, actuators, relays and switches on a power grid. The project, which began in 2015, originally tested out Oracle's relational database software. But like Jarasch, Liu found drawbacks to the relational approach.
"With the Oracle database, you need to convert tables into a data structure representing the topology of the system," Liu said. With TigerGraph, however, "the topology is right there," he added. The graph database also makes it possible to do data searches and calculations in parallel, according to Liu.
Philip Howard, an analyst at London-based Bloor Research, said he expects the use of graph technology to continue to expand. In particular, he pointed to the advantages of graph databases over relational software for the large-scale "who knows who?" questions that underlie many modern applications.
Yet graph tools are currently often used as an adjunct to relational databases or other types of NoSQL systems, if at all, Howard said. Graphs may offer a more natural way to model and connect data, he noted, but IT teams tend to think "inside the box" when evaluating and selecting data management platforms.