Sergey Galushko - Fotolia

Data hub vs. data lake: Deciphering the differences

Data lakes and data hubs are approaches to data management that are typically opposed. Here are the main differences between these two storage options.

Joseph M. Carew

By

Joseph M. Carew

Published: 29 Apr 2020

Companies have realized that the more data they gather, the better they can understand their customers and users. And the way a company stores its data can allow for a more balanced and intelligent view of its operations.

Two storage options are data lakes and data hubs. A data lake stores raw data similar to a regular lake, while a data hub is composed of a core storage system at its center with data in spokes reaching out to different areas.

There has been an ongoing debate on data hub vs. data lake and which is the best way to approach data gathering and storage. Though these are both common terms, differentiating between the two can still be a challenge.

Defining a data lake

A data lake is a centralized option in which all forms of data can be stored in a variety of ways. There is no need to translate data to a singular form, as a data lake can hold a vast amount of raw data in its original format.

Data lakes were created by companies because they understood the value of their data, said Hossein Rahnama, MIT machine intelligence professor and founder and CEO of Flybits. Bringing all that data together allows companies to better predict the needs of their customers and the needs of their business.

A data lake acts as a repository for data from all different parts of an organization. This makes data storage easier than other storage solutions but can become a problem when it comes to drawing that data back out. In order to retrieve desired data from a data lake, it must be queried, and data lake users may struggle with accessibility. Highly technical skills are often required to find relevant information and draw conclusions from that data.

"Companies that are going to be successful leveraging their data lake are the ones that are also building a creative and interactive layer on top of that data lake so non-IT experts can also leverage data assets to build new capabilities," Rahnama said.

Data lakes are often associated with a Hadoop framework; however, many vendors now support data lake architectures, including Amazon, Cloudera and Microsoft. Many even offer the option to deploy data lakes in the cloud.

Key differences between data hub and data lake — A few key differences between data hubs and data lakes

Establishing a data hub

A data hub can be thought of as a hub-and-spoke approach to storing and managing data. Data is physically moved and reindexed into a new system. This provides more structure to the data and permits diverse business users to access information that they need more rapidly than in a data lake.

Data hubs are usually created as a joint effort between complementary businesses, Rahnama said. It could be between a telecom operator, a bank and a supermarket, and they will all come together to share insights and elements of data.

Each spoke of this wheel would have access to some or all of the collective data gathered, depending on what they were looking to gain from it.

"The telecom operator may have a data cloud [storing] telecom information, the financial organization may have another cloud owning transaction data and the supermarket may have another data set," Rahnama said. "Now, these organizations have two options to create a data alliance or a data hub; they may agree to host their data in a centralized repository that can be accessible by all three of them."

This brings up concerns about privacy, as information collected by a bank could find its way to a completely different company. To ease these worries, it is critical for companies using data hubs to ask for user consent to sharing their data.

Similar to data lakes, data hubs were originally built on a Hadoop framework, but there are now other popular vendors, including MarkLogic and Google.

Data hub vs. data lake

Creating a data hub does not mean that data lake architecture is unavailable, however.

"A data hub, at the same time, may or may not use a data lake architecture," Rahnama said. "I can use a data lake with different stakeholders to participate in. Or I can completely decentralize it and leverage something like a blockchain or edge of the cloud or other decentralized mechanism to still form the alliance but in a decentralized way."

Giving numerous businesses access to a communal data lake would, for example, combine both a data lake and a data hub in one solution. This would increase the amount of participating companies but would do nothing to mitigate the accessibility of data lakes.

Data hub and data lake use cases

Both models are strong contenders to reduce data silos, as they are built to be accessible across business divisions' access to the same data.

The multipronged approach of a data hub is popular for use cases that require multiple interpretations to the same data. For example, analyzing similar data for both marketing and financial analytics. This makes data hubs popular for enterprises that analyze various types of data to perform tasks, such as fraud detection and customer service.

Because data lakes are built to store data until it's necessary, they tend to be more popular among enterprise with a less urgent need for data. Data lakes are popular for storing IoT data and archival data. Active archive data stored in a data lake can be used by data scientists for research across industries, including health sciences.

The debate between data lakes vs. data hubs isn't straightforward. With both filling different needs and having a combination as a possibility, the right data management approach boils down to company needs.

Dig Deeper on Database management

Business Analytics

MicroStrategy targets trusted AI with latest platform update
The vendor's platform now includes capabilities aimed at helping customers trust NLP responses, as well as automate workflows ...
Tableau adds generative AI tools, tightens Databricks bond
The analytics vendor's new features include a tool that enables customers to explore metrics using natural language as well as ...
Logi analytics suite to add new GenAI, SaaS capabilities
Insightsoftware, parent company of the embedded BI specialist, unveiled a new generative AI assistant and SaaS version of ...

AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...
Compare EKS vs. self-managed Kubernetes on AWS
AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

Content Management

Benefits and challenges of a headless CMS
Headless CMSes enable omnichannel publishing and improve front-end flexibility. Yet, these platforms can have steep learning ...
7 SharePoint problems that spur customers to leave the platform
SharePoint is a well-known content management and collaboration platform. Despite its popularity, it can introduce many ...
5 benefits of enterprise search
With a proper enterprise search strategy in place, organizations can improve their employees' efficiency and ensure customers ...

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP earnings for Q1 indicate strong cloud growth
SAP's cloud revenue for the first quarter of 2024 indicates healthy growth and sets the stage as customers plan cloud migrations ...
SAP chief AI officer: Waiting on AI is the wrong strategy
SAP's first chief AI officer, Philipp Herzig, outlines the company's new AI-focused organization and underscores why companies ...
SAP, Nvidia partner to boost Business AI development
SAP and Nvidia are working together to combine platforms and services that help customers build business-specific generative AI ...

Close