Sergey Galushko - Fotolia
Global companies have many applications. One company I recently worked with had over 600 documented IT applications, just one of which was its ERP system. To get any sense of business performance across the enterprise, you need to somehow aggregate this data together, resolving inconsistencies in classifications of products, customers, suppliers, etc.
This non-trivial task, which requires dealing with data quality problems and other thorny issues, often results in a data warehouse. Keeping that data warehouse up to date amid corporate restructuring, acquisitions and other business changes is a major challenge for data management teams, but it's what companies have -- with mixed degrees of success -- relied on to give them a unified view of their business.
Welcome data lakes
The advent of big data in volumes too large or diverse for commercially licensed databases to handle further complicated things economically. Such data includes data from smart meters, sensors, web logs, telephone masts, social media and more. A modern airplane generates 5 TB of data per flight, while an autonomous car spews out 40 TB per day. Traditional databases were never designed for such high volumes, and costs can quickly rise when scaling up.
A cheaper storage option has been Hadoop, an open source distributed processing framework. This allows very large volumes of data to be stored and managed on clusters of commodity hardware. Hadoop has been pressed into service to deal with the big data that companies now generate, but it is important to understand that this data is raw and not processed or summarized as the kind found in a data warehouse is.
The term data lake is used to describe a store of raw data. Think of the difference between the water of a real lake compared to a bottle of Evian, which has been cleansed, branded and packaged for easy consumption.
Initially, data lakes were all hosted inside the corporate firewall on dedicated hardware. However, maintaining a growing data lake -- adding and managing servers as the data pours in -- takes resources. Just as vendors have stepped into other markets that companies used to handle in-house, it is not surprising that the same has happened with data lakes.
Data lakes in the cloud
Big data lake management in your own corporate data center -- dealing with backups, security and hardware failures, etc. -- is a major effort. That's why managed cloud services have become a major alternative to Hadoop for data lakes.
Amazon, Microsoft and Google offer data lakes in the cloud. But there are some important data lake management issues to consider before handing over your data to a cloud service provider.
On the plus side, the administration is someone else's problem, and you can scale up or down as needed without having to invest in new hardware. On the other hand, you need to consider whether you trust the provider to handle the security of your data, much of which may be highly sensitive, and whether you trust its ability to keep an operational service running.
Although the majority of providers are getting more reliable, even in 2019 there were major outages affecting the Google Cloud (on June 2) and Microsoft (on January 24). But is your internal data center any less likely to run into an issue with outages?
The decision to run a data lake in the cloud or in-house comes down to whether you have faith in a third-party provider to maintain your data safely and securely compared to managing it in-house.
In the early days of the cloud, corporations were very nervous about having their data lake in the cloud, outside the corporate firewall. Gradually, the economic benefits outweighed those concerns.
These days, more and more applications are moving to the cloud, including data lakes, with cloud computing in 2019 growing almost 24% over 2018, according to an IDC report, and with 90% of companies using some type of cloud service, according to a 2017 survey by 451 Research.
Making that data useful
Before deciding if you should host your data lake in the cloud or in-house, the larger hurdle companies face is how to actually make use of the data filling up their data lakes at an increasing pace.
Being a data analyst confronted by such a high volume of data is like trying to drink from a fire hose. You need to classify the data being stored in the data lake, tag its data sets with meaningful metadata that makes it identifiable later and start to map how this data relates to your corporate data. Adding meaningful metadata or tags to the raw data is especially important. If you don't, your data lake will be more of a data swamp.
Companies typically set up their data lakes alongside their traditional data warehouses, with data pumped out of the lake into the warehouse as needed. Before deciding whether to go with a cloud service for your data lake, you need to consider whether that service is a good complement to your data warehouse.
For example, if your data lake takes in social media feeds with customer comments about your brand, how can you relate this data to your customer database? You might want to pay a lot more attention to a customer complaining if they are a valued customer in your corporate loyalty scheme, but are you able to make that connection?
Wrangling the raw data lake and combining it with mainstream corporate data presents many opportunities, but it is also a major challenge for hard-pressed data management staff.