michelangelus - Fotolia
Businesses have the perpetual problem of trying to get a grip on their performance. Executives need to have the latest information on their revenue, costs and profitability. They also want these figures segmented by business unit, geography, product line and customer.
The problem is that getting this overall picture is difficult. The typical large company might have several hundred applications deployed globally to capture sales, logistics and supplier data. Customer and product data are scattered across these applications, often with conflicting or inconsistent classifications.
Challenges with corralling data
Corralling all this data and making sense of it has been a thorny problem for decades.
Traditionally, companies took copies of key data from their transaction systems, amalgamated them into a corporate data warehouse and resolved inconsistencies in definitions by matching up inconsistent sales or product hierarchies as data was loaded into the data warehouse.
The data then went through some data cleaning and was funneled into a carefully designed schema and stored in a relational database. Subsets of the database could be spun out into local data marts to satisfy the needs of a specific business unit.
The data modeling and cleaning took time and scarce technology skills, and the carefully designed database schema was inflexible. If the company acquired another firm, it could take months to adapt the data warehouse schema to deal with the data of the newly acquired company.
This inherent time lag meant business users would not always have the up-to-date data they required. Many of them circumvented the IT department and created data feeds they could control.
Data volume strains data lakes
One of the most prominent data lake challenges is sifting through the copious amounts of data.
Web traffic, sensor data and the like could be an order of magnitude higher in volume than traditional sales data, and relational databases struggled to cope with the sheer amount of data, especially at an affordable price. More and more data came from outside the enterprise. Much of it was unstructured, such as documents and images rather than numbers.
This pressure led to the development of big data file systems such as the Hadoop Distributed File System (HDFS), which were designed for very large-scale storage using inexpensive commodity disk storage. The data lake -- using such storage and dealing with raw, unprocessed data -- was born.
However, HDFS is a file system -- not a database -- and lacks the index structures that enable the complex SQL-based queries that relational databases were built for. A data lake may rest on HDFS but can also use NoSQL databases that lack a rigid schema and the strict data consistency of a traditional database.
Challenges with data structure
Data lakes and their raw data are very different from data warehouses that have carefully cleaned, processed and indexed data.
Data lakes complement data warehouses rather than compete with them. A business analyst who wants to run queries on sales performance would hardly know where to start in the dark depths of a data lake, which is the natural preserve of a data scientist who has the skills to navigate uncharted raw data.
In practice, even data scientists can face data lake challenges. In some organizations, there is now an attempt to tame this wild west of raw data by adding a layer of metadata on top of the data lake to catalog it. In some cases, the metadata may add commonly used aggregates and calculations.
This is where the dividing line between a data lake and a data warehouse blurs.
Data warehouses were built to put some structure on top of a chaotic world of raw transactional data. But people now realize that data lakes present many of the same challenges that confronted early data warehouses. To make sense of all the data, you need some structure to know when the various data files were loaded, where they originated from and who loaded them.
Other data lake challenges
Data inconsistencies may still need to be resolved when combining different data sets. You also need to impose some control over the data -- e.g., clearly differentiating production data from sandbox data used for testing and experimentation.
In addition, certain questions need to be answered. Who owns the data sources and feeds? Who is the arbiter when competing versions of product hierarchies are found? The latter is the territory of data governance, another necessary area when building corporate data warehouses.
In short, data lake challenges are similar to those found in data warehouses. The underlying storage layer may have changed, but the issues of data governance, security, metadata, data quality and consistency still lurk beneath the surface of the data lake. These areas need to be baked into the design and management of a data lake, just as they were with data warehouses.