Deploying a data lake lets you absorb information from a wide variety of sources in a single repository. But that glut of available info creates challenges -- especially when integrating and preparing data in a consistent way. Big data analytics applications can hit rocks as a result, a fate that metadata management tools are designed to help you avoid.
There's no doubt that a data lake provides an efficient platform for capturing massive amounts of raw data and enabling data scientists and other downstream users to pursue different methods of preparing and analyzing the accumulated data. In addition, storage capacity can be easily expanded by adding more disk space to a data lake's underlying Hadoop cluster; that in turn makes it easy to capture more and more data for processing and analysis.
The incoming data may be static, with an entire data set delivered at one time. Or it can be dynamic -- data that's continuously streamed and appended to existing files in the data lake. In either case, a broader collection of data clears the way for more advanced processing and analytics. Examples include indexing data for search uses, applying predictive models to data streams and implementing automated processes triggered in real time by predefined events or patterns recognized in a data set.
However, all that varied information, and the various ways in which it's used, puts data integration and preparation processes to the test. The main issue has to do with what might be deemed an inverted approach to data ingestion, transformation and preparation.
One-size-fits-all view of data
In a traditional data warehouse environment, data sets are initially integrated and stored at a server that's set up as a staging area, where they're standardized, reorganized and configured into a predefined data model. Then, the processed data is loaded into the data warehouse and made available for use. The work is typically managed by the IT department, which usually applies a monolithic set of data integration routines and transformations. This process, referred to as schema-on-write, basically limits all end users to working with the same interpretation of a particular data set.
That's too much of a constraint for many data analysts; data science team members often have their own ideas about how they want to use data. As a result, it's likely that each one will want to individually assess data sets in their raw formats before devising a target data model for a specific analytics application and engineering the required data transformations.
A data lake architecture gives analysts the ability to impose their own structures and transformations on data sets as needed. This schema-on-read approach provides greater flexibility in using data but poses a risk to data and analytics consistency. It's entirely possible that different users will infer completely different meanings from the same set of data. The risk grows even greater when analytics applications include external data sets whose provenance may not be fully known.
Self-service data preparation tools provide some relief by standardizing the approaches used to profile, assess and transform raw data. But such tools are often used in a virtual vacuum, with individual analysts still integrating and preparing data independently. That leads to duplicated efforts at best and inconsistent analytical results at worst.
Pair metadata tools with collaboration
The ultimate goal is to reduce confusion, streamline data interpretation and lower the level of effort needed to integrate and prepare data. And that can be accomplished by combining collaboration processes with the use of metadata management tools. When done correctly, maintaining a shared set of metadata definitions can help foster consistent treatment of data among data scientists and other analysts, thereby lowering the risk of conflicting interpretations.
Here's an example of how to facilitate de facto standards for data use: Profile a data set and note any inferences about source data elements, then consult the metadata repository for previously documented impressions and definitions. For a specific data element, if the definitions in the repository are consistent with your inferences, then select one of them to use in your application.
If none of the definitions are usable, document your observations in the metadata repository and reach out to other analysts to explore the root causes of why people have different views of the raw data. Hopefully, you'll eventually reach a point where there's broad agreement on the proper interpretations.
To help you get there, modern metadata management tools are being equipped with more sophisticated methods of facilitating collaboration. For instance, many tools now support discussion threads to use when sharing current info and historical context on how data is being integrated, prepared and used.
Aligning data integration and preparation steps with the corresponding metadata definitions also provides a sanity check for ensuring that data is interpreted and used consistently. And that consistency will go a long way toward making your data lake a consistently productive platform.
Consultant Anne Marie Smith on why metadata management is important
Don't get dragged under when planning and deploying a data lake system
Why data integration is a key component of big data architectures