The concept of the data lake originated with big data's emergence as a core asset for companies and Hadoop's arrival...
as a platform for storing and managing the data. However, blindly plunging into a Hadoop data lake implementation won't necessarily bring your organization into the big data age -- at least, not in a successful way.
That's particularly true in cases where data assets of all shapes and sizes are funneled into a Hadoop environment or another big data repository in an ungoverned manner. A haphazard approach of this sort leads to several challenges and problems that can severely hamper the use of a data lake to support big data analytics applications.
For example, you might not be able to document what data objects are stored in a data lake or their sources and provenance. That makes it difficult for data scientists and other analysts to find relevant data distributed across a Hadoop cluster and for data managers to track who accesses particular data sets and determine what level of access privileges are needed on them.
Organizing data and "bucketing" similar data objects together to help ease access and analysis is also challenging if you don't have a well-managed process.
None of these issues have to do with the physical architecture of the data lake or the underlying environment, whether that's the Hadoop Distributed File System or a cloud object store like Amazon Simple Storage Service -- or a combination of those technologies, each containing different types of data. Rather, the biggest impediments to a successful data lake implementation result from inadequate planning and oversight on managing data.
Do what needs doing with Hadoop data
The good news, however, is the challenges are easily overcome. Here are seven steps to address and avoid them:
- Create a taxonomy of data classifications. Organizing data objects in a data lake depends on how they're classified. Identify the key dimensions of the data as part of your classifications, such as data type, content, usage scenarios, groups of possible users and data sensitivity. The latter relates to protecting both personal and corporate data -- such as personally identifiable information on customers in the first case and intellectual property in the second.
- Design a proper data architecture. Apply the defined classification taxonomy to direct how the data is organized in your Hadoop environment. The resulting plan should include things like file hierarchy structures for data storage, file and folder naming conventions, access methods and controls for different data sets, and mechanisms for guiding data distribution.
- Employ data profiling tools. In many cases, the absence of knowledge about all the data going into a data lake can be partially alleviated by analyzing its content. Data profiling tools can help by gathering information about what's in data objects, thereby providing insight for classifying them. Profiling data as part of a data lake implementation also aids in identifying data quality issues that should be assessed for possible fixes to make sure data scientists and other analysts are working with accurate information.
- Standardize the data access process. Difficulties in effectively using data sets stored in a Hadoop data lake often stem from the use of a variety of data access methods, many undocumented, by different analytics teams. Instead, instituting a common and straightforward API can simplify data access and ultimately allow more users to take advantage of the data.
- Develop a searchable data catalog. A more insidious obstacle to effective data access and usage is prospective users being unaware of what's in a data lake and where different data sets are located in the Hadoop environment, in addition to information about data lineage, quality and currency. A collaborative data catalog allows these -- and other -- details about each data asset to be documented. For example, it captures structural and semantic metadata, provenance and lineage records, info on access privileges and more. A data catalog also provides a forum for groups of users to share experiences, issues and advice on working with the data.
- Implement sufficient data protections. Aside from the conventional aspects of IT security, such as network-perimeter defenses and role-based access controls, utilize other methods to prevent the exposure of sensitive information contained in a data lake. That includes mechanisms like data encryption and data masking, along with automated monitoring to generate alerts about unauthorized data access or transfers.
- Raise data awareness internally. Finally, make sure that the users of your data lake are aware of the need to actively manage and govern the data assets it contains. Train them on how to use the data catalog to find available data sets and how to configure analytics applications to access the data they need. At the same time, impress upon them the importance of proper data usage and strong data quality.
Gartner analyst Merv Adrian discusses Hadoop's continuing innovation, its do-it-yourself nature and how to best manage its deployment process.
To meet the ultimate objective of making a data lake accessible and usable, it's crucial to have a well-designed plan for dealing with the data prior to migrating it into your Hadoop environment or cloud-based big data architecture. Taking the steps outlined here will help streamline the data lake implementation process. More important, the right combination of planning, organization and governance will help maximize your organization's investment in a data lake and reduce the risk of a failed deployment.