E-Handbook: Data lake concept needs firm hand to pay big data dividends Article 2 of 3

michelangelus - Fotolia

7 steps to a successful data lake implementation

Flooding a Hadoop cluster with data that isn't well organized and managed can stymie analytics efforts. Take these steps to help make your data lake accessible and usable.

David Loshin

By

David Loshin, Knowledge Integrity Inc.

Published: 08 Oct 2019

The concept of the data lake originated with big data's emergence as a core asset for companies and Hadoop's arrival as a platform for storing and managing the data. However, blindly plunging into a Hadoop data lake implementation won't necessarily bring your organization into the big data age -- at least, not in a successful way.

That's particularly true in cases where data assets of all shapes and sizes are funneled into a Hadoop environment or another big data repository in an ungoverned manner. A haphazard approach of this sort leads to several challenges and problems that can severely hamper the use of a data lake to support big data analytics applications.

For example, you might not be able to document what data objects are stored in a data lake or their sources and provenance. That makes it difficult for data scientists and other analysts to find relevant data distributed across a Hadoop cluster and for data managers to track who accesses particular data sets and determine what level of access privileges are needed on them.

Organizing data and "bucketing" similar data objects together to help ease access and analysis is also challenging if you don't have a well-managed process.

None of these issues have to do with the physical architecture of the data lake or the underlying environment, whether that's the Hadoop Distributed File System or a cloud object store like Amazon Simple Storage Service -- or a combination of those technologies, each containing different types of data. Rather, the biggest impediments to a successful data lake implementation result from inadequate planning and oversight on managing data.

Data lake vs. data warehouse — The difference between data lakes and data warehouses

Do what needs doing with Hadoop data

The good news, however, is the challenges are easily overcome. Here are seven steps to address and avoid them:

Create a taxonomy of data classifications. Organizing data objects in a data lake depends on how they're classified. Identify the key dimensions of the data as part of your classifications, such as data type, content, usage scenarios, groups of possible users and data sensitivity. The latter relates to protecting both personal and corporate data -- such as personally identifiable information on customers in the first case and intellectual property in the second.
Design a proper data architecture. Apply the defined classification taxonomy to direct how the data is organized in your Hadoop environment. The resulting plan should include things like file hierarchy structures for data storage, file and folder naming conventions, access methods and controls for different data sets, and mechanisms for guiding data distribution.
Employ data profiling tools. In many cases, the absence of knowledge about all the data going into a data lake can be partially alleviated by analyzing its content. Data profiling tools can help by gathering information about what's in data objects, thereby providing insight for classifying them. Profiling data as part of a data lake implementation also aids in identifying data quality issues that should be assessed for possible fixes to make sure data scientists and other analysts are working with accurate information.
Standardize the data access process. Difficulties in effectively using data sets stored in a Hadoop data lake often stem from the use of a variety of data access methods, many undocumented, by different analytics teams. Instead, instituting a common and straightforward API can simplify data access and ultimately allow more users to take advantage of the data.
Develop a searchable data catalog. A more insidious obstacle to effective data access and usage is prospective users being unaware of what's in a data lake and where different data sets are located in the Hadoop environment, in addition to information about data lineage, quality and currency. A collaborative data catalog allows these -- and other -- details about each data asset to be documented. For example, it captures structural and semantic metadata, provenance and lineage records, info on access privileges and more. A data catalog also provides a forum for groups of users to share experiences, issues and advice on working with the data.
Implement sufficient data protections. Aside from the conventional aspects of IT security, such as network-perimeter defenses and role-based access controls, utilize other methods to prevent the exposure of sensitive information contained in a data lake. That includes mechanisms like data encryption and data masking, along with automated monitoring to generate alerts about unauthorized data access or transfers.
Raise data awareness internally. Finally, make sure that the users of your data lake are aware of the need to actively manage and govern the data assets it contains. Train them on how to use the data catalog to find available data sets and how to configure analytics applications to access the data they need. At the same time, impress upon them the importance of proper data usage and strong data quality.

To meet the ultimate objective of making a data lake accessible and usable, it's crucial to have a well-designed plan for dealing with the data prior to migrating it into your Hadoop environment or cloud-based big data architecture. Taking the steps outlined here will help streamline the data lake implementation process. More important, the right combination of planning, organization and governance will help maximize your organization's investment in a data lake and reduce the risk of a failed deployment.

Dig Deeper on Data management strategies

E-Handbook: Data lake concept needs firm hand to pay big data dividends

Article2 of 3

Up Next

Data management mistakes can ruin your data lake journey

Data lakes pose technology deployment and data management challenges that can leave analytics users high and dry if the implementation process isn't handled properly.

7 steps to a successful data lake implementation

Flooding a Hadoop cluster with data that isn't well organized and managed can stymie analytics efforts. Take these steps to help make your data lake accessible and usable.

Three ways to turn old files into Hadoop data sets in a data lake

Hadoop data lakes offer a new home for legacy data that still has analytical value. But there are different ways to convert the data for use in Hadoop depending on your analytics needs.

Business Analytics

Logi analytics suite to add new GenAI, SaaS capabilities
Insightsoftware, parent company of the embedded BI specialist, unveiled a new generative AI assistant and SaaS version of ...
Snowflake targets enterprise AI with launch of Arctic LLM
The data cloud vendor's open source LLM was designed to excel at business-specific tasks, such as generating code and following ...
AI-fueled efficiency a focus for SAS analytics platform
The vendor's latest product development plans include an AI assistant and prebuilt AI models that enable workers to be more ...

AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...
Compare EKS vs. self-managed Kubernetes on AWS
AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

Content Management

7 SharePoint problems that spur customers to leave the platform
SharePoint is a well-known content management and collaboration platform. Despite its popularity, it can introduce many ...
5 benefits of enterprise search
With a proper enterprise search strategy in place, organizations can improve their employees' efficiency and ensure customers ...
OpenText expands GenAI for enterprise content, IoT
OpenText finds a novel use for generative AI: combing through, sorting and summarizing massive amounts of IoT data. It also ...

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP earnings for Q1 indicate strong cloud growth
SAP's cloud revenue for the first quarter of 2024 indicates healthy growth and sets the stage as customers plan cloud migrations ...
SAP chief AI officer: Waiting on AI is the wrong strategy
SAP's first chief AI officer, Philipp Herzig, outlines the company's new AI-focused organization and underscores why companies ...
SAP, Nvidia partner to boost Business AI development
SAP and Nvidia are working together to combine platforms and services that help customers build business-specific generative AI ...

Close