E-Handbook: Enterprise data lakes hold the key to actionable insights Article 2 of 4

rolffimages - Fotolia

Key factors for successful data lake implementation

There are many important parts to a data lake implementation, from technology to governance. Read on for the top factors to evaluate in your implementation strategy.

Chris Foot

By

Chris Foot

Published: 06 Jul 2020

In addition to the business drivers behind the growth of data lakes, the cloud's ability to offer vast amounts of storage and processing power at ever-decreasing price points are making data lake platforms increasingly attractive to organizations of all sizes.

Data lake implementation continues to capture the attention of the IT community. A recent analysis report from Research and Markets forecasts that the data lake market will grow by a 26% compound annual growth rate (CAGR), reaching $20.1 billion by 2024.

If your organization is considering a data lake implementation, here are some things you should consider.

What is a data lake?

An easy way to define and better understand data lakes is to compare them to data warehouses. Although data warehouses and data lakes are both used to store large amounts of data, there are significant differences.

Organizations can use data lake information in many ways, and the data sources do not need a predefined purpose to qualify for ingestion into a data lake. Analysts explore, experiment and evaluate data lake information to identify its benefits and use cases. Meanwhile, data warehouses ingest and store data for a predetermined purpose.

Data warehouse specialists often perform a high level of analysis to evaluate and identify input sources. But the strategy for a data lake implementation is to ingest and analyze data from virtually any system that generates information.

Data warehouses use predefined schemas to ingest data. In a data lake, analysts apply schemas after the ingestion process is complete.

Data lakes store data in its raw form. As a result, data ingestion is a fairly uncomplicated process. In a data warehouse, data is heavily processed during ingestion to ensure it adheres to the schema and its predefined purpose.

Data lakes specialize in ingesting structured, semistructured and unstructured data. They also provide mechanisms to easily ingest streaming data in addition to batch loads. Although data warehouses can accept many different forms of data, they usually ingest structured data using batch loads.

How to get started

The first step in data lake implementation is to learn more about data lake architectures, platforms, products and workflows through vendor websites and other resources.

Like any product evaluation, your organization will need to perform a thorough analysis of the competing offerings. Here is a starter list of evaluation criteria to help your analysis:

Technology. Although Apache Hadoop and its suite of supporting products have been the perennial favorites for many organizations, there are a growing number of alternatives. Many vendors that use Hadoop for their data lake offerings provide their own customizations and edge products to simplify, streamline and facilitate administration and analysis.

There are a wide range of platforms available, including Amazon Data Lake Solutions, Microsoft Azure Data Lake, Google Data Lakes, Snowflake for Data Lakes and Oracle Data Lake.

Security and access control. Data lakes hold a treasure trove of information about your business. Like all enterprise data stores, you will need to protect data lakes against unauthorized access.

Data ingestion. Does the platform easily and quickly ingest structured, semistructured and unstructured data? Is it capable of efficiently ingesting data streams, micro batch and mega batch data loads?

Metadata management. Big data specialists use metadata to search, identify and better understand the data sets that are in the data lake. How does the platform capture and store metadata?

Data processing, performance and scalability. What tools and processes does the platform offer users to interact with the data? How does it enable data exploration? What background processes does it execute during the course of daily operations? How fast are those processes and will they scale to meet your workload requirements?

Management and monitoring. Does the platform provide a strong UI for system administration and monitoring? What workload management capabilities does it offer?

Data governance. Does the platform offer mechanisms to ensure the data is consistent and reliable? Does it provide the ability to create sandbox environments that allow users to experiment with data without affecting the contents of the data lake?

Data analysis and accessibility. What mechanisms does the platform provide to analyze the data? Does it allow you to easily incorporate machine learning? What data analytics features does it offer to consumers? Can you easily integrate third-party analysis tools?

Costing strategies. How will the vendor charge you?

Data lake implementation

After platform selection, the next step is to build the organizational infrastructure, processes and procedures to load, govern, administer and analyze data in the data lake.

These are the key steps in a data lake implantation strategy:

Identify the expertise you need to effectively support the platform and analyze the data. Like many complex technologies, data lakes have a steep learning curve. Hire experienced personnel and train internal staff. Your organization will need to define new organizational roles and reporting structures with data lake implementation.
To execute a well-thought-out data lake implementation strategy and design, your organization will need to develop a traditional project plan with goals, milestones and assigned action items. You will need to identify the criteria your organization will use to evaluate the success of the data lake project. Design the system to foster self-service data analysis. You should also develop data classification standards for data storage and archival.
Virtually any data the organization generates is a potential source for data lake ingestion. The challenge becomes one of prioritization. A good approach is to evaluate the source that generates the data and identify its importance to the organization at a high level.
You should determine if the information is currently being analyzed and the level of analysis that is occurring. Highly analyzed data, although still a potential source for ingestion, may have a lower importance than data from a system that is not being evaluated.
Develop, implement and enforce data governance strategies to ensure the data is secure, complete, consistent and accurate.
Establish standards for data exploration, experimentation and analysis. Data scientists should follow a standardized but flexible process to evaluate the data and identify the use cases that will generate the most value to the business. Potential targets for the data are other BI platforms and new and existing business applications.

Dig Deeper on Data management strategies

E-Handbook: Enterprise data lakes hold the key to actionable insights

Article2 of 4

Up Next

Big data's vast melting pot for business intelligence

Technological pillars of sound business decisions, AI, machine learning and advanced analytics depend on the quantity, quality and integrity of information in data lakes.

Key factors for successful data lake implementation

There are many important parts to a data lake implementation, from technology to governance. Read on for the top factors to evaluate in your implementation strategy.

Data fabrics help data lakes seek the truth

Data fabrics can play a key role in aligning business goals with the integration, governance, reliability and democratization of information collected in massive data lakes.

How to ensure your data lake security

Your data lake is full of sensitive information and securing that data is a top priority. These are the best practices to keep that information safe from hackers.

Business Analytics

MicroStrategy targets trusted AI with latest platform update
The vendor's platform now includes capabilities aimed at helping customers trust NLP responses, as well as automate workflows ...
Tableau adds generative AI tools, tightens Databricks bond
The analytics vendor's new features include a tool that enables customers to explore metrics using natural language as well as ...
Logi analytics suite to add new GenAI, SaaS capabilities
Insightsoftware, parent company of the embedded BI specialist, unveiled a new generative AI assistant and SaaS version of ...

AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...
Compare EKS vs. self-managed Kubernetes on AWS
AWS users face a choice when deploying Kubernetes: run it themselves on EC2 or let Amazon do the heavy lifting with EKS. See ...

Content Management

Benefits and challenges of a headless CMS
Headless CMSes enable omnichannel publishing and improve front-end flexibility. Yet, these platforms can have steep learning ...
7 SharePoint problems that spur customers to leave the platform
SharePoint is a well-known content management and collaboration platform. Despite its popularity, it can introduce many ...
5 benefits of enterprise search
With a proper enterprise search strategy in place, organizations can improve their employees' efficiency and ensure customers ...

Oracle sets lofty national EHR goal with Cerner acquisition
With its Cerner acquisition, Oracle sets its sights on creating a national, anonymized patient database -- a road filled with ...
With Cerner, Oracle Cloud Infrastructure gets a boost
Oracle plans to acquire Cerner in a deal valued at about $30B. The second-largest EHR vendor in the U.S. could inject new life ...
Supreme Court sides with Google in Oracle API copyright suit
The Supreme Court ruled 6-2 that Java APIs used in Android phones are not subject to American copyright law, ending a ...

SAP earnings for Q1 indicate strong cloud growth
SAP's cloud revenue for the first quarter of 2024 indicates healthy growth and sets the stage as customers plan cloud migrations ...
SAP chief AI officer: Waiting on AI is the wrong strategy
SAP's first chief AI officer, Philipp Herzig, outlines the company's new AI-focused organization and underscores why companies ...
SAP, Nvidia partner to boost Business AI development
SAP and Nvidia are working together to combine platforms and services that help customers build business-specific generative AI ...

Close