Problem solve Get help with specific problems with your technologies, process and projects.

10 big data challenges and how to address them

Bringing a big data initiative to fruition requires an array of data skills and best practices. Here are 10 big data challenges enterprises must be ready for.

A well-executed big data strategy can streamline operational costs, reduce time to market and enable new products. But enterprises face a variety of big data challenges in moving initiatives from board room discussions to practices that work.

IT and data professionals need to build out the physical infrastructure for moving data from different sources and between multiple applications. They also need to meet requirements for performance, scalability, timeliness, security and governance. In addition, implementation costs must be considered upfront, as they can quickly spiral out of control.

Perhaps most importantly, enterprises need to figure out how and why big data matters in the first place.

"One of the greatest challenges around big data projects comes down to successfully applying the insights captured," said Bill Szybillo, business intelligence manager at VAI, an ERP solutions provider.

Today, many applications and systems are capturing data, he explained, but organizations are struggling to understand what is valuable, and from there, to apply those insights in an impactful way.

Here are 10 big data challenges enterprises should be aware of and some pointers on how to address them.

1. Managing large volumes of data

Big data by its very definition involves large volumes of data housed in disparate systems and platforms. Szybillo said the first challenge for enterprises is consolidating the extremely large data sets you are extracting from systems such as CRM, ERP and others and collecting them into a single manageable data warehouse.

Once you have a sense of the data you are capturing, it becomes easier to narrow in on insights by making small adjustments, he said. Plan for an infrastructure that allows for incremental changes. Attempting big changes may end up creating new problems.

Big data challenges in 2021
To shepherd a big data initiative from boardroom discussions to business insights, enterprises face a number of challenges.

2. Finding and fixing data quality issues

The analytics and artificial intelligence built on big data can generate bad results when quality issues creep into big data infrastructure. These problems can become more significant and harder to audit as teams attempt to pull in more and different types of data. Bunddler, an online marketplace for finding web shipping assistants and companies, experienced these problems firsthand as it scaled to 500,000 customers. A key growth driver for the company was the use of big data to provide a highly personalized experience, reveal upselling opportunities and monitor new trends. Data quality was a key concern.

"You need to monitor and fix any data quality issues constantly," said Pavel Kovalenko, CEO of Bunddler. Duplicates and typos are common, he said, especially when data comes from different sources. To ensure the quality of the data they collected, Kovalenko's team created an intelligent data identifier that matched duplicates with minor data variance and reported any possible typos. This helped ensure the quality of their big data, which has improved the accuracy of business insights.

3. Dealing with data integration and preparation complexities

Big data platforms solve the problem of the collection and safe storage of large amounts of data of different types -- and the quick retrieval of data that's needed. But the data collection process can still be very challenging, said Rosaria Silipo, Ph.D., principal data scientist at KNIME, a data analytics platform.

The integrity of the enterprise's collected data stores is dependent on their being constantly updated. This requires maintaining access to a variety of data sources and having dedicated big data integration strategies.

Some enterprises use data lakes as a catch-all for data collected from diverse sources of truth, without thinking through how the disparate data will be integrated. Various business domains, for example, produce data that is important for joint analysis, but this data often comes with different underlying semantics that must be disambiguated. Silipo cautions against ad hoc integration for projects, which can involve a lot of rework. For the optimal ROI on big data projects, it's generally better to develop a strategic approach for big data integration.

4. Scaling big data systems efficiently and cost effectively

Enterprises can waste a lot of money storing big data if they don't have a strategy for how they want to use it. Organizations need to understand that big data analytics starts at the ingestion of data, said George Kobakhidze, head of enterprise solutions at ZL Tech, a business analytics firm. Curating enterprise repositories also requires consistent retention policies to cycle out old information, especially now because pre-COVID data is rarely accurate in today's market.

Thus, teams should plan out the types, schemas and uses of data. This is easier said than done, said Travis Rehl, vice president of product at CloudCheckr, a cloud management platform.

"Oftentimes you start from one data model and expand out but quickly realize the model doesn't fit your new data points and you suddenly have technical debt you need to resolve," he said.

A generic data lake with the appropriate data structure can make it easier to reuse data efficiently and cost effectively. For example, Parquet files often provide a better performance-to-cost ratio than CSV dumps within a data lake.

5. Evaluating and selecting big data technologies

Teams have a wide range of big data technologies to choose from, and these can often overlap in terms of capability.

Lenley Hensarling, chief strategy officer at database company Aerospike, recommends teams start by considering current and future needs of streaming and batch sources, such as mainframe, cloud applications and third-party data services. Enterprise-grade streaming platforms to consider include Apache Kafka, Apache Pulsar, AWS Kinesis and Google Pub/Sub -- all of which provide seamless movement of data between on-premises, hybrid or cloud systems, he said.

Next, teams should start evaluating the complex data preparation capabilities required to feed AI/machine learning and other advanced analytics systems. It's also important to plan for where the data might be processed. For circumstances where latency is an issue, teams need to consider how to run analytics and AI on edge servers, and how to make it easy to update these models. These capabilities need to be balanced against the cost of deploying and managing the equipment and applications run on premises, in the cloud or on the edge.

6. Generating business insights

It's tempting for data teams to focus on the technology of big data, rather than outcomes. Silipo has found that much less attention is placed on what to do with the data.

Generating valuable business insights requires considering scenarios like creating KPI-based reports, identifying useful predictions or making different types of recommendations.

These efforts will require input from a mix of business analytics professionals, statisticians and machine learning experts. She has found that pairing this team with the big data engineering team can make a difference in increasing the ROI of setting up a big data environment.

7. Hiring and retaining workers with big data skills

"One of the biggest challenges regarding big data software development is finding and retaining the workers with big data skills," said Mike O'Malley, senior vice president of strategy at SenecaGlobal, an IT expert sourcing firm.

This big data trend is not likely to go away soon. A report from S&P Global found that cloud architects and data scientists are among the most in-demand positions in 2021. One strategy for filling these positions is to partner with software development services companies that have already built out these talent pools.

Another strategy is to work with HR to identify and address any gaps in existing big data talent, said Pablo Listingart, founder and owner of ComIT, a charity that provides free IT training.

"Many big data initiatives fail because of incorrect expectations and faulty estimations that are carried forward from the beginning of the project to the end," he said. The right team will be able to estimate risks, evaluate severity and implement solutions for a variety of big data challenges.

It's also important to establish a culture for attracting and retaining the right talent. Vojtech Kurka, CTO at Meiro, a CDP platform, said he started off imagining he could solve every data problem with a few SQL and Python scripts in the right place. Over time, he realized he could get a lot further by hiring the right people and promoting a safe company culture that keeps people happy and motivated.

8. Keeping costs from getting out of control

Another common big data challenge is what David Mariani, founder and CTO of AtScale, a data integration company, refers to as the "cloud bill heart attack." Many enterprises estimate the costs of their new big data infrastructure using existing data consumption metrics. That is a mistake.

One issue is that companies underestimate the sheer demand that expanded access to richer data sets creates. The cloud makes it easier to access data and for big data platforms to surface richer, more granular data. This capability drives demand, which drives costs since the cloud will elastically scale to meet user demand.

Using an on-demand pricing model can also drive up cost. "I've seen several customers where users have written $10,000 queries due to poorly designed SQL," Mariani said.

One good practice is to opt for fixed resource pricing, but that won't completely solve the problem. Although the meter stops at a fixed amount, poorly written apps may still end up eating resources that impact other users and workloads. So, another good practice lies in implementing fine-grained controls over queries.

CloudCheckr's Rehl also recommends teams raise the issue in their discussions with business and data engineering teams. It's the responsibility of the business to define what it is asking for; software developers should be responsible for delivering the data in an efficient format and DevOps is responsible for ensuring the right archival policies and growth rates are monitored and managed.

9. Governing big data environments

Governance issues become harder to address as big data applications grow across more systems. This problem is compounded as new cloud architectures enable enterprises to capture and store all the data they collect in its unaggregated form. Protected information fields can accidentally creep into a variety of applications.

"Without a data governance strategy and controls, much of the benefit of broader, deeper data access can be lost, in my experience," Mariani said.

A good practice is to treat data as a product, with built-in governance rules instituted from the beginning. Investing more time up front in identifying and addressing governance issues will make it easier to provide self-service access that does not require oversight of each new use case.

10. Ensuring data context and use cases are understood

Enterprises also tend to overemphasize the technology without understanding the context of the data and its uses for the business.

"There is often a ton of effort put into thinking about big data storage architectures, security frameworks and ingestion, but very little thought put into onboarding users and use cases," said Adam Wilson, CEO of Trifacta, a data wrangling tools provider.

Teams need to think about who will refine the data and how. Those closest to the business problems need to collaborate with those closest to the technology to manage risk and ensure proper alignment. This involves thinking about how to democratize the data engineering. It's also helpful to build out a few simple end-to-end use cases to get early wins, understand the limitations and engage users.

Dig Deeper on Data integration technology