Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
In an email discussion with Krish Krishnan, author of Data Warehousing in the Age of Big Data, SearchDataManagement editorial assistant Emma Preslar asked the author to further elaborate on the challenges of big data and the ways in which it is transforming our strategies for managing and storing data. For more on data warehousing and big data, check out this two-part excerpt from Chapter 10, "Integration of Big Data and Data Warehousing."
How has big data changed the way that we look at information and our strategies in dealing with data?
Krish Krishnan: Big data has become an impactful business decision support platform, both from a technology infrastructure standpoint and from a data architecture standpoint. From an infrastructure perspective, we have seen the emergence of platforms, from Hadoop to NoSQL -- and more emerging by the day -- to handle the challenge of collecting vast volumes of data produced at varied speeds by different sources across multiple formats. The infrastructure platform has become a new mechanism for driving the collaboration between different teams that use data in the enterprise for decision support.
From a data architecture perspective, the big data landscape has challenged and evolved the data strategy for the enterprise, from different silos to being an integrated stack. The biggest change that we see today is the revival of enterprise metadata for data discovery and integration challenges, and the use of semantic frameworks and taxonomies for analytics and reporting purposes for big data. We are at the starting edge of enterprise adoption of these platforms, and the next two to three years will be more solution-focused before we see major success stories.
In your opinion, what are the biggest challenges of big data, in terms of working with and storing it?
Krishnan: The biggest challenge in working with big data lies with the complexity and ambiguity of data. Complexity and ambiguity have always been tough from a strategy and delivery perspective, especially the issues that arise from applications of MDM [master data management], CRM [customer relationship management] and call centers. Big data also creates challenges in understanding the context of analysis, especially when we deal with voice, video and image data. While we are working on making this situation become a nonissue, the volume of data and the variety of formats are an area of concern.
In the case of processing text data, there are a number of possible solutions that can be implemented to create a more robust and scalable back-end platform for managing the data. From a storage perspective, big data will become challenging as days progress in the enterprise; for example, a starter size of 10 TB of unstructured data can grow to 1,000 TB in one year, and managing the lifecycle of data is an issue, especially from a storage perspective. Companies that are leading the revolution have published several techniques for compressing and managing data volume. Enterprises will need to implement these techniques as they migrate to the world of big data.
For some, the hype around big data technologies such as Hadoop suggests that these technologies will eventually fully replace the traditional data warehouse. Do you think that this will ever happen, or do you think it is more likely that most companies will move forward with integrated strategies using both traditional and big data technologies?
Krishnan: Big data is an evolution, and the technologies that have been created for managing this type of data will become a heterogeneous addition to the world of data warehousing. The future move for enterprises to consider is a Hadoop-based enterprise data repository or enterprise lake that will store all enterprise data. We can create a sandbox for data discovery and analytical processing based on Hadoop or NoSQL types of technologies and use the database for managing the analytical applications and user interface requirements.
At this point in time, I do not see the rationale for an enterprise to move to Hadoop as a data warehouse platform. With the evolution of Hadoop, however, we can see the emergence of SQL interfaces and more real-time interactive capabilities, which may be pivotal if the platform emerges to challenge the database. For now, we are not ready currently to make that kind of an infrastructure and platform change.
What advice do you have for companies looking to integrate big data technologies into their data architecture?
Krishnan: For an enterprise that aspires to evolve into the big data platform from infrastructure and data perspectives, there are a few things to consider:
- Identify the requirements for what you are attempting to accomplish.
- Identify the data requirements for the purpose, the metadata and master data capabilities of your teams.
- Identify the sources of data, the volume of data and speed of production for collection and storage.
- Do not select technology prior to doing the first three steps. Once you have a clear idea of the need, then select the technology platform.
- Plan for heterogeneous technology architecture to become a normal setup for the enterprise.
- Plan the security requirements before you start looking at application and data development.
- Plan for the skills needed and adjust for skills gaps.
- Plan the development activities from proof of concept (POC) to implementation.
- Execute a POC with real use cases and share the results.
- Involve the business teams as much as possible in the process.