Investigating Hadoop distributions: Which is right for you?
A collection of articles that takes you from defining technology needs to purchasing options
Many companies are struggling to manage the massive amounts of data they collect. Whereas in the past they may have used a data warehouse platform, such conventional architectures can fall short for dealing with data originating from numerous internal and external sources and often varying in structure and types of content. But new technologies have emerged to offer help -- most prominently, Hadoop, a distributed processing framework designed to address the volume and complexity of big data environments involving a mix of structured, unstructured and semi-structured data.
Part of Hadoop's allure is that it consists of a variety of open source software components and associated tools for capturing, processing, managing and analyzing data. But, as addressed in a previous article in this series, in order to help users take advantage of the framework, many vendors offer commercial Hadoop distributions that provide performance and functionality enhancements over the base Apache open source technology and bundle the software with maintenance and support services. As the next step, let's take a look at how a Hadoop distribution could benefit your organization.
Making a case for a Hadoop distribution
Hadoop runs in clusters of commodity servers and typically is used to support data analysis and not for online transaction processing applications. Several increasingly common analytics use cases map nicely to its distributed data processing and parallel computation model. The list includes:
- Operational intelligence applications for capturing streaming data from transaction processing systems and organizational assets, monitoring performance levels, and applying predictive analytics for pre-emptive maintenance or process changes.
- Web analytics, which are intended to help companies understand the demographics and online activities of website visitors, review Web server logs to detect system performance problems, and identify ways to enhance digital marketing efforts.
- Security and risk management, such as running analytical models that compare transactional data to a knowledge base of fraudulent activity patterns, as well as continuous cybersecurity analysis for identifying emerging patterns of suspicious behavior.
- Marketing optimization, including recommendation engines that absorb huge amounts of Internet clickstream and online sales data and blend that information with customer profiles to provide real-time suggestions for product bundling and upselling.
- Internet of Things applications, such as analyzing data from things -- like manufacturing devices, pipelines and so-called smart buildings -- via sensors that continuously generate and broadcast information about their status and performance.
- Sentiment analysis and brand protection, which might involve capturing streaming social media data and analyzing the text to identify unsatisfied customers whose issues can be addressed quickly.
- Massive data ingestion for data collection, processing and integration scenarios such as capturing satellite images and geospatial data.
- Data staging, in which Hadoop is used as an initial landing spot for data that is then integrated, cleansed and transformed into more structured formats in preparation for loading into a data warehouse or analytical database for analysis.
Capabilities supporting the use cases
Applications supporting these usage scenarios can be built on top of Hadoop using some prototypical implementation methodologies, such as:
Data lakes. Because Hadoop delivers linear scalability for processing and storage as new data nodes are incorporated into a cluster architecture, it provides a natural platform for capturing and managing voluminous files of raw data. This has motivated many users to implement Hadoop systems as a catch-all platform for their data, creating a conceptual data lake.
Data warehouse augmentation platform. Hadoop's distributed storage can also be used to expand the data that's accessible for analysis in a data warehouse environment. For example, a temperature-based scheme can be used for allocating data to different levels of the storage hierarchy, depending on its frequency of use. The most frequently accessed "hot" data is kept in the data warehouse, while less-frequently used "cool" data is relegated to higher-latency storage such as the Hadoop Distributed File System. This approach relies on tightly coupled data warehouse integration with Hadoop.
Large-scale batch computation engine. When configured with a combination of data and compute nodes, Hadoop becomes a massively parallel processing platform that's suited to batch processing applications for manipulating and analyzing data. One example would be data standardization and transformation jobs applied to data sets to prepare them for analysis. Algorithm-driven analytics applications such as data mining, machine learning, pattern analysis and predictive modeling are also good matches for Hadoop's batch capabilities, as they can be executed in parallel over massive distributed data files with iterations of partial results accumulated until the program completes with a final set of results.
Event stream analytics processing engine. A Hadoop environment can also be configured to process incoming data streams in real or near real time. As an example, a customer sentiment analysis application can have multiple communication agents running in parallel on a Hadoop cluster, each applying a set of stream processing rules to data feeds from social networks such as Twitter and Facebook.
Advantages of adopting Hadoop: Is it right for you?
A low-cost, high-performance computing framework like Hadoop can address different IT and business motivations for scaling up processing power or expanding data management capabilities in an organization. Let's examine some characteristics of application requirements that suggest the need for a data management platform based on a Hadoop distribution:
Ingestion and processing of large data sets, massive data volumes and streaming data. Examples include capturing Web server logs that contain information about billions of online events; indexing hundreds of millions of documents across different data sets; and continuously pulling in data streams such as social media channels, stock market data, news feeds and content published at expert communities.
A need to eliminate performance impediments. Application performance is often throttled on traditional data warehouse systems as a result of data accessibility, latency and availability issues or bandwidth limits in relation to the amount of data that needs to be processed.
The desire for linear scalability on performance. As data volumes grow and the number of users increases, having an environment in which performance will scale linearly as more computing and storage resources are added can be crucial, especially when applications can benefit from parallel computing.
A mixture of structured and unstructured data. The applications need to use data from different sources that vary in structure, and some -- or much -- of it is unstructured or semi-structured, for example, text or server log data.
IT cost efficiencies. Rather than paying premium prices for high-end servers or specialty hardware appliances, the system architects believe that acceptable performance can be achieved using commodity components.
Considerations for integrating Hadoop into the enterprise
A positive value proposition for using Hadoop still must be balanced, though, with the feasibility of integrating the platform into the enterprise. Because many organizations have made significant investments in traditional data warehouse platforms, there may be some resistance to introducing a newer technology. Before engaging a Hadoop distribution vendor, work to resolve any potential barriers to adoption and assess requirements for cluster sizing and configuration.
For example, determine where a Hadoop cluster fits in your organization's data warehousing and analytics strategy -- whether it's intended to augment existing data warehouses or replace them. Also, identify integration and interoperability issues that need to be addressed, and review configuration alternatives, including whether it's better to implement the Hadoop ecosystem on premises or in a cloud-based or hosted environment. In addition, ensure that you have funding to hire people with the right skills or retrain existing employees. Hadoop application development differs greatly from conventional database development.
Answering these types of questions will help in determining the feasibility of a Hadoop deployment. The next step, which will be examined in the third article in this series, is to evaluate the features and functions you need in a commercial Hadoop distribution.
This is the second article in a four-part series on Hadoop distributions for big data management. The first article laid the groundwork while this article examined specific use cases for buying a Hadoop distribution. Article No. 3 will help you determine your must-have features. The concluding article will examine Hadoop distributions from the leading vendors, comparing and contrasting their features.
Use this guide to learn about Hadoop and its place in big data management
How Hadoop data lakes can be an alternative to enterprise data warehouses
Diving into the murky depths of the data lake