
BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Using big data platforms for data management, access and analytics
-
Article
IT teams face pressing need for streaming analytics platforms
Building an architecture to support real-time analytics applications is becoming a priority for many organizations. But there's a plethora of data streaming platforms to consider. Read Now
-
Article
Big data tools, databases often best used in mixed company
EMA analyst John Myers says that, when evaluating data management technologies, IT teams should look to mix and match processing platforms for their big data workloads. Read Now
-
Article
Companies make Spark a centerpiece in big data environments
The Apache Spark processing engine has pushed its way into the big data spotlight alongside Hadoop, and users are turning to Spark for more than just its batch processing speed. Read Now
-
Article
For big data management needs, NoSQL software may be the answer
This handbook examines the potential role of NoSQL databases in big data applications -- and functionality issues that must be addressed when considering a deployment. Read Now
Editor's note
Big data platforms abound, which has upsides and downsides for prospective users. Hadoop clusters, the Spark processing engine, NoSQL databases, even conventional databases and data warehouses -- these and a variety of other technologies can all be tapped to create a big data architecture. But it's possible to go down the wrong technology path -- or multiple wrong paths.
It's up to IT managers, enterprise architects and others involved in building a big data framework to keep their organization on track to meet the business goals behind the deployment. "You need to make sure your architecture will take you where you want to go," said Ibrahim Itani, an independent consultant who focuses on big data analytics and a former leader of analytics and data warehousing teams at Verizon.
During a panel discussion at the 2017 TDWI Leadership Summit in Las Vegas, Itani compared architecting big data environments to designing bridges with multiple lanes and levels that can handle different traffic needs. In both cases, he said, you have to anticipate future usage so you can reconfigure or expand on top of the same foundations. Modifying a big data architecture "is costly and destructive to business operations if major changes are needed very often," Itani cautioned. He added, though, that big data systems should be able to accommodate new platforms and tools as they emerge or as business needs change.
Edd Wilder-James, a consultant at Silicon Valley Data Science, also pointed to technology agility as a key element of well-designed big data architectures. In addition, he cited related attributes such as linear scale-out and rapid deployment capabilities, plus support for schema-on-read approaches to data modeling, which provide flexibility in how information is organized. "Not all data is equal," Wilder-James said, in a session at the TDWI conference. "We need to treat different data in different ways. The things we have to think about are much more complicated than before."
To help address such challenges, many organizations are deploying multiple big data platforms to handle different parts of the processing pipeline. This guide includes a wide range of content on the available platform options, including Hadoop, Spark and database technologies. In the sections below, you'll find guidance on navigating the technology selection process, real-world examples of big data programs and information on big data management trends and technology developments.
1Big data platforms and management strategies in action
Like other IT projects, big data applications face a host of hurdles -- only writ larger, in most cases. That starts with planning, designing and building a big data architecture, then continues on to things such as configuring and partitioning data sets, deploying advanced analytics tools, governing data and managing the use of Hadoop clusters and other big data platforms.
The stories in this section provide a window into big data projects at numerous user organizations, with tips from experienced IT managers and other users on tactics and strategies they've used in their deployments.
-
Article
Rise of big data platforms spurs new look at data governance process
Big data systems pose new data governance challenges in organizations. But some are navigating their way through the changes as they move to govern their data lakes effectively. Read Now
-
Article
Big data analytics initiatives find value in a variety of tools
Getting full business value from big data systems often takes a mix of predictive modeling, machine learning and other advanced analytics applications -- and a lot of effort. Read Now
-
Article
Big data architecture development doesn't happen overnight
Although Hadoop and related technologies enable organizations to design big data environments that are a match for their needs, putting all the pieces together isn't an easy feat. Read Now
-
Article
Spark usage on the rise despite gaps in its functionality
Spark still has some growing to do, but that isn't stopping an increasing number of organizations from deploying the technology to boost their big data processing performance. Read Now
-
Article
Real-time data streaming platforms speed up big data analytics
Companies are using real-time data processing and analytics technologies to find information in streams of big data that can help their business operations take action fast. Read Now
-
Article
User priority: Finding the business benefits of Hadoop platforms
More IT managers are looking to deploy Hadoop clusters in their organizations, but first, they have to sell business executives on the value of big data analytics applications. Read Now
-
Article
Monitoring, governance of Hadoop platforms key to big data success
Hadoop is playing a more central role in business operations, which has made managing the distributed processing framework a big priority for IT vendors and big data users alike. Read Now
-
Article
New users face learning curve for managing big data platforms
IT and analytics teams have to learn their way with system configuration, data partitioning and other setup processes to optimize the performance of Hadoop and Spark systems. Read Now
2Technology developments on big data platforms
Things move quickly in the big data ecosystem, partly because of the open source nature of Hadoop, Spark and other technologies. In addition, many big data platforms and tools are still relatively new, so they get updated with new functionality on a regular basis. The growth of cloud computing and the emergence of technologies such as containers and microservices are also driving changes to big data software and systems.
The stories in this section examine trends affecting big data vendors and users; they also shine a light on new technologies that have been added to the big data mix.
-
Article
Drizzle software pegged to perk up Spark's streaming throughput
Spark's lead developers are looking to gain a performance edge on rival open source stream processing platforms via the addition of a low-latency execution engine called Drizzle. Read Now
-
Article
Real-time processing pipelines bring changes to big data systems
Big data architectures are changing to support the move to real-time processing and faster data analytics, with microservices gaining prominence in the Hadoop development domain. Read Now
-
Article
New components in Hadoop platforms include containers, microservices
In big data environments, microservices running in containers can break processing and analytics jobs up into pieces, easing development and management of Hadoop data flows. Read Now
-
Article
Hadoop vendors look to ease cost, complexity of cloud-based clusters
Big data vendors are moving to simplify the process of running Hadoop platforms in the cloud, partly through metered pricing that lets users set up transient clusters as needed. Read Now
-
Article
Together, IoT and big data increase data management needs
Consultant Andy Hayler says organizations looking to handle the large volumes of data coming from the internet of things may need to start by deploying new big data platforms. Read Now
-
Article
New data management tools lean on graph database technology
Graph databases are increasingly being tapped to help power a variety of new application architectures, including data integration, data governance and master data management tools. Read Now
-
Article
Data modeling techniques must evolve to accommodate big data
The surging adoption of big data platforms is pushing IT teams to adjust the way they approach data modeling, including the process of creating database schemas. Read Now
-
Article
GPUs find a place in graph database, machine learning systems
Familiar to gamers and supercomputer programmers, graphics processing units are now being tapped to power big data systems running graph databases and machine learning applications. Read Now
3Terms to know related to big data platforms
Read the definitions included in this section to learn the basics about big data and the key technologies for processing, managing and analyzing it.
-
Definition
Apache Hadoop YARN
Apache Hadoop YARN is the resource management and job scheduling technology in the open source Hadoop distributed processing framework. Read Now
-
Definition
Apache Spark
Apache Spark is an open source parallel processing framework for running large-scale data analytics applications across clustered computers. It can handle both batch and real-time analytics and data processing workloads. Read Now
-
Definition
big data
Big data is a combination of structured, semistructured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications. Read Now
-
Definition
big data analytics
Big data analytics is the often complex process of examining large and varied data sets, or big data, to uncover information -- such as hidden patterns, unknown correlations, market trends and customer preferences -- that can help organizations make informed business decisions. Read Now
-
Definition
big data as a service (BDaaS)
Big data as a service (BDaaS) is the delivery of statistical analysis tools or information by an outside provider that helps organizations understand and use insights gained from large information sets in order to gain a competitive advantage. Read Now
-
Definition
big data management
Big data management is the organization, administration and governance of large volumes of both structured and unstructured data. Read Now
-
Definition
data engineer
A data engineer is a worker whose primary job responsibilities involve preparing data for analytical or operational uses. Read Now
-
Definition
data scientist
A data scientist is a professional responsible for collecting, analyzing and interpreting extremely large amounts of data. The data scientist role is an offshoot of several traditional technical roles, including mathematician, scientist, statistician and computer professional. Read Now
-
Definition
database management system (DBMS)
A database management system (DBMS) is system software for creating and managing databases. Read Now
-
Definition
Hadoop 2
Apache Hadoop 2 is the second iteration of the Hadoop framework for distributed data processing. Hadoop 2 adds support for running non-batch applications as well as new features to improve system availability. Read Now
-
Definition
Hadoop as a service (HaaS)
Hadoop as a service (HaaS), also known as Hadoop in the cloud, is a big data analytics framework that stores and analyzes data in the cloud using Hadoop. Read Now
-
Definition
Hadoop cluster
A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Read Now
-
Definition
Hadoop data lake
A Hadoop data lake is a data management platform comprising one or more Hadoop clusters. Read Now
-
Definition
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. Read Now
-
Definition
MapReduce
MapReduce est un composant central du framework Hadoop. Il répartit les tâches sur plusieurs nœuds au sein du cluster (Map) puis il organise et agrège les résultats de chacun des nœuds pour apporter une réponse à une requête (Reduce) Read Now
-
Definition
NoSQL
NoSQL is an approach to database design that can accomodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, which stand for "not only SQL," is an alternative to traditional relational databases in which data is placed in tables and data schema is carefully designed before the database is built. Read Now
-
Definition
SQL-on-Hadoop
SQL-on-Hadoop is a class of analytical application tools that combine established SQL-style querying with newer Hadoop data framework elements. Read Now
-
Definition
multimodel database
A multimodel database is a data processing platform that supports multiple data models, which define the parameters for how the information in a database is organized and arranged. Read Now