Investigating Hadoop distributions: Which is right for you?
A collection of articles that takes you from defining technology needs to purchasing options
Editor's note: In June 2017, IBM announced that it is ending the development of BigInsights and becoming a reseller of Hortonworks Inc.'s Hadoop distribution, Hortonworks Data Platform (HDP). As part of the deal, IBM will work to migrate all existing IBM BigInsights customers to HDP.
IBM BigInsights combines its enterprise capabilities and industry-standard Hadoop components into a single platform, enabling users to manage and analyze large volumes of structured and unstructured data.
IBM BigInsights features several advanced analytics capabilities, including sophisticated text analytics; BigSheets for advanced data exploration; and Big SQL, which enables SQL access to data in a Hadoop cluster. Added-value enterprise capabilities are designed to enhance and simplify application development and system implementation, as well as provide features that improve performance, scalability, reliability, security and administration.
IBM BigInsights release 4.1 includes the IBM Open Platform with Apache Hadoop, as well as several prepackaged value-add modules containing proprietary advanced enterprise-grade features.
IBM Open Platform with Apache Hadoop, IBM's core distribution of open source Hadoop, includes the following Apache components: Ambari (2.1), Apache Kafka (0.8.2), Flume (1.5.2),Ganglia (3.1.7), Hadoop ( 2.7.1), HBase (1.1.1), Hive (1.2.1), Knox (0.6.0), Lucene (4.7.0), Nagios (3.5.1), Oozie (4.2.0), Parquet (4.0), Parquet MR/format (1.6.0/2.2), Pig (0.15.0), Slider (0.80.0), Solr (5.1.0), Spark (1.4.1), Sqoop (1.4.6.), Terada Connector for Hadoop (1.4) and Zookeeper (3.4.6)
BigInsight value-add modules include:
IBM BigInsights Analyst, which provides specific tools for data analysis. The modules include BigInsights Home service, the primary interface used to launch other BigInsights components, as well as Big SQL and BigSheets:
- Big SQL is an advanced SQL engine that provides users who have standard SQL query skills with fast query access to data in a Hadoop cluster within a single query, whether it be in Hive, HBase or Hadoop Distributed File System enabled by massively parallel processing technology. The product also supports federated query access to IBM DB2, Oracle, Teradata and Open Database Connectivity sources.
- BigSheets lets users explore, transform and perform visualizations on large data sets stored in Hadoop through a spreadsheet-like Web interface. The tool supports fast queries against massive data sets by translating user actions into MapReduce functions against the Hadoop cluster.
IBM BigInsights Data Scientist, which enables users with advanced analytics skills tools to gain further insight into the data in the cluster. In addition to components provided as part of the Analyst module, the following tools are also included:
- Big R, which provides users familiar with the R language a set of libraries, enabling them to develop and use R language functions on data residing in the IBM BigInsights cluster. This tool lets users perform complex operations and queries using R against large data sets by hiding some of the complexity of writing MapReduce functions.
- Text Analytics, which is a powerful and intuitive tool for extracting information from unstructured and semi-structured text.
- SystemML, which provides users with a tool to use an R-like syntax to perform statistical functions and machine learning constructs. The tools enable the algorithms to be executed in a distributed fashion across nodes of a cluster using MapReduce or Spark (in memory). IBM contributed SystemML to the open source community, and it has been accepted as an Apache Incubator project.
IBM Enterprise Management module, which provides enterprise-grade capabilities to support cluster scaling and performance through parallel computing and application grid management. The module also provides other enterprise features to support cluster security and reliability. IBM Enterprise Management includes IBM Spectrum Scale-FPO, a Portable Operating System Interface-compliant file system that can be used in place of Hadoop Distributed File System. This gives administrators more control and improved integration capabilities with other systems in the enterprise. Also included is IBM Platform Symphony, which provides administrators with tools to efficiently manage multiple platform instances as well as enables support for data isolation for multi-tenant environments.
IBM BigInsights for Apache Hadoop, which includes the contents of the three modules noted above.
BigInsights modules operate on Linux servers. Detailed system requirements include operating system and hardware, as well as supported software.
While IBM BigInsights modules can be downloaded and installed on-premises, the company also offers BigInsights on Cloud, Hadoop as a service on IBM's global cloud infrastructure. This option provides users with all the features of BigInsights in a 24/7 managed environment.
IBM BigInsights licensing and distribution
While IBM's Open Platform with Apache Hadoop is available as an open source free-to-use distribution, BigInsight's value-add modules require IBM licensing for purposes other than evaluation. Contact IBM or an IBM Business Partner for detailed pricing and support options.
IBM offers the BigInsights Quick Start evaluation edition of its software for nonproduction use.
IBM is a founding member of the Open Data Platform Initiative, a group of big data industry leaders and vendors that promote technologies based on open source on the Apache Hadoop ecosystem and share in efforts to promote interoperability of big data tools.
The essential guide to overcoming the hurdles with Hadoop
How to use vendor distributions of Hadoop to manage big data
Learn how data lakes manage pools of big data