Extract, transform and load (ETL) is the most prevalent data integration process used by IT and analytics teams....
ETL jobs extract data from source systems, transform it for downstream processing and analytics uses, and load the transformed data into target systems -- for example, an enterprise data warehouse or, in big data applications, a Hadoop cluster.
ETL tools automate these operations, which can streamline the ETL process compared to hand-coding integration scripts in SQL or other programming languages. However, manual coding is still a widespread practice among data integration developers, data analysts and other users with technical skills. If your organization is looking to get away from manual coding, the good news is that choosing the best ETL tools to meet your data integration needs has become less of a hit-or-miss proposition as the software has matured.
Rick Sherman, founder of consulting company Athena IT Solutions in Maynard, Mass., noted in a 2012 Q&A with SearchDataManagement that ETL software in particular and data integration products as a whole have evolved significantly since first emerging in the 1990s. The initial ETL tools to hit the market moved data in batches "from point A to B, and didn't provide very much functionality, but that's no longer true," Sherman said. For starters, he pointed to the addition of predefined integration routines, ETL process and code management features, and built-in documentation of integration procedures.
ETL tool features to look for
As evidence of this evolution, Sherman and other data management consultants point to a mix of core functions that most ETL vendors now offer -- and that users will typically want to have at their disposal. The list includes the following:
- The ability to connect to and extract data from a variety of sources -- databases of all stripes (relational, NoSQL, etc.), big data systems built around technologies such as Hadoop and Spark, flat file repositories, application-to-application message queues, and more.
- A GUI-based design environment that supports drag and drop development of source to target mappings and ETL workflows.
- Team-based development capabilities for collaborating on integration projects, with associated version control and release management features.
- Basic data transformation functions, such as data type conversion, date reformatting and string handling, plus data mapping and workflow orchestration capabilities.
- Built-in data profiling software that can analyze source data for consistency, dependencies and other attributes before beginning the ETL process.
- Data quality and cleansing functionality for identifying and fixing errors in data sets, plus data synchronization for keeping data consistent in source and target systems.
- Metadata management support for synchronizing integration processes and documenting data transformation and business rules.
- A job scheduler, along with process management controls that provide things like runtime monitoring and error alerting, handling, and logging.
Another common characteristic of modern ETL tools is support for integrating data stored in both on-premises and cloud systems, including hybrid cloud environments that mix the two types of platforms.
Most vendors now also provide the ability to flip the last two stages of the ETL process and instead do extract, load and transform (ELT) integrations. The ELT approach lets users manipulate and transform data after it has been loaded into a target system; that capability is a particularly good fit for big data applications in which large volumes of raw data are often loaded as is into the Hadoop Distributed File System, NoSQL databases or other repositories, and then filtered as needed for different analytics uses.
ETL tools grow into integration suites
Traditional ETL is still the biggest part of the data integration market by a wide margin, according to Gartner -- in its "2016 Gartner Magic Quadrant for Data Integration Tools" report, the consulting and market research company said that "easily more than 80% of all organizations" are running a significant amount of batch jobs involving bulk integration of data. In many cases, though, vendors that started with ETL tools augmented them with other types of technologies to create broader data integration suites.
In addition to batch ETL processes, the suites generally offer data replication, change data capture and other forms of real-time integration; data virtualization software that provides integrated views of data without physically extracting it from sources and loading it into a central data warehouse or other target system; streaming and event processing tools; and application integration functionality via service-oriented architecture and enterprise service bus technologies.
There are dozens of vendors you can consider in your search for the best ETL tools to handle the data integration jobs your organization needs to run. Major vendors selling full integration suites include Informatica, IBM, SAP, Oracle, SAS, Microsoft and Information Builders. Also available, potentially at a lower cost, are open source ETL and integration platforms from companies such as Talend, Pentaho and TIBCO Software's Jaspersoft unit.
In addition, the market contains numerous smaller vendors, including ETL specialists and companies that focus on specific vertical industries or certain types of integration applications. Another option is data warehouse automation software, which generates integration code as part of the data warehouse development process without requiring a separate ETL tool or server -- although some ETL proponents contend that such products aren't as functional as mainstream data integration tools.
Get more advice in our guide to buying ETL and other integration software
More users should use data integration tools, consultant Rick Sherman says
Learn why Pfizer replaced ETL tools with data virtualization software