BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
A data engineer is a worker whose primary job responsibilities involve preparing data for analytical or operational uses. The specific tasks handled by data engineers can vary from organization to organization but typically include building data pipelines to pull together information from different source systems; integrating, consolidating and cleansing data; and structuring it for use in individual analytics applications.
The data engineer often works as part of an analytics team, providing data in a ready-to-use form to data scientists who are looking to run queries and algorithms against the information for predictive analytics, machine learning and data mining purposes. In many cases, data engineers also work with business units and departments to deliver data aggregations to executives, business analysts and other end users for more basic types of analysis to aid in ongoing operations.
Data engineers commonly deal with both structured and unstructured data sets -- as a result, they must be versed in different approaches to data architecture and applications. A variety of big data technologies, including an ever-growing assortment of open source data ingestion and processing frameworks, are also part of the data engineer's tool kit.
To carry out their duties, data engineers can be expected to have skills in such programming languages as C#, Java, Python, Ruby, Scala and SQL. They also need a good understanding of extract, transform and load tools and REST-oriented APIs for creating and managing data integration jobs, and providing data analysts and business users with simplified access to prepared data sets.
Hadoop data lakes that offload some of the processing and storage work of established enterprise data warehouses have been a chief area of application for the data engineer in support of big data analytics efforts. NoSQL databases and Apache Spark systems are also becoming increasingly common components of the data workflows set up by data engineers. Another area of focus is Lambda architecture, which supports unified data pipelines for both batch and real-time processing.
As the data engineer job has gained more definition, IBM, Hadoop vendor Cloudera Inc. and other organizations have begun offering certifications for data engineering professionals.