feature engineering

Contributor(s): Linda Rosencrance

Feature engineering is the process that takes raw data and transforms it into features that can be used to create a predictive model using machine learning or statistical modeling, such as deep learning. The aim of feature engineering is to prepare an input data set that best fits the machine learning algorithm as well as to enhance the performance of machine learning models. Feature engineering can help data scientists by accelerating the time it takes to extract variables from data, allowing for the extraction of more variables. Automating feature engineering will help organizations and data scientists create models with better accuracy.

How feature engineering works

The feature engineering process may look something like this:

  • Devise features -- examine a lot of data, analyze feature engineering on other problems and figure out what to use from them.
  • Define features -- involves two processes: feature extraction, which consists of defining and extracting a set of features that represent data that's important for the analysis; and feature construction, which entails transforming a particular set of input features to make a new set of more effective features that can be used for prediction. Depending on the problem, users can decide to use automatic feature extraction, manual feature construction or a combination of the two.
  • Select features -- when users know something about the data, and they've defined the potential features, the next step is to choose the right features. This consists of two elements: feature selection, the process of selecting some subset of the features most relevant to a particular task; and feature scoring, an assessment of how useful a feature is for prediction.
  • Evaluate models -- evaluate features by evaluating the accuracy of the model on unseen data using the selected features.

Feature engineering techniques

Feature engineering techniques include:

  • Imputation -- a typical problem in machine learning is missing values in the data sets, which affects the way machine learning algorithms Imputation is the process of replacing missing data with statistical estimates of the missing values, which produces a complete data set to use to train machine learning models.
  • One-hot encoding -- a process by which categorical data is converted into a form that the machine learning algorithm understands so it can make better predictions.
  • Bag of words -- a counting algorithm that calculates how many times a word is repeated in a document. It can be used to determine similarities and differences in documents for such applications as search and document classification.
  • Automated feature engineering -- this technique pulls out useful and meaningful features using a framework that can be applied to any problem. Automated feature engineering enables data scientists to be more productive by allowing them to spend more time on other components of machine learning. This technique also allows citizen data scientists to do feature engineering using a framework-based approach.
  • Binning -- binning, or grouping data, is key to preparing numerical data for machine learning. This technique can be used to replace a column of numbers with categorical values representing specific ranges.
  • N-grams -- help predict the next item in a sequence. In sentiment analysis, the n-gram model helps analyze the sentiment of the text or document.
  • Feature crosses -- a way to combine two or more categorical features into one. This technique is particularly useful when certain features together denote a property better than they do by themselves.

There are some open source Python libraries that support feature engineering techniques, including the Featuretools library for automatically creating features out of a set of related tables using deep feature synthesis, an algorithm that automatically generates features for relational data sets.

Feature engineering use cases

The following are examples of feature engineering use cases:

  • Calculating a person's age from the individual's birth date and the current date
  • Obtaining the average and median retweet count of particular tweets
  • Acquiring word and phrase counts from news articles
  • Extracting pixel information from images
  • Tabulating how frequently teachers enter various grades

Feature engineering for machine learning

Feature engineering involves applying business knowledge, mathematics and statistics to transform data into a form that machine learning models can use.

Algorithms depend on data to drive machine learning algorithms. A user who understands historical data can detect the pattern and then develop a hypothesis. Then, based on the hypothesis, the user can predict the likely outcome, such as which customers are likely to buy certain products over a certain period of time. Feature engineering is about uncovering the best possible combination of hypotheses.

Feature engineering is critical because if the user provides the wrong hypothesis as an input, machine learning is unable to make accurate predictions. The quality of any hypothesis that's provided to the machine learning algorithm is key to the success of a machine learning model.

In addition, feature engineering influences how machine learning models perform and how accurate they are. It helps uncover the hidden patterns in the data and boosts the predictive power of machine learning.

For machine algorithms to work properly, users must input the right data that the algorithms can understand. Feature engineering transforms that input data into a single aggregated form that's optimized for machine learning. Feature engineering enables machine learning to do its job, e.g., predicting churn for retailers or preventing fraud for financial institutions.

Feature engineering in predictive modeling

An effective way to improve predictive models is with feature engineering, the process of creating new input features for machine learning.

One of the main goals of predictive modeling is to find an effective and reliable predictive relationship between an available set of features and an outcome: for example, how likely a customer is to perform a desired action.

Feature engineering is the process of selecting and transforming variables when creating a predictive model using machine learning. It's a good way to enhance predictive models as it involves isolating key information, highlighting patterns and bringing in someone with domain expertise.

The data used to create a predictive model consists of an outcome variable -- which contains data that needs to be predicted-- and a series of predictor variables, i.e., features, that contain data that can predict a particular outcome.

For example, in a model predicting the price of a certain house, the outcome variable is the data showing the actual price. The predictor variables are the data showing such things as the size of the house, a number of bedrooms and location -- features thought to determine the value of the home.

This was last updated in January 2021

Continue Reading About feature engineering

Dig Deeper on Data virtualization and data federation