Feature Engineering in Machine Learning

If you have some raw data and want to convert in into meaningful insight, feature engineering in machine learning is one of the best tools to help you. Imagine a real-life situation where you want to know how frequently a customer order food from your restaurant in order to offer best customer service. In such a case, you can use the raw data such as time interval between two orders to your restaurant by a customer, the items the customer frequently ordering, particulars of your restaurant compared to nearby similar restaurant etc. and convert it into meaningful feature like ‘quality of your service to customer’ using feature engineering. Feature engineering convert the unrefined data to a meaningful feature which the machine learning model can better understand. Feature engineering can be done either from your domain knowledge or through exploratory data analysis. In some cases, additional data provided by an external agency also can aid in feature engineering. Feature engineering plays a vital role in finding meaningful insight in real life situations as well as it became an exciting tool in material research toolbox. Feature engineering in machine learning has the power to considerably speed up the researches on fundamental and applied aspects of material science and physics.

Why do you need feature engineering and how does it work?


Data cleaning and preprocessing, scaling and encoding are the necessary steps to be performed to make the data perfect for an algorithm. Feature engineering compiles all these steps and make the data interpretable and meaningful before we input it to a machine learning model.

Data scientists often spend a major amount of their time in cleaning and organizing data because in real world the data looks pretty messy as shown in the title picture before feature engineering. Machine learning models cannot handle such messy data. Feature engineering helps to organize the data before we input the data into training model and so it makes our machine learning model as simple as possible with better performance. Data preprocessing which does outlier (error in the data) removal and domain understanding are the two important aspects which has to be done before feature extraction in feature engineering. To be precise, it is a process of extracting useful information after data preprocessing using mathematical models, statistical approximations and domain knowledge.

Let’s understand data preprocessing, which is a mandatory step to support feature engineering with an example of real-life problem. If you want to purchase some land and got data similar to that in the table below, it will be difficult for you to infer the right information you need to assess the land price.

Area (cent)Price (lakhs Rs)
1855
1550
552
311 

However, you can add another column for price per cent to understand the data in a meaningful way.

Area (cent)Price (Lakhs. Rs)Price/cent (lakhs.RS)
18553.05
15503.33
55210.4
3113.66

By combining the information in the column added for price/cent and domain knowledge, you can conclude the row highlighted doesn’t follow the general trend of listed data and hence an outlier. By removing this outlier, we can expect the actual average price/per cent in that particular area is around 3.35 lakhs. Rs. The mathematical operation performed in the added column of price/cent is a basic example outlier detection in data preprocessing. Similarly data preprocessing handle missing values in a data and skewness. Mathematical operations such as mean, median and mode are the most commonly used mathematical operations to fill the missing values in a data. While dealing with skewness, which is a measure of asymmetry in data distribution, mathematical operations such as log transformation demonstrated through the code below are popularly employed to make the data to symmetric distribution.

from scipy.stats import skewnorm
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

r = skewnorm.rvs(a = -5, size=10000) 
r1 = np.log(r)
r2 = np.log(r) / np.log(2) 
r3 = np.log(r) / np.log(1.5) 

sns.kdeplot(r,color = 'blue', label='raw data')
sns.kdeplot(r1,color = 'green', label = 'ln(data)')
sns.kdeplot(r2,color = 'red',label = 'log'+str(2)+'(data)')
sns.kdeplot(r3,color = 'cyan',label = 'log'+str(1.5)+'(data)')
plt.legend()
plt.show()
Feature engineering

Once done with the data cleaning and preprocessing steps such as removal of outliers, handling missing values and attending skewness, we can move on to the next step of feature engineering which is scaling. Scaling is not a mandatory step of feature engineering for all problems. However, imagine a scenario where the data is presented with one column with values varying from 10 to 1000 and another column with range 0 to 1. Here one column gets an inappropriate weightage over the other. In such cases scaling modifies the features to a particular range with a predefined maximum and minimum, in most cases [0,1].

If there are categorical features such as text or string in the data, encoding is used in feature engineering to covert the categorical variables to numerical values. 

What is the order of steps in feature engineering for a machine learning model?

  1. Data preparation: Converting the raw data from different sources to a standardized format in order to be used in a machine learning model.
  2. Exploratory data analysis: Identify and consolidate the characteristic features of a data to decide on the most appropriate statistical/mathematical technique to be applied to suit for an ML model.
  3. Benchmark: This step set up a baseline standard to accuracy upon which all variables are to be compared.

once we aim to get the best out of what we have in the form of raw data, how good we handle the features through feature engineering is vital to the success of a machine learning model. Hence feature engineering has to be treated as an art where we have to represent the data as most elegant features to extract the best and meaningful insights.

Leave a Comment

Your email address will not be published. Required fields are marked *