If you have some raw data and want to convert in into meaningful insight, feature engineering in machine learning is one of the best tools to help you. Imagine a real-life situation where you want to know how frequently a customer order food from your restaurant in order to offer best customer service.
In such a case, you can use the raw data such as time interval between two orders to your restaurant by a customer, the items the customer frequently ordering, particulars of your restaurant compared to nearby similar restaurant etc. and convert it into meaningful feature like ‘quality of your service to customer’ using feature engineering. Feature engineering convert the unrefined data to a meaningful feature which the machine learning model can better understand.
Feature engineering can be done either from your domain knowledge or through exploratory data analysis. In some cases, additional data provided by an external agency also can aid in feature engineering. Feature engineering plays a vital role in finding meaningful insight in real life situations as well as it became an exciting tool in material research toolbox. Feature engineering in machine learning has the power to considerably speed up the researches on fundamental and applied aspects of material science and physics.
Why do you need feature engineering and how does it work?
Data cleaning and preprocessing, scaling and encoding are the necessary steps to be performed to make the data perfect for an algorithm. Feature engineering compiles all these steps and make the data interpretable and meaningful before we input it to a machine learning model.
Data scientists often spend a major amount of their time in cleaning and organizing data because in real world the data looks pretty messy as shown in the title picture before feature engineering. Machine learning models cannot handle such messy data. Feature engineering helps to organize the data before we input the data into training model and so it makes our machine learning model as simple as possible with better performance. Data preprocessing which does outlier (error in the data) removal and domain understanding are the two important aspects which has to be done before feature extraction in feature engineering. To be precise, it is a process of extracting useful information after data preprocessing using mathematical models, statistical approximations and domain knowledge.
Data Preprocessing for Feature Engineering
Let’s understand data preprocessing, which is a mandatory step to support feature engineering with an example of real-life problem. If you want to purchase some land and got data similar to that in the table below, it will be difficult for you to infer the right information you need to assess the land price.
Area (cent) | Price (lakhs Rs) |
18 | 55 |
15 | 50 |
5 | 52 |
3 | 11 |
However, you can add another column for price per cent to understand the data in a meaningful way.
Area (cent) | Price (Lakhs. Rs) | Price/cent (lakhs.RS) |
18 | 55 | 3.05 |
15 | 50 | 3.33 |
5 | 52 | 10.4 |
3 | 11 | 3.66 |
By combining the information in the column added for price/cent and domain knowledge, you can conclude the row highlighted doesn’t follow the general trend of listed data and hence an outlier. By removing this outlier, we can expect the actual average price/per cent in that particular area is around 3.35 lakhs. Rs. The mathematical operation performed in the added column of price/cent is a basic example outlier detection in data preprocessing. Similarly data preprocessing handle missing values in a data and skewness. Mathematical operations such as mean, median and mode are the most commonly used mathematical operations to fill the missing values in a data. While dealing with skewness, which is a measure of asymmetry in data distribution, mathematical operations such as log transformation demonstrated through the code below are popularly employed to make the data to symmetric distribution.
Python Code
from scipy.stats import skewnorm
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
r = skewnorm.rvs(a = -5, size=10000)
r1 = np.log(r)
r2 = np.log(r) / np.log(2)
r3 = np.log(r) / np.log(1.5)
sns.kdeplot(r,color = 'blue', label='raw data')
sns.kdeplot(r1,color = 'green', label = 'ln(data)')
sns.kdeplot(r2,color = 'red',label = 'log'+str(2)+'(data)')
sns.kdeplot(r3,color = 'cyan',label = 'log'+str(1.5)+'(data)')
plt.legend()
plt.show()
Once done with the data cleaning and preprocessing steps such as removal of outliers, handling missing values and attending skewness, we can move on to the next step of feature engineering which is scaling. Scaling is not a mandatory step of feature engineering for all problems. However, imagine a scenario where the data is presented with one column with values varying from 10 to 1000 and another column with range 0 to 1. Here one column gets an inappropriate weightage over the other. In such cases scaling modifies the features to a particular range with a predefined maximum and minimum, in most cases [0,1].
If there are categorical features such as text or string in the data, encoding is used in feature engineering to covert the categorical variables to numerical values.
Steps in feature engineering for a machine learning model
- Data Collection: Firstly, gather all the relevant data sources needed for the model. This can include structured data from databases, unstructured data from text files or images, or even data from Application programming interfaces (APIs).
- Data Cleaning: Next, meticulously clean the collected data. This involves handling missing values, removing duplicates, and dealing with outliers. The aim is to ensure the dataset is consistent and accurate for analysis.
- Feature Selection: After cleaning the data, carefully select the features that are most relevant to the problem at hand. This step helps in reducing dimensionality and improving model performance by focusing only on the essential features.
- Feature Transformation: Transform the selected features as needed to make them more suitable for the model. Techniques such as normalization, scaling, or encoding categorical variables may be applied here to standardize the data and make it more interpretable for the model.
- Feature Creation: Sometimes, the existing features may not capture the underlying patterns effectively. In such cases, new features can be engineered from the existing ones. This could involve mathematical transformations, combining features, or creating interaction terms.
- Feature Evaluation: Once the features are engineered, evaluate their effectiveness in contributing to the model’s predictive power. Techniques like correlation analysis, feature importance ranking, or domain expertise can help in assessing the significance of each feature.
- Iterative Process: Feature engineering is often an iterative process. After evaluating the features, it’s common to go back and refine the earlier steps based on the insights gained. This iterative approach helps in continuously improving the model’s performance.
Use Cases of Feature Engineering
1. Handling Categorical Data
Categorical data, such as gender or product category, often requires encoding into numerical values for machine learning algorithms. One common approach is one-hot encoding, where each category is represented as a binary vector.
import pandas as pd
# Sample data
data = {'category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)
# One-hot encoding
encoded_df = pd.get_dummies(df['category'], prefix='category')
print(encoded_df)
2. Dealing with Missing Values
Missing values in datasets can adversely affect model performance. Imputation is a common technique used to fill in missing values with meaningful estimates, such as the mean or median of the column.
import pandas as pd
# Sample data
data = {'A': [1, 2, None, 4, 5], 'B': [None, 2, 3, 4, 5]}
df = pd.DataFrame(data)
# Imputation
df.fillna(df.mean(), inplace=True)
print(df)
3. Feature Scaling
Features with different scales can lead to biased model training. Scaling techniques like Min-Max scaling or Standardization can bring features to a similar scale, improving model convergence and performance.
from sklearn.preprocessing import MinMaxScaler
# Sample data
data = [[1, 2], [2, 4], [3, 6], [4, 8]]
# Min-Max scaling
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
4. Feature Extraction
In some cases, extracting features from raw data can improve model performance. For text data, techniques like TF-IDF vectorization or word embeddings can be used to represent text documents numerically.
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample text data
corpus = ['This is the first document.', 'This document is the second document.', 'And this is the third one.']
# TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
once we aim to get the best out of what we have in the form of raw data, how good we handle the features through feature engineering is vital to the success of a machine learning model. Hence feature engineering has to be treated as an art where we have to represent the data as most elegant features to extract the best and meaningful insights.
Endnote:
As we conclude this exploration of feature engineering, we invite you to share your thoughts and suggestions with us. Your feedback is invaluable to our ongoing improvement.
If you’re aspiring to pursue data science job roles, we’ve compiled a comprehensive folder of interview questions to support your preparation.
Moreover, for those eager to explore deeper into machine learning, we encourage you to visit python foundation for machine learning and machine learning fundamentals in our website.
Let’s continue our learning journey together, fuelled by curiosity and a passion for innovation in data science and machine learning!