Feature Engineering! As much as we all sigh at the mere utterance of the dreaded phrase, we as data scientists spend the majority of our time engaging in this activity (as shown in fig 1.1). It seems a majority of data scientists find this part of their workload to be the most tedious and least enjoyable.
As tedious and unenjoyable it can be, it is vital and it can lead to a better performing model. Too many data scientists, the inability to quantify its impact (if any) until towards the end of the workload adds to the lack of enjoyment.
The feature engineering process usually goes like this: clean your data, select viable features for your model, extract features (combining/one-hot encoding), or create features.
Did you know that python has automated libraries that can help aid your feature engineering process?
In this blog, I will introduce two python libraries that help automate the process, show you how to set them up, and present them in action.
Featuretools is a library that can work with many tables in relational databases. It requires you to define entities and it allows you to merge data frames and bin them so that you can perform various aggregations on them to create new features.
It uses a function called deep feature synthesis to aggregate your features and automatically create new columns. You are in control of setting the math behind the aggregations. You can access the documentation for this library from this link.
Even though there are 50+ columns added, these aggregations do not affect your original data frame:
There are many different aggregations you can make with this library. It is nice because it automates the process and is an alternative to manual aggregation with pandas manipulation.
Feature-engine is an amazing library that has an array of functions that help you with things such as: handling missing data, one-hot encoding, outlier capping, discretization, and numerical variable transformation. It pairs very well with sci-kit learn and after you separate your train and test data sets, you can apply the transformers. You can access the documentation for this library from this link.
There are many ways to deal with missing data. Feature-engine allows you to replace values with: the mean or median, an arbitrary number, a number at a certain point of a distribution tail, a random sample of the variable, or add a binary flag onto it.
Create a variable with one of the functions, define your parameters, and use the
.fit() function to apply it to your data frame.
Feature-engine has a function called
Winsorizer() that caps the maximum or minimum values using interquartile ranges. To tweak this function, you can tell it what type of distribution you want the IQR to be taken from and tell it what part of the tail (if normal distribution) you want to cap.
If you want to personally decide your outliers, you can use the
ArbitraryOutlierCapper() function. This function allows you to pass a dictionary of maximum and minimum parameters.
Feature engineering does not always have to be such a drag. With libraries such as featuretools and feature-engine, the dull process can become a little more seamless and less code-intensive.