Feature Engineering Libraries in Python …

Feature Engineering! As much as we all sigh at the mere utterance of the dreaded phrase, we as data scientists spend the majority of our time engaging in this activity (as shown in fig 1.1). It seems a majority of data scientists find this part of their workload to be the most tedious and least enjoyable.

As tedious and unenjoyable it can be, it is vital and it can lead to a better performing model. Too many data scientists, the inability to quantify its impact (if any) until towards the end of the workload adds to the lack of enjoyment.

The feature engineering process usually goes like this: clean your data, select viable features for your model, extract features (combining/one-hot encoding), or create features.

Did you know that python has automated libraries that can help aid your feature engineering process?

In this blog, I will introduce two python libraries that help automate the process, show you how to set them up, and present them in action.

Featuretools

Featuretools is a library that can work with many tables in relational databases. It requires you to define entities and it allows you to merge data frames and bin them so that you can perform various aggregations on them to create new features.

It uses a function called deep feature synthesis to aggregate your features and automatically create new columns. You are in control of setting the math behind the aggregations. You can access the documentation for this library from this link.

The process of creating features on Iris data set
Added features from the aggregation

Even though there are 50+ columns added, these aggregations do not affect your original data frame:

There are many different aggregations you can make with this library. It is nice because it automates the process and is an alternative to manual aggregation with pandas manipulation.

Feature-engine

Feature-engine is an amazing library that has an array of functions that help you with things such as: handling missing data, one-hot encoding, outlier capping, discretization, and numerical variable transformation. It pairs very well with sci-kit learn and after you separate your train and test data sets, you can apply the transformers. You can access the documentation for this library from this link.

Missing data

There are many ways to deal with missing data. Feature-engine allows you to replace values with: the mean or median, an arbitrary number, a number at a certain point of a distribution tail, a random sample of the variable, or add a binary flag onto it.

Create a variable with one of the functions, define your parameters, and use the .fit() function to apply it to your data frame.

Outlier capping

Feature-engine has a function called Winsorizer() that caps the maximum or minimum values using interquartile ranges. To tweak this function, you can tell it what type of distribution you want the IQR to be taken from and tell it what part of the tail (if normal distribution) you want to cap.

If you want to personally decide your outliers, you can use the ArbitraryOutlierCapper() function. This function allows you to pass a dictionary of maximum and minimum parameters.

conclusion

Feature engineering does not always have to be such a drag. With libraries such as featuretools and feature-engine, the dull process can become a little more seamless and less code-intensive.

Sources:

https://feature-engine.readthedocs.io/en/latest/

https://docs.featuretools.com/en/stable/index.html

https://adtmag.com/articles/2016/03/25/data-science-report.aspx

Junior Data Scientist| Passionate about using data for social good.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store