{"id":19122,"date":"2020-07-22T07:54:29","date_gmt":"2020-07-22T05:54:29","guid":{"rendered":"https:\/\/www.inovex.de\/blog\/?p=19122"},"modified":"2022-12-02T08:57:36","modified_gmt":"2022-12-02T07:57:36","slug":"automated-feature-engineering-open-source-libraries","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/","title":{"rendered":"Automated Feature Engineering with Open-Source Libraries"},"content":{"rendered":"<p>In the hope of excellent features, without requiring domain experts spending days engineering them, lies this review of automated feature engineering with TPOT, auto-sklearn and autofeat.<\/p>\n<p><!--more--><\/p>\n<p>In this blogpost, we will examine the automated feature engineering of the three libraries mentioned for four datasets. We will take a look at the following questions: How do the automated feature engineering frameworks differ conceptionally and in result? What are the gains by using them depending on the expressiveness of the predictive model at hand? How do these automated features compare to those engineered manually?<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#Why-Automated-Feature-Engineering\" >Why (Automated) Feature Engineering?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#How-do-These-Libraries-Work\" >How do These Libraries Work?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#Evaluation-Setup\" >Evaluation Setup<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#Results\" >Results<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#Optimization-of-Features-Over-Time\" >Optimization of Features Over Time<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#Output-of-Libraries-and-Comparison-to-Manual-Feature-Engineering\" >Output of Libraries and Comparison to Manual Feature Engineering<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#Performance-of-Libraries\" >Performance of Libraries<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#Conclusion%E2%80%94tldr\" >Conclusion\u2014tl;dr<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#Sources\" >Sources<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Why-Automated-Feature-Engineering\"><\/span>Why (Automated) Feature Engineering?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Feature engineering is often described as the extraction and accentuation of implicit information from data [1], to construct features best amenable to learning [2]. Feature engineering is heavily domain-specific while machine learning models can be general-purpose [2]. Whether or not feature engineering includes data exploration and cleaning tasks\u2014also sometimes referred to as feature preprocessing\u2014, such as missing values imputation or outlier treatment, is not formally defined but considered in this blogpost.<\/p>\n<p>The main motivation for feature engineering itself is interesting for two reasons, the improvement of prediction metrics like accuracy, as well as allowing simpler models to be a viable alternative to complex ones. As the prominent figure of the data science community, Andrew Ng, puts it: &#8222;Coming up with features is difficult, time-consuming, requires expert knowledge. \u2019Applied machine learning\u2019 is basically feature engineering&#8220; [3]. There are at least four reasons for automating this process:<\/p>\n<ol>\n<li>Domain experts may not be readily available or biased, in which case new useful feature transformations are desired.<\/li>\n<li>Scalability: Automation has the ability to scale in producing and testing more novel features faster (which in itself does not mean that the features are better).<\/li>\n<li>Feature engineering is a trial and error heavy process: the influence of a feature on the quality of a machine learning model needs to be evaluated. According to Forbes, data scientists spend most of their time on such tasks, although it is ranked as the least enjoyable [4].<\/li>\n<li>In the strive for complete automated machine learning, every step of the pipeline needs to be automated.<\/li>\n<\/ol>\n<p>Just for clarification, the differences between deep learning and automated feature engineering are sometimes confused. Feature engineering is much more general-purpose than deep learning, depending on the available feature engineering strategies. Feature engineering can include the mentioned feature pre-processing transformations and the input data can even be discrete or sparse. Moreover, feature engineering is also done on deep learning tasks, architectural decisions that transform the input features could be considered feature engineering by the provided definition. These transformations of the input data are chosen by design and are not the result of model training.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"How-do-These-Libraries-Work\"><\/span>How do These Libraries Work?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In a nutshell, TPOT designs machine learning pipelines with genetic programming. Genetic programming is a subcategory of evolutionary algorithms used to build programs automatically and independent of their domain. Looking at Figure 1, one can see an exemplary TPOT optimized pipeline. The default possible feature operators\u2014here represented in circles\u2014include mainly preprocessors, transformers and feature selection algorithms implemented in Scikit-learn. The available operators can easily be extended. The model and hyperparameter selection of a machine learning pipeline is also part of TPOT but deactivated in the following examples. Ultimately, the optimization goal in TPOT is to find a pipeline with a minimal number of operators but with the best-possible prediction accuracy given some error measure.<\/p>\n<figure id=\"attachment_19128\" aria-describedby=\"caption-attachment-19128\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-19128\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tree_based_pipeline-1024x519.png\" alt=\"\" width=\"1024\" height=\"519\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tree_based_pipeline-1024x519.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tree_based_pipeline-300x152.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tree_based_pipeline-768x389.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tree_based_pipeline-1536x779.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tree_based_pipeline-400x203.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tree_based_pipeline-360x183.png 360w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tree_based_pipeline.png 1826w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"caption-attachment-19128\" class=\"wp-caption-text\">Figure 1: Example of a TPOT tree-based pipeline.<\/figcaption><\/figure>\n<p>Auto-sklearn also constructs complete machine learning pipelines that maximize a chosen error measure. The pipelines of auto-sklearn are not as flexible and consist of up to four feature operators in front of the machine learning model. The optimization of pipelines in auto-sklearn is done using Bayesian optimization. In a nutshell, Bayesian optimization is a strategy for efficiently finding global extrema. It prooved to be especially useful for non-convex objective functions. The sampled evidence is captured in a probabilistic model to represent the relationship between data points and measured performance, which is then used to select the next values to test.<\/p>\n<p>Autofeat is intended for generating linear features in a two-step process. First, it generates tens of thousands of features with non-linear feature transformations and combinations. In a second step it selects the most useful subset of features by the correlation of features with the target residual and importance of features for a linear model. Both steps can be done for a different amount of iterations, resulting in more complex and more stable features respectively.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Evaluation-Setup\"><\/span>Evaluation Setup<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>A blueprint of the setup to generate data for an evaluation can be seen in Figure 2.<\/p>\n<figure id=\"attachment_19131\" aria-describedby=\"caption-attachment-19131\" style=\"width: 751px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-19131\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/Versuchsaufbau.png\" alt=\"\" width=\"751\" height=\"481\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/Versuchsaufbau.png 751w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/Versuchsaufbau-300x192.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/Versuchsaufbau-400x256.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/Versuchsaufbau-360x231.png 360w\" sizes=\"auto, (max-width: 751px) 100vw, 751px\" \/><figcaption id=\"caption-attachment-19131\" class=\"wp-caption-text\">Figure 2: Experimental evaluation setup.<\/figcaption><\/figure>\n<p>The experiments conducted for our comarison used four different regression datasets. The first two are synthetic, noiseless and constructed artificially while the true target function is known. The first corresponds to a simple XOR problem (the &#8222;Hello World&#8220; of feature engineering) impossible to be solved by a linear model. The second synthetic dataset corresponds to the target function \\(target = 2 + 15 * x_1 + 3 \/ (x_2 &#8211; 1 \/ x_3) + 5 * (x_2 + log(x_1))^2- x_4^4 \\), with \\(x1, x3 \\) sampled from a uniform distribution sampled from 1 to 100 and \\(x2, x4 \\) sampled from a standard distribution with mean 0 and standard deviation of 100.<\/p>\n<p>Complementary to these, two more datasets have been taken from <a href=\"https:\/\/www.kaggle.com\/\">Kaggle<\/a>. One is the <a href=\"https:\/\/www.kaggle.com\/c\/rossmann-store-sales\">Rossmann Store Sales<\/a> dataset\u00a0&#8211; from now on referred to as \u201cRossmann\u201c, where the goal is to forecast daily sales for each store using store, promotion and competitor data. The other is about predicting the taxi trip duration in the <a href=\"https:\/\/www.kaggle.com\/c\/nyc-taxi-trip-duration\">New York City Taxi Trip Duration<\/a> dataset\u2014from now on referred to as \u201cTaxiTrip\u201c. We took a subset of 100.000 data samples for each of those datasets to speed up the training process. Also, we removed outliers in the TaxiTrip dataset with the interquartile range because there are few examples where a taxi trip took around 24 days\u2014while the average trips takes 11 minutes\u2014which has too big of an impact on the chosen accuracy metric.<\/p>\n<p>Before initialising the frameworks, the data must be pre-processed according to the requirements of each framework. For example, TPOT only needs typecasting to numerical values, auto-sklearn and autofeat need an additional index list for categorical features, TPOT and auto-sklearn do work with data imputations while autofeat does not.<\/p>\n<p>The pre-processed data is used for automated feature engineering. TPOT and auto-sklearn are used with cross-validation with four folds for every pipeline evaluation and TPOT\u2019s population size is set to 50, autofeat\u2019s generation steps are set to 3 with 5 selection iterations. With TPOT and auto-sklearn the user may decide which operators can be utilized in a pipeline. For both we use the default operators but restrict the pipeline to only use a fixed machine learning model. Note that the versions of the frameworks are 1.1.2 for autofeat, 0.11.1 for TPOT, and 0.6.0 for auto-sklearn at the time of testing. In addition to the libraries, we will compare feature engineering and performance of top Kaggle solutions (<a href=\"https:\/\/www.kaggle.com\/elenapetrova\/time-series-analysis-and-forecasts-with-prophet\">Rossmann<\/a>, <a href=\"https:\/\/www.kaggle.com\/gaborfodor\/from-eda-to-the-top-lb-0-367\">TaxiTrip<\/a>) for the real-life datasets. We will also once do no feature engineering for a naive baseline.<\/p>\n<p>Based on the results, I trained and evaluated four models on holdout sets. The models used are a linear regressor, lasso lars CV, decision tree regressor, and random forest regressor. In the end, the performance of all models is measured by the RMSE error. The whole process is repeated 20 times for each configuration, as the initialization of libraries produces different results, with 12 hours of CPU time per run.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Results\"><\/span>Results<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3><span class=\"ez-toc-section\" id=\"Optimization-of-Features-Over-Time\"><\/span>Optimization of Features Over Time<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Let\u2019s look at how the pipeline error gets improved over time: The x-axis of Figure 3 shows the evaluated generation number, each of which is comprised of 50 individuals with their associated performance. The corresponding y-axis shows the RMSE error score of the best individual of the generation. In cases where the next generation did not exceed the best performance so far, the most recent best performance is plotted.<\/p>\n<p>To illustrate what median performing optimization history can be expected, the median of the 20 runs per model is highlighted in a stronger color than the individual runs that are presented with a lower alpha value. Additionally, visible in the models\u2019 color are filled intervals showing the performance range from first to the third quartile. The intervals and medians are plotted until there are at least five runs left. When a run is terminated, because the allocated time has run out, it is marked by a red line on the median.<\/p>\n<figure id=\"attachment_19139\" aria-describedby=\"caption-attachment-19139\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-19139 size-large\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_rossmann-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_rossmann-1024x768.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_rossmann-300x225.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_rossmann-768x576.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_rossmann-400x300.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_rossmann-360x270.png 360w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_rossmann.png 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"caption-attachment-19139\" class=\"wp-caption-text\">Figure 3: TPOT&#8217;s optimisation history for Rossmann.<\/figcaption><\/figure>\n<figure id=\"attachment_19141\" aria-describedby=\"caption-attachment-19141\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-19141 size-large\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_taxitrip-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_taxitrip-1024x768.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_taxitrip-300x225.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_taxitrip-768x576.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_taxitrip-400x300.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_taxitrip-360x270.png 360w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/tpot_taxitrip.png 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"caption-attachment-19141\" class=\"wp-caption-text\">Figure 4: TPOT&#8217;s optimization history for TaxiTrip.<\/figcaption><\/figure>\n<p>The more expressive models start with better performance, especially the random forest. The convergence of performance is foreseeable after around 20 generations for most models, with the exception of the linear models. In cases where TPOT is able to construct good linear features, an improvement over time is consistently high until termination. For the more expressive models, that did well without feature engineering\u2014being able to already model more of the underlying data structure\u2014TPOT showed to be less effective in improving performance. In those cases, more computational resources should not make a difference in performance.<\/p>\n<p>There is a noticeable difference in the number of evaluated generations before termination, which is not only due to the different complexity of pipelines. We specified that each generation should have 50 individuals, however, most have around 10 individuals. They seem to get homogenous and no new combinations of operators are found. There are enough operator combinations and a pipeline can always have more operators, but few operators make up most hyperparameter configurations and pipeline growth is heavily restricted. The issue with pipeline growth is that the first generation gets initialized with only up to 3 operators and growth is limited by design. As a pipeline\u2019s fitness is measured by both error and the number of operators, a pipeline with an additional operator is always also worse than its predecessor on at least one fitness criterion. That also favors complex operators that do much at once as only the number of operators count, because the complexity of operators arguably differs. For example, the default configuration contains both an operator that counts the occurrence of zeros in a sample and a radial basis function sampler that produces a hundred new features at once. The mutation of a pipeline, a procedure to overcome local minima or maxima in evolutionary algorithms, does change, prune or add one operator at a time, which is possibly not enough change at once to skip local minima. The effect of this is that while 80 percent of the best pipelines had at least 4 operators, only 20 percent of evaluated had more than 4. Altering the optimization process to be more growth-friendly, initializing deeper pipelines, and redefining complexity could potentially lead to a greater variety and better pipelines.<\/p>\n<p>The observations for the optimization history for auto-sklearn in Figure 5 are similar to TPOT, the only difference is a more stable convergence for all datasets. This is quite possibly due to the fixed pipeline structure.<\/p>\n<figure id=\"attachment_19145\" aria-describedby=\"caption-attachment-19145\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-19145\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autosklearn_rossmann-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" \/><figcaption id=\"caption-attachment-19145\" class=\"wp-caption-text\">Figure 5: Auto-sklearn&#8217;s optimisation history for Rossmann.<\/figcaption><\/figure>\n<figure id=\"attachment_19144\" aria-describedby=\"caption-attachment-19144\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-19144\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autosklearn_taxitrip-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autosklearn_taxitrip-1024x768.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autosklearn_taxitrip-300x225.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autosklearn_taxitrip-768x576.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autosklearn_taxitrip-400x300.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autosklearn_taxitrip-360x270.png 360w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autosklearn_taxitrip.png 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"caption-attachment-19144\" class=\"wp-caption-text\">Figure 6: Auto-sklearn&#8217;s optimization history for TaxiTrip.<\/figcaption><\/figure>\n<p>Autofeat\u2019s plot in Figure 7 looks different than the others as there is no improvement over time that could be analyzed but only the final performance for a varying number of generation and selection steps. Note that the feature generation is independent of the final prediction model, the number of selected features is represented in the bins of the heatmap and the color intensity shows the performance gain against not doing feature engineering. Generally, the most important performance factor was the complexity of generated features on the y-axis, which is bottlenecked by the memory limitation. Scenarios with large datasets, especially true with many categorical features one-hot encoded, get very memory intensive and would need a different evaluation setup with large memory allocation to inspect the effect of more generation steps. The simpler feature transformations in the first two feature generation steps were not complex enough to produce new selected features. Linear models profit more from feature engineering; unsurprisingly, as the focus of this library is to produce linear features and features are partially selected based on the coefficients of a linear model.<\/p>\n<figure id=\"attachment_19146\" aria-describedby=\"caption-attachment-19146\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-19146\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/autofeat_taxitrip-1024x614.png\" alt=\"\" width=\"1024\" height=\"614\" \/><figcaption id=\"caption-attachment-19146\" class=\"wp-caption-text\">Figure 7: Autofeat\u2019s optimization of features for TaxiTrip.<\/figcaption><\/figure>\n<h3><span class=\"ez-toc-section\" id=\"Output-of-Libraries-and-Comparison-to-Manual-Feature-Engineering\"><\/span>Output of Libraries and Comparison to Manual Feature Engineering<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Let\u2019s explore the output of libraries for the simple XOR feature engineering problem, but only for the linear regression model. The default RMSE error is 0.5001, but every framework solves this simple problem almost immediately within machine precision, but with very different features.<\/p>\n<p>Beginning with TPOT\u2019s output, the following code snippets show two solutions. This generated output is a short representation of TPOT\u2019s produced pipeline. It lists the operators of the pipeline from last to the first operator with the operators\u2019 parameter settings. For example, the first pipeline uses an operator called RBFSampler with the original feature matrix as input and the gamma hyperparameter set to 0.1, whose output is then the input for the linear regression model.<\/p>\n<pre class=\"lang:python decode:true\" title=\"TPOT pipelines\"># Pipeline 1\r\n\r\nLinearRegression(RBFSampler(input_matrix, gamma=0.1))\r\n\r\n# Pipeline 2\r\n\r\nLinearRegression(ZeroCount(ZeroCount(input_matrix)))\r\n\r\n<\/pre>\n<p>The output shows a simple and overly complex example. Minimalistic feature construction can be examined for <em>Pipeline 2<\/em>, where two features are added sequentially by the &#8222;ZeroCount&#8220; operator. <em>Pipeline 1<\/em>, with the default settings for the RBFSampler (i.e., radial basis function sampler), computes an additional 100 features, resulting in a total of 102 features for this simple case. This is a great example of one of the downsides of the current definition of complexity in TPOT, the construction of 100 new features with the RBFSampler is considered less complex than counting the occurrences of zeroes twice per sample in pipeline two.<\/p>\n<p>Auto-sklearn&#8217;s outputs look different but contain the same information as before. The library does also sometimes use overly complex operators as in <em>Pipeline 1<\/em>, but because the pipeline can only have so many operators, very simple operators like the \u201cZeroCount\u201c operator is not part of auto-sklearn per default. <em>Pipeline 2<\/em> with the polynomial feature operator shows a more minimalistic approach. Note that sometimes additional scaling or imputation strategies are applied that do nothing, but since this library does enforce the fewest possible operators by design, they are still part of the pipeline.<\/p>\n<pre class=\"lang:default decode:true\" title=\"Auto-sklearn pipelines\"># Pipeline 1\r\n\r\nimputation:strategy, Value: 'median'\r\n\r\npreprocessor:random_trees_embedding: bootstrap= True, max_depth=5, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=1.0, n_estimators=10\r\n\r\nregressor:__choice__, Value: 'OwnLinearReg'\r\n\r\n# Pipeline 2\r\n\r\ncategorical_encoding:one_hot_encoding: use_minimum_fraction=False\r\n\r\nimputation:strategy, Value: 'most_frequent'\r\n\r\npreprocessor:polynomial: degree=2, include_bias=True, interaction_only=False\r\n\r\nregressor:__choice__, Value: 'OwnLinearReg'\r\n\r\nrescaling:__choice__, Value: 'minmax'<\/pre>\n<p>Autofeat solves the XOR dataset consistently with a minimal set of features. With every hyperparameter configuration it adds a single additional feature like \\(x1 \u2217 x2 \\) or \\(|x1 \u2212 x2|\\).<\/p>\n<p>We will not go into detail about the constructed features for both TPOT and auto-sklearn as they show only the feature operators, without consideration about the exact features created. As of now, the output features are not labeled, and especially in TPOT\u2019s potentially big pipelines features are mixed up in many transformations and their meaning is lost in complex operators. While autofeat\u2019s feature construction is clearly labeled, the complex non-linear transformations are just not interpretable in the tested real-life scenarios. One interesting observation I want to highlight is the complex synthetic function \\(target = 2 + 15 * x_1 + 3 \/ (x_2 &#8211; 1 \/ x_3) + 5 * (x_2 + log(x_1))^2- x_4^4 \\). The terms\u00a0\\(5 * (x_2 + log(x_1))^2\\) and\u00a0\\(x_4^4 \\) of the target function are the exact output of autofeat and thereby produce an almost perfect error score. To see how long autofeat is able to produce these exact features I introduced a variety of complications; the exact terms could only be consistently engineered with duplicate, random features or little noise on the target function. Especially new categorical values that affected part of the other features could not be reconstructed. Even though all complications could be engineered in the feature generation, they were just not selected.<\/p>\n<p>The Kaggle feature engineering shows some similarities. Sometimes the exact operators of the pipeline-generating libraries are used, in other cases, the operators are simply not implemented. Especially a time-and-date-conversion operator is not available in either libraries. On the other hand, lots of features produced by combinations of features are only generated in autofeat but the ones from Kaggle were not selected. An overview of what feature engineering techniques are used in the Kaggle contributions and whether the frameworks can reproduce them you can find in Figure 8. Note that a framework is considered as being able to make use of a technique if it implements at least one sub-technique. For all listed feature engineering techniques there are multiple potential approaches.<\/p>\n<figure id=\"attachment_19148\" aria-describedby=\"caption-attachment-19148\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-19148\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/Screenshot-2020-06-30-at-13.33.40-1024x401.png\" alt=\"\" width=\"1024\" height=\"401\" \/><figcaption id=\"caption-attachment-19148\" class=\"wp-caption-text\">Figure 8: Implemented feature engineering techniques.<\/figcaption><\/figure>\n<p>The most important takeaway is that current automated solutions show that domain knowledge is not replaceable as the feature transformations can be anything depending on the domain. TPOT and auto-sklearn can not have every operator, and introducing new external data is a feature completely unthinkable as of now, but often a part of Kaggle solutions.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Performance-of-Libraries\"><\/span>Performance of Libraries<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The overall median performance for all datasets but the XOR problem can be seen in Figure 9. The best-performing library for the more expressive models is TPOT, followed by autofeat and lastly auto-sklearn. For simple models, the frameworks are performing equally well. In simpler scenarios where data can be made complete linear, as in the Synthetic examples, autofeat dominated performance by constructing parts of the exact target function.<\/p>\n<figure id=\"attachment_19147\" aria-describedby=\"caption-attachment-19147\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-19147\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/06\/total_overall-1024x732.png\" alt=\"\" width=\"1024\" height=\"732\" \/><figcaption id=\"caption-attachment-19147\" class=\"wp-caption-text\">Figure 9: Median RMSE performance for all datasets.<\/figcaption><\/figure>\n<p>Overall, the best human feature engineering solutions outperform the frameworks for all real-life datasets by as much as five times for the Rossmann example. The Kaggle performances for the linear models have to be taken with a grain of salt as they were not intended for a linear model. For the real-life datasets the model\u2019s complexity was more important for the performance than the feature engineering, although all models benefit from automated feature engineering.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion%E2%80%94tldr\"><\/span>Conclusion\u2014tl;dr<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ol>\n<li>TPOT&#8217;s variably sized pipelines and complex operators outperformed on the real-life datasets. Linear models could potentially improve with more resources.<\/li>\n<li>Autofeat is best for small datasets with underlying data possibly being linearly separable.<\/li>\n<li>Automated feature engineering is more effective for simpler models, however, model complexity was more important than feature engineering.<\/li>\n<li>The automated solutions are clearly inferior to the best Kaggle solutions.<\/li>\n<li>Automated feature engineering is not fully automated yet as there are limitless feature engineering possibilities, highly dependent on the problem domain.<\/li>\n<\/ol>\n<p>Automated feature engineering can be another tool for a data scientist\u2019s workflow, especially with limited domain knowledge or available human resources, but certainly does not replace the need for human feature engineering.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Sources\"><\/span>Sources<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>[1] Ray, Sunil: A Comprehensive Guide to Data Exploration. https:\/\/www.analyticsvidhya.com\/blog\/2016\/01\/guide-data-exploration\/, Retrieved: June 09. 2020<\/p>\n<p>[2] Domingos, Pedro: A Few Useful Things to Know about Machine Learning. In: Communications of the ACM 55 (2012), Nr. 10, S. 78\u201387. \u2013 ISSN 0001-0782, 1557-7317<\/p>\n<p>[3]\u00a0Ng, Andrew: Machine Learning and AI via Brain Simulations. https:\/\/helper.ipam.ucla.edu\/publications\/gss2012\/gss2012_10595.pdf, Retrieved: January 21. 2020<\/p>\n<p>[4]\u00a0Rencberoglu, Emre: Fundamental Techniques of Feature Engineering for Machine Learning. https:\/\/towardsdatascience.com\/feature-engineering-for- machine-learning-3a5e293a5114, Retrieved: January 07. 2020<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the hope of excellent features, without requiring domain experts spending days engineering them, lies this review of automated feature engineering with TPOT, auto-sklearn and autofeat.<\/p>\n","protected":false},"author":157,"featured_media":19319,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[245,206,327],"service":[431],"coauthors":[{"id":157,"display_name":"Jonas Meier","user_nicename":"jonas-meierinovex-de"}],"class_list":["post-19122","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-automl","tag-data-science","tag-feature-engineering","service-data-science"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Automated Feature Engineering with Open-Source Libraries - inovex GmbH<\/title>\n<meta name=\"description\" content=\"In the hope of excellent features, without requiring domain experts spending days engineering them, lies this review of automated feature engineering with TPOT, auto-sklearn and autofeat.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Automated Feature Engineering with Open-Source Libraries - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"In the hope of excellent features, without requiring domain experts spending days engineering them, lies this review of automated feature engineering with TPOT, auto-sklearn and autofeat.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2020-07-22T05:54:29+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-12-02T07:57:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/automated-feature-engineering.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Jonas Meier\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/automated-feature-engineering-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jonas Meier\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"16\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Jonas Meier\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/\"},\"author\":{\"name\":\"Jonas Meier\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/0aff00280d0c64c9bc6a3b1ff24c2b09\"},\"headline\":\"Automated Feature Engineering with Open-Source Libraries\",\"datePublished\":\"2020-07-22T05:54:29+00:00\",\"dateModified\":\"2022-12-02T07:57:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/\"},\"wordCount\":3164,\"commentCount\":2,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/automated-feature-engineering.png\",\"keywords\":[\"AutoML\",\"Data Science\",\"Feature Engineering\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/\",\"name\":\"Automated Feature Engineering with Open-Source Libraries - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/automated-feature-engineering.png\",\"datePublished\":\"2020-07-22T05:54:29+00:00\",\"dateModified\":\"2022-12-02T07:57:36+00:00\",\"description\":\"In the hope of excellent features, without requiring domain experts spending days engineering them, lies this review of automated feature engineering with TPOT, auto-sklearn and autofeat.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/automated-feature-engineering.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/07\\\/automated-feature-engineering.png\",\"width\":1920,\"height\":1080,\"caption\":\"A machine assembling features from un-organized iinput\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/automated-feature-engineering-open-source-libraries\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Automated Feature Engineering with Open-Source Libraries\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/0aff00280d0c64c9bc6a3b1ff24c2b09\",\"name\":\"Jonas Meier\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7841440a936288b49258720ee4b6a9157f8abd7dbd857b8f21203807cba65ab4?s=96&d=retro&r=g9f98d06e6d2a702984ac5049306bccb9\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7841440a936288b49258720ee4b6a9157f8abd7dbd857b8f21203807cba65ab4?s=96&d=retro&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7841440a936288b49258720ee4b6a9157f8abd7dbd857b8f21203807cba65ab4?s=96&d=retro&r=g\",\"caption\":\"Jonas Meier\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/jonas-meierinovex-de\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Automated Feature Engineering with Open-Source Libraries - inovex GmbH","description":"In the hope of excellent features, without requiring domain experts spending days engineering them, lies this review of automated feature engineering with TPOT, auto-sklearn and autofeat.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/","og_locale":"de_DE","og_type":"article","og_title":"Automated Feature Engineering with Open-Source Libraries - inovex GmbH","og_description":"In the hope of excellent features, without requiring domain experts spending days engineering them, lies this review of automated feature engineering with TPOT, auto-sklearn and autofeat.","og_url":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2020-07-22T05:54:29+00:00","article_modified_time":"2022-12-02T07:57:36+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/automated-feature-engineering.png","type":"image\/png"}],"author":"Jonas Meier","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/automated-feature-engineering-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Jonas Meier","Gesch\u00e4tzte Lesezeit":"16\u00a0Minuten","Written by":"Jonas Meier"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/"},"author":{"name":"Jonas Meier","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/0aff00280d0c64c9bc6a3b1ff24c2b09"},"headline":"Automated Feature Engineering with Open-Source Libraries","datePublished":"2020-07-22T05:54:29+00:00","dateModified":"2022-12-02T07:57:36+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/"},"wordCount":3164,"commentCount":2,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/automated-feature-engineering.png","keywords":["AutoML","Data Science","Feature Engineering"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/","url":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/","name":"Automated Feature Engineering with Open-Source Libraries - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/automated-feature-engineering.png","datePublished":"2020-07-22T05:54:29+00:00","dateModified":"2022-12-02T07:57:36+00:00","description":"In the hope of excellent features, without requiring domain experts spending days engineering them, lies this review of automated feature engineering with TPOT, auto-sklearn and autofeat.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/automated-feature-engineering.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/automated-feature-engineering.png","width":1920,"height":1080,"caption":"A machine assembling features from un-organized iinput"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/automated-feature-engineering-open-source-libraries\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Automated Feature Engineering with Open-Source Libraries"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/0aff00280d0c64c9bc6a3b1ff24c2b09","name":"Jonas Meier","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/secure.gravatar.com\/avatar\/7841440a936288b49258720ee4b6a9157f8abd7dbd857b8f21203807cba65ab4?s=96&d=retro&r=g9f98d06e6d2a702984ac5049306bccb9","url":"https:\/\/secure.gravatar.com\/avatar\/7841440a936288b49258720ee4b6a9157f8abd7dbd857b8f21203807cba65ab4?s=96&d=retro&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7841440a936288b49258720ee4b6a9157f8abd7dbd857b8f21203807cba65ab4?s=96&d=retro&r=g","caption":"Jonas Meier"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/jonas-meierinovex-de\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/19122","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/157"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=19122"}],"version-history":[{"count":1,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/19122\/revisions"}],"predecessor-version":[{"id":39797,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/19122\/revisions\/39797"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/19319"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=19122"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=19122"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=19122"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=19122"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}