{"id":20915,"date":"2021-06-29T12:17:00","date_gmt":"2021-06-29T11:17:00","guid":{"rendered":"https:\/\/www.inovex.de\/blog\/?p=20915"},"modified":"2022-10-11T12:34:59","modified_gmt":"2022-10-11T10:34:59","slug":"honey-i-shrunk-the-target-variable","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/","title":{"rendered":"Honey, I Shrunk the Target Variable"},"content":{"rendered":"<p>Transforming a target variable can be tricky. This blog post elaborates on the mathematical reasons why things can go wrong but also shows how a smart transformation can bring you honour &amp; fame in practical applications.<!--more--><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#Motivation\" >Motivation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#Lets-get-started\" >Let\u2019s get\u00a0started<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#Choosing-the-right-error-measure\" >Choosing the right error\u00a0measure<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#Distribution-of-the-target-variable\" >Distribution of the target\u00a0variable<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#Analysis-of-the-residual-distribution\" >Analysis of the residual\u00a0distribution<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#Shrinking-the-target-variable\" >Shrinking the target\u00a0variable<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#Transforming-the-target-for-fun-and-profit\" >Transforming the target for fun and\u00a0profit<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#Aftermath\" >Aftermath<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Motivation\"><\/span>Motivation<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>For me, it is often an irritating sight to see how inexperienced, up and coming data scientists jump right into the feature engineering when facing some new supervised learning problem \u2026 but it also makes me contemplate about my past when I started with data science. So full of vigour and enthusiasm, I was often completely absorbed by the idea of minimizing whatever error measure I was given or chose myself \u2013 maybe even randomly \u2013, like the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Root-mean-square_deviation\">root mean square error<\/a>. In my drive, I used to construct many derived features using clever transformations and sometimes did not even stop at the target variable. Why should I? If the target variable is for instance non-negative and quite right-skewed, why not transform it using the logarithm to make it more normally distributed? Isn\u2019t this better or even required for simple models like linear regression, anyways? A little \\(\\log\\)\u00a0never killed a dog, so what could possibly go\u00a0wrong?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-38837 aligncenter\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/shrunk_meme.jpg\" alt=\"Couple looking at spoon with magnifier\" width=\"649\" height=\"385\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/shrunk_meme.jpg 649w, https:\/\/www.inovex.de\/wp-content\/uploads\/shrunk_meme-300x178.jpg 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/shrunk_meme-400x237.jpg 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/shrunk_meme-360x214.jpg 360w\" sizes=\"auto, (max-width: 649px) 100vw, 649px\" \/><\/p>\n<figure>\n<p align=\"center\">\n<\/figure>\n<p>As you might have guessed from these questions, it is not that easy, and transforming your target variable puts you directly into the\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=siwpn14IE7E\">danger zone<\/a>. In this blog post, I want to elaborate on why this is, from a mathematical perspective but also by demonstrating it in some practical examples. Without spoiling too much I hope, for the too busy or plain lazy readers, the main take-away\u00a0is:<\/p>\n<blockquote><p>TLDR: Applying any non-<a href=\"https:\/\/en.wikipedia.org\/wiki\/Affine_transformation\">affine transformation<\/a>\u00a0to your target variable might have unwanted effects on the error measure you are minimizing. So if you do not know exactly what you are doing, just\u00a0don\u2019t.<\/p><\/blockquote>\n<h2><span class=\"ez-toc-section\" id=\"Lets-get-started\"><\/span>Let\u2019s get\u00a0started<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Before we start with the gory mathematical details, let\u2019s first pick and explore a typical use-case where most inexperienced data scientists might be tempted to transform the target variable without a second thought. In order to demonstrate this, I chose the\u00a0<a href=\"https:\/\/www.kaggle.com\/vfsousas\/autos\">used-cars database from Kaggle<\/a>\u00a0and if you want to follow along, you find the code in the notebooks folder of my Github\u00a0<a href=\"https:\/\/github.com\/FlorianWilhelm\/used-cars-log-trans\/\">used-cars-log-trans repository<\/a>. As the name suggests, the data set contains used cars having car features like <span style=\"font-size: 12pt;\"><code>vehicleType<\/code>,\u00a0<code>yearOfRegistration<\/code><\/span> &amp; <code>monthOfRegistration<\/code>,\u00a0\u00a0<code>gearbox<\/code>,\u00a0\u00a0<code>powerPS<\/code>,\u00a0\u00a0<code>model<\/code>,\u00a0\u00a0<code>kilometer<\/code> (mileage),\u00a0<code>fuelType<\/code>,\u00a0\u00a0<code>brand<\/code> and\u00a0<code>price<\/code>.<\/p>\n<p>Let\u2019s say the business unit basically asks us to determine the proper market value of a car given the features above to determine if its price is actually a good deal, fair deal or a bad deal. The obvious way to approach this problem is to create a model that predicts the price of a car, which we assume to be its market value, given its features. Since we have roughly 370,000 cars in our data set, for most cars we will have many similar cars and thus our model will predict a price that is some kind of average of their prices. Consequently, we can think of this predicted price (let\u2019s call it\u00a0pred_price) as the actual market value. To determine if the actual\u00a0price\u00a0of a vehicle is a good, fair or bad deal, we would then calculate for instance the relative\u00a0error<\/p>\n<p style=\"text-align: center;\"><span class=\"h5\">\\(\\frac{\\mathrm{pred\\_price}-\\mathrm{price}}{\\mathrm{price}}\\)<\/span><\/p>\n<p>in the simplest case. If the relative error is close to zero, we would call it fair. If it is much larger than zero, it is a good deal, and a bad deal if it is much smaller than zero. For the actual subject of this blog post, this use-case serves us already as a good motivation for the development of some regression model that will predict the price given some car features. The attentive reader has certainly noticed that the prices in our data set will be biased towards a higher price and thus also our predicted \u201cmarket value\u201c. This is due to the fact that we do not know for which price the car was eventually sold. We only know the amount of money the seller wanted to have which is of course higher or equal than what he or she gets in the end. For the sake of simplicity, we assume that we have raised this point with the business unit, they noted it duly and we thus neglect it for our\u00a0analysis.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Choosing-the-right-error-measure\"><\/span>Choosing the right error\u00a0measure<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>At this point, a lot of inexperienced data scientists would directly get into business of feature engineering and build some kind of fancy model. Nowadays, most machine learning frameworks like\u00a0<a href=\"https:\/\/scikit-learn.org\/\">Scikit-Learn<\/a>\u00a0are so easy to use that one might even forget the error measure that is optimized, as in most cases it will be the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_squared_error\">mean square error<\/a>\u00a0(MSE) by default. But does the\u00a0MSE\u00a0really make sense for this use case? First of all is our target measured in some currency, so why would we try to minimize some squared difference? Squared Euro? Very clearly, even taking the square root in the end, i.e.\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Root-mean-square_deviation\">root mean square error<\/a>\u00a0(RMSE), would not change a thing about this fact. Still, we would weight one large residual higher than many small residuals which sum up to the exact same value as if 10 times a residual of 10.- \u20ac is somehow less severe than a single residual of 100.- \u20ac. You see where I am getting at. In our use-case an error measure like the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_absolute_error\">mean absolute error<\/a>\u00a0(MAE) might be the more natural choice compared to the\u00a0MSE.<\/p>\n<p>On the other hand, is it really that important if a car costs you 1,000.- \u20ac more or less? It definitely does if you are looking at cars at around 10,000.- \u20ac but it might be negligible if your luxury vehicle is around 100,000.- \u20ac anyway. Consequently, the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mean_absolute_percentage_error\">mean absolute percentage error<\/a>\u00a0(MAPE) might even be a better fit than the\u00a0MAE\u00a0for this use-case. Having said that, we will keep all those error measures in mind but use the default\u00a0MSE\u00a0criterion in our machine-learning algorithm for the sake of simplicity and to help me make the actual point of this blog post.\u00a0\ud83d\ude09<\/p>\n<p>Nevertheless, one crucial aspect should be kept in mind for the rest of this post. In the end, after the fun part of modeling, we, as data scientists, have to communicate the results to business people and the assessment of the quality of the results is going to play an important role in this. This assessment will almost always be conducted using the raw, i.e. untransformed, target as well as the chosen error measure to answer the question if the results are good enough for the use case at hand and consequently if the model can go to production as a first iteration. Practically, that means that even if we decide to train a model on a transformed target, we have to transform the predictions of the model back for evaluation. Results are always communicated based on the original\u00a0target.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Distribution-of-the-target-variable\"><\/span>Distribution of the target\u00a0variable<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Our data contains not only cars that are for sale but also cars people are searching for with a certain price. Additionally, we have people offering damaged cars, wanting to trade their car for another or just hoping to get an insanely enormous amount of money. Sometimes you get lucky. For our use case, we gonna keep only real offerings of undamaged cars with a reasonable price between 200.- \u20ac and 50,000.- \u20ac with a first registration not earlier than 1910. This is what the distribution of the price looks\u00a0like.<\/p>\n<figure id=\"attachment_38833\" aria-describedby=\"caption-attachment-38833\" style=\"width: 951px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-38833 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_distribution.png\" alt=\"Depiction of the distribution plot of the price variable\" width=\"951\" height=\"531\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_distribution.png 951w, https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_distribution-300x168.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_distribution-768x429.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_distribution-400x223.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_distribution-360x201.png 360w\" sizes=\"auto, (max-width: 951px) 100vw, 951px\" \/><figcaption id=\"caption-attachment-38833\" class=\"wp-caption-text\">Distribution plot of the price variable using 1,000.- \u20ac bins.<\/figcaption><\/figure>\n<figure><\/figure>\n<p>It surely does look like a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Log-normal_distribution\">log-normal distribution<\/a>\u00a0and just to have visual check, fitting a log-normal distribution with the help of the wonderful\u00a0<a href=\"https:\/\/www.scipy.org\/\">SciPy<\/a>\u00a0gets us\u00a0this.<\/p>\n<figure id=\"attachment_38835\" aria-describedby=\"caption-attachment-38835\" style=\"width: 952px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-38835 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_log-normal_fit.png\" alt=\"depiction of a log-normal distribution fitted to the distribution of prices\" width=\"952\" height=\"530\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_log-normal_fit.png 952w, https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_log-normal_fit-300x167.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_log-normal_fit-768x428.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_log-normal_fit-400x223.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/histtv_price_log-normal_fit-360x200.png 360w\" sizes=\"auto, (max-width: 952px) 100vw, 952px\" \/><figcaption id=\"caption-attachment-38835\" class=\"wp-caption-text\">Log-normal distribution fitted to the distribution of prices.<\/figcaption><\/figure>\n<figure><\/figure>\n<p>Seeing this, you might feel the itch to just apply now the logarithm to our target variable, just to make it look more\u00a0normal. And isn&#8217;t this some basic assumption of a linear model\u00a0anyway?<\/p>\n<p>Well, this is a common misconception. The dependent variable, i.e. target variable, of a linear model does not need to be normally distributed, only the residuals are. This can be seen easily by revisiting the formula of a linear model. For the observed outcome \\(y_i\\) and some true latent outcome \\(\\mu_i\\) of the \\(i\\)-th sample, we have<\/p>\n<div id=\"1\" style=\"text-align: center;\"><span class=\"h5\">\\(\\begin{array}{lcl} \\mu_i &amp;= &amp;\\sum_{j=1}^M w_j \\phi_j(\\mathbf{x}_i) , \\\\ y_i &amp;= &amp;\\mu_i + \\epsilon, \\label{linear-model} \\end{array}\\)<\/span><\/div>\n<p>where\u00a0\\(\\mathbf{x}_i\\)\u00a0is the original feature vector, \\(\\phi_j\\), \\(j=1, \\ldots, M\\)\u00a0a set of (potentially non-linear) functions,\u00a0\\(w_j\\), \\(j=1, \\ldots, M\\)\u00a0some scalar weights and\u00a0\\(\\epsilon\\)\u00a0some random noise that is distributed like the normal distribution with mean\u00a0\\(0\\)\u00a0and variance\u00a0\\(\\sigma^2\\)\u00a0(or \\(\\epsilon\\sim\\mathcal{N}(0, \\sigma^2)\\) for short). If you wonder about the \\(\\phi_j\\), that is where all your feature engineering skills and domain knowledge go into to transform the raw features into more suitable\u00a0ones.<\/p>\n<p>One of the reasons for this common misconception might be that the literature often states that the dependent variable\u00a0\\(y\\) conditioned\u00a0on the predictor\u00a0\\(\\mathbf{x}\\)\u00a0is normally distributed in a linear model. So for a fixed\u00a0\\(\\mathbf{x}\\),\u00a0we have according to (<a href=\"#1\">1<\/a>)\u00a0also a fixed\u00a0\\(\\mu\\)\u00a0and thus\u00a0\\(y\\)\u00a0can be imagined as a realization of a random variable \\(Y=\\mathcal{N}(\\mu, \\sigma^2)\\).<\/p>\n<p>To make it even a tad more illustrative, imagine you want to predict the average alcohol level (in some strange log scale) of a person celebrating Carnival only using a single binary feature, e.g. did the person have a one-night stand over Carnival or not. Under these assumptions we simple generate some data using the linear model from above and plot\u00a0it:<\/p>\n<pre class=\"lang:python decode:true \">import numpy as np\r\nimport seaborn as sns\r\nimport matplotlib.pylab as plt\r\nN = 10000  # number of people\r\nx = np.random.randint(2, size=N)\r\ny = x + 0.28*np.random.randn(N)\r\nsns.distplot(y)\r\nplt.xlim(-2, 10)<\/pre>\n<p>Obviously, this results in a bimodal distribution also known as the notorious\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Cologne_Cathedral\">Cologne Cathedral distribution<\/a>\u00a0as some may call it. Thus, although using a linear model, we generated a non-normally distributed target variable with residuals that are normally\u00a0distributed.<\/p>\n<figure id=\"attachment_38831\" aria-describedby=\"caption-attachment-38831\" style=\"width: 1194px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-38831 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/dom_distribution.png\" alt=\"Depiction of a bimodal distribution resembling the cathedral of Cologne.\" width=\"1194\" height=\"666\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/dom_distribution.png 1194w, https:\/\/www.inovex.de\/wp-content\/uploads\/dom_distribution-300x167.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/dom_distribution-1024x571.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/dom_distribution-768x428.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/dom_distribution-400x223.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/dom_distribution-360x201.png 360w\" sizes=\"auto, (max-width: 1194px) 100vw, 1194px\" \/><figcaption id=\"caption-attachment-38831\" class=\"wp-caption-text\">Bimodal distribution generated with a linear model, which is obviously resembling the cathedral of Cologne.<\/figcaption><\/figure>\n<p>Based on common mnemonic techniques, and assuming this example was surprising, physical, sexual and humorous enough for you, you will never forget that the residuals of a linear model are normally distributed and\u00a0not\u00a0the target variable in general. Only in the case that you used a linear model having only an intercept, i.e. \\(M=1\\) and \\(\\phi_1(\\mathbf{x})\\equiv 1\\), the target distribution equals the residual distribution (up to some shift) on all data sets. But seriously, who does that in real\u00a0life?<\/p>\n<figure>\n<div class=\"mceTemp\"><\/div>\n<\/figure>\n<h2><span class=\"ez-toc-section\" id=\"Analysis-of-the-residual-distribution\"><\/span>Analysis of the residual\u00a0distribution<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Now that we learnt about the distribution of the residual, we want to further analyse it. Especially with respect to the error measure that we are trying to minimize as well as the transformation we apply to the target variable beforehand. Let\u2019s take a look at the definition of the\u00a0MSE\u00a0again,\u00a0i.e.<\/p>\n<p id=\"2\" style=\"text-align: center;\"><span class=\"h5\">\\(\\begin{array}{c} \\frac{1}{n}\\sum_{i=1}^n (y_i &#8211; \\hat y_i)^2, \\end{array}\\)<\/span><\/p>\n<p>where \\(\\hat y_i = \\hat y(\\mathbf{x}_i)\\) is our prediction given the feature vector \\(\\mathbf{x}_i\\) and \\(y_i\\) is the observed outcome for the sample \\(i\\). In reality, we might only have a single or maybe a few samples sharing exactly the same feature vector \\(\\mathbf{x}_i\\) and thus also the same model prediction \\(\\hat y_i\\). In order to do some actual analysis, we now assume that we have an infinite number of observed outcomes for a given feature vector. Now assume we keep \\(\\mathbf{x}_i\\) fixed and want to compute (<a href=\"#2\">2<\/a>) having all those observed outcomes. Let&#8217;s drop the index \\(i\\) from \\(\\hat y_i\\) as it depends only on our fixed \\(\\mathbf{x}_i\\). Also we can imagine all these outcomes \\(y\\) to be realizations of some random variable \\(Y\\) conditioned on \\(\\mathbf{x}\\).\u00a0To now handle an infinite number of possible realizations, we need to introduce the probability \\(f(y)\\)\u00a0of some realization \\(y\\), or more precisely the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Probability_density_function\">probability density function<\/a>\u00a0(pdf) since \\(Y\\)\u00a0is a\u00a0continuous\u00a0random variable. Consequently, as the summation becomes an integration, the discrete\u00a0MSE\u00a0in (<a href=\"#2\">2<\/a>) becomes<\/p>\n<p style=\"text-align: center;\"><span id=\"3\" class=\"h5\">\\(\\int_{-\\infty}^\\infty (y &#8211; \\hat y)^2f(y)\\, \\mathrm{d}y,\\)<\/span><\/p>\n<p>as you might have expected. Now, this is awesome, as it allows us to apply some good, old-school calculus. By the way, when I am talking about the\u00a0residual distribution\u00a0I am actually referring to the distribution \\(y &#8211; \\hat y\\)\u00a0with \\(y\\sim f(y)\\). Thus the residual distribution is determined by \\(f(y)\\)\u00a0except for a shift of \\(\\hat y\\). So what kind of assumptions can we make about it? In case of a linear model as in (<a href=\"#1\">1<\/a>), we assume \\(f(y)\\)\u00a0to be the pdf of a normal distribution but it could also be anything else. In our car pricing use case, we know that \\(y\\)\u00a0will be non-negative as no one is gonna give you money if you take a working car. Let me know if you have a counter-example. \ud83d\ude09 This rules out the normal distribution and demands a right skewed distribution, thus the pdf of the log-normal distribution might be an obvious assumption for \\(f(y)\\)\u00a0but we will come back later to\u00a0that.<\/p>\n<p>For now, we are going to consider (<a href=\"#3\">3<\/a>) again and note that our model, whatever it is, will somehow try to minimize (<a href=\"#2\">2<\/a>) by choosing a proper \\(\\hat y\\). So let&#8217;s do that analytically by deriving (<a href=\"#2\">2<\/a>) with respect to \\(\\hat y\\) and setting to \\(0\\), we have that<\/p>\n<p style=\"text-align: center;\"><span class=\"h5\">\\(\\frac{d}{d\\hat y}\\int_{-\\infty}^\\infty (y &#8211; \\hat y)^2f(y)\\, \\mathrm{d}y = -2\\int_{-\\infty}^\\infty yf(y)\\, \\mathrm{d}y + 2\\hat y \\stackrel{!}{=} 0,\\)<\/span><\/p>\n<p>and subsequently<\/p>\n<p style=\"text-align: center;\"><span id=\"4\" class=\"h5\">\\(\\hat\u00a0 y = \\int_{-\\infty}^\\infty yf(y)\\, \\mathrm{d}y.\\)<\/span><\/p>\n<p>Looks familiar? Yes, this is just the definition of the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Expected_value#Absolutely_continuous_case\">expected value in the continuous case!<\/a>\u00a0So whenever we are using the RMSE or MSE as error measure, we are actually calculating the expected value of \\(y\\) at some fixed \\(\\mathbf{x}\\). So what happens if we do the same for the MAE? In this case, we have<\/p>\n<p style=\"text-align: center;\"><span class=\"h5\">\\(\\int_{-\\infty}^\\infty \\vert y-\\hat y\\vert f(y)\\, \\mathrm{d}y=\\int_{\\hat y}^\\infty(y-\\hat y) f(y)\\, \\mathrm{d}y-\\int_{-infty}^{\\hat y} (y-\\hat y)f(y)\\, \\mathrm{d}y,\\)<\/span><\/p>\n<p>and deriving by \\(\\hat y\\) again, we have<\/p>\n<p style=\"text-align: center;\"><span class=\"h5\">\\(\\int_{-\\infty}^{\\hat y} f(y)\\, \\mathrm{d}y &#8211; \\int_{\\hat y}^\\infty f(y)\\,\\mathrm{d}y \\stackrel{!}{=} 0.\\)<\/span><\/p>\n<p>We thus have\u00a0\\(\\hat y = P(X\\leq\u00a0 \\frac{1}{2})\\), which is, lo and behold, the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Median#Probability_distributions\">median<\/a> of the distribution pdf \\(f(y)\\)!<\/p>\n<div class=\"entry-content\">\n<p>A small recap at this point: We just learnt that minimizing the\u00a0MSE\u00a0or\u00a0RMSE\u00a0(also\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Sequence_space#%E2%84%93p_spaces\">l2-norm<\/a>\u00a0as a fancier name) leads to the estimation of the expected value while minimizing\u00a0MAE\u00a0(also known as l1-norm) gets us the median of some distribution. Also remember that our feature vector \\(\\mathbf{x}\\) is still fixed, so \\(y\\sim f(y)\\)\u00a0just describes the random fluctuations around some true value \\(y^\\star\\), which we just do not know, and \\(\\hat y\\) is our best guess for it. If we assume the normal distribution, there is no reason to abandon all the nice mathematical properties of the l2-norm since the result will be theoretically the same as minimizing the l1-norm. It may make a huge difference though if we are dealing with a non-symmetrical distribution like the log-normal\u00a0distribution.<\/p>\n<p>Let\u2019s just demonstrate this using our used cars example. We have already seen that the distribution of price is rather log-normally than normally distributed. If we now use the simplest model we can think of, having only a single variable, (yeah, here comes the linear model with just an intercept again), the target distribution directly determines the residual distribution. Now, we find the minimum point using\u00a0RMSE\u00a0and\u00a0MAE\u00a0to compare the results to the mean and median of the price vector<span style=\"font-size: 12pt;\"><code>y<\/code>,<\/span>\u00a0respectively.<\/p>\n<pre class=\"lang:python decode:true\">&gt;&gt;&gt; def rmse(y_pred, y_true):\r\n&gt;&gt;&gt;     # not taking the root, i.e. MSE, would not change the actual result\r\n&gt;&gt;&gt;     return np.sqrt(np.mean((y_true - y_pred)**2))\r\n&gt;&gt;&gt; def mae(y_pred, y_true):\r\n&gt;&gt;&gt;     return np.mean(np.abs(y_true - y_pred))\r\n&gt;&gt;&gt; y = df.price.to_numpy()\r\n&gt;&gt;&gt; sp.optimize.minimize(rmse, 1., args=(y,))\r\n      fun: 7174.003600843465\r\n hess_inv: array([[7052.74958795]])\r\n      jac: array([0.])\r\n  message: 'Optimization terminated successfully.'\r\n     nfev: 36\r\n      nit: 5\r\n     njev: 12\r\n   status: 0\r\n  success: True\r\n        x: array([6703.59325181])\r\n&gt;&gt;&gt; np.mean(y)\r\n6704.024314214464\r\n&gt;&gt;&gt; sp.optimize.minimize(mae, 1., options=dict(gtol=2e-4), args=(y,))\r\n      fun: 4743.492333474732\r\n hess_inv: array([[7862.69627309]])\r\n      jac: array([-0.00018311])\r\n  message: 'Optimization terminated successfully.'\r\n     nfev: 120\r\n      nit: 8\r\n     njev: 40\r\n   status: 0\r\n  success: True\r\n        x: array([4099.9946168])\r\n&gt;&gt;&gt; np.median(y)\r\n4100.0<\/pre>\n<p>As expected, by looking at the\u00a0<span style=\"font-size: 12pt;\"><code>x<\/code><\/span>in the output of\u00a0<span style=\"font-size: 12pt;\"><code>minimize<\/code><\/span>, we approximated the mean by minimizing the\u00a0RMSE\u00a0and the median by minimizing the\u00a0MAE.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Shrinking-the-target-variable\"><\/span>Shrinking the target\u00a0variable<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>There is still the elephant in the room that we haven\u2019t talked about yet. What happens now if we shrink our target variable by applying a log transformation and then minimize the\u00a0MSE?<\/p>\n<pre class=\"lang:python decode:true \">&gt;&gt;&gt; y_log = np.log(df.price.to_numpy())\r\n&gt;&gt;&gt; sp.optimize.minimize(rmse, 1., args=(y_log,), tol=1e-16)\r\n      fun: 1.0632889349620418\r\n hess_inv: array([[1.06895454]])\r\n      jac: array([0.])\r\n  message: 'Optimization terminated successfully.'\r\n     nfev: 36\r\n      nit: 6\r\n     njev: 12\r\n   status: 0\r\n  success: True\r\n        x: array([8.31228458])<\/pre>\n<p>So if we now transform the result\u00a0<span style=\"font-size: 12pt;\"><code>x<\/code><\/span>which is roughly\u00a0<span style=\"font-size: 12pt;\"><code>8.31<\/code><\/span>\u00a0back using\u00a0<span style=\"font-size: 12pt;\"><code>np.exp(8.31<\/code><\/span>we get a rounded result of\u00a0<span style=\"font-size: 12pt;\"><code>4064<\/code><\/span>\u00a0. *Wait a second! What just happened!?* We would have expected the final result to be around\u00a0<span style=\"font-size: 12pt;\"><code>6704<\/code><\/span>\u00a0 because that is the mean value we had before. Somehow, transforming the target variable, minimizing the same error measure as before and applying the inverse transformation changed the result. Now our result of\u00a0<span style=\"font-size: 12pt;\"><code>4064<\/code><\/span>looks rather like an approximation of the median &#8230; well &#8230; it actually is assuming a log-normal distribution as we will fully understand soon. If we had applied some full-blown machine learning model, the difference would have been much smaller since the variance of the residual distribution would have been much smaller. Still, we would have missed our actual goal of minimizing the (R)MSE on the raw target. Instead we would have unknowingly minimized the MAE, which might actually be better suited for our use-case at hand. Nevertheless, being a data *scientist*, we should know what we are doing and a lucky punch without a clue of what happened, just does not suit a scientist.<\/p>\n<p>Before, we showed that the distribution of prices, and thus our target, resembles a log-normal distribution. So let&#8217;s assume now that we have a log-normal distribution, and thus we have<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\log(\\mathrm{price})\\sim\\mathcal{N}(\\mu,\\sigma^2)\\).<\/span><\/div>\n<p>Consequently, the pdf of the price is<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\tilde f(x)=\\frac {1}{x}\\cdot{\\frac {1}{{\\sqrt {2\\pi\\sigma^2 \\,}}}}\\exp\\left(-{\\frac{(\\ln(x) -\\mu)^{2}}{2\\sigma^{2}}}\\right),\\)<\/span><\/div>\n<p>where the only difference to the pdf of the normal distribution is \\(ln(x)\\) instead of \\(x\\) and the additional factor \\(\\frac{1}{x}\\). Also note that parameters \\(\\mu\\) and \\(\\sigma\\) are the well-known parameters of the normal distribution but for the log-transformed target. So when we now minimize the RMSE of the log-transformed prices as we did before, we actually infer the parameter \\(\\mu\\) of the normal distribution, which is the expected value and also the *median*, i.e. \\(\\operatorname {P} (\\log(\\mathrm{price})\\leq \\mu)= 0.5\\). Applying any kind of strictly monotonic increasing transformation \\(\\varphi\\) to the price, we see that \\(\\operatorname {P} (\\varphi(\\log(\\mathrm{price}))\\leq \\varphi(\\mu)) = 0.5\\) and thus the median as well as any other quantile is equivariant under the transformation \\(\\varphi\\). In our specific case from above, we have \\(\\varphi(x) = \\exp(x)\\) and thus the result, that we are approximating the median instead of the mean, is not surprising at all from a mathematical point of view.<\/p>\n<p>The expected value is not so well-behaved under transformations as the median. Using the definition of the expected value (<a href=\"#4\">4<\/a>), we can easily show that only transformations \\(\\phi\\) of the form \\(\\phi(x)=ax + b\\), with scalars \\(a\\) and \\(b\\), allow us to transform the target, determine the expected value and apply the inverse transformation to get the expected value of the original distribution. In math-speak, a transformation of the form \\(\\phi(x)=ax + b\\) is also called an [affine transformation]. For the transformed random variable \\(\\phi(X)\\) we have for the expected value that<\/p>\n<div id=\"5\" style=\"text-align: center;\"><span class=\"h5\">\\(\\begin{array}{lcl} E[\\phi(X)] &amp;= &amp;E[aX + b] \\\\ &amp;= &amp;\\int (ax + b)f(x)\\, \\mathrm{d}x \\\\ &amp;= &amp;a\\int xf(x)\\, \\mathrm{d}x + b \\\\ &amp;= &amp;aE[X] + b\u00a0=\\phi(E[X]), \\end{array}\\)<\/span><\/div>\n<\/div>\n<p>where we used the fact that probability density functions are normalized, i.e. \\(\\int f(x)\\mathrm{d}x=1\\). What a relief! That means at least affine transformations are fine when we minimize the (R)MSE. This is especially important if you belong to the illustrious circle of deep learning specialists. In some cases, the target variable of a regression problem is standardized or <a href=\"https:\/\/en.wikipedia.org\/wiki\/Feature_scaling#Rescaling_(min-max_normalization)\">min-max scaled<\/a> during training and transformed back afterwards. Since these normalization techniques are affine transformations we are on the safe side though.<\/p>\n<p>Let&#8217;s come back to our example where we know that the distribution is quite log-normal. Can we somehow still receive the mean of the untransformed target variable? Yes we can, actually. Using the parameter \\(\\mu\\) that we already determined above we just calculate the variance \\(\\sigma^2\\) and have \\(\\exp(\\mu + \\frac{\\sigma^2}{2})\\) for the mean of the untransformed distribution. More details on how to do this can be found in the <a href=\"https:\/\/github.com\/FlorianWilhelm\/used-cars-log-trans\/blob\/master\/notebooks\/used-cars.ipynb\">notebook<\/a> of the <a href=\"https:\/\/github.com\/FlorianWilhelm\/used-cars-log-trans\/\">used-cars-log-trans repository<\/a>. Way more interesting, at least for the mathematically interested reader, is the question *Why does this work?*.<\/p>\n<div>\n<p>This is easy to see using some calculus. With \\(\\tilde y = \\log(y)\\) and let $f(y)$ be the pdf of the normal distribution as well as \\(\\tilde f(y)\\) the pdf of the log-normal distribution (<a href=\"#5\">5<\/a>). Using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Integration_by_substitution\">integration by substitution<\/a> and noting that \\(\\mathrm{d}y = e^{\\tilde y}\\mathrm{d}\\tilde y\\), we have<\/p>\n<div id=\"6\" style=\"text-align: center;\"><span class=\"h5\">\\(\\int y \\tilde f(y)\\, \\mathrm{d}y = \\int e^{\\tilde y} \\tilde f(e^{\\tilde y})e^{\\tilde y}\\, \\mathrm{d}\\tilde y = \\int e^{\\tilde y} f(\\tilde y)\\, \\mathrm{d}\\tilde y,\\)<\/span><\/div>\n<p>where in the last equation the additional factor of the log-normal distribution was canceled out with \\(e^{\\tilde y}\\) and thus became the pdf of the normal distribution due to our substitution. Writing out the exponent in \\(f(x)\\), which is \\(-\\frac{(\\tilde y-\\mu)^2}{2\\sigma^2}\\) and completing the square with \\(\\tilde y\\), we have<\/p>\n<div id=\"7\" style=\"text-align: center;\"><span class=\"h5\">\\(\\begin{array}{lcl} \\tilde y &#8211; \\frac{(\\tilde y-\\mu)^2}{2\\sigma^2} &amp;= &amp;-\\frac{\\mu^2 &#8211; 2\\mu\\tilde y + \\tilde{y}^2 &#8211; 2\\sigma^2\\tilde y}{2\\sigma^2} \\\\ &amp;= &amp;-\\frac{(\\tilde y &#8211; (\\mu + \\sigma^2))^2}{2\\sigma^2} + \\mu + \\frac{\\sigma^2}{2}. \\end{array}\\)<\/span><\/div>\n<p>Using this result, we can rewrite the last expression of (<a href=\"#6\">6<\/a>) by shifting the parameter \\(\\mu\\) of the normal distribution by \\(\\sigma^2\\). Denoting with \\(f_s(y)\\) the shifted pdf, we have<\/p>\n<\/div>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\int e^{\\tilde y} f(\\tilde y)\\, \\mathrm{d}\\tilde y = e^{\\mu + \\frac{1}{2}\\sigma^2}\\int f_s(\\tilde y)\\, \\mathrm{d}\\tilde y = e^{\\mu + \\frac{\\sigma^2}{2}}, \\)<\/span><\/div>\n<p>and subsequently we have proved that the expected value of the log-normal distribution indeed is \\(\\exp(\\mu + \\frac{\\sigma^2}{2})\\).<\/p>\n<p>A little recap of this section&#8217;s most important points to remember. When minimizing l2, i.e. (R)MSE, only affine transformations allow us to determine the expected value of the original target by applying the inverse transformation to the expected value of the transformed target variable. When minimizing l1, i.e. MAE, all strictly monotonic increasing transformations can be applied to determine the median from the transformed target variable followed by the inverse transformation.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Transforming-the-target-for-fun-and-profit\"><\/span>Transforming the target for fun and\u00a0profit<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>So we have seen that not everything is as it seems or as we might have expected by doing some rather academical analysis. But can we somehow use this knowledge in our use case of predicting the market value of a used car? Yeah, this is the point where we close the circle to the beginning of the story. We have already argued that the RMSE might not be the right error measure to minimize. Log-transforming the target and still minimizing the RMSE gave us an approximation to the result we would have gotten if we had minimized the MAE, which quite likely is a more appropriate error measure in this use case than the RMSE. This is a neat trick if our regression method only allows minimizing the MSE or if it is too slow or unstable when minimizing the MAE directly. A word of caution again, this only works if the residual distribution approximates a log-normal distribution. So far we have only seen that the target distribution, not the residual distribution, is quite log-normal but since we are dealing with positive numbers, and also taking into account the fact that a car seller might be more inclined to start with a higher price, this justifies the assumption that the residual distribution (in case of a multivariate model) will also approximate a log-normal distribution.<\/p>\n<p>Well, the MAE surely is quite nice, but how about minimizing some relative measure like the MAPE? Assuming that our regression method does not support minimizing it directly, does the log-transformation do any good here? Intuitively, since we know that multiplicative, and thus relative, relationships become additive in log-space, we might expect it to be advantageous and indeed it does help. But before we do some experiments, let&#8217;s first look at some other relative error measure, namely the Root Mean Square Percentage Error (RMSPE), for example.<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\sqrt{\\frac{1}{n}\\sum_{i=1}^n\\left(\\frac{y_i-\\hat y_i}{y_i}\\right)^2}.\\)<\/span><\/div>\n<p>This error measure was used for evaluation in the\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/rossmann-store-sales\/\">Rossmann Store Sales<\/a>\u00a0Kaggle challenge. Since the\u00a0RMSPE\u00a0is a rather unusual and uncommon error measure, most participants log-transformed the target and minimized the\u00a0RMSE\u00a0without giving too much thought about it, just following their intuition. Some participants in the challenge dug deeper, like Tim Hochberg who proved in a\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/rossmann-store-sales\/discussion\/17026\">forum\u2019s post<\/a>\u00a0that minimizing the\u00a0RMSE\u00a0of the log-transformed target is a first-order approximation of the\u00a0RMSPE\u00a0using\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Taylor_series\">Taylor series<\/a>\u00a0expansion. Although his result is correct, it only tells us that we are asymptotically doing the right thing, i.e. only if we had some really glorious model that perfectly predicts the target, which of course is never true. So in practice, the residual distribution might be quite narrow but if it was the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Dirac_delta_function\">Dirac delta function<\/a>\u00a0we would have found some deterministic relationship between our feature and the target variable, or more likely made a mistake by evaluating some over-fitted model on the train set. A nice example of being asymptotically right but practically wrong, by the way. In a\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/rossmann-store-sales\/discussion\/17601\">reply post<\/a>\u00a0to Tim\u2019s original post, a guy who just calls himself\u00a0ML\u00a0pointed out the overly optimistic assumption and proved that a correction of \\(-\\frac{3}{2}\\sigma^2\\) is necessary during back-transformation assuming a log-normal residual distribution. Since his post is quite scrambled, and also just for the fun of it, we will also prove this after some more practical applications using the notation we already established. And while we are at it, we will also show that the necessary correction in case of MAPE is \\(-\\sigma^2\\). But for now, we will just take for granted the following<\/p>\n<div>\n<table width=\"599\">\n<tbody>\n<tr>\n<td><\/td>\n<td>(R)MSE<\/td>\n<td>MAE<\/td>\n<td>MAPE<\/td>\n<td>RMSPE<\/td>\n<\/tr>\n<tr>\n<td>correction terms, i.e.<\/td>\n<td>\\(+\\frac{1}{2}\\sigma^2\\)<\/td>\n<td>\n<div>\n<div>\n<div>\\(0\\)<\/div>\n<\/div>\n<\/div>\n<\/td>\n<td>\n<div>\n<div>\\(-\\sigma^2\\)<\/div>\n<\/div>\n<\/td>\n<td>\n<div>\n<div>\\(-\\frac{3}{2}\\sigma^2\\)<\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>which need to be added to the minimum point obtained by the\u00a0RMSE\u00a0minimization of the log-transformed target before transforming it back. Needless to say, the correction for\u00a0RMSPE\u00a0was one of the decisive factors to win the Kaggle challenge and thus to make some profit. The winner Gert Jacobusse mentions this in the attached\u00a0PDF\u00a0of his\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/rossmann-store-sales\/discussion\/18024\">model documentation post<\/a>.<\/p>\n<p>What if we do not have a log-normal residual distribution or only a really rough approximation? Can we be better than applying those theoretical corrections terms? Sure, we can! In the end, since we are transforming back using \\(\\exp\\), it is only a correction factor close to \\(1\\) that we are applying. So in case of RMSPE and for our approximation \\(\\hat\\mu\\) of the log-normal distribution, we have a factor of \\(c=\\exp(-\\frac{3}{2}\\sigma^2)\\) for the back-transformed target \\(\\exp(\\hat\\mu)\\). We can just treat this as another one-dimensional optimization problem and determine the best correction factor numerically. Speaking of numerical computation, we are not gonna determine a factor \\(c\\) but equivalently a correction term \\(\\tilde c\\), so that \\(\\exp(\\hat \\mu + \\tilde c)=\\hat y\\), which is numerically much more stable.<\/p>\n<p>At my former employer\u00a0<a href=\"https:\/\/blueyonder.com\/\">Blue Yonder<\/a>, we used to call this the\u00a0Gronbach factor\u00a0after our colleague Moritz Gronbach, who would successfully apply this fitted correction to all kinds of regression problems with non-negative values. The implementation is actually quite easy given the true value, our predicted value in log-space and some error\u00a0measure:<\/p>\n<pre class=\"lang:python decode:true \">def get_corr(y_true, y_pred_log, error_func, **kwargs):\r\n    \"\"\"Determine correction delta for exp transformation\"\"\"\r\n    def cost_func(delta):\r\n        return error_func(np.exp(delta + y_pred_log), y_true)\r\n    res = sp.optimize.minimize(cost_func, 0., **kwargs)\r\n    if res.success:\r\n        return res.x\r\n    else:\r\n        raise RuntimeError(f\"Finding correction term failed!\\n{res}\")<\/pre>\n<p>Let&#8217;s now get our hands dirty and evaluate how RMSE, MAE, MAPE and RMSPE behave in our use case with the raw as well as the log-transformed target using no, the theoretical and the fitted correction. To do so we gonna do some feature engineering and apply some ML method. Regarding the former, we just do some extremely basic things like calculating the age of a car and average mileage per year, i.e.<\/p>\n<pre class=\"lang:python decode:true \">df['monthOfRegistration'] = df['monthOfRegistration'].replace(0, 7)\r\ndf['dateOfRegistration'] = df.apply(\r\n    lambda ds: datetime(ds['yearOfRegistration'], ds['monthOfRegistration'], 1), axis=1)\r\ndf['ageInYears'] = df.apply(\r\n    lambda ds: (ds['dateCreated'] - ds['dateOfRegistration']).days \/ 365, axis=1)\r\ndf['mileageOverAge'] = df['kilometer'] \/ df['ageInYears']<\/pre>\n<div>\n<div>In total, combined with the original features, we take as our feature set including the target:<\/div>\n<\/div>\n<pre class=\"lang:python decode:true \">FEATURES = [\"vehicleType\",\r\n            \"ageInYears\",\r\n            \"mileageOverAge\",\r\n            \"gearbox\",\r\n            \"powerPS\",\r\n            \"model\",\r\n            \"kilometer\",\r\n            \"fuelType\",\r\n            \"brand\",\r\n            \"price\"]\r\ndf = df[FEATURES]<\/pre>\n<p>and transform all categorical features to integers, i.e.<\/p>\n<pre class=\"lang:python decode:true \">for col, dtype in zip(df.columns, df.dtypes):\r\n    if dtype is np.dtype('O'):\r\n        df[col] = df[col].astype('category').cat.codes<\/pre>\n<p>to get our final feature matrix\u00a0<code>X<\/code> and target vector\u00a0<code>Y<\/code> with<\/p>\n<pre class=\"lang:python decode:true\">y = df['price'].to_numpy()\r\nX = df.drop(columns='price').to_numpy()<\/pre>\n<p>As\u00a0ML\u00a0method, let\u2019s just choose a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Random_forest\">Random Forest<\/a>\u00a0as for me it is like the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Volkswagen_Passat\">Volkswagen Passat Variant <\/a>among all\u00a0ML\u00a0algorithms. Although you will not win any competition with it, in most use cases it will do a pretty decent job without much hassle. In a real world scenario, one would rather select and fine-tune some\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Gradient_boosting\">Gradient Boosted Decision Tree<\/a>\u00a0like\u00a0<a href=\"https:\/\/xgboost.readthedocs.io\/\">XGBoost<\/a>,\u00a0<a href=\"https:\/\/lightgbm.readthedocs.io\/\">LightGBM <\/a>or maybe even better\u00a0<a href=\"https:\/\/catboost.ai\/\">CatBoost<\/a>\u00a0since categories (e.g.\u00a0<span style=\"font-size: 12pt;\"><code>vehicleType<\/code> <\/span>and\u00a0<span style=\"font-size: 12pt;\"><code>model<\/code><\/span>) surely play an important part in this use case. We will use the default\u00a0MSE\u00a0criterion of\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html\">Scikit-Learn\u2019s RandomForestRegressor <\/a>implementation for all\u00a0experiments.<\/p>\n<p>To now evaluate this model, we gonna use a 10-fold cross-validation and split off a validation set from the training set in each split to calculate \\(\\sigma^2\\) and fit our correction term. The cross-validation will give us some indication about the variance in our results. In each of these 10 splits, we then fit the model and predict using the<\/p>\n<ol>\n<li>raw, i.e. untransformed,\u00a0target,<\/li>\n<li>log-transformed target with no\u00a0correction,<\/li>\n<li>log-transformed target with the corresponding sigma2\u00a0correction,<\/li>\n<li>log-transformed target with the fitted\u00a0correction,<\/li>\n<\/ol>\n<p>and evaluate the results with\u00a0RMSE,\u00a0MAE,\u00a0MAPE\u00a0and\u00a0RMSPE. To spare you the trivial implementation, which is to be found in the\u00a0<a href=\"https:\/\/github.com\/FlorianWilhelm\/used-cars-log-trans\/blob\/master\/notebooks\/used-cars.ipynb\">notebook<\/a>, we jump directly to the results of the first of 10\u00a0splits:<\/p>\n<table width=\"448\">\n<thead>\n<tr>\n<th align=\"right\">split<\/th>\n<td align=\"left\">target<\/td>\n<td align=\"right\">RMSE<\/td>\n<td align=\"right\">MAE<\/td>\n<td align=\"right\">MAPE<\/td>\n<td align=\"right\">RMSPE<\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td align=\"right\">0<\/td>\n<td align=\"left\">raw<\/td>\n<td align=\"right\">2368.36<\/td>\n<td align=\"right\">1249.34<\/td>\n<td align=\"right\">0.342704<\/td>\n<td align=\"right\">1.65172<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">0<\/td>\n<td align=\"left\">log\u00a0&amp;\u00a0no corr<\/td>\n<td align=\"right\">2464.50<\/td>\n<td align=\"right\">1253.19<\/td>\n<td align=\"right\">0.307301<\/td>\n<td align=\"right\">1.56172<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">0<\/td>\n<td align=\"left\">log\u00a0&amp;\u00a0sigma2 corr<\/td>\n<td align=\"right\">2475.48<\/td>\n<td align=\"right\">1253.19<\/td>\n<td align=\"right\">0.305424<\/td>\n<td align=\"right\">1.27903<\/td>\n<\/tr>\n<tr>\n<td align=\"right\">0<\/td>\n<td align=\"left\">log\u00a0&amp;\u00a0fitted corr<\/td>\n<td align=\"right\">2449.23<\/td>\n<td align=\"right\">1251.35<\/td>\n<td align=\"right\">0.299577<\/td>\n<td align=\"right\">0.85879<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>For each split, we take now the errors on the raw target, i.e. the first row, as baseline and calculate the percentage change along each column for the other rows. Then, we calculate for each cell the mean and standard deviation over all 10 splits, resulting\u00a0in:<\/p>\n<table>\n<tbody>\n<tr>\n<td><\/td>\n<td colspan=\"2\">RMSE<\/td>\n<td colspan=\"2\">MAE<\/td>\n<td colspan=\"2\">MAPE<\/td>\n<td colspan=\"2\">RMSPE<\/td>\n<\/tr>\n<tr>\n<td>target<\/td>\n<td>mean<\/td>\n<td>std<\/td>\n<td>mean<\/td>\n<td>std<\/td>\n<td>mean<\/td>\n<td>std<\/td>\n<td>mean<\/td>\n<td>std<\/td>\n<\/tr>\n<tr>\n<td>log &amp; no corr<\/td>\n<td>+3.42%<\/td>\n<td>\u00b11.07%p<\/td>\n<td>-0.09%<\/td>\n<td>\u00b10.61%p<\/td>\n<td>-10.99%<\/td>\n<td>\u00b10.65%<\/td>\n<td>-12.08%<\/td>\n<td>\u00b14.14%p<\/td>\n<\/tr>\n<tr>\n<td>log &amp; sigma2 corr<\/td>\n<td>+4.14%<\/td>\n<td>\u00b10.84%p<\/td>\n<td>-0.09%<\/td>\n<td>\u00b10.61%p<\/td>\n<td>-11.03%<\/td>\n<td>\u00b10.74%p<\/td>\n<td>-28.24%<\/td>\n<td>\u00b13.35%p<\/td>\n<\/tr>\n<tr>\n<td>log &amp; fitted corr<\/td>\n<td>+2.75%<\/td>\n<td>\u00b11.00%p<\/td>\n<td>-0.19%<\/td>\n<td>\u00b10.58%p<\/td>\n<td>-13.23%<\/td>\n<td>\u00b10.68%p<\/td>\n<td>-47.27%<\/td>\n<td>\u00a0\u00b15.37%p<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Having proven mathematically and shown in our example use case, we can conclude finally that transforming the target variable is a dangerous business. It can be the key to success and wealth in a Kaggle challenge but it can also lead to disaster. It is a bit like wielding a double handed sword in a fight. Limbs will be cut off, we should just make sure they do not belong to us. The rest of this post is only for the inquisitive reader who wants to know exactly where the correction terms for\u00a0RMSPE\u00a0and\u00a0MAPE\u00a0come from. So let\u2019s wash it all down with some more\u00a0math. Let\u2019s interpret these evaluation results and note that negative percentages mean an improvement over the error on the untransformed target, the lower the better.<\/p>\n<p>The\u00a0RMSE\u00a0column shows us that if we really wanna get the best results for\u00a0RMSE, transforming the target variable leads to a worse result compared to a model trained on the original target. The theoretical sigma2 correction makes it even worse which tells us that the residuals in log-space are not normally distributed. We can check that using for instance the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Kolmogorov%E2%80%93Smirnov_test\">Kolmogorov\u2013Smirnov test<\/a>. At least the fitted correction improves somewhat over an uncorrected back-transformation. For the\u00a0MAE, we see an improvement as expected and we know that theoretically there is no need for a correction, thus the sigma2 correction cell shows exactly the same result. Again, noting that the log-normal assumption is quite idealistic, we can understand that the fitted correction is better than the theoretical optimisation.<\/p>\n<p>Coming now to the more appropriate measures for this use-case, we see some nice percentage improvements for\u00a0MAPE. Applying the log-transformation here gets us a huge performance boost even without correction. The sigma2 correction makes it a tad better but is outperformed by the fitted correction. Last but not least,\u00a0RMSPE\u00a0brings us the most pleasing results. Transforming without correction is good, sigma2 makes it even better and the fitted corrections is simply outstanding, at least percentage-wise compared to the baseline. In absolute numbers, judged in the respective error measure, we would still need to improve the model a lot to use it in some production use case but that was not the actual point of this\u00a0exercise.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Aftermath\"><\/span>Aftermath<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>So you are still reading? I totally appreciate it and bet you are one of those people who wants to know for sure. Let&#8217;s get started for what you are still here, that is proving that \\(-\\frac{3}{2}\\sigma^2\\) is the right correction for RMSPE and \\(-\\sigma^2\\) for MAPE. Let&#8217;s start with the former.<\/p>\n<p>We use again our notation \\(\\tilde \\ast = \\log(\\ast)\\) for our variables and also to differentiate between the normal and log-normal distribution. To minimize the error, we have<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\mathrm{RMSPE}(\\hat y) = \\int\\left(\\frac{y-\\hat y}{y}\\right)^2\\,\\tilde f(y)\\, \\mathrm{d}y = 1 -2\\hat y\\int\\frac{\\tilde f(y)}{y}\\, \\mathrm{d}y + {\\hat y}^2\\int\\frac{\\tilde f(y)}{y^2}\\, \\mathrm{d}y.\\)<\/span><\/div>\n<p>To find the minimum, we derive by \\(\\hat y\\) and set to \\(0\\), resulting in<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\hat y=\\frac{\\int\\frac{\\tilde f(y)}{y}\\, \\mathrm{d}y}{\\int\\frac{\\tilde f(y)}{y^2}\\, \\mathrm{d}y}\\)<\/span><\/div>\n<p>Thus, we now need to calculate<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(q_\\alpha = \\int\\frac{\\tilde f(y)}{y^\\alpha}\\, \\mathrm{d}y \\)<\/span><\/div>\n<p>for \\(\\alpha =1,2\\). To that end, we substitute \\(y=\\exp(\\tilde y)\\) and using \\(\\mathrm{d}y = e^{-\\tilde y}\\, \\mathrm{d}\\tilde y\\), we have<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(q_\\alpha = \\int e^{-\\alpha\\tilde y}\\,\\tilde f(e^{\\tilde y})e^{\\tilde y}\\, \\mathrm{d}\\tilde y = \\int e^{-\\alpha\\tilde y}\\,f(\\tilde y)\\, \\mathrm{d}\\tilde y. \\)<\/span><\/div>\n<p>Writing out the exponent and completing the square similar to (<a href=\"#7\">7<\/a>), we obtain<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\log(q_\\alpha) = -\\alpha \\mu +\\frac12 \\alpha^2\\sigma^2, \\)<\/span><\/div>\n<p>leading in total to<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\log(\\hat y)=\\log(q_1)-\\log(q_2) = \\mu -\\frac{3}{2}\\sigma^2. \\)<\/span><\/div>\n<p>Subsequently, the correction term for RMSPE is \\(-\\frac{3}{2}\\sigma^2\\). For MAPE we have<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\mathrm{MAPE}(\\hat y) = \\int_0^{\\infty}\\frac{\\vert y-\\hat y\\vert}{y}\\,\\tilde f(y)\\, \\mathrm{d}y = \\int_{\\hat y}^{\\infty}1 &#8211; \\frac{\\hat y}{y}\\,\\tilde f(y)\\, \\mathrm{d}y -\\int_0^{\\hat y}1-\\frac{\\hat y}{y}\\,\\tilde f(y)\\, \\mathrm{d}y,\\)<\/span><\/div>\n<p>and after deriving by \\(\\hat y\\) as well as setting to 0, we need to find \\(\\hat y\\) such that<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\int_{\\hat y}^{\\infty}\\frac{1}{y}\\,\\tilde f(y)\\, \\mathrm{d}y &#8211; \\int_0^{\\hat y}\\frac{1}{y}\\,\\tilde f(y)\\, \\mathrm{d}y = 0.\\)<\/span><\/div>\n<p>Doing the same substitution as with RMSPE, results in<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(\\int_{\\log(\\hat y)}^{\\infty}e^{-\\tilde y}\\,f(\\tilde y)\\, \\mathrm{d} \\tilde y &#8211; \\int_{-\\infty}^{\\log(\\hat y)}e^{-\\tilde y}\\,f(\\tilde y)\\, \\mathrm{d}\\tilde y = 0.\\)<\/span><\/div>\n<p>Again, we complete the square of the exponent similar to (<a href=\"#7\">7<\/a>), resulting in<\/p>\n<div style=\"text-align: center;\"><span id=\"8\" class=\"h5\">\\(e^{-\\mu + \\frac{1}{2}\\sigma^2}\\left(\\int_{\\log(\\hat y)}^{\\infty}f_s(\\tilde y)\\, \\mathrm{d} \\tilde y &#8211; \\int_{-\\infty}^{\\log(\\hat y)}f_s(\\tilde y)\\, \\mathrm{d}\\tilde y\\right) = 0,\\)<\/span><\/div>\n<p>where<\/p>\n<div style=\"text-align: center;\"><span class=\"h5\">\\(f_s(x) = {\\frac {1}{ {\\sqrt {2\\pi\\sigma^2 \\,}}}}\\exp \\left(-{\\frac {(x &#8211; (\\mu &#8211; \\sigma^2) )^{2}}{2\\sigma ^{2}}}\\right). \\)<\/span><\/div>\n<p>We need the two integrals in (<a href=\"#8\">8<\/a>) to be equal to fulfill the equation, thus \\(\\log(\\hat y)\\) needs to be the median. With the shifted normal distribution \\(f_s(x)\\), we have that for \\(\\log(\\hat y) = \\mu &#8211; \\sigma^2\\). Consequently, the correction term for MAPE is \\(-\\sigma^2\\).<\/p>\n<p><em>This article first appeared on <a href=\"https:\/\/florianwilhelm.info\/2020\/05\/honey_i_shrunk_the_target_variable\/\" target=\"_blank\" rel=\"noopener\">my personal blog<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Transforming a target variable can be tricky. This blog post elaborates on the mathematical reasons why things can go wrong but also shows how a smart transformation can bring you honour &amp; fame in practical applications.<\/p>\n","protected":false},"author":52,"featured_media":30613,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[509,206],"service":[431],"coauthors":[{"id":52,"display_name":"Florian Wilhelm","user_nicename":"fwilhelm"}],"class_list":["post-20915","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-ai-2","tag-data-science","service-data-science"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Honey, I Shrunk the Target Variable - inovex GmbH<\/title>\n<meta name=\"description\" content=\"Transforming a target variable can be tricky. This blog post elaborates on the mathematical reasons why things can go wrong.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Honey, I Shrunk the Target Variable - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"Transforming a target variable can be tricky. This blog post elaborates on the mathematical reasons why things can go wrong.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2021-06-29T11:17:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-10-11T10:34:59+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/honey-i-shrunk-the-target-variable-header.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1366\" \/>\n\t<meta property=\"og:image:height\" content=\"768\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Florian Wilhelm\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/honey-i-shrunk-the-target-variable-header-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Florian Wilhelm\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"35\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Florian Wilhelm\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/\"},\"author\":{\"name\":\"Florian Wilhelm\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/57ad7c24ee7f9ec59ed87598c73fe79e\"},\"headline\":\"Honey, I Shrunk the Target Variable\",\"datePublished\":\"2021-06-29T11:17:00+00:00\",\"dateModified\":\"2022-10-11T10:34:59+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/\"},\"wordCount\":6218,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/honey-i-shrunk-the-target-variable-header.png\",\"keywords\":[\"Ai\",\"Data Science\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/\",\"name\":\"Honey, I Shrunk the Target Variable - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/honey-i-shrunk-the-target-variable-header.png\",\"datePublished\":\"2021-06-29T11:17:00+00:00\",\"dateModified\":\"2022-10-11T10:34:59+00:00\",\"description\":\"Transforming a target variable can be tricky. This blog post elaborates on the mathematical reasons why things can go wrong.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/honey-i-shrunk-the-target-variable-header.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/honey-i-shrunk-the-target-variable-header.png\",\"width\":1366,\"height\":768,\"caption\":\"A shrinking target variable under a magnifying glass\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/honey-i-shrunk-the-target-variable\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Honey, I Shrunk the Target Variable\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/57ad7c24ee7f9ec59ed87598c73fe79e\",\"name\":\"Florian Wilhelm\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg5db1abe47435abb84b0b7484ce0890e9\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg\",\"caption\":\"Florian Wilhelm\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/fwilhelm\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Honey, I Shrunk the Target Variable - inovex GmbH","description":"Transforming a target variable can be tricky. This blog post elaborates on the mathematical reasons why things can go wrong.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/","og_locale":"de_DE","og_type":"article","og_title":"Honey, I Shrunk the Target Variable - inovex GmbH","og_description":"Transforming a target variable can be tricky. This blog post elaborates on the mathematical reasons why things can go wrong.","og_url":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2021-06-29T11:17:00+00:00","article_modified_time":"2022-10-11T10:34:59+00:00","og_image":[{"width":1366,"height":768,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/honey-i-shrunk-the-target-variable-header.png","type":"image\/png"}],"author":"Florian Wilhelm","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/honey-i-shrunk-the-target-variable-header-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Florian Wilhelm","Gesch\u00e4tzte Lesezeit":"35\u00a0Minuten","Written by":"Florian Wilhelm"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/"},"author":{"name":"Florian Wilhelm","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/57ad7c24ee7f9ec59ed87598c73fe79e"},"headline":"Honey, I Shrunk the Target Variable","datePublished":"2021-06-29T11:17:00+00:00","dateModified":"2022-10-11T10:34:59+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/"},"wordCount":6218,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/honey-i-shrunk-the-target-variable-header.png","keywords":["Ai","Data Science"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/","url":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/","name":"Honey, I Shrunk the Target Variable - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/honey-i-shrunk-the-target-variable-header.png","datePublished":"2021-06-29T11:17:00+00:00","dateModified":"2022-10-11T10:34:59+00:00","description":"Transforming a target variable can be tricky. This blog post elaborates on the mathematical reasons why things can go wrong.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/honey-i-shrunk-the-target-variable-header.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/honey-i-shrunk-the-target-variable-header.png","width":1366,"height":768,"caption":"A shrinking target variable under a magnifying glass"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/honey-i-shrunk-the-target-variable\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Honey, I Shrunk the Target Variable"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/57ad7c24ee7f9ec59ed87598c73fe79e","name":"Florian Wilhelm","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg5db1abe47435abb84b0b7484ce0890e9","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg","caption":"Florian Wilhelm"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/fwilhelm\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/20915","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/52"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=20915"}],"version-history":[{"count":6,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/20915\/revisions"}],"predecessor-version":[{"id":38839,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/20915\/revisions\/38839"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/30613"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=20915"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=20915"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=20915"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=20915"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}