Efficient UD(A)Fs with PySpark

Nowadays, Spark surely is one of the most prevalent technologies in the fields of data science and big data. Luckily, even though it is developed in Scala and runs in the Java Virtual Machine (JVM), it comes with Python bindings also known as PySpark, whose API was heavily influenced by Pandas. With respect to functionality, modern PySpark has about the same capabilities as Pandas when it comes to typical ETL and data wrangling, e.g. groupby, aggregations and so on. As a general rule of thumb, one should consider an alternative to Pandas whenever the data set has more than 10,000,000 rows which, depending on the number of columns and data types, translates to about 5-10 GB of memory usage. At that point PySpark might be an option for you that does the job, but of course there are others like for instance Dask which won’t be addressed in this post. Weiterlesen

Declarative Thinking and Programming

Before we actually dive into this topic, imagine the following: You just moved to a new place and the time is ripe for a little house-warming dinner with your best friends Alice and Bob. Since Alice is really tech-savvy you just send her a digital invitation with date, time and of course your new address that she can add to her calendar with a single click. With good old Bob is a bit more difficult, he is having a real struggle with modern IT. That’s why you decide to send him an e-mail not only including the time and location but also a suggestion which train to take to your city, details about the trams, stops, the right street and so. Weiterlesen

Causal Inference and Propensity Score Methods

In the field of machine learning and particularly in supervised learning, correlation is crucial to predict the target variable with the help of the feature variables. Rarely do we think about causation and the actual effect of a single feature variable or covariate on the target or response. Some even go so far as to say that „correlation trumps causation“ like in the book „Big Data: A Revolution That Will Transform How We Live, Work, and Think“ by Viktor Mayer-Schönberger and Kenneth Cukier. Following their reasoning, with Big Data there is no need to think about causation anymore, since nonparametric models will do just fine using correlation alone. For many practical use cases, this point of view may seem acceptable — but surely not for all.