Causal Inference and Propensity Score Methods

In the field of machine learning and particularly in supervised learning, correlation is crucial to predict the target variable with the help of the feature variables. Rarely do we think about causation and the actual effect of a single feature variable or covariate on the target or response. Some even go so far as to say that „correlation trumps causation“ like in the book „Big Data: A Revolution That Will Transform How We Live, Work, and Think“ by Viktor Mayer-Schönberger and Kenneth Cukier. Following their reasoning, with Big Data there is no need to think about causation anymore, since nonparametric models will do just fine using correlation alone. For many practical use cases, this point of view may seem acceptable — but surely not for all.

Weiterlesen

24/7 Spark Streaming on YARN in Production

At a large client in the German food retailing industry, we have been running Spark Streaming on Apache Hadoop™ YARN in production for close to a year now. Overall, Spark Streaming has proved to be a flexible, robust and scalable streaming engine. However, one can tell that streaming itself has been retrofitted into Apache Spark™. Many of the default configurations are not suited for a 24/7 streaming application. The same applies to YARN, which was not primarily designed with long-running applications in mind. Weiterlesen

Elk on Docker (-Compose)

The ELK/Elastic stack is a common open source solution for collecting and analyzing log data from distributed systems. This article will show you how to run an ELK on Docker using Docker Compose. This will enable you to run ELK distributed on your docker infrastructure or test it on your local system. Weiterlesen

HBase and Phoenix on Azure: adventures in abstraction

One of my favourite essays by Joel Spolsky (he of Stack Overflow fame) is “The law of leaky abstractions”. In it he describes how the prevalence of layers of abstraction – be it coding languages or libraries or frameworks – have helped us accelerate our productivity. We don’t have to talk directly to a database engine because we can let our SQL do that for us; we don’t have to implement map reduce jobs in java anymore because we can use Hive; we don’t have to… well, you get the idea. Weiterlesen

Cloud Wars: Computation [Teil 3]

Um aus gesammelten Daten nützliche Informationen und einen Mehrwert zu gewinnen, ist in der Regel eine Aufbereitung notwendig. Die Methoden zur Verarbeitung lassen sich in Realtime und Batch Processing unterteilen. Erstere beziehen sich nur auf einen sehr aktuellen Ausschnitt der Daten und wurden bereits im Teil Collection and Storage bei den Streaming-Diensten vorgestellt. Die Batch-Verarbeitung bezieht meist einen größeren Datenausschnitt mit ein, also auch historische Daten, um neue Erkenntnisse oder Analysemodelle zu erhalten. Zur Verarbeitung großer Datenmengen stellen die Cloud Provider meist Tools aus dem Hadoop Big Data Ecosystem zur Verfügung. Weiterlesen

Storm in a Teacup

I wanted to call this blog article something like „Storm in a Nutshell“ but decided against it as

  1. there is probably a book by that name out there somewhere, and I wanted to avoid any unannounced visits in the dead of night from shady-looking types from the copyright police, and
  2. I really wanted to use a corny pun.

So think of a teacup as conceptually similar to a nutshell, but bigger.

On a recent project, we used Apache Storm as the real-time component of a complex, cloud-based environment used for fraud detection. In this article I would like to offer an introductory overview of storm, showing how to define a simple spout and bolt, as well as highlighting some of the issues that are important when building storm topologies. Weiterlesen