Efficient UD(A)Fs with PySpark

Nowadays, Spark surely is one of the most prevalent technologies in the fields of data science and big data. Luckily, even though it is developed in Scala and runs in the Java Virtual Machine (JVM), it comes with Python bindings also known as PySpark, whose API was heavily influenced by Pandas. With respect to functionality, modern PySpark has about the same capabilities as Pandas when it comes to typical ETL and data wrangling, e.g. groupby, aggregations and so on. As a general rule of thumb, one should consider an alternative to Pandas whenever the data set has more than 10,000,000 rows which, depending on the number of columns and data types, translates to about 5-10 GB of memory usage. At that point PySpark might be an option for you that does the job, but of course there are others like for instance Dask which won’t be addressed in this post. Weiterlesen

A hybrid supervised/unsupervised approach to network anomaly detection

The previous two posts gave a short introduction of network anomaly detection in general. We also introduced the k-means algorithm as a simple clustering technique and discussed some advantages and drawbacks of the algorithm. Furthermore we gave some general information about techniques other than clustering which can be used for anomaly detection. In this post we want to introduce a hybrid unsupervised/supervised approach. We are going to use Balanced Iterative Reducing and Clustering using Hierarchies, also known as BIRCH as a pre-clustering step for a subsequent Support Vector Machine (SVM) classifier. Weiterlesen

Simplifying the Android developer’s life with Architecture Components

Some weeks ago at the I/O 2017 Google announced that they are working on some components to make the Android developers‘ life easier. They released a bunch of libraries which are supposed to take care of the lifecycle and suggest an architecture to rely on when developing an app. Despite the libraries not being considered stable by now, the community has been pretty excited about this and is adapting to it already. And indeed the libraries are working pretty stable already, some APIs might change though due to community feedback. Let’s have a closer look! Weiterlesen

Neural Networks in the Browser

Neural networks are the basis of some pretty impressive recent advances in machine learning. From greatly improved translation to automatic transfer of painting styles and from expert-level Go to Super Smash Bros, neural networks seem to to conquer various fields previously dominated by human performance. The on-going progress in different algorithms and techniques allows the application of neural networks in more and more use cases. Combined with the continuous maturing of the web as an application platform (see progressive web apps) this begs the question whether neural network applications can be deployed as web apps with all the advantages that come along with them. Weiterlesen

7 Checkpoints für einen guten Feedback Loop

Da Daten ohne Kosten verteilt und kopiert werden können, ist es mitunter schwierig, ein Alleinstellungsmerkmal für einen Online-Dienst oder eine Webseite zu finden. Die Nutzung von Daten, die während des Gebrauchs einer Website oder eines Dienstes anfallen, die Interaktionsdaten, können ein Alleinstellungsmerkmal für das Produkt erzeugen. Dies ist sowohl aus geschäftlicher Sicht als auch aus technischer Sicht sinnvoll, wie ich in meinem letzten Artikel über den Feedback Loop verdeutlicht habe. Heute werde ich meiner Ankündigung gerecht, eine Checkliste mit den wichtigsten Aspekten für die Erstellung eines Feedback Loops zu präsentieren. Weiterlesen

Re-usable Web Interfaces with client-side Frameworks and Web Components (Part 2)

In the first part of our series we elaborated on the common use of Web Components and web frameworks taking Angular and Polymer as examples. We ended the article with the statement that both suffer from essential compatibility issues which aren’t fixed so far. So now we will illustrate solutions for these problems and thus enable applications with Angular and Polymer running side by side. Weiterlesen

Mehrwerte von Cloud Services in hybriden DWH-Architekturen

Die Gründe, IT-Infrastruktur aus dem eigenen Rechenzentrum in eine (Public) Cloud zu verschieben, sind vielfältig und einleuchtend: Niedrigere Kosten, schnellere Time-to-market und Effizienz des Kapitaleinsatzes, um nur die Wichtigsten zu nennen. Die komplette Verlagerung eines Corporate Data Warehouse in die Cloud ist allerdings bislang eher die Ausnahme. Abgesehen von der gerade in Deutschland sehr ernsthaft geführten Datenschutz-Diskussion würde es viel Aufwand und Kosten bedeuten, die Daten aller relevanten Quellsysteme mit der Cloud zu synchronisieren oder alle Systeme des Unternehmens dorthin umzuziehen. Ein Zwischenweg stellen hybride DWH-Architekturen dar, welche selektiv Cloud-Dienste nutzen und dabei die On-Premises und Cloud-Datenbestände intelligent integrieren. Die Vorteile dieser Services für Data-Management- und Analytics-Aufgaben werden in den folgenden Abschnitten dargestellt. Im Anschluss werden einige typische Szenarien aus der Praxis vorgestellt. Weiterlesen

Declarative Thinking and Programming

Before we actually dive into this topic, imagine the following: You just moved to a new place and the time is ripe for a little house-warming dinner with your best friends Alice and Bob. Since Alice is really tech-savvy you just send her a digital invitation with date, time and of course your new address that she can add to her calendar with a single click. With good old Bob is a bit more difficult, he is having a real struggle with modern IT. That’s why you decide to send him an e-mail not only including the time and location but also a suggestion which train to take to your city, details about the trams, stops, the right street and so. Weiterlesen