Efficient UD(A)Fs with PySpark

Nowadays, Spark surely is one of the most prevalent technologies in the fields of data science and big data. Luckily, even though it is developed in Scala and runs in the Java Virtual Machine (JVM), it comes with Python bindings also known as PySpark, whose API was heavily influenced by Pandas. With respect to functionality, modern PySpark has about the same capabilities as Pandas when it comes to typical ETL and data wrangling, e.g. groupby, aggregations and so on. As a general rule of thumb, one should consider an alternative to Pandas whenever the data set has more than 10,000,000 rows which, depending on the number of columns and data types, translates to about 5-10 GB of memory usage. At that point PySpark might be an option for you that does the job, but of course there are others like for instance Dask which won’t be addressed in this post. Weiterlesen

A hybrid supervised/unsupervised approach to network anomaly detection

The previous two posts gave a short introduction of network anomaly detection in general. We also introduced the k-means algorithm as a simple clustering technique and discussed some advantages and drawbacks of the algorithm. Furthermore we gave some general information about techniques other than clustering which can be used for anomaly detection. In this post we want to introduce a hybrid unsupervised/supervised approach. We are going to use Balanced Iterative Reducing and Clustering using Hierarchies, also known as BIRCH as a pre-clustering step for a subsequent Support Vector Machine (SVM) classifier. Weiterlesen

Simplifying the Android developer’s life with Architecture Components

Some weeks ago at the I/O 2017 Google announced that they are working on some components to make the Android developers‘ life easier. They released a bunch of libraries which are supposed to take care of the lifecycle and suggest an architecture to rely on when developing an app. Despite the libraries not being considered stable by now, the community has been pretty excited about this and is adapting to it already. And indeed the libraries are working pretty stable already, some APIs might change though due to community feedback. Let’s have a closer look! Weiterlesen

Neural Networks in the Browser

Neural networks are the basis of some pretty impressive recent advances in machine learning. From greatly improved translation to automatic transfer of painting styles and from expert-level Go to Super Smash Bros, neural networks seem to to conquer various fields previously dominated by human performance. The on-going progress in different algorithms and techniques allows the application of neural networks in more and more use cases. Combined with the continuous maturing of the web as an application platform (see progressive web apps) this begs the question whether neural network applications can be deployed as web apps with all the advantages that come along with them. Weiterlesen

Re-usable Web Interfaces with client-side Frameworks and Web Components (Part 2)

In the first part of our series we elaborated on the common use of Web Components and web frameworks taking Angular and Polymer as examples. We ended the article with the statement that both suffer from essential compatibility issues which aren’t fixed so far. So now we will illustrate solutions for these problems and thus enable applications with Angular and Polymer running side by side. Weiterlesen

Declarative Thinking and Programming

Before we actually dive into this topic, imagine the following: You just moved to a new place and the time is ripe for a little house-warming dinner with your best friends Alice and Bob. Since Alice is really tech-savvy you just send her a digital invitation with date, time and of course your new address that she can add to her calendar with a single click. With good old Bob is a bit more difficult, he is having a real struggle with modern IT. That’s why you decide to send him an e-mail not only including the time and location but also a suggestion which train to take to your city, details about the trams, stops, the right street and so. Weiterlesen

Anomaly Detection: (Dis-)advantages of k-means clustering

In the previous post we talked about network anomaly detection in general and introduced a clustering approach using the very popular k-means algorithm. In this blog post we will show you some of the advantages and disadvantages of using k-means. Furthermore we will give a general overview about techniques other than clustering which can be used for anomaly detection. Weiterlesen