Powering a Data Hub at Otto Group BI with Schedoscope

In order to build data services or advanced machine learning models, organizations must integrate large amounts of information from diverse sources. As a central place to consolidate as many data sources as possible we often find what is fashionably called a data lake. Building a data lake usually starts by collecting as much data in raw form as possible. The idea is to give data scientists simple access to all available data so that they can combine information in ways not yet anticipated. Hadoop is the preferred choice for such a system because it is able to store vast amounts of data in a cost-efficient manner and is largely agnostic to structure. Weiterlesen

24/7 Spark Streaming on YARN in Production

At a large client in the German food retailing industry, we have been running Spark Streaming on Apache Hadoop™ YARN in production for close to a year now. Overall, Spark Streaming has proved to be a flexible, robust and scalable streaming engine. However, one can tell that streaming itself has been retrofitted into Apache Spark™. Many of the default configurations are not suited for a 24/7 streaming application. The same applies to YARN, which was not primarily designed with long-running applications in mind. Weiterlesen

Apache Mesos: An introduction

One of the biggest challenges in data centers is to maintain multiple clusters for different workloads. Say you want to run Hadoop, Kafka and Storm which means that you have to maintain 3 different clusters. These different clusters are hardly utilized most of the time so for example when you run Hadoop you need many resources to get the job done but the rest of the day these resources stay idle. With a very simple calculation you can see how much time your resources are idle and only waste space and money (and we didn’t talk about hardware replacements at this point!). Read on for the nitty gritty details in this first article in our Mesos mini series. Weiterlesen

inovex retrospex [July 2015]

Another month passed, another retrospex due. This time we’re talking Android embedded, IPv6 for mobile and women in tech – more precisely Sophie Wilson, the creator of the ARM architecture. Read on for everything new in tech you might have missed in July. Weiterlesen

Comparing Apache Flink and Spark: Stream vs. Batch Processing

Flink has its origins in a research project called Stratosphere but was donated to the Apache Software Foundation in 2014. It can be described as a modern, more effective replacement of map reduce and has quite some similarities to Apache Spark. For example, the API resembles the Spark API and both adress similar use cases. Furthermore you will find a counterpart for almost every Spark component in Flink, e.g. for Machine Learning and Graph Processing. Read on for a quick comparison! Weiterlesen