24/7 Spark Streaming on YARN in Production

At a large client in the German food retailing industry, we have been running Spark Streaming on Apache Hadoop™ YARN in production for close to a year now. Overall, Spark Streaming has proved to be a flexible, robust and scalable streaming engine. However, one can tell that streaming itself has been retrofitted into Apache Spark™. Many of the default configurations are not suited for a 24/7 streaming application. The same applies to YARN, which was not primarily designed with long-running applications in mind. Weiterlesen

HBase and Phoenix on Azure: adventures in abstraction

One of my favourite essays by Joel Spolsky (he of Stack Overflow fame) is “The law of leaky abstractions”. In it he describes how the prevalence of layers of abstraction – be it coding languages or libraries or frameworks – have helped us accelerate our productivity. We don’t have to talk directly to a database engine because we can let our SQL do that for us; we don’t have to implement map reduce jobs in java anymore because we can use Hive; we don’t have to… well, you get the idea. Weiterlesen