Comparing Apache Flink and Spark: Stream vs. Batch Processing

Flink has its origins in a research project called Stratosphere but was donated to the Apache Software Foundation in 2014. It can be described as a modern, more effective replacement of map reduce and has quite some similarities to Apache Spark. For example, the API resembles the Spark API and both adress similar use cases. Furthermore you will find a counterpart for almost every Spark component in Flink, e.g. for Machine Learning and Graph Processing. Read on for a quick comparison!

Stream Processing

Under the hood, Flink and Spark are quite different. While Spark is a batch oriented system that operates on chunks of data, called RDDs, Apache Flink is a stream processing system able to process row after row in real time. Streaming with Spark on the other hand operates on micro-batches, making at least a minimal latency inevitable.

Iterations

By exploiting its streaming architecture, Flink allows you to natively iterate over data, something Spark also supports only as batches.

Memory Management

In our experience Spark jobs have to be optimized and adapted to specific datasets because you need to manually control partitioning and caching if you want to get it right.

Maturity

Flink is still in its infancy and has but a few production deployments. However, being tightly integrated into Hadoop and having a familiar looking API it is easy to evaluate Flink whenever fast iterative processing is a requirement.

Data Flow

In contrast to the procedural programming paradigm Flink follows a distributed data flow approach. For data set operations where intermediate results are required in addition to the regular input of an operation, broadcast variables are used to distribute the pre calculated results to all worker nodes. This is useful for auxiliary data sets or data-dependent parameterization. The data set will then be accessible for all parallel instances of an operation.

Visualization

Flink offers a web interface for submitting and executing jobs. The resulting execution plan can be visualized on this interface. Like Spark, Flink is integrated into Apache Zeppelin. It provides data analytics, ingestion, as well as discovery, visualization and collaboration. Zeppelin supports a multi language backend which allows to add and execute Flink programs. The results can easily be visualized.

Read on

For more information on Big Data technology and case studies proving their success please refer to our website. For personal contact get in touch via email or call +49 721 619 021-0.

comments powered by Disqus