Comparing Apache Flink and Spark: Stream vs. Batch Processing

Notice:
This post is older than 5 years – the content might be outdated.

Flink has its origins in a research project called Stratosphere but was donated to the Apache Software Foundation in 2014. It can be described as a modern, more effective replacement of map reduce and has quite some similarities to Apache Spark. For example, the API resembles the Spark API and both adress similar use cases. Furthermore you will find a counterpart for almost every Spark component in Flink, e.g. for Machine Learning and Graph Processing. Read on for a quick comparison!

Stream Processing

Under the hood, Flink and Spark are quite different. While Spark is a batch oriented system that operates on chunks of data, called RDDs, Apache Flink is a stream processing system able to process row after row in real time. Streaming with Spark on the other hand operates on micro-batches, making at least a minimal latency inevitable.

Iterations

By exploiting its streaming architecture, Flink allows you to natively iterate over data, something Spark also supports only as batches.

Memory Management

In our experience Spark jobs have to be optimized and adapted to specific datasets because you need to manually control partitioning and caching if you want to get it right.

Maturity

Flink is still in its infancy and has but a few production deployments. However, being tightly integrated into Hadoop and having a familiar looking API it is easy to evaluate Flink whenever fast iterative processing is a requirement.

Data Flow

In contrast to the procedural programming paradigm Flink follows a distributed data flow approach. For data set operations where intermediate results are required in addition to the regular input of an operation, broadcast variables are used to distribute the pre calculated results to all worker nodes. This is useful for auxiliary data sets or data-dependent parameterization. The data set will then be accessible for all parallel instances of an operation.

Visualization

Flink offers a web interface for submitting and executing jobs. The resulting execution plan can be visualized on this interface. Like Spark, Flink is integrated into Apache Zeppelin. It provides data analytics, ingestion, as well as discovery, visualization and collaboration. Zeppelin supports a multi language backend which allows to add and execute Flink programs. The results can easily be visualized.

Read on

For more information on Big Data technology and case studies proving their success please refer to our website. For personal contact get in touch via email or call +49 721 619 021-0.

Go/Golang Advanced Training

This training course is for individuals who have completed the Golang Basic Training or who already have experience in developing web software. It focuses on the typical protocols commonly used in professional software projects.

Zum Training

One thought on “Comparing Apache Flink and Spark: Stream vs. Batch Processing”

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	Leadfeeder
Anbieter	Dealfront Group GmbH

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Akzeptieren
Name	Podigee
Anbieter	Podigee
Zweck	Wird verwendet, um Podigee-Inhalte automatisch zu entsperren.
Datenschutzerklärung	https://www.podigee.com/de/ueber-uns/datenschutz
Host(s)	podigee., podigee.com, podigee.io