{"id":40924,"date":"2023-05-23T14:52:56","date_gmt":"2023-05-23T12:52:56","guid":{"rendered":"https:\/\/www.inovex.de\/?p=40924"},"modified":"2023-05-23T14:52:56","modified_gmt":"2023-05-23T12:52:56","slug":"data-orchestration-is-airflow-still-the-best-part-1","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/","title":{"rendered":"Data Orchestration: Is Airflow Still the Best? (Part 1)"},"content":{"rendered":"<p>Nowadays, we rely a lot on technology. As such every second a tremendous amount of data is being collected and processed. Companies can only utilize this tremendous amount of data by building reliable, maintainable, and robust data pipelines. A large company, especially a technology-oriented company, can have more than a thousand data pipelines. How can companies manage so many pipelines?<!--more--><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/blogpost_dataorchestration_workflow_17.png\" alt=\"Morning Routine: Teethbrushing, having a breakfast, showering and working\" width=\"1410\" height=\"534\" \/><\/p>\n<p>This is only possible due to data orchestration tools like Airflow whose primary task is to simplify the management of data pipelines. Data orchestration is a subfield of workflow orchestration. Workflow orchestration is responsible for the management of related tasks. For instance, when you wake up in the morning, you will probably follow your routine: tooth brushing, making yourself a sandwich, hopping into the shower, and so on. Every activity represents a task. You will usually do them in order, but you could also do some of them simultaneously, e.g. toasting your toast and going in the shower. All of these tasks would be related.<\/p>\n<p>The same concepts apply to data orchestration, e.g. pulling data out of some database, storing raw data into some file on a file system, validating your data, aggregating your data, and so forth. All of these activities can be viewed as tasks that we need to work off to get our desired result. A typical result could be a cleaned and validated dataset for machine learning model training.<\/p>\n<p>If you have only one pipeline, you would not need any orchestration tools. Orchestration gets more interesting when you have several pipelines which need to be managed and which need to be run on schedule. E.g. one pipeline should run every hour, the second should run daily at 6 a.m. and the third one needs to run every 10 minutes. Of course, you could set up some cron job that executes your pipelines at the right time but you would need to manually manage all of these cron jobs. Moreover, the cost of maintenance increases dramatically with every new feature you add to your pipelines, e.g., data governance features, logging, or monitoring. All of these requirements can potentially be handled &amp; facilitated by a data orchestrator. But before diving into the world of data orchestration, let us have a look at the history of workflow orchestration.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_79_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Airflows-appearance\" >Airflow&#8217;s appearance<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Cron-jobs\" >Cron jobs<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Spotify-Luigi\" >Spotify: Luigi<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Airbnb-Airflow\" >Airbnb: Airflow<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Airflows-dominance\" >Airflow&#8217;s dominance<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#The-Experiment\" >The Experiment<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Prerequisites\" >Prerequisites<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#The-Pipeline\" >The Pipeline<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#DAG-Definition\" >DAG Definition<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Task-Postgres-Ingestion\" >Task: Postgres Ingestion<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Task-Plotting-Revenue-vs-Time\" >Task: Plotting Revenue vs. Time<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Task-Aggregation\" >Task: Aggregation<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Task-Plotting-Average-Revenue-vs-Manager\" >Task: Plotting Average Revenue vs. Manager<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Finish-the-DAG-Definition\" >Finish the DAG Definition<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Results\" >Results<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Final-Remarks\" >Final Remarks<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Bonus\" >Bonus<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Replaying-Tasks\" >Replaying Tasks<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-19\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Looping-over-tasks-multiple-times\" >Looping over tasks multiple times<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-20\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#Slack-Integration\" >Slack Integration<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Airflows-appearance\"><\/span>Airflow&#8217;s appearance<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3><span class=\"ez-toc-section\" id=\"Cron-jobs\"><\/span>Cron jobs<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Before any open-source workflow orchestration tools were released, developers had to set up cron jobs. On UNIX systems, a cron job utilizes cron-daemons. These daemons are processes that run in the background and most often fulfill maintenance work. Creating a cron job is pretty simple. Imagine, you have a bash script named &#8222;example.sh&#8220; and you want to run this bash script every 5 minutes, then you could open your crontab configuration via<\/p>\n<pre class=\"lang:default decode:true\">crontab -e<\/pre>\n<p>and append into this configuration file the following line:<\/p>\n<pre class=\"lang:zsh decode:true\">*\/5 * * * * .\/example.sh\r\n<\/pre>\n<p>That&#8217;s it. Furthermore, the cron-daemon was invented by <a href=\"https:\/\/de.wikipedia.org\/wiki\/Brian_W._Kernighan\" target=\"_blank\" rel=\"noopener\">Brian W. Kernighan.<\/a> The algorithm behind the original cron-daemon is very simple. You simply have an infinite long-running process that would check every minute whether the time condition is fulfilled. If it is fulfilled, it will launch a child process that runs the given command. In our case, it will execute the bash script. If it is not fulfilled, it will just wait for the right time to come. A common problem that can arise is when a daemon unexpectedly stops working, then you would need to manually restart the daemon.<\/p>\n<p>Furthermore, cron jobs are only able to launch a process but don&#8217;t provide any monitoring or logging utilities. Also, the daemons run on only one machine, there is no trivial way to distribute jobs across machines, thus scalability is a problem. On top of that, there is no clear separation between local development environments and a production server. But it was never the cron job&#8217;s intention to cover all of these aspects, developers have just misused cron jobs to launch workflows.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Spotify-Luigi\"><\/span>Spotify: Luigi<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The need for data orchestration increased when more and more data needed to be handled. Spotify was a front-runner and created <a href=\"https:\/\/luigi.readthedocs.io\/en\/stable\/\" target=\"_blank\" rel=\"noopener\">Luigi<\/a> which was the first tool that launched as an <a href=\"https:\/\/github.com\/spotify\/luigi\" target=\"_blank\" rel=\"noopener\">open-source project<\/a> in late 2012. Luigi offered a class-based approach to data orchestration and propagated the notion of tasks. With Luigi, you can define tasks as classes, define dependencies between them and let them run together. A classic object-oriented design (OOD) approach. But Luigi lacks features like a pipeline trigger. You still need a cron job to trigger Luigi pipelines at the right time.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Airbnb-Airflow\"><\/span>Airbnb: Airflow<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>In June 2015, it was time for <a href=\"https:\/\/airflow.apache.org\/\" target=\"_blank\" rel=\"noopener\">Airflow<\/a>, which was developed at Airbnb, to appear on the stage as an <a href=\"https:\/\/github.com\/apache\/airflow\/releases\/tag\/1.0.0\" target=\"_blank\" rel=\"noopener\">open-source project<\/a>. Airflow was the first workflow orchestration tool to have a modern-looking web UI and a scheduler with a trigger that can kick off pipelines on schedule. As time has gone by, the open-source community has grown and more and more features have been implemented in Airflow. During that time, Airflow was regarded as the best open-source workflow orchestration tool and that is why so many companies have adopted Airflow into their production environment.<\/p>\n<p>Later on, workflow orchestration tools like Kubeflow, Argo, Flyte, Prefect, Dagster, and many more emerged, challenging Airflow&#8217;s position in this domain.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Airflows-dominance\"><\/span>Airflow&#8217;s dominance<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>You might ask: \u201cWhat made Airflow so popular and dominant?\u201c The answer to this question can be given by looking closer at Airflow. Airflow describes itself on the <a href=\"https:\/\/airflow.apache.org\/\" target=\"_blank\" rel=\"noopener\">official homepage<\/a> as follows:<\/p>\n<blockquote><p>\u201cAirflow is a platform created by the community to programmatically author, schedule and monitor workflows\u201c<\/p><\/blockquote>\n<p>Airflow&#8217;s strength lies in the fact that the open-source community has become fairly large and active. Authoring pipelines in Airflow can be only done via the programming language Python. Since this is a very popular language, it is a convenient choice for many developers out there. Airflow is also providing a lot of features that help develop and monitoring data pipelines. Moreover, since Airflow was one of the first data orchestration tools to go open source and provide a web UI, many developers were gravitating towards Airflow. Moreover, as time has gone by, all large public cloud vendors have published a managed service for using Airflow in production-ready environments. But can Airflow hold its dominant position in the long term?<\/p>\n<p>Let&#8217;s assess Airflow&#8217;s current position by experimenting! Let&#8217;s build a toy pipeline in Airflow.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"The-Experiment\"><\/span>The Experiment<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>But this experiment would be boring without competitors. Our competitors in this experiment will be two other data orchestration tools, Dagster and Prefect! Why Prefect and Dagster? In my opinion, these orchestration tools have the highest potential to surpass Airflow in the long term.<\/p>\n<p>To systemize our experiment, I will present you with 7 assessment categories &#8211; each data orchestration tool will be ranked in each category. Each tool can collect up to 5 stars in each category. In the end, we will sum up the number of stars each tool collected and elect the champion of this little experiment. These are the assessment categories: Setup\/Installation, Features, UI\/UX, Pipeline Developer Experience, Unit Testing, Documentation, and Deployment.<\/p>\n<p>Before diving into the code of the actual pipeline we are going to implement, I want to give you a little backstory to this pipeline: Imagine we are the CEOs of some franchise restaurants. Each restaurant is managed by a dedicated manager and after each day, the manager will register the revenue that the restaurant made. We, as the CEOs, want to know whether our franchise is performing well or badly, so we tell our Data Engineer to build a pipeline that should generate two plots, one showing how the revenue of each franchise develops over time per day and the second one showing the average revenue made by each franchise. The structure of the pipeline will look as depicted in figure 1.<\/p>\n<figure style=\"width: 1004px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/blogpost_dataorchestration_pipeline_1.png\" alt=\"Example Pipeline Structure\" width=\"1004\" height=\"286\" \/><figcaption class=\"wp-caption-text\"><strong>Figure 1: Pipeline structure for our franchise experiment<\/strong><\/figcaption><\/figure>\n<h3><span class=\"ez-toc-section\" id=\"Prerequisites\"><\/span>Prerequisites<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Before you can follow along with this experiment, we have to install Airflow. A simple method to install Airflow is to use Docker\/Docker Compose. Just follow the steps in the official <a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/howto\/docker-compose\/index.html\">documentation<\/a>. There is just one subtlety that is not mentioned in the documentation right away and that is how to define our dependencies since we will use pandas for data frame manipulation and plotly for plot generation. Also, we will need a provider package that allows us to connect to a Postgres database. This step comes before executing the command <em><span class=\"lang:default decode:true crayon-inline \">docker compose up<\/span><\/em>. Copy &amp; Paste the following requirements.txt file into your Airflow home directory under <span class=\"lang:default decode:true crayon-inline \">$AIRFLOW_HOME<\/span> :<\/p>\n<pre class=\"lang:default decode:true\">pip\r\npandas\r\nplotly\r\napache-airflow-providers-postgres<\/pre>\n<p>After this step, we have to modify our docker-compose.yml file a little bit, please change the following lines under the x-airflow-common service:<\/p>\n<pre class=\"lang:default decode:true \">x-airflow-common:\r\n\t&amp;airflow-common\r\n\t# In order to add custom dependencies or upgrade provider packages you can use your     extended image.\r\n\u00a0   # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml\r\n\u00a0   # and uncomment the \"build\" line below, Then run `docker-compose build` to build the images.\r\n\u00a0   # image: ${AIRFLOW_IMAGE_NAME:-apache\/airflow:2.4.3}\r\n\u00a0   build: .<\/pre>\n<p>We simply comment out the image attribute and un-comment the build attribute. This will instruct Airflow to not use its base image but to build an image from our custom Dockerfile. Since we don&#8217;t have a Dockerfile yet, we will create a Dockerfile in the same directory with the following content:<\/p>\n<pre class=\"lang:default decode:true \">FROM apache\/airflow:2.4.2\r\nCOPY requirements.txt .\r\nRUN pip install -r requirements.txt<\/pre>\n<p>Now we are ready to launch Airflow with:<\/p>\n<pre class=\"lang:default decode:true \">docker compose up<\/pre>\n<p>This will launch the webserver, scheduler, worker, trigger, Airflow&#8217;s Postgres database, and Redis. Also, we will need our own Postgres database where we can store our franchise data, so follow the instructions on the <a href=\"https:\/\/www.postgresql.org\/\">official Postgres webpage<\/a> on how to install a Postgres database. Alternatively, you can use Docker to set up a Postgres database. You will also need to run the following SQL query to set up the table with toy data after you have created a database:<\/p>\n<pre class=\"lang:pgsql decode:true\">CREATE TABLE stores (\r\n    id BIGSERIAL PRIMARY KEY,\r\n    manager VARCHAR(30) NOT NULL,\r\n    city VARCHAR(20) NOT NULL,\r\n    street VARCHAR(20) NOT NULL,\r\n    street_number INTEGER NOT NULL,\r\n    revenue DOUBLE PRECISION NOT NULL,\r\n    day DATE NOT NULL\r\n);\r\n\r\nINSERT INTO stores (manager, city, street, street_number, revenue, day) VALUES \r\n    ('Raphael Rodriguez', 'Aachen', 'Pontstrasse', 4, 4724.57, '2022-11-01'), \r\n    ('Raphael Rodriguez', 'Aachen', 'Pontstrasse', 4, 2579.35, '2022-11-02'),\r\n    ('Raphael Rodriguez', 'Aachen', 'Pontstrasse', 4, 5804.42, '2022-11-03'), \r\n    ('Joe Merkur', 'Koeln', 'Trankgasse', 24, 5608.32, '2022-11-01'),\r\n    ('Joe Merkur', 'Koeln', 'Trankgasse', 24, 2475.62, '2022-11-02'),\r\n    ('Joe Merkur', 'Koeln', 'Trankgasse', 24, 12843.76, '2022-11-03'), \r\n    ('Alice Mueller', 'Koeln', 'Keupstrasse', 124, 6764.56, '2022-11-01'), \r\n    ('Alice Mueller', 'Koeln', 'Keupstrasse', 124, 4524.35, '2022-11-02'), \r\n    ('Alice Mueller', 'Koeln', 'Keupstrasse', 124, 4792.64, '2022-11-03'), \r\n    ('Berno Goeth', 'Bonn', 'Koblenzer Strasse', 47, 1357.35, '2022-11-01'),\r\n    ('Berno Goeth', 'Bonn', 'Koblenzer Strasse', 47, 2597.25, '2022-11-02'),\r\n    ('Berno Goeth', 'Bonn', 'Koblenzer Strasse', 47, 899.96, '2022-11-03');<\/pre>\n<p>Okay, we are good to go!<\/p>\n<h3><span class=\"ez-toc-section\" id=\"The-Pipeline\"><\/span>The Pipeline<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Airflow works with the concept of Directed Acyclic Graphs (DAGs) where each task represents a node. The edges are given by the dependencies between the tasks. Our pipeline will consist of 4 tasks. To create our first DAG, run the following command in the root directory of the Airflow workspace:<\/p>\n<pre class=\"lang:default decode:true\">cd dags\r\nmkdir stores &amp;&amp; cd stores\r\nmkdir ingestion plotting transformation\r\ntouch pipeline.py\r\n<\/pre>\n<p>This will create a directory named <em>stores<\/em> inside of the dags directory and create 3 subdirectories inside of it: <em>ingestion<\/em>, <em>plotting,<\/em> and the <em>transformation<\/em> directory. Inside the <em>stores&#8216;<\/em> directory, we also create a file named <em>pipeline.py<\/em> which will hold our DAG definition. Your project structure should look similar to figure 2.<\/p>\n<figure style=\"width: 254px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/blogpost_dataorchestration_airflow_dirstruct_2.png\" alt=\"Airflow Project Structure\" width=\"254\" height=\"458\" \/><figcaption class=\"wp-caption-text\"><strong>Figure 2: Airflow Project Structure<\/strong><\/figcaption><\/figure>\n<h3><span class=\"ez-toc-section\" id=\"DAG-Definition\"><\/span>DAG Definition<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Now, we are ready to author our first pipeline, so let us start with outlining the DAG inside of <em>pipeline.py<\/em>:<\/p>\n<pre class=\"lang:python decode:true\">import pendulum\r\nfrom airflow.decorators import dag\r\n\r\n@dag(\r\n    dag_id=\"franchise_analysis_pipeline\",\r\n    description=\"Analysing franchise data\",\r\n    schedule=\"0 7 * * *\",\r\n    start_date=pendulum.datetime(\r\n        year=2022,\r\n\tmonth=11,\r\n\tday=1,\r\n\thour=7,\r\n\ttz='Europe\/Berlin'\r\n    ),\r\n    catchup=False,\r\n    default_args={\r\n        'retries': 0\r\n    },\r\n)\r\ndef franchise_analysis_pipeline():\r\n    pass\r\n\r\nfranchise_analysis_pipeline()<\/pre>\n<p>There are two ways to write DAGs in Airflow but we will use what Airflow refers to as the <a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/tutorial\/taskflow.html\" target=\"_blank\" rel=\"noopener\">TaskFlow API<\/a>. In my opinion, it increases readability. We import the <em>pendulum<\/em> library since Airflow does not allow the usage of the\u00a0<em>DateTime<\/em> library. This has to do with Airflow&#8217;s scheduler.\u00a0With the <em>dag<\/em> decorator in place, Airflow will know which function defines our DAG.<\/p>\n<p>Airflow also needs to identify our DAG, so we should give it a unique <em>dag_id<\/em>. Also, we add a <em>description<\/em>, telling Airflow that our pipeline should run daily at 7 a.m. and that it should start from the 1st of November 2022 onwards. The timezone is specified to our current timezone location and we also define a catchup and retry parameter. Catchup tells Airflow whether to schedule all the runs which we have missed, e.g. when we create our pipeline on the 3rd of November 2022, then <span class=\"lang:default decode:true crayon-inline\">catchup=False<\/span>\u00a0tells Airflow to not schedule any runs from the 1st of November 2022 onwards but only from the 3rd November 2022. We can also pass a<em>\u00a0<\/em>parameter named <em>retries<\/em>. When a task fails, Airflow will schedule the task run as often as we specified our retry count. This is useful if e.g. some tasks fail rarely due to timeout issues, then a retry might solve the problem but in our case, this is not needed.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Task-Postgres-Ingestion\"><\/span>Task: Postgres Ingestion<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Next, we should write out our tasks. Therefore we create a file named <em>postgres.py\u00a0<\/em>inside of the <em>ingestion<\/em> directory and write the following skeleton inside of it:<\/p>\n<pre class=\"lang:default decode:true\">from airflow.decorators import task\r\nfrom airflow.providers.postgres.hooks.postgres import PostgresHook\r\n\r\n@task\r\ndef ingest_franchise_data_from_postgres(target_file: str):\r\n    pass<\/pre>\n<p>Defining a task is simple, just apply the <em>task<\/em> decorator on <em>ingest_franchise_data_from_postgres<\/em> for this purpose. We can also already import <em>PostgresHook<\/em> since we will need it to establish a connection to our Postgres database. But before we write out the business logic, we have to tell Airflow where it can find our database. Therefore log in to the web UI of Airflow and hover over the field <em>Admin<\/em> in the top navigation bar. Then click on <em>Connections\u00a0<\/em>which should list all of your connections. Since you didn&#8217;t create any connections yet, the list should be empty, click on the plus sign and you should see a formula like the one illustrated in figure 3.<\/p>\n<figure style=\"width: 2538px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/blogpost_dataorchestration_airflow_connection_list_3.png\" alt=\"Airflow Connection List\" width=\"2538\" height=\"919\" \/><figcaption class=\"wp-caption-text\"><strong>Figure 3: Airflow&#8217;s connection list\u00a0<\/strong><\/figcaption><\/figure>\n<p>Specify the following parameters:<\/p>\n<ul>\n<li>Connection Id: &#8222;postgres_franchise&#8220;. This should specify a unique name for the connection, such that we can unambiguously refer to it in our code<\/li>\n<li>Connection Type: &#8222;Postgres&#8220;<\/li>\n<li>Host: &#8222;localhost&#8220;, when you deployed your database on your host machine, otherwise choose the appropriate hostname<\/li>\n<li>Schema: your database name<\/li>\n<li>Login: the database user<\/li>\n<li>Password: your database password<\/li>\n<li>Port: port of your database, the default port in Postgres is 5432.<\/li>\n<\/ul>\n<p>Afterward, click on <em>Test <\/em>to check whether your credentials are correct or not. Then click <em>Save<\/em> and we can continue with our code.<\/p>\n<p>The business logic of the ingestion task is straightforward. It should query the data from the Postgres database, open a CSV file and write the data with an appropriate header in it. With the business logic in place, our task will look like this:<\/p>\n<pre class=\"lang:python decode:true\">from airflow.decorators import task\r\nfrom airflow.providers.postgres.hooks.postgres import PostgresHook\r\n\r\n@task\r\ndef ingest_franchise_data_from_postgres(target_file: str):\r\n    import os\r\n    import csv\r\n\r\n    os.makedirs(os.path.dirname(target_file), exist_ok=True)\r\n\r\n    sql_statement = \"\"\"\r\n\tselect id, manager, city, street, street_number, revenue, day from stores;\r\n    \"\"\"\r\n\r\n    postgres_hook = PostgresHook(postgres_conn_id=\"postgres_franchise\")\r\n    connection = postgres_hook.get_conn()\r\n    cursor = connection.cursor()\r\n    cursor.execute(sql_statement)\r\n    result = cursor.fetchall()\r\n    connection.commit()\r\n\r\n    with open(target_file, 'w', newline='') as file:\r\n\ttarget_csv = csv.writer(file)\r\n\ttarget_csv.writerow(['id', 'manager', 'city', 'street', 'street_number',   'revenue', 'day'])\r\n\tfor row in result:\r\n\t    target_csv.writerow(row)\r\n\r\n    return target_file<\/pre>\n<p>The\u00a0<em>PostgresHook\u00a0<\/em>requires an argument: <em>postgres_conn_id<\/em>. We can pass the connection id which we specified in the connection list to our hook. Afterward, the task executes our SQL statement and fetches the records. These are then written into a CSV file. Note, that we import the required libraries inside of the task (local imports). This is recommended by Airflow since top-level imports affect the loading time of a DAG. You can find out more in the <a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/best-practices.html\" target=\"_blank\" rel=\"noopener\">Best Practices<\/a> section of the official Airflow documentation.\u00a0Also, we pass an argument\u00a0<em>target_file<\/em> to our function which indicates the path where our CSV file should be stored to.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Task-Plotting-Revenue-vs-Time\"><\/span>Task: Plotting Revenue vs. Time<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>It&#8217;s time to plot some data. Create a new python file <em>series.py\u00a0<\/em>inside of the <em>plotting <\/em>directory<em>.\u00a0<\/em>The task should look like this:<\/p>\n<pre class=\"lang:python decode:true \">from airflow.decorators import task\r\n\r\n@task\r\ndef plot_revenue_per_day_per_manager(source_file: str, base_dir: str):\r\n    import os\r\n    import pandas as pd\r\n    import plotly.express as px\r\n\r\n    target_file = f\"{base_dir}\/plots\/revenue_per_day_per_manager.html\"\r\n    os.makedirs(os.path.dirname(target_file), exist_ok=True)\r\n\t\r\n    df \u00a0= pd.read_csv(source_file)\r\n\u00a0 \u00a0 fig = px.line(df, x='day', y='revenue', color='manager', symbol=\"manager\")\r\n\u00a0 \u00a0 fig.update_layout(\r\n\u00a0 \u00a0 \u00a0 \u00a0 font=dict(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size=20\r\n\u00a0 \u00a0 \u00a0 \u00a0 )\r\n\u00a0 \u00a0 )\r\n\u00a0 \u00a0 fig.write_html(target_file)<\/pre>\n<p>Our task will accept the following two arguments:<\/p>\n<ul>\n<li>source_file: Location of the CSV file with our raw data<\/li>\n<li>base_dir: Top-level directory where we will store our plots<\/li>\n<\/ul>\n<p>Then again, we import what we need and define our <em>target_file <\/em>which specifies the storage location of our plot. Note, that we create a <em>.html <\/em>file because plotly allows us to create interactive plots. The resulting plot is a line plot where we have the date on the x-axis and the revenue on the y-axis. Finished! Let&#8217;s aggregate some data next!<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Task-Aggregation\"><\/span>Task: Aggregation<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>We would like to compute the average revenue per franchise\/manager. Therefore, create a new Python file <em>aggregation.py\u00a0<\/em>inside of the <em>transformation <\/em>directory. Copy &amp; paste the following code:<\/p>\n<pre class=\"lang:python decode:true\">from airflow.decorators import task\r\n\r\n@task\r\ndef aggregate_avg_revenue_per_manager(source_file: str, base_dir: str):\r\n\u00a0 \u00a0 import os\r\n\u00a0 \u00a0 import pandas as pd\r\n  \r\n\u00a0 \u00a0 target_file = f\"{base_dir}\/storage\/agg_avg_revenue_manager.json\"\r\n\u00a0 \u00a0 pickle_file = f\"{base_dir}\/storage\/agg_avg_revenue_manager.pkl\"\r\n\u00a0 \u00a0 os.makedirs(os.path.dirname(target_file), exist_ok=True)\r\n  \r\n\u00a0 \u00a0 df = pd.read_csv(source_file)\r\n\u00a0 \u00a0 result = df.groupby([\"manager\" \"city\", \"street\", \"street_number\"]).aggregate('mean')\r\n\u00a0 \u00a0 result[\"average_revenue\"] = result[\"revenue\"]\r\n\u00a0 \u00a0 result = result.drop(columns=[\"revenue\"]).reset_index()\r\n  \r\n\u00a0 \u00a0 result.to_pickle(pickle_file)\r\n\u00a0 \u00a0 result.to_json(target_file, orient=\"records\")\r\n  \r\n\u00a0 \u00a0 return pickle_file<\/pre>\n<p>We define two file paths:<\/p>\n<ul>\n<li>target_file: a json file which holds our aggregated data<\/li>\n<li>pickle_file: holds our transformed data frame<\/li>\n<\/ul>\n<p>To aggregate the revenue, we simply have to compute the mean and drop the revenue column. Moreover, we reset the index such that we obtain a clean indexed data frame which we then pickle to our local file system. Our task returns the path to the pickled file such that the next (downstream) task can plot the data.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Task-Plotting-Average-Revenue-vs-Manager\"><\/span>Task: Plotting Average Revenue vs. Manager<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>We are left with the last task, plotting the aggregated data. Create a new Python file\u00a0<em>aggregation.py <\/em>inside of the\u00a0<em>plotting<\/em> directory &#8211; the code should look like this:<\/p>\n<pre class=\"lang:python decode:true\">from airflow.decorators import task\r\n  \r\n@task\r\ndef plot_avg_revenue_per_manager(pkl_file: str, base_dir: str):\r\n\u00a0 \u00a0 import os\r\n\u00a0 \u00a0 import pandas as pd\r\n\u00a0 \u00a0 import plotly.express as px\r\n  \r\n\u00a0 \u00a0 target_file = f\"{base_dir}\/plots\/agg_avg_revenue_manager.html\"\r\n\u00a0 \u00a0 os.makedirs(os.path.dirname(target_file), exist_ok=True)\r\n  \r\n\u00a0 \u00a0 df \u00a0= pd.read_pickle(pkl_file)\r\n\u00a0 \u00a0 fig = px.bar(df, x='manager', y='average_revenue',\r\n\u00a0 \u00a0 hover_data = ['city', 'street', 'street_number'],\r\n\u00a0 \u00a0 labels = {'average_revenue': \"Average Revenue by Manager\", 'manager_name': \"Manager\"})\r\n  \r\n\u00a0 \u00a0 fig.update_layout(\r\n\u00a0 \u00a0 \u00a0 \u00a0 font=dict(\r\n\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 size=20\r\n\u00a0 \u00a0  \u00a0  )\r\n\u00a0 \u00a0 )\r\n\u00a0 \u00a0 fig.write_html(target_file)<\/pre>\n<p>We load the data from the pickled file into a data frame and create a bar plot where we have the manager on the x-axis and the average revenue on the y-axis. On top of that, we enrich our plot with more information appearing on hover like the city and street of the respective franchise. We also adjust the labels and write our HTML file to its target location defined by <em>target_file<\/em>.<\/p>\n<p>We are almost at the end, we only have to wire all tasks together inside of our DAG definition &#8211; where we will specify the dependencies between the tasks.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Finish-the-DAG-Definition\"><\/span>Finish the DAG Definition<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Add the following lines of code in our python file named\u00a0<em>pipeline.py<\/em> where our DAG definition resides:<\/p>\n<pre class=\"lang:python decode:true\">from stores.ingestion.postgres import ingest_franchise_data_from_postgres\r\nfrom stores.transformation.aggregation import aggregate_avg_revenue_per_manager\r\nfrom stores.plotting.aggregation import plot_avg_revenue_per_manager\r\nfrom stores.plotting.series import plot_revenue_per_day_per_manager\r\n\r\n@dag(\r\n    ...\r\n)\r\ndef franchise_analysis_pipeline():\r\n\u00a0 \u00a0 base_dir = \"\/opt\/airflow\/dags\/data\"\r\n\u00a0 \u00a0 target_file = f\"{base_dir}\/stores.csv\"\r\n  \r\n\u00a0 \u00a0 source_file = ingest_franchise_data_from_postgres(target_file)\r\n\u00a0 \u00a0 pkl_file = aggregate_avg_revenue_per_manager(source_file, base_dir)\r\n\u00a0 \u00a0 plot_revenue_per_day_per_manager(source_file, base_dir)\r\n\u00a0 \u00a0 plot_avg_revenue_per_manager(pkl_file, base_dir)\r\n  \r\nfranchise_analysis_pipeline()\r\n<\/pre>\n<p>I left out all of the other details and only included the relevant lines of code. We pass the return values as input to the appropriate tasks and Airflow is then able to infer the dependencies. Simple as that.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Results\"><\/span>Results<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>To kick off the pipeline, switch to the web UI and wait for the pipeline to load in the DAG list. If you don&#8217;t see your pipeline, you should wait a few minutes until the list refreshes. Typing errors and other errors will be caught by Airflow and will be displayed as an error message. Another way to check whether the pipeline is working, is to execute the following command in a terminal:<\/p>\n<pre class=\"lang:python decode:true \">python pipeline.py<\/pre>\n<p>If no errors pop up, this means that our pipeline is correctly defined. Anyway, when you click on the pipeline name on the web UI, you will see a different window where you can click on the play button. This will trigger the pipeline and you should see something showing up on your run history. Wait for the pipeline to complete. I invite you to look around, the UI has information and you can discover a lot of details. One recommendation is to click on the <em>Gantt <\/em>view. If you click on the graph view, you will see the graph structure of our data pipeline as depicted in figure 4 and see that all tasks are marked as successful which can be seen by the green outline.<\/p>\n<figure style=\"width: 1109px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/blogpost_dataorchetsration_dag_graph_15.png\" alt=\"DAG structure of our data pipeline\" width=\"1109\" height=\"334\" \/><figcaption class=\"wp-caption-text\"><strong>Figure 4: DAG Structure of our Data Pipeline<\/strong><\/figcaption><\/figure>\n<p>By the way, figure 5 and 6 show the plots which we generated with our pipeline. As you can see, Airflow is pretty powerful and that is why Airflow has become so dominant in our data-driven world. But Airflow is not perfect, in Part 2 of this blog article, you can read about Airflow&#8217;s weaknesses and how its competitors perform in comparison.<\/p>\n<figure style=\"width: 1562px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/blogpost_dataorchestration_avgrev_manager_plot_4.png\" alt=\"Average Revenue per Manager\" width=\"1562\" height=\"787\" \/><figcaption class=\"wp-caption-text\"><strong>Figure 5: Average Revenue per Manager<\/strong><\/figcaption><\/figure>\n<figure style=\"width: 1542px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/blogpost_dataorchestration_revenue_time_plot_5.png\" alt=\"Revenue per Day\" width=\"1542\" height=\"776\" \/><figcaption class=\"wp-caption-text\"><strong>Figure 6: Revenue of a Franchise per Day<\/strong><\/figcaption><\/figure>\n<h3><span class=\"ez-toc-section\" id=\"Final-Remarks\"><\/span>Final Remarks<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><strong>Note<\/strong>, that this pipeline implementation in Airflow does not represent a production grade implementation. When you have lots of data to process, then you should refrain from using simple tasks and use so called Airflow Operators. Please handle your processing workload in instances like a Spark cluster or use the\u00a0<em>KubernetesPodOperator\u00a0<\/em>to process your workload on a Kubernetes cluster. This has the following advantage: Airflow will then only handle orchestration and will not deal with any workload handling. The business logic is outsourced to e.g. Kubernetes and you can far more easily scale Pods on Kubernetes to satisfy your resource requirements than in Airflow. But since this is an advanced topic, I wanted to keep it simple for this blog article such that anyone can have an entry point into the world of Apache Airflow since not everyone knows how to setup a Kubernetes cluster or a Spark cluster. But at a later point you will want to definitely check out the <em>KubernetesPodOperator<\/em> and the <em>SparkSubmitOperator.<\/em><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Bonus\"><\/span>Bonus<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Of course, this article has limited space and your reading time is precious, such that I cannot mention all the features that Airflow has to offer. But I want to mention a few more features of Airflow here which are pretty useful for data engineers.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Replaying-Tasks\"><\/span>Replaying Tasks<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Sometimes one of your tasks may fail due to some unfortunate event, e.g. your database was not available because the database server crashed. If you want to re-run your pipeline, you should select your failed task and then click on &#8222;Clear&#8220;. When you click on your failed task, you should see a similar UI view as illustrated by figure 7.<\/p>\n<figure style=\"width: 1672px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/blogpost_dataorchestration_failed_task_16.png\" alt=\"replaying pipeline due to a failed task\" width=\"1672\" height=\"614\" \/><figcaption class=\"wp-caption-text\"><strong>Figure 7: DAG run with a failed task<\/strong><\/figcaption><\/figure>\n<p>When you click on &#8222;Clear&#8220;, a confirmation message will pop up that will show you what kind of changes are made when you continue. Note, that you have several options here, I clicked on <em>Downstream<\/em> and\u00a0<em>Recursive<\/em> which will clear any tasks that are downstream from the selected task and any tasks that are cross-dependent on our selected task, e.g. having other DAGs that rely on that task.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Looping-over-tasks-multiple-times\"><\/span>Looping over tasks multiple times<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Sometimes you might want to run your tasks multiple times over some set of variables. The following code snippet should demonstrate to you how you can accomplish this:<\/p>\n<pre class=\"lang:python decode:true\">@task\r\ndef first_task(variable):\r\n    return variable\r\n\r\n@task\r\ndef second_task(variable):\r\n    return variable**2\r\n\r\n@dag(\r\n    dag_id=\"for_loop_dag\",\r\n    description=\"Showing how for loop works for tasks\",\r\n    ...\r\n)\r\ndef example_dag():\r\n    variables = [1, 2, 3]\r\n    for variable in variables:\r\n        intermediary_result = first_task(variable)\r\n        result = second_task(intermediary_result)        \r\n\r\nexample_dag()<\/pre>\n<p>The unnecessary details are left out in this code snippet and only the relevant part of the code is shown. Essentially, we are looping over 3 values and computing the square of these values. Thus, the tasks <em>first_task\u00a0<\/em>and\u00a0<em>second_task<\/em> are running 3 times in total.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Slack-Integration\"><\/span>Slack Integration<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Of course, we can also integrate Airflow with Slack, e.g. you may want to send alerts to Slack when something happened on your Airflow instance. This can be accomplished by using another provider package named <em>apache-airflow-providers-slack\u00a0<\/em>which comes with a <em>SlackWebhookHook.\u00a0<\/em>This hook allows you to send messages to your Slack channels. If you want to find out more about this integration, have a look at the following link: <a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow-providers-slack\/stable\/_api\/airflow\/providers\/slack\/hooks\/slack_webhook\/index.html\">Slack Integration<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Nowadays, we rely a lot on technology. As such every second a tremendous amount of data is being collected and processed. Companies can only utilize this tremendous amount of data by building reliable, maintainable, and robust data pipelines. A large company, especially a technology-oriented company, can have more than a thousand data pipelines. How can [&hellip;]<\/p>\n","protected":false},"author":318,"featured_media":45666,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[783,77,385,377],"service":[411],"coauthors":[{"id":318,"display_name":"Raphael Skuza","user_nicename":"rskuza"}],"class_list":["post-40924","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-airflow","tag-big-data","tag-data-engineering","tag-development","service-data-engineering"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Data Orchestration: Is Airflow Still the Best? (Part 1) - inovex GmbH<\/title>\n<meta name=\"description\" content=\"Apache Airflow is used to orchestrate pipelines. New competitors like Dagster and Prefect emerged, can they surpass Airflow?\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Data Orchestration: Is Airflow Still the Best? (Part 1) - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"Apache Airflow is used to orchestrate pipelines. New competitors like Dagster and Prefect emerged, can they surpass Airflow?\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2023-05-23T12:52:56+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Raphael Skuza\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header-1024x576.jpg\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Raphael Skuza\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"22\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Raphael Skuza\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/\"},\"author\":{\"name\":\"Raphael Skuza\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/126d1a5e3c6b844fa0d4708df41e9cfb\"},\"headline\":\"Data Orchestration: Is Airflow Still the Best? (Part 1)\",\"datePublished\":\"2023-05-23T12:52:56+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/\"},\"wordCount\":3605,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg\",\"keywords\":[\"Airflow\",\"Big Data\",\"Data Engineering\",\"Development\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\",\"Infrastructure\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/\",\"url\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/\",\"name\":\"Data Orchestration: Is Airflow Still the Best? (Part 1) - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg\",\"datePublished\":\"2023-05-23T12:52:56+00:00\",\"description\":\"Apache Airflow is used to orchestrate pipelines. New competitors like Dagster and Prefect emerged, can they surpass Airflow?\",\"breadcrumb\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#primaryimage\",\"url\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg\",\"contentUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg\",\"width\":1920,\"height\":1080,\"caption\":\"Zeichnung von zwei Frauen, die vor einer Datentafel stehen.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.inovex.de\/de\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Data Orchestration: Is Airflow Still the Best? (Part 1)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.inovex.de\/de\/#website\",\"url\":\"https:\/\/www.inovex.de\/de\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.inovex.de\/de\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\/\/www.inovex.de\/de\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/inovexde\",\"https:\/\/x.com\/inovexgmbh\",\"https:\/\/www.instagram.com\/inovexlife\/\",\"https:\/\/www.linkedin.com\/company\/inovex\",\"https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/126d1a5e3c6b844fa0d4708df41e9cfb\",\"name\":\"Raphael Skuza\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/image\/a7fa52311aba9815836e166521178831\",\"url\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-Raphael-Skuza-Profilbild-96x96.jpg\",\"contentUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-Raphael-Skuza-Profilbild-96x96.jpg\",\"caption\":\"Raphael Skuza\"},\"sameAs\":[\"https:\/\/de.linkedin.com\/in\/raphael-skuza\"],\"url\":\"https:\/\/www.inovex.de\/de\/blog\/author\/rskuza\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Data Orchestration: Is Airflow Still the Best? (Part 1) - inovex GmbH","description":"Apache Airflow is used to orchestrate pipelines. New competitors like Dagster and Prefect emerged, can they surpass Airflow?","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/","og_locale":"de_DE","og_type":"article","og_title":"Data Orchestration: Is Airflow Still the Best? (Part 1) - inovex GmbH","og_description":"Apache Airflow is used to orchestrate pipelines. New competitors like Dagster and Prefect emerged, can they surpass Airflow?","og_url":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2023-05-23T12:52:56+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg","type":"image\/jpeg"}],"author":"Raphael Skuza","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header-1024x576.jpg","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Raphael Skuza","Gesch\u00e4tzte Lesezeit":"22\u00a0Minuten","Written by":"Raphael Skuza"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/"},"author":{"name":"Raphael Skuza","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/126d1a5e3c6b844fa0d4708df41e9cfb"},"headline":"Data Orchestration: Is Airflow Still the Best? (Part 1)","datePublished":"2023-05-23T12:52:56+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/"},"wordCount":3605,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg","keywords":["Airflow","Big Data","Data Engineering","Development"],"articleSection":["Analytics","English Content","General","Infrastructure"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/","url":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/","name":"Data Orchestration: Is Airflow Still the Best? (Part 1) - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg","datePublished":"2023-05-23T12:52:56+00:00","description":"Apache Airflow is used to orchestrate pipelines. New competitors like Dagster and Prefect emerged, can they surpass Airflow?","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Data-Orchestration-Header.jpg","width":1920,"height":1080,"caption":"Zeichnung von zwei Frauen, die vor einer Datentafel stehen."},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/data-orchestration-is-airflow-still-the-best-part-1\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Data Orchestration: Is Airflow Still the Best? (Part 1)"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/126d1a5e3c6b844fa0d4708df41e9cfb","name":"Raphael Skuza","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/image\/a7fa52311aba9815836e166521178831","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-Raphael-Skuza-Profilbild-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-Raphael-Skuza-Profilbild-96x96.jpg","caption":"Raphael Skuza"},"sameAs":["https:\/\/de.linkedin.com\/in\/raphael-skuza"],"url":"https:\/\/www.inovex.de\/de\/blog\/author\/rskuza\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/40924","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/318"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=40924"}],"version-history":[{"count":6,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/40924\/revisions"}],"predecessor-version":[{"id":45683,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/40924\/revisions\/45683"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/45666"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=40924"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=40924"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=40924"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=40924"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}