{"id":20288,"date":"2020-12-22T10:07:00","date_gmt":"2020-12-22T09:07:00","guid":{"rendered":"https:\/\/www.inovex.de\/blog\/?p=20288"},"modified":"2023-01-04T09:22:25","modified_gmt":"2023-01-04T08:22:25","slug":"airflow-orchestrating-hybrid-workloads-cloud","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/","title":{"rendered":"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud"},"content":{"rendered":"<p>At inovex we use Apache Airflow as a scheduling and orchestration tool in a wide range of different applications and use cases. One very common use case is building data pipelines to load data lakes, data platforms or however you want to call it.<\/p>\n<p>When we are working in a public hyperscaler environment (such as AWS) there are many different options process the data itself. We could do the actual work on Airflow itself, Glue, EMR clusters, lambda functions, plain EC2 instances, a Kubernetes (K8s) cluster managed by our operations team, and so on.<\/p>\n<p>In this article I want to show some details of a hybrid approach that we use to load and manage a data lake for one of our personalization projects to handle heterogeneous workloads. It definitely helps if you already have some basic knowledge of Apache Airflow, AWS EMR and Kubernetes.<!--more--><\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#Our-Use-Cases-Architecture-and-Technologies\" >Our Use-Cases, Architecture and Technologies<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#What-Does-Hybrid-Workload-Actually-Mean\" >What Does Hybrid Workload Actually Mean?<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#Our-Hybrid-Pipeline-Approach\" >Our Hybrid Pipeline Approach<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#Implementation-Details\" >Implementation Details<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#A-Basic-Airflow-App-Folder-Template\" >A Basic Airflow App Folder Template<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#Running-an-EMR-Job\" >Running an EMR Job<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#Running-a-K8s-Job\" >Running a K8s Job<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#Overall-Example-DAG\" >Overall Example DAG<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#Summary\" >Summary<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Our-Use-Cases-Architecture-and-Technologies\"><\/span>Our Use-Cases, Architecture and Technologies<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The project I\u2019m talking about is all about personalization on an end user-facing application, available for a wide range of different platforms. We use different data sources to calculate different styles of recommendations and characteristics of users which are used to personalize the product. The data sources we consume are mainly tracking data that reveals insights on the customers behaviour on the platform and metadata about the users themselves (subscriptions, cancellations, \u2026) and about the <em>items<\/em> the users interact with on the platform.<\/p>\n<p>The project was started on AWS with cloud-agnostic, flexibel and future-proof development in mind. These were the reasons we decided to switch from high-level AWS services (like Glue) to lower-level services like EMR more and more over time. To give you a basic idea of what we are doing and how we are doing things, the following picture shows a high-level view of our architecture (at least for the classic data lake or \u201cbatch\u201c part of it, that I\u2019m talking about in this post):<\/p>\n<figure id=\"attachment_20309\" aria-describedby=\"caption-attachment-20309\" style=\"width: 450px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-20309\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/architecture-2.png\" alt=\"Architecture overview\" width=\"450\" height=\"259\" \/><figcaption id=\"caption-attachment-20309\" class=\"wp-caption-text\">Architecture overview<\/figcaption><\/figure>\n<p>Our data is stored in multiple levels of abstraction in S3. To deliver our data products to the frontends, we use DynamoDB tables (by the way: this is not a good example of being cloud-agnostic). All processes are scheduled and orchestrated by one single Airflow instance per environment (dev, staging, pre-production, production), hosted on a simple EC2 instance.<\/p>\n<p>Before we started to use a more hybrid approach, every job was written as a PySpark job. We decided to schedule one dedicated EMR cluster per job to have flexibility on configuration (for example different hardware configurations, different software versions and independent and isolated environments for every job).<\/p>\n<h2><span class=\"ez-toc-section\" id=\"What-Does-Hybrid-Workload-Actually-Mean\"><\/span>What Does Hybrid Workload Actually Mean?<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>More and more use cases have popped up that needed the use of libraries\/features that aren\u2019t available in Spark or simply don\u2019t need the power of a large distributed computation framework like Spark. One example is scikit-learn which was used by our data scientists to build some of the ML models. Us data engineers faced the challenge that Spark ML doesn\u2019t offer all algorithms\/models. Some of the implementations are simply performing really bad in comparison to scikit models or are not flexible enough in terms of configuration. Another use case where Spark seemed to be the wrong tool are pretty small tasks that only process a few megabytes of data.<\/p>\n<p>Some of our jobs still needed to aggregate large amounts of data for pre-processing. One first idea was to keep the heavy lifting (pre-processing) in Spark, transform the Spark dataframes to Pandas and do the actual model training or predictions on the driver node. We experimented with <span class=\"lang:default decode:true crayon-inline\">df.toPandas()<\/span>\u00a0 to handle this. But also in conjunction with <a href=\"https:\/\/spark.apache.org\/docs\/latest\/sql-pyspark-pandas-with-arrow.html\">Apache Arrow optimizations<\/a> it led to various memory problems for large workloads (by time of our experiments, the Apache Arrow integration was not well documented and felt still like a kind of beta; by the time of writing this post, it seems to be more stable and we definitely should have a look at it again).<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Our-Hybrid-Pipeline-Approach\"><\/span>Our Hybrid Pipeline Approach<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We then started to build our first hybrid pipeline that makes use of both worlds: Spark for the hard work, and plain python scripts for the scikit parts. The only missing part was a runtime environment for the scikit\/python scripts. Using the master node of the EMR cluster felt like a bad idea because we didn\u2019t want to influence all the processes going on the master by a ML model training or prediction job. EC2 seemed to be too low-level since we then would\u2019ve needed to configure and manage a lot of stuff we\u2019re not experts in. So we decided to switch to tiny and lightweight containers scheduled on a self-hosted Kubernetes cluster which is fully managed by a vertical team (of course you can run your containers on any container platform you like). With that approach, all jobs are independent of each other and can have isolated resources. The necessary data exchange between single stages is done with intermediate S3 buckets, as shown in the following illustration:<\/p>\n<figure id=\"attachment_20310\" aria-describedby=\"caption-attachment-20310\" style=\"width: 2251px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-20310 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/generic_ml_pipeline-1.png\" alt=\"Concept for hybrid data pipeline\" width=\"2251\" height=\"836\" \/><figcaption id=\"caption-attachment-20310\" class=\"wp-caption-text\">Concept for our hybrid data pipeline<\/figcaption><\/figure>\n<p>This gives us more overall calculation speed than transforming from Spark to Pandas and vice versa (at least for really large jobs) and has the advantage of high flexibility regarding the used frameworks or libraries for the intermediate pipeline steps.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Implementation-Details\"><\/span>Implementation Details<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Airflow is a very generic technology and thus gives us the flexibility to build and integrate any runtime we want. But we tried to find three basic Airflow DAG-patterns and all of our jobs follow one of them:<\/p>\n<ul>\n<li>Spark-only processing:Start EMR cluster \u2192 schedule Spark job \u2192 terminate EMR cluster<\/li>\n<li>K8s-only processing:Schedule K8s job<\/li>\n<li>Generic processing:Start EMR cluster \u2192 schedule Spark job (pre-processing) \u2192 schedule K8s job (ml training\/predictions) \u2192 schedule Spark or K8s job (post-processing) \u2192 terminate EMR cluster<\/li>\n<\/ul>\n<p>This has the effect that we need to use only a very small number of different Airflow operators that are all more or less easy to use and high-level in the sense of integration into Airflow and the cloud environment. Furthermore, we strictly separate business logic from scheduling code. That way we could easily switch the scheduling tool, runtime environment or cloud provider. For example, if we wanted to switch from AWS to GCP, the only thing we need to do is to switch the EMR operators to\u2014for example\u2014Google Dataproc operators. In addition, it leads to faster development cycles because we implemented all the Airflow pipeline logic into generic libraries that can be re-used over and over again. One more positive aspect of the code separation is that we can deploy Airflow code and the business logic itself separately.\u00a0<a href=\"https:\/\/medium.com\/bluecore-engineering\/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753\">This post<\/a> describes a similar approach of using only a small number of different Airflow operators and shows more aspects in detail.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"A-Basic-Airflow-App-Folder-Template\"><\/span>A Basic Airflow App Folder Template<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>All files that belong to one specific DAG are collected in one common sub-folder with the following structure:<\/p>\n<pre class=\"lang:default decode:true\">my_app\r\n\u00a0\u00a0\u00a0\u00a0job_configs\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0my_job.json\r\n\u00a0\u00a0\u00a0\u00a0cluster_configs\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0my_cluster.json\r\n\u00a0\u00a0\u00a0\u00a0my_app_dag.py<\/pre>\n<p>This keeps all the files in a clear and repeatable structure and helps developers as an initial starting point for their DAGs.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Running-an-EMR-Job\"><\/span>Running an EMR Job<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>With our own EMR cluster DAG library mentioned in the previous chapter starting an EMR job from Airflow is as simple as:<\/p>\n<pre class=\"lang:python decode:true\">app_folder = configuration.conf.get(\"core\", \"dags_folder\") + \"\/my_app\"\r\njob_config_folder = app_folder + \"\/job_configs\"\r\ncluster_config_folder = app_folder + \"\/cluster_configs\"\r\n\r\ndag = DAG(\r\n  dag_id=\"my_dag\",\r\n  schedule_interval=\"0 1 * * *\",\r\n  template_searchpath=cluster_config_folder\r\n)\r\n\r\nmy_cluster = Cluster(dag=dag, cluster_name=\"my_cluster\")\r\nmy_cluster.add_step(job_config_file=f\"{job_config_folder}\/my_job.json\")\r\nmy_cluster.build_sequential_flow()<\/pre>\n<p>One DAG built this way is shown in the example below:<\/p>\n<figure id=\"attachment_20298\" aria-describedby=\"caption-attachment-20298\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-20298 size-large\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-1024x233.png\" alt=\"Airflow DAG for simple EMR workflow\" width=\"1024\" height=\"233\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-1024x233.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-300x68.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-768x174.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-1536x349.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-2048x465.png 2048w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-1920x436.png 1920w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-400x91.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_example_dag-360x82.png 360w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"caption-attachment-20298\" class=\"wp-caption-text\">Airflow DAG for simple EMR workflow<\/figcaption><\/figure>\n<p>As you can see, there are a few different tasks in between the task itself: We have to actively wait for the cluster to come up before we can add the job, and we have to wait for the job to be finished before terminating the cluster. To ensure that no cluster keeps running and idling a long time without doing work, we add the parameter <span class=\"lang:default decode:true crayon-inline \">trigger_rule=&#8220;all_done&#8220;<\/span>\u00a0 to the cluster termination task. This way, the cluster also terminates after a job has failed.<\/p>\n<p>The cluster config is loaded via Airflow\u2019s template mechanism and is just a JSON file with all the configurations needed for starting the EMR cluster (for example instance types and counts, EMR version, path to bootstrap scripts, \u2026). The job config file is also a simple, templated JSON file which looks similar to this:<\/p>\n<pre class=\"lang:default decode:true\">{\r\n  \"SparkArgs\": {\r\n      \"--deploy-mode\": \"cluster\",\r\n      \"--packages\": \"mysql:mysql-connector-java:8.0.11\"\r\n  },\r\n  \"ScriptPath\": \"{{ config.DEPLOYMENT_BUCKET }}\/my_app\/my_app.py\",\r\n  \"JobParams\": {\r\n      \"--stage\": \"{{ config.ENV }}\",\r\n      \"--execution_date\": \"{{ ds }}\"\r\n  }\r\n}\r\n<\/pre>\n<p>It\u2019s basically parsed into a spark-submit command which is then executed on the EMR cluster as a step with the generic <a href=\"https:\/\/docs.aws.amazon.com\/emr\/latest\/ReleaseGuide\/emr-commandrunner.html\">command-runner<\/a>. The parsed spark-submit command for the example above would look like this:<\/p>\n<pre class=\"lang:sh decode:true\">spark-submit \\\r\n  --deploy-mode cluster \\\r\n  --packages mysql:mysql-connector-java:8.0.11 \\\r\n  s3:\/\/my-deployment-bucket\/my_app\/my_app.py \\\r\n  --stage prod \\\r\n  --execution_date 2020-12-24<\/pre>\n<p>We can also plug any arbitrary tasks into the automatically built workflow this way:<\/p>\n<pre class=\"lang:python decode:true\">my_cluster = Cluster(dag=dag, cluster_name=\"my_cluster\")\r\nadd_my_job, check_my_job = my_cluster.add_step(job_config_file=f\"{job_config_folder}\/my_job.json\")\r\nmy_cluster.build_sequential_flow()\r\n\r\nsome_upstream_task = BashOperator(...)\r\nsome_downstream_task = BashOperator(...)\r\n\r\nsome_upstream_task &gt;&gt; add_my_job\r\nsome_downstream_task &gt;&gt; check_my_job\r\n<\/pre>\n<p>The resulting DAG looks like this:<\/p>\n<figure id=\"attachment_20296\" aria-describedby=\"caption-attachment-20296\" style=\"width: 926px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-20296\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-1024x250.png\" alt=\"Airflow DAG for EMR workflow with additional tasks\" width=\"926\" height=\"226\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-1024x250.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-300x73.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-768x187.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-1536x375.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-2048x500.png 2048w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-1920x468.png 1920w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-400x98.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/emr_extended_example_dag-360x88.png 360w\" sizes=\"auto, (max-width: 926px) 100vw, 926px\" \/><figcaption id=\"caption-attachment-20296\" class=\"wp-caption-text\">Airflow DAG for EMR workflow with additional tasks<\/figcaption><\/figure>\n<h3><span class=\"ez-toc-section\" id=\"Running-a-K8s-Job\"><\/span>Running a K8s Job<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>As you can see in the previous chapter, there is a lot of boilerplate stuff to do for an EMR workflow and it requires a lot of overhead until the job itself can be started. For jobs that should be scheduled as Kubernetes Pod, we use the <a href=\"https:\/\/airflow.apache.org\/docs\/stable\/kubernetes.html#kubernetespodoperator\">KubernetesPodOperator<\/a>. None of the overhead is needed here. It directly launches Pods, continuously checks their state and gathers logs from the containers.<\/p>\n<p>The following code snippet shows how we start jobs as a Kubernetes Pod from Airflow:<\/p>\n<pre class=\"lang:python mark:16,18,21 decode:true\">docker_tag = Variable.get(\"docker_tag\")\r\n\r\nresources = {\r\n  \"request_memory\": \"8000Mi\",\r\n  \"limit_memory\": \"8000Mi\",\r\n}\r\n\r\nmy_k8s_job = KubernetesPodOperator(\r\n  name=\"my_k8s_job\",\r\n  task_id=\"my_k8s_job\",\r\n  dag=dag,\r\n  image=f\"&lt;url-to-our-harbor&gt;\/my_k8s_job:{docker_tag}\",\r\n  arguments=[\"--date\", \"{{ ds }}\", \"--stage\", \"{{ config.ENV }}\"],\r\n  namespace=\"&lt;my-k8s-namespace&gt;\",\r\n  config_file=\"\/home\/ec2-user\/.kube\/config\",\r\n  get_logs=True,\r\n  in_cluster=False,\r\n  do_xcom_push=False,\r\n  resources=resources,\r\n  image_pull_secrets=\"&lt;my_pull_secret&gt;\",\r\n  is_delete_operator_pod=True,\r\n)\r\n<\/pre>\n<p>We build pretty lightweight Docker containers per job that simply contain the Python code, needed dependencies, and run it. The latest Docker version tag is injected into Airflow during the deployment process via Airflow variables. Please note: Kubernetes recently set the <a href=\"https:\/\/kubernetes.io\/blog\/2020\/12\/02\/dont-panic-kubernetes-and-docker\/\">Docker runtime to deprecated<\/a>. That doesn&#8217;t mean you can&#8217;t use Docker containers in Kubernetes or Docker as a development tool. But if you&#8217;re interested in other container solutions, you can find a comparison of alternative tools in <a href=\"https:\/\/www.inovex.de\/blog\/containers-docker-containerd-nabla-kata-firecracker\/\">this article<\/a>.<\/p>\n<p>One interesting parameter is <span class=\"lang:default decode:true crayon-inline\">get_logs=True<\/span>. With that set, you can see all the logs printed in your container in the Airflow U.<\/p>\n<p>With <span class=\"lang:default decode:true crayon-inline \">do_xcom_push=True<\/span>\u00a0you can pass results from your container to the downstream tasks via the <a href=\"https:\/\/airflow.apache.org\/docs\/stable\/concepts.html?highlight=xcom#xcoms\">xcom concept<\/a>. Please note that this is only suitable for smaller results due to the fact that xcom values are stored in the meta database of your Airflow installation.<\/p>\n<p>One problem we faced with the KubernetesPodOperator is <a href=\"https:\/\/issues.apache.org\/jira\/browse\/AIRFLOW-3534\">this bug<\/a> which lets your long-running Pod crash when it doesn\u2019t produce any logs for a while. We can overcome this issue with a background thread periodically writing a dummy <em>alive log<\/em> to the logs (as suggested in a Jira ticket).<\/p>\n<p>Another recommendation I can give is to set <span class=\"lang:default decode:true crayon-inline \">is_delete_operator_pod=True<\/span>\u00a0so that the finished Pod gets cleaned afterwards and doesn&#8217;t pollute your Kubernetes namespace with terminated Pods. Be careful: You lose all the logs and details of failed Pods. Most of the time, the collected logs in Airflow are detailed enough for debugging. If you want to keep all those details in Kubernetes itself for a while, you can configure <a href=\"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/pods\/pod-lifecycle\/#pod-garbage-collection\">Kubernetes garbage collection<\/a> to do the job automatically.<\/p>\n<p>Please note that Kubernetes has the concept of a Job which fits very naturally to our concept of one-time jobs. It would be a great idea to implement a Pod Operator for Airflow based on the <a href=\"https:\/\/kubernetes.io\/docs\/concepts\/workloads\/controllers\/job\/\">Kubernetes Jobs-API<\/a>. At time of writing, that\u2019s not available, yet.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Overall-Example-DAG\"><\/span>Overall Example DAG<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The following code snippet shows a complete example of a hybrid workflow orchestrated as Airflow DAG.<\/p>\n<pre class=\"lang:python decode:true \">app_folder = (\r\n  configuration.conf.get(\"core\", \"dags_folder\") + \"\/app_\"\r\n)\r\n\r\njob_config_folder = app_folder + \"\/job_configs\"\r\ncluster_config_folder = app_folder + \"\/cluster_configs\"\r\n\r\ndefault_args = {\r\n  \"start_date\": datetime(2020, 12, 1),\r\n}\r\n\r\ndag = DAG(\r\n  dag_id=\"generic_model_pipeline_example\",\r\n  default_args=default_args,\r\n  schedule_interval=\"0 1 * * *\",\r\n  template_searchpath=cluster_config_folder,\r\n)\r\n\r\n# This is the location in S3 which is used for data exchange between the single pipeline steps\r\njob_params = {\r\n  \"s3_result_path\": \"s3:\/\/intermediate-results-bucket\/recommendation-job\/intermediate_results\"\r\n}\r\n\r\nexample_cluster = Cluster(dag=dag, cluster_name=\"example_app\")\r\n\r\nadd_preprocess_step, check_preprocess_step = example_cluster.add_step(\r\n  job_name=\"example_preprocess\",\r\n  job_config_file=f\"{job_config_folder}\/example_preprocess_job.json\",\r\n  job_params=job_params,\r\n)\r\n\r\nresources = {\r\n  \"request_memory\": \"8000Mi\",\r\n  \"limit_memory\": \"8000Mi\",\r\n}\r\n\r\ndocker_tag = Variable.get(\"docker_tag\")\r\n\r\nrecommend_step = KubernetesPodOperator(\r\n  name=\"example_recommendation_task\",\r\n  task_id=\"example_recommendation_task\",\r\n  dag=dag,\r\n  image=f\"url.to.our.harbor\/repo_name\/my_recommendation_task:{docker_tag}\",\r\n  arguments=[\r\n    \"--s3_result_path\",\r\n    job_params[\"s3_result_path\"],\r\n    \"--source_path\",\r\n    add_preprocess_step.task_id,\r\n  ],\r\n  namespace=\"my-k8s-namespace\",\r\n  config_file=\"\/home\/ec2-user\/.kube\/config\",\r\n  get_logs=True,\r\n  in_cluster=False,\r\n  do_xcom_push=False,\r\n  resources=resources,\r\n  image_pull_secrets=\"name-of-pull-secret\",\r\n  is_delete_operator_pod=True,\r\n)\r\n\r\nadd_postprocess_step, _ = example_cluster.add_step(\r\n  job_name=\"example_postprocess\",\r\n  job_config_file=f\"{job_config_folder}\/example_postprocess_job.json\",\r\n  job_params={**job_params, **{\"source_path\": recommend_step.task_id}},\r\n)\r\n\r\nexample_cluster.build_sequential_flow()\r\n\r\ncheck_preprocess_step &gt;&gt; recommend_step &gt;&gt; add_postprocess_step\r\n\r\n<\/pre>\n<p>As you can see, every task writes its intermediate results to S3 under a key named after its task ID. We feed every task with the task ID of its predecessor, where it can find the data it needs for its job. The rendered DAG looks like this:<\/p>\n<figure id=\"attachment_20299\" aria-describedby=\"caption-attachment-20299\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-20299 size-large\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/overall_example_dag-1024x159.png\" alt=\"Airflow DAG for complete, hybrid pipeline\" width=\"1024\" height=\"159\" \/><figcaption id=\"caption-attachment-20299\" class=\"wp-caption-text\">Airflow DAG for complete, hybrid pipeline<\/figcaption><\/figure>\n<h2><span class=\"ez-toc-section\" id=\"Summary\"><\/span>Summary<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In this article I showed that we face different types of workloads that we want to run in the cloud or on other arbitrary runtimes. I presented one possible concept of orchestrating hybrid workloads with Airflow. There are a lot more options how you can handle heterogeneous jobs, but this way works very well for our project setup, as we initially started with EMR and needed a quick way to integrate other types of jobs into our pipelines without migrating all the existing Spark jobs to a new runtime environment like for example Spark on Kubernetes or similar.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>At inovex we use Apache Airflow as a scheduling and orchestration tool in a wide range of different applications and use cases. One very common use case is building data pipelines to load data lakes, data platforms or however you want to call it. When we are working in a public hyperscaler environment (such as [&hellip;]<\/p>\n","protected":false},"author":206,"featured_media":20403,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[385,206],"service":[414,411,423],"coauthors":[{"id":206,"display_name":"Julian Seither","user_nicename":"jseither"}],"class_list":["post-20288","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-data-engineering","tag-data-science","service-cloud","service-data-engineering","service-kubernetes"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Apache Airflow: Orchestrating Hybrid Workloads in the Cloud<\/title>\n<meta name=\"description\" content=\"This article describes a hybrid approach that we use to manage a data lake by handling heterogeneous workloads with the help of Apache Airflow, Kubernetes and Apache Spark on EMR.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud\" \/>\n<meta property=\"og:description\" content=\"This article describes a hybrid approach that we use to manage a data lake by handling heterogeneous workloads with the help of Apache Airflow, Kubernetes and Apache Spark on EMR.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2020-12-22T09:07:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-01-04T08:22:25+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/apache-airflow-orchestration.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Julian Seither\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/apache-airflow-orchestration-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Julian Seither\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"13\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Julian Seither\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/\"},\"author\":{\"name\":\"Julian Seither\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/0a3268d44503c4d0d32fbeb6f1129b94\"},\"headline\":\"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud\",\"datePublished\":\"2020-12-22T09:07:00+00:00\",\"dateModified\":\"2023-01-04T08:22:25+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/\"},\"wordCount\":2058,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/12\\\/apache-airflow-orchestration.png\",\"keywords\":[\"Data Engineering\",\"Data Science\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/\",\"name\":\"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/12\\\/apache-airflow-orchestration.png\",\"datePublished\":\"2020-12-22T09:07:00+00:00\",\"dateModified\":\"2023-01-04T08:22:25+00:00\",\"description\":\"This article describes a hybrid approach that we use to manage a data lake by handling heterogeneous workloads with the help of Apache Airflow, Kubernetes and Apache Spark on EMR.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/12\\\/apache-airflow-orchestration.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2020\\\/12\\\/apache-airflow-orchestration.png\",\"width\":1920,\"height\":1080,\"caption\":\"Apache AIrflow in the center orchestrating clouds and local resources\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/airflow-orchestrating-hybrid-workloads-cloud\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/0a3268d44503c4d0d32fbeb6f1129b94\",\"name\":\"Julian Seither\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-36713720_1746291525467658_8163086856494252032_n-96x96.jpg35f978bb618834bfd2353e7390e16e33\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-36713720_1746291525467658_8163086856494252032_n-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-36713720_1746291525467658_8163086856494252032_n-96x96.jpg\",\"caption\":\"Julian Seither\"},\"description\":\"I'm a Data Engineer and Architect, interested in designing and implementing various types of data platforms and streaming applications in the cloud as well as on premise.\",\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/in\\\/julian-seither-34ba40139\\\/\"],\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/jseither\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud","description":"This article describes a hybrid approach that we use to manage a data lake by handling heterogeneous workloads with the help of Apache Airflow, Kubernetes and Apache Spark on EMR.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/","og_locale":"de_DE","og_type":"article","og_title":"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud","og_description":"This article describes a hybrid approach that we use to manage a data lake by handling heterogeneous workloads with the help of Apache Airflow, Kubernetes and Apache Spark on EMR.","og_url":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2020-12-22T09:07:00+00:00","article_modified_time":"2023-01-04T08:22:25+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/apache-airflow-orchestration.png","type":"image\/png"}],"author":"Julian Seither","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/apache-airflow-orchestration-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Julian Seither","Gesch\u00e4tzte Lesezeit":"13\u00a0Minuten","Written by":"Julian Seither"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/"},"author":{"name":"Julian Seither","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/0a3268d44503c4d0d32fbeb6f1129b94"},"headline":"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud","datePublished":"2020-12-22T09:07:00+00:00","dateModified":"2023-01-04T08:22:25+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/"},"wordCount":2058,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/apache-airflow-orchestration.png","keywords":["Data Engineering","Data Science"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/","url":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/","name":"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/apache-airflow-orchestration.png","datePublished":"2020-12-22T09:07:00+00:00","dateModified":"2023-01-04T08:22:25+00:00","description":"This article describes a hybrid approach that we use to manage a data lake by handling heterogeneous workloads with the help of Apache Airflow, Kubernetes and Apache Spark on EMR.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/apache-airflow-orchestration.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/12\/apache-airflow-orchestration.png","width":1920,"height":1080,"caption":"Apache AIrflow in the center orchestrating clouds and local resources"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/airflow-orchestrating-hybrid-workloads-cloud\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Apache Airflow: Orchestrating Hybrid Workloads in the Cloud"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/0a3268d44503c4d0d32fbeb6f1129b94","name":"Julian Seither","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-36713720_1746291525467658_8163086856494252032_n-96x96.jpg35f978bb618834bfd2353e7390e16e33","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-36713720_1746291525467658_8163086856494252032_n-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-36713720_1746291525467658_8163086856494252032_n-96x96.jpg","caption":"Julian Seither"},"description":"I'm a Data Engineer and Architect, interested in designing and implementing various types of data platforms and streaming applications in the cloud as well as on premise.","sameAs":["https:\/\/www.linkedin.com\/in\/julian-seither-34ba40139\/"],"url":"https:\/\/www.inovex.de\/de\/blog\/author\/jseither\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/20288","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/206"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=20288"}],"version-history":[{"count":2,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/20288\/revisions"}],"predecessor-version":[{"id":40945,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/20288\/revisions\/40945"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/20403"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=20288"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=20288"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=20288"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=20288"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}