{"id":19256,"date":"2020-09-17T07:59:49","date_gmt":"2020-09-17T05:59:49","guid":{"rendered":"https:\/\/www.inovex.de\/blog\/?p=19256"},"modified":"2024-08-29T08:04:43","modified_gmt":"2024-08-29T06:04:43","slug":"isolated-virtual-environments-pyspark","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/","title":{"rendered":"A Case for Isolated Virtual Environments with PySpark"},"content":{"rendered":"<p>This blog post motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.<!--more--><\/p>\n<p>When developing and shipping Python applications, one challenge is dependency management. Unlike other languages such as C++, it is not possible to install Python packages in all kinds of different versions to a central location on the host machine and just link the correct ones to each and every application. Python also does not provide the option to create a \u201cfat jar\u201c in order to ship a package which already contains all of its dependencies, like Java. Starting with the Python version itself, each application needs a specific Python installation and this installation has to provide the necessary package dependencies for the code to run as is expected. This becomes an issue, at the latest, when two different Python applications have conflicting requirements but are supposed to run on the same system.<\/p>\n<figure style=\"width: 492px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/imgs.xkcd.com\/comics\/python_environment.png\" alt=\"python_environment.png\" width=\"492\" height=\"487\" \/><figcaption class=\"wp-caption-text\">https:\/\/xkcd.com\/1987\/<\/figcaption><\/figure>\n<p>To not end up in a situation like the above, virtual environments are a great way to encapsulate a Python application. What a virtual environment does is creating an installation directory that contains Python itself or a symlink to it, as well as all dependency packages that are required for a specific application. In a best-case scenario, this virtual environment is completely independent of any global installation of Python and therefore, all installed packages inside the virtual environment are isolated from other Python applications on the host system. This not only allows multiple applications and their specific dependencies to be installed side by side, it also helps to avoid accidental breaking changes caused by updating global libraries.<\/p>\n<p>Being able to run multiple Python applications independently from each other is not just important on single-host systems. When it comes to distributed computing on large clusters, in order to run a Python application on each worker node, the same specific execution environment has to be made available throughout the cluster. In the case of <a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener\">Apache Spark<\/a>, the official Python API \u2013 also known as PySpark \u2013 has immensely grown in popularity over the last years. Spark itself is written in Scala and therefore, the way Spark works is that each executor in the cluster is running a Java Virtual Machine. The illustration below shows the schematic architecture of a Spark Cluster, in this case managed by YARN. In order to run Python code on the executors, a Python subprocess is launched per executor which interprets the PySpark code and translates it into Spark RDD operations. While the PySpark API provides a large set of built-in functions, with more and more complex Spark applications being written in Python, there is also a growing need for user-defined functions also known as UDFs. Additionally, with the introduction of vectorized pandas_udfs in Spark 2.3, the performance gap to the Scala API has been mostly bridged, making it much more attractive to use custom Python code in a distributed manner. This is where the problem begins.<\/p>\n<figure id=\"attachment_19398\" aria-describedby=\"caption-attachment-19398\" style=\"width: 1024px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-19398 size-large\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-1024x487.png\" alt=\"pyspark_yarn_architecture.png\" width=\"1024\" height=\"487\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-1024x487.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-300x143.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-768x365.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-1536x731.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-2048x975.png 2048w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-1920x914.png 1920w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-400x190.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/07\/Screenshot-2020-07-31-at-15.05.19-360x171.png 360w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption id=\"caption-attachment-19398\" class=\"wp-caption-text\">Drawn after https:\/\/miro.medium.com\/max\/3868\/1*BDfKR9VMg-E6twBBJEhC6g.png<\/figcaption><\/figure>\n<p>Since one might want to use a Python UDF that relies on dependencies to other Python packages such as pandas or numpy, these dependencies have to be available on each worker node in the cluster. As many large clusters are shared between multiple teams and managed by yet another, it is not always possible to meet every developer\u2019s needs with one common installation. Of course the most popular packages could come pre-installed, but this would limit the functionality of the cluster and we would quickly end up in the situation that motivated us to use virtual environments on our local system. Another option would be to grant each and every Data Scientist SSH access to the cluster, allowing them to install their own virtual environments. What could possibly go wrong?!<\/p>\n<p>There are multiple solutions to this problem, one being introduced in this earlier <a href=\"https:\/\/www.inovex.de\/blog\/managing-isolated-environments-with-pyspark\/\" target=\"_blank\" rel=\"noopener\">blog post by Florian Wilhelm<\/a>. He outlines a pragmatic solution that extracts Python wheel files into an HDFS directory, thereby creating an isolated environment. The files are then individually added to the SparkContext using the built-in sc.addFile function. This causes the files to be distributed on the cluster as new nodes are spawned. While this solution solves the problem, it comes with several drawbacks which I want to address here. First, creating such an environment can be quite a hassle, especially when dealing with many different dependencies and in turn their transitive dependencies. Some packages do not even offer wheel files via pip in which case a workaround is needed. For many Data Scientists, who might not be very familiar with the command line, this can present a significant hurdle. Environments may be copied from other projects, which might work for a while but then causes trouble down the line. These HDFS environments are also quite hard to update since there is no consistency check when packages are manually replaced. The second issue with this approach is that the extracted files are often tiny. This is anything but ideal when it comes to HDFS. There have been cases when Spark jobs took close to an hour to start because very large environments containing thousands of small files had to be distributed on launch.<\/p>\n<p>I want to present a different approach that does not require the isolated environments to be created on HDFS. The spark-submit command offers an option to include an archive when launching a Spark job. The specified archive gets sent to the driver and executor nodes where it is automatically extracted. In combination with a tool called <a href=\"https:\/\/conda.github.io\/conda-pack\/\">conda-pack<\/a>, the same conda environment used locally for development can thus be used on the cluster to run the job. Keep reading to see how it works.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_79_2 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#Creating-a-Virtual-Environment-with-Conda\" >Creating a Virtual Environment with Conda<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#Creating-an-Environment-Archive-with-conda-pack\" >Creating an Environment Archive with conda-pack<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#Distributing-the-environment-on-the-cluster\" >Distributing the environment on the cluster<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Creating-a-Virtual-Environment-with-Conda\"><\/span>Creating a Virtual Environment with Conda<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><a href=\"https:\/\/docs.conda.io\/projects\/conda\/en\/latest\/index.html\" target=\"_blank\" rel=\"noopener\">Conda<\/a> is an open-source tool that combines extensive virtual environment functionalities and package management for all kinds of languages including Python. This flexibility and the fact that conda environments come with their own Python installation make it my virtual environment framework of choice. The approach that is proposed here will probably work similarly using virtualenv or pipenv and creating the archive manually.<\/p>\n<p>I recommend <a href=\"https:\/\/docs.conda.io\/projects\/conda\/en\/latest\/user-guide\/install\/\" target=\"_blank\" rel=\"noopener\">installing miniconda<\/a>, as it comes as a lean distribution of conda that does not contain any packages by default. Anaconda in comparison already includes the most popular packages but is also multiple GB in size. All of the packages we need can simply be installed using the conda package manager.<\/p>\n<p>Once conda is installed, we can create a conda environment by defining an <em>environment.yaml<\/em> file.<\/p>\n<pre class=\"\">name: my_venv\r\nchannels:\r\n\u00a0\u00a0- defaults\r\ndependencies:\r\n\u00a0\u00a0- python=3.7\r\n\u00a0\u00a0- numpy=1.16\r\n\u00a0\u00a0- pandas=0.24\r\n\u00a0\u00a0- pyarrow=0.13\r\n\u00a0\u00a0- pip=19.2\r\n\u00a0\u00a0- pip:\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0- conda-pack==0.4.0<\/pre>\n<p>Using this file we can simply run:<\/p>\n<pre class=\"\">conda env create -f environment.yaml<\/pre>\n<p>and the environment is created. Afterwards we activate the environment by running:<\/p>\n<pre class=\"\">conda activate my_env<\/pre>\n<p>We can now go ahead and develop our own Python package or just write a simple PySpark job containing our custom code that we will want to run on the cluster. Once we are finished, in case of a proper Python package, we can install our package into the conda environment by running:<\/p>\n<pre class=\"\">python setup.py install<\/pre>\n<h2><span class=\"ez-toc-section\" id=\"Creating-an-Environment-Archive-with-conda-pack\"><\/span>Creating an Environment Archive with conda-pack<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Conda saves each environment in an isolated installation directory which is by default located in the user\u2019s home directory. Under <em>~\/miniconda3\/envs\/my_venv<\/em> you can find a folder structure that looks a lot like the root directory on a Linux machine. In order to send the conda environment to the Spark nodes and to avoid sending countless small individual files, we pack the whole environment into an archive. For this we use a little tool called conda-pack which is <a href=\"https:\/\/conda.github.io\/conda-pack\/\" target=\"_blank\" rel=\"noopener\">a command line tool for creating relocatable conda environments.<\/a><\/p>\n<p>The tool packs the whole conda environment directory into a tarball-gzip archive when running the simple command:<\/p>\n<pre class=\"\">conda pack -f -o environment.tar.gz<\/pre>\n<p>This can take a couple of minutes depending on the size of your environment. When it is done, you should see the <em>environment.tar.gz<\/em> file in your current directory.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Distributing-the-environment-on-the-cluster\"><\/span>Distributing the environment on the cluster<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Assuming we have a PySpark script ready to go, we can now launch a Spark job and include our archive using spark-submit.<\/p>\n<pre class=\"\">PYTHON_CONDA='.\/environment\/bin\/python'\r\nPYSPARK_PYTHON=${PYTHON_CONDA}\r\nspark-submit --master yarn --deploy-mode cluster \\\r\n--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=${PYTHON_CONDA}\\\r\n--num-executors 5 --executor-cores 5 \\\r\n--driver-memory 8g --executor-memory 16g \\\r\n--archives .\/environment.tar.gz#environment \\\r\nstart.py<\/pre>\n<p>The interesting part here is the <em>&#8211;archives<\/em>\u00a0option.<\/p>\n<p>By appending #environment to the archive, we are generating an alias under which the content of the archive will be available on the Spark nodes. That means not only are we making sure our archive is being extracted on the cluster, we can even use our own Python installation on the cluster by referring to it in the <em>PYSPARK_PYTHON<\/em> variable. Since the Python version on the workers has to be identical to the Python version on the driver, this can prevent errors, especially in client mode.<\/p>\n<p>And that\u2019s it!<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>As extensive as the PySpark API is, sometimes it is not enough to just use built-in functionality. While the public cloud becomes more and more popular for Spark development and developers have more freedom to start up their own private clusters in the spirit of DevOps, many companies still have large on-premise clusters. Especially in these setups, it is important for developers to distribute their custom Python environments. Conda and conda-pack provide a great way to encapsulate PySpark applications and run them on managed Spark clusters without needing direct access to the nodes. The approach also works for interactive Spark sessions, i.e. when developing a Proof of Concept in <a href=\"https:\/\/jupyter.org\/\" target=\"_blank\" rel=\"noopener\">Jupyter<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This blog post motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.<\/p>\n","protected":false},"author":240,"featured_media":19774,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[77,206,105],"service":[411,431],"coauthors":[{"id":240,"display_name":"Jannis Madison","user_nicename":"jmadison"}],"class_list":["post-19256","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-big-data","tag-data-science","tag-spark","service-data-engineering","service-data-science"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>A Case for Isolated Virtual Environments with PySpark - inovex GmbH<\/title>\n<meta name=\"description\" content=\"This blogpost motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Case for Isolated Virtual Environments with PySpark - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"This blogpost motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2020-09-17T05:59:49+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-08-29T06:04:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Jannis Madison\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jannis Madison\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"8\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Jannis Madison\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/\"},\"author\":{\"name\":\"Jannis Madison\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/5b7efc354e22b9ce1919e99e5db4df10\"},\"headline\":\"A Case for Isolated Virtual Environments with PySpark\",\"datePublished\":\"2020-09-17T05:59:49+00:00\",\"dateModified\":\"2024-08-29T06:04:43+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/\"},\"wordCount\":1574,\"commentCount\":8,\"publisher\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png\",\"keywords\":[\"Big Data\",\"Data Science\",\"Spark\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/\",\"url\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/\",\"name\":\"A Case for Isolated Virtual Environments with PySpark - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png\",\"datePublished\":\"2020-09-17T05:59:49+00:00\",\"dateModified\":\"2024-08-29T06:04:43+00:00\",\"description\":\"This blogpost motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#primaryimage\",\"url\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png\",\"contentUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png\",\"width\":1920,\"height\":1080,\"caption\":\"3 pythons in isolated environments for PySpark\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.inovex.de\/de\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"A Case for Isolated Virtual Environments with PySpark\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.inovex.de\/de\/#website\",\"url\":\"https:\/\/www.inovex.de\/de\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.inovex.de\/de\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.inovex.de\/de\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\/\/www.inovex.de\/de\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/inovexde\",\"https:\/\/x.com\/inovexgmbh\",\"https:\/\/www.instagram.com\/inovexlife\/\",\"https:\/\/www.linkedin.com\/company\/inovex\",\"https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/5b7efc354e22b9ce1919e99e5db4df10\",\"name\":\"Jannis Madison\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/www.inovex.de\/de\/#\/schema\/person\/image\/efda21c91d3b427136672f39cdb3d62f\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/8b5dfc23ae491513ae3e92784bcd8faf598ad0f2bd477aa0db87660f76a45e20?s=96&d=retro&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/8b5dfc23ae491513ae3e92784bcd8faf598ad0f2bd477aa0db87660f76a45e20?s=96&d=retro&r=g\",\"caption\":\"Jannis Madison\"},\"url\":\"https:\/\/www.inovex.de\/de\/blog\/author\/jmadison\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A Case for Isolated Virtual Environments with PySpark - inovex GmbH","description":"This blogpost motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/","og_locale":"de_DE","og_type":"article","og_title":"A Case for Isolated Virtual Environments with PySpark - inovex GmbH","og_description":"This blogpost motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.","og_url":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2020-09-17T05:59:49+00:00","article_modified_time":"2024-08-29T06:04:43+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png","type":"image\/png"}],"author":"Jannis Madison","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Jannis Madison","Gesch\u00e4tzte Lesezeit":"8\u00a0Minuten","Written by":"Jannis Madison"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/"},"author":{"name":"Jannis Madison","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/5b7efc354e22b9ce1919e99e5db4df10"},"headline":"A Case for Isolated Virtual Environments with PySpark","datePublished":"2020-09-17T05:59:49+00:00","dateModified":"2024-08-29T06:04:43+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/"},"wordCount":1574,"commentCount":8,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png","keywords":["Big Data","Data Science","Spark"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/","url":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/","name":"A Case for Isolated Virtual Environments with PySpark - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png","datePublished":"2020-09-17T05:59:49+00:00","dateModified":"2024-08-29T06:04:43+00:00","description":"This blogpost motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2020\/09\/pyspark-isolated-virtual-environments.png","width":1920,"height":1080,"caption":"3 pythons in isolated environments for PySpark"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/isolated-virtual-environments-pyspark\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"A Case for Isolated Virtual Environments with PySpark"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/5b7efc354e22b9ce1919e99e5db4df10","name":"Jannis Madison","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/image\/efda21c91d3b427136672f39cdb3d62f","url":"https:\/\/secure.gravatar.com\/avatar\/8b5dfc23ae491513ae3e92784bcd8faf598ad0f2bd477aa0db87660f76a45e20?s=96&d=retro&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/8b5dfc23ae491513ae3e92784bcd8faf598ad0f2bd477aa0db87660f76a45e20?s=96&d=retro&r=g","caption":"Jannis Madison"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/jmadison\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/19256","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/240"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=19256"}],"version-history":[{"count":5,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/19256\/revisions"}],"predecessor-version":[{"id":57650,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/19256\/revisions\/57650"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/19774"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=19256"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=19256"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=19256"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=19256"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}