{"id":21083,"date":"2018-04-10T13:30:43","date_gmt":"2018-04-10T11:30:43","guid":{"rendered":"http:\/\/www.inovex.de\/blog\/?p=13029"},"modified":"2023-09-18T07:45:15","modified_gmt":"2023-09-18T05:45:15","slug":"managing-isolated-environments-with-pyspark","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/","title":{"rendered":"Managing isolated Environments with PySpark"},"content":{"rendered":"<p>With the sustained success of the Spark data processing platform even data scientists with a strong focus on the Python ecosystem can no longer ignore it. Fortunately, it is easy to get started with <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/\">PySpark<\/a>\u2014the official Python <span class=\"caps\">API<\/span> for Spark\u2014due to millions of word count tutorials on the web. In contrast to that, resources on how to deploy and use Python packages like Numpy, Pandas, Scikit-Learn in isolated environments with PySpark are scarce. A nice exception to that is a blog post by Eran Kampf. Being able to install your own Python libraries is especially important if you want to write User-Defined-Functions (UDFs) as explained in the blog post <a href=\"https:\/\/www.inovex.de\/blog\/efficient-udafs-with-pyspark\/\">Efficient <span class=\"caps\">UD<\/span>(A)Fs with PySpark<\/a>.<\/p>\n<p><!--more--><\/p>\n<p>For most Spark\/Hadoop distributions, Cloudera in my case, there are basically two options for managing isolated\u00a0environments:<\/p>\n<ol>\n<li>You give all your data scientists <span class=\"caps\">SSH<\/span> access to all your cluster\u2019s nodes and let them do whatever they want like installing virtual environments with <a href=\"https:\/\/virtualenv.pypa.io\/en\/stable\/\">virtualenv<\/a> or <a href=\"https:\/\/conda.io\/docs\/intro.html\">conda<\/a> as detailed in the <a href=\"https:\/\/www.cloudera.com\/documentation\/enterprise\/5-6-x\/topics\/spark_python.html#spark_python__section_kr2_4zs_b5\">Cloudera documentation<\/a>.<\/li>\n<li>Your sysadmins install Anaconda Parcels using the Cloudera Manager Admin Console to provide the most popular Python packages in a one size fits all fashion for all your data scientists as described in a <a href=\"http:\/\/blog.cloudera.com\/blog\/2016\/02\/making-python-on-apache-hadoop-easier-with-anaconda-and-cdh\/\">Cloudera blog post<\/a>.<\/li>\n<\/ol>\n<p>Both options have drawbacks which are as severe as they are obvious. Do you really want to let a bunch of data scientists run processes on your cluster and fill up the local hard-drives? The second option is not even a real isolated environment at all since all your applications would use the same libraries and maybe break after an update of a\u00a0library.<\/p>\n<p>Therefore, we need to empower our data scientists developing a predictive application to manage isolated environments with their dependencies themselves. This was also recognized as a problem and several issues (<a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-13587\"><span class=\"caps\">SPARK<\/span>-13587<\/a> <span class=\"amp\">&amp;<\/span> <a href=\"https:\/\/issues.apache.org\/jira\/browse\/SPARK-16367\"><span class=\"caps\">SPARK<\/span>-16367<\/a>) suggest solutions, but none of them have been integrated yet. The most mature solution is actually <a href=\"https:\/\/github.com\/nteract\/coffee_boat\">coffee boat<\/a>, which is still in beta and not meant for production. Therefore, we want to present here a simple but viable solution for this problem that we have been using in production for more than a\u00a0year.<\/p>\n<p>So how can we distribute Python modules and whole packages on our executors? Luckily, PySpark provides the functions <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/api\/pyspark.SparkContext.addFile.html\">sc.addFile<\/a> and <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/api\/pyspark.SparkContext.addPyFile.html\">sc.addPyFile<\/a> which allow us to upload files to every node in our cluster, even Python modules and egg files in case of the latter. Unfortunately, there is no way to upload wheel files which are needed for binary Python packages like Numpy, Pandas and so on. As a data scientist you cannot live without\u00a0those.<\/p>\n<p>At first sight this looks pretty bad but thanks to the simplicity of the wheel format it\u2019s not so bad at all. So here is what we do in a nutshell: For a given PySpark application, we will create an isolated environment on <span class=\"caps\">HDFS<\/span> with the help of wheel files. When submitting our PySpark application, we copy the content of our environment to the driver and executors using <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/api\/pyspark.SparkContext.addFile.html\">sc.addFile<\/a>. Simple but\u00a0effective.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#Generating-the-environment\" >Generating the\u00a0environment<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#Bootstrapping-the-environment\" >Bootstrapping the\u00a0environment<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Generating-the-environment\"><\/span>Generating the\u00a0environment<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>In order to create our aforementioned environment we start by creating a directory that will contain our isolated environment, e.g. <span class=\"lang:sh decode:true crayon-inline \">venv<\/span>, on our local Linux machine. Then we will populate this directory with the wheel files of all libraries that our PySpark application uses. Since wheel files contain compiled code they are dependent on the exact Python version and platform.<\/p>\n<p>For us this means we have to make sure that we use the same platform and Python version locally as we are gonna use on the Spark cluster. In my case the cluster runs Ubuntu Trusty Linux with Python 3.4. To replicate this locally it\u2019s best to use a conda\u00a0environment:<\/p>\n<pre class=\"lang:sh decode:true \">conda create -n py34 python=3.4\r\n\r\nsource activate py34<\/pre>\n<p>Having activated the conda environment, we just use <span class=\"lang:sh decode:true crayon-inline \">pip download<\/span> to download all the requirements of our PySpark application as wheel files. In case there is no wheel file available, <span class=\"lang:sh decode:true crayon-inline \">pip<\/span> will download a source-based <span class=\"lang:sh decode:true crayon-inline \">tar.gz<\/span> file instead but we can easily generate a wheel from it. To do so, we just unpack the archive, change into the directory and type <span class=\"lang:sh decode:true crayon-inline \">python setup.py bdist_wheel<\/span>. A wheel file should now reside in the <span class=\"lang:sh decode:true crayon-inline \">dist<\/span> subdirectory. At this point one should also be aware that some wheel files come with low-level Linux dependencies that just need to be installed by a sysadmin on every host, e.g. <span class=\"lang:sh decode:true crayon-inline \">python3-dev<\/span> and <span class=\"lang:sh decode:true crayon-inline \">unixodbc-dev<\/span>.<\/p>\n<p>Now we copy the wheel files of all our PySpark application\u2019s dependencies into the <span class=\"lang:sh decode:true crayon-inline \">venv<\/span> directory. After that, we unpack them with <span class=\"lang:sh decode:true crayon-inline \">unzip<\/span> since they are just normal zip files with a strange suffix. Finally, we push everything to <span class=\"caps\">HDFS<\/span>, e.g. <span class=\"lang:sh decode:true crayon-inline \">\/my_venvs\/venv<\/span>, using <span class=\"lang:sh decode:true crayon-inline \">hdfs dfs -put .\/venv \/my_venvs\/venv<\/span> and make sure that the files are readable by\u00a0anyone.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Bootstrapping-the-environment\"><\/span>Bootstrapping the\u00a0environment<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When our PySpark application runs the first thing we do is calling <span class=\"lang:sh decode:true crayon-inline \">sc.addFile<\/span> on every file in <span class=\"lang:sh decode:true crayon-inline \">\/my_venvs\/venv<\/span>. Since this will also set the <span class=\"lang:sh decode:true crayon-inline \">PYTHONPATH<\/span> correctly, importing any library which resides in <span class=\"lang:sh decode:true crayon-inline \">venv<\/span> will just work. If our Python application itself is also nicely structured as a Python package (maybe using <a href=\"http:\/\/pyscaffold.org\/\">PyScaffold<\/a>) we can also push it to <span class=\"lang:sh decode:true crayon-inline \">\/my_venvs\/venv<\/span>. This allows us to roll a full-blown PySpark application and nicely separate the boilerplate code that bootstraps our isolated environment from\u00a0it.<\/p>\n<p>Let\u2019s assume our PySpark application is a Python package called <span class=\"lang:sh decode:true crayon-inline \">my_pyspark_app<\/span>. The boilerplate code to bootstrap <span class=\"lang:sh decode:true crayon-inline \">my_pyspark_app<\/span>, i.e. to activate the isolated environment on Spark, will be in the module <span class=\"lang:sh decode:true crayon-inline \">activate_env.py<\/span>. When we submit our Spark job we will specify this module and specify the environment as an argument,\u00a0e.g.:<\/p>\n<pre class=\"lang:sh decode:true \">PYSPARK_PYTHON=python3.4 \/opt\/spark\/bin\/spark-submit --master yarn --deploy-mode cluster \\\r\n\r\n--num-executors 4 --driver-memory 12g --executor-memory 4g --executor-cores 1 \\\r\n\r\n--files \/etc\/spark\/conf\/hive-site.xml --queue default --conf spark.yarn.maxAppAttempts=1 \\\r\n\r\nactivate_env.py \/my_venvs\/venv<\/pre>\n<p>Easy and quite flexible! We are even able to change from one environment to another by just passing another <span class=\"caps\">HDFS<\/span> directory. Here is what <span class=\"lang:sh decode:true crayon-inline \">activate_env.py<\/span> looks like, which does the actual heavy lifting with <span class=\"lang:sh decode:true crayon-inline \">sc.addFile<\/span>:<\/p>\n<pre class=\"lang:python decode:true \">\"\"\"\r\n\r\nBootstrapping an isolated environment for `my_pyspark_app` on Spark\r\n\r\n\"\"\"\r\n\r\nimport os\r\n\r\nimport sys\r\n\r\nimport logging\r\n\r\nfrom pyspark.context import SparkContext\r\n\r\nfrom pyspark.sql import SparkSession\r\n\r\nfrom pyspark.sql.functions import *\r\n\r\n_logger = logging.getLogger(__name__)\r\n\r\ndef list_path_names(path):\r\n\r\n    \"\"\"List files and directories in an HDFS path\r\n\r\n    Args:\r\n\r\n        path (str): HDFS path to directory\r\n\r\n    Returns:\r\n\r\n        [str]: list of file\/directory names\r\n\r\n    \"\"\"\r\n\r\n    sc = SparkContext.getOrCreate()\r\n\r\n    # low-level access to hdfs driver\r\n\r\n    hadoop = sc._gateway.jvm.org.apache.hadoop\r\n\r\n    path = hadoop.fs.Path(path)\r\n\r\n    config = hadoop.conf.Configuration()\r\n\r\n    status = hadoop.fs.FileSystem.get(config).listStatus(path)\r\n\r\n    return (path_status.getPath().getName() for path_status in status)\r\n\r\ndef distribute_hdfs_files(hdfs_path):\r\n\r\n    \"\"\"Distributes recursively a given directory in HDFS to Spark\r\n\r\n    Args:\r\n\r\n        hdfs_path (str): path to directory\r\n\r\n    \"\"\"\r\n\r\n    sc = SparkContext.getOrCreate()\r\n\r\n    for path_name in list_path_names(hdfs_path):\r\n\r\n        path = os.path.join(hdfs_path, path_name)\r\n\r\n        _logger.info(\"Distributing {}...\".format(path))\r\n\r\n        sc.addFile(path, recursive=True)\r\n\r\ndef main(args):\r\n\r\n    \"\"\"Main entry point allowing external calls\r\n\r\n    Args:\r\n\r\n      args ([str]): command line parameter list\r\n\r\n    \"\"\"\r\n\r\n    # setup logging for driver\r\n\r\n    logging.basicConfig(level=logging.DEBUG, stream=sys.stdout)\r\n\r\n    _logger = logging.getLogger(__name__)\r\n\r\n    _logger.info(\"Starting up...\")\r\n\r\n    # Create the singleton instance\r\n\r\n    spark = (SparkSession\r\n\r\n             .builder\r\n\r\n             .appName(\"My PySpark App in its own environment\")\r\n\r\n             .enableHiveSupport()\r\n\r\n             .getOrCreate())\r\n\r\n    # For simplicity we assume that the first argument is the environment on HDFS\r\n\r\n    VENV_DIR = args[0]\r\n\r\n    # make sure we have the latest version available on HDFS\r\n\r\n    distribute_hdfs_files('hdfs:\/\/' + VENV_DIR)\r\n\r\n    from my_pyspark_app import main\r\n\r\n    main(args[1:])\r\n\r\ndef run():\r\n\r\n    \"\"\"Entry point for console_scripts\r\n\r\n    \"\"\"\r\n\r\n    main(sys.argv[1:])\r\n\r\nif __name__ == \"__main__\":\r\n\r\n    run()<\/pre>\n<p>It is actually easier than it looks. In the <span class=\"lang:sh decode:true crayon-inline \">main<\/span> function we initialize the <span class=\"lang:sh decode:true crayon-inline \">SparkSession<\/span> for the first time so that later calls to the session builder will use this instance. Thereafter, the passed path argument when doing the <span class=\"lang:sh decode:true crayon-inline \">spark-submit<\/span> is extracted. Subsequently, this is passed to <span class=\"lang:sh decode:true crayon-inline \">distribute_hdfs_files<\/span> which calls <span class=\"lang:sh decode:true crayon-inline \">sc.addFile<\/span> recursively on every file to set up the isolated environment on the driver and executors. After this we are able to import our <span class=\"lang:sh decode:true crayon-inline \">my_pyspark_app<\/span> package and call, for instance, its <span class=\"lang:sh decode:true crayon-inline \">main<\/span> method. The following figure illustrates the whole\u00a0concept:<\/p>\n<figure><a href=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/pyspark_venv.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-13033\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/pyspark_venv.png\" alt=\"Isolated environment with PySpark\" width=\"641\" height=\"417\" \/><\/a><figcaption><strong>Figure:<\/strong> Executing <em>spark-submit<\/em> uploads our <em>activate_env.py<\/em> module and starts a Spark driver process. Thereafter, <em>activate_env.py<\/em> is executed within the driver and bootstraps our <em>venv<\/em> environment on the Spark driver as well as on the executors. Finally, <em>activate_env.py<\/em> relinquishes control to <em>my_pyspark_app<\/em>.<\/figcaption><\/figure>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Setting up an isolated environment like this is a bit cumbersome and surely also somewhat hacky. Still, in our use-case it served us quite well and allowed the data scientists to set up their specific environments without access to the cluster\u2019s nodes. Since the explained method also works with <a href=\"https:\/\/jupyter.org\/\">Jupyter<\/a> this is not only useful for production but also for proof-of-concepts. That being said, we still hope that soon there will be an official solution by the Spark project\u00a0itself.<\/p>\n<p><em>This article first appeared on <a href=\"https:\/\/www.florianwilhelm.info\/2018\/03\/isolated_environments_with_pyspark\/\" target=\"_blank\" rel=\"noopener\">Florianwilhelm.info<\/a>.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>With the sustained success of the Spark data processing platform even data scientists with a strong focus on the Python ecosystem can no longer ignore it. Fortunately, it is easy to get started with PySpark\u2014the official Python API for Spark\u2014due to millions of word count tutorials on the web. In contrast to that, resources on [&hellip;]<\/p>\n","protected":false},"author":52,"featured_media":13422,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[206],"service":[431],"coauthors":[{"id":52,"display_name":"Florian Wilhelm","user_nicename":"fwilhelm"}],"class_list":["post-21083","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-data-science","service-data-science"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Managing isolated Environments with PySpark - inovex GmbH<\/title>\n<meta name=\"description\" content=\"In this article we present a simple solution for managing Isolated Environments with PySpark that we have been using in production for more than a year.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Managing isolated Environments with PySpark - inovex GmbH\" \/>\n<meta property=\"og:description\" content=\"In this article we present a simple solution for managing Isolated Environments with PySpark that we have been using in production for more than a year.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2018-04-10T11:30:43+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-09-18T05:45:15+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/isolated-environments-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Florian Wilhelm\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/isolated-environments-1-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Florian Wilhelm\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"9\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Florian Wilhelm\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/\"},\"author\":{\"name\":\"Florian Wilhelm\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/57ad7c24ee7f9ec59ed87598c73fe79e\"},\"headline\":\"Managing isolated Environments with PySpark\",\"datePublished\":\"2018-04-10T11:30:43+00:00\",\"dateModified\":\"2023-09-18T05:45:15+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/\"},\"wordCount\":1224,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2018\\\/04\\\/isolated-environments-1.png\",\"keywords\":[\"Data Science\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/\",\"name\":\"Managing isolated Environments with PySpark - inovex GmbH\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2018\\\/04\\\/isolated-environments-1.png\",\"datePublished\":\"2018-04-10T11:30:43+00:00\",\"dateModified\":\"2023-09-18T05:45:15+00:00\",\"description\":\"In this article we present a simple solution for managing Isolated Environments with PySpark that we have been using in production for more than a year.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2018\\\/04\\\/isolated-environments-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2018\\\/04\\\/isolated-environments-1.png\",\"width\":1920,\"height\":1080,\"caption\":\"Three Bubbles holding the logos of pandas, scikit and NumPy\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/managing-isolated-environments-with-pyspark\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Managing isolated Environments with PySpark\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/57ad7c24ee7f9ec59ed87598c73fe79e\",\"name\":\"Florian Wilhelm\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg5db1abe47435abb84b0b7484ce0890e9\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg\",\"caption\":\"Florian Wilhelm\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/fwilhelm\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Managing isolated Environments with PySpark - inovex GmbH","description":"In this article we present a simple solution for managing Isolated Environments with PySpark that we have been using in production for more than a year.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/","og_locale":"de_DE","og_type":"article","og_title":"Managing isolated Environments with PySpark - inovex GmbH","og_description":"In this article we present a simple solution for managing Isolated Environments with PySpark that we have been using in production for more than a year.","og_url":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2018-04-10T11:30:43+00:00","article_modified_time":"2023-09-18T05:45:15+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/isolated-environments-1.png","type":"image\/png"}],"author":"Florian Wilhelm","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/isolated-environments-1-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Florian Wilhelm","Gesch\u00e4tzte Lesezeit":"9\u00a0Minuten","Written by":"Florian Wilhelm"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/"},"author":{"name":"Florian Wilhelm","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/57ad7c24ee7f9ec59ed87598c73fe79e"},"headline":"Managing isolated Environments with PySpark","datePublished":"2018-04-10T11:30:43+00:00","dateModified":"2023-09-18T05:45:15+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/"},"wordCount":1224,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/isolated-environments-1.png","keywords":["Data Science"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/","url":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/","name":"Managing isolated Environments with PySpark - inovex GmbH","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/isolated-environments-1.png","datePublished":"2018-04-10T11:30:43+00:00","dateModified":"2023-09-18T05:45:15+00:00","description":"In this article we present a simple solution for managing Isolated Environments with PySpark that we have been using in production for more than a year.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/isolated-environments-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2018\/04\/isolated-environments-1.png","width":1920,"height":1080,"caption":"Three Bubbles holding the logos of pandas, scikit and NumPy"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/managing-isolated-environments-with-pyspark\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Managing isolated Environments with PySpark"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/57ad7c24ee7f9ec59ed87598c73fe79e","name":"Florian Wilhelm","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg5db1abe47435abb84b0b7484ce0890e9","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/cropped-florian-1-IMG_5829-800x610-1-96x96.jpg","caption":"Florian Wilhelm"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/fwilhelm\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21083","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/52"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=21083"}],"version-history":[{"count":3,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21083\/revisions"}],"predecessor-version":[{"id":48574,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/21083\/revisions\/48574"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/13422"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=21083"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=21083"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=21083"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=21083"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}