{"id":33943,"date":"2022-03-09T10:53:26","date_gmt":"2022-03-09T09:53:26","guid":{"rendered":"https:\/\/www.inovex.de\/?p=33943"},"modified":"2023-11-02T07:57:34","modified_gmt":"2023-11-02T06:57:34","slug":"getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks","status":"publish","type":"post","link":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/","title":{"rendered":"Getting Started with the Rapids Accelerator for Apache Spark on Azure Databricks"},"content":{"rendered":"<p>In this blog post, we will look at Nvidia&#8217;s Rapids Accelerator for Apache Spark, a plugin for distributed data processing on the GPU. You will learn, how to set it up on Azure Databricks and run a toy example.<br \/>\n<!--more--><\/p>\n<p>In the last two decades, we have witnessed the rise of big data and data science. Data is getting bigger and transformations are becoming more complex while CPU processing power is reaching its limits. Due to their high parallelism and bandwidth, GPUs are getting more and more adopted in high-performance computing and data science to keep up with the ever-growing demand for computing power.<\/p>\n<p>Apache Spark is currently the de-facto standard framework for building distributed data processing pipelines for big data. Nvidia recently released and open-sourced the \u201cRapids Accelerator for Apache Spark\u201c, a plugin to accelerate Spark applications on the GPU. This is done completely transparent, i.e. the user defines his Spark data frames or SQL queries as on regular (CPU) Spark and the plugin performs the respective operations on the GPU or falls back to the CPU if an operation is not supported.<\/p>\n<p>In this series of blog posts, we will get to know and evaluate the Rapids Accelerator . This first blog post will be about:<\/p>\n<ul>\n<li>Setting up Azure Databricks as our Spark hosting service, running a Spark job and understanding the associated costs<\/li>\n<li>Setting up the plugin on Databricks and running a toy example<\/li>\n<li>Some runtime measurements with varying degrees of parallelization on a demo provided by Nvidia<\/li>\n<\/ul>\n<p>The plugin is currently under active development with stable releases about every two months. This post is based on Spark-Rapids v.21.06.2. A basic familiarity with regular (CPU) Spark will be assumed.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_83 counter-hierarchy ez-toc-counter ez-toc-custom ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\"><p class=\"ez-toc-title\" style=\"cursor:inherit\"><\/p>\n<\/div><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Setting-up-the-Environment\" >Setting up the Environment<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Setting-up-Databricks-on-Azure\" >Setting up Databricks on Azure<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Create-an-Account-and-a-Subscription\" >Create an Account and a Subscription<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Create-a-Resource-Group\" >Create a Resource Group<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Create-a-Databricks-Workspace\" >Create a Databricks Workspace<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Adjusting-Azure-VM-core-quota\" >Adjusting Azure VM core quota<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Launching-the-Databricks-Workspace\" >Launching the Databricks Workspace<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Quick-Tour-Through-Databricks\" >Quick Tour Through Databricks<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Running-Spark-Jobs\" >Running Spark Jobs<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Understanding-available-Azure-Databricks-VMs-and-Pricing\" >Understanding available Azure Databricks VMs and Pricing<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Setting-up-the-Rapids-Accelerator-on-Databricks\" >Setting up the Rapids Accelerator on Databricks<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Limitations-of-using-the-Rapids-Accelerator-on-Databricks\" >Limitations of using the Rapids Accelerator on Databricks<\/a><\/li><\/ul><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Running-a-Demo\" >Running a Demo<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Performance-and-Cost-Results-on-Demo\" >Performance and Cost Results on Demo<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Nvidias-Cost-and-Performance\" >Nvidias Cost and Performance<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Conclusion\" >Conclusion<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Appendix\" >Appendix<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Log-Databricks-Cluster-Parameters\" >Log Databricks Cluster Parameters<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-19\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#Compute-Costs-incl-Driver\" >Compute Costs incl. Driver<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Setting-up-the-Environment\"><\/span>Setting up the Environment<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Nvidia provides a <a href=\"https:\/\/nvidia.github.io\/spark-rapids\/\" target=\"_blank\" rel=\"noopener\">documentation page<\/a> for the plugin.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-34442 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage.png\" alt=\"Screenshot of Documentation Page of Spark Plugin\" width=\"1257\" height=\"846\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage.png 1257w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-300x202.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-1024x689.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-768x517.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-400x269.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-452x304.png 452w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-360x242.png 360w\" sizes=\"auto, (max-width: 1257px) 100vw, 1257px\" \/><\/p>\n<p>It is the most comprehensive source of information for the plugin currently available. The \u201cGetting Started\u201c page describes how to set up the plugin on every major cloud Spark service as well as a Kubernetes and an on-premise setup.\u00a0Note that the usage of the associated cloud services even for demo purposes is not free of charge. Currently supported GPU families are the HPC data center GPUs Nvidia T4, V100 and the new A-Series (A2, A10, A30, A100).<\/p>\n<p>Let\u2019s set up the plugin on Databricks since it is one of the easiest options to get started. We will walk through a complete example in Azure, however, Databricks is also available on AWS or GCP. If you already have access to Databricks, e.g. via your company, feel free to move on to Section \u201cQuick Tour Through Databricks\u201c. If you want to set up the plugin in another service, follow the instructions on Nvidia&#8217;s documentation page and continue reading at Section \u201cRunning a Demo\u201c.<\/p>\n<p><strong>Note: I would recommend setting everything up with Azure as an institutional client instead of a private individual.<\/strong> We will need Virtual Machines with GPUs that are in high demand at the time of writing. The chances are higher that Azure may approve the required VM core quota (Section \u201cAdjusting Azure VM core quota\u201c below) \u00a0if you are an institutional client.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Setting-up-Databricks-on-Azure\"><\/span>Setting up Databricks on Azure<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>The official quick start guide on setting up a Databricks Workspace and running a Spark Job can be found <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/databricks\/scenarios\/quickstart-create-databricks-workspace-portal?tabs=azure-portal\" target=\"_blank\" rel=\"noopener\">here<\/a>.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Create-an-Account-and-a-Subscription\"><\/span>Create an Account and a Subscription<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>A valid Microsoft account with an Azure \u201cPay-As-You-Go\u201c subscription will be required. Head to <a href=\"http:\/\/portal.azure.com\" target=\"_blank\" rel=\"noopener\">Azure Portal<\/a> and create an account or use an existing Github or Microsoft account.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34456 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Sign-In-Page.png\" alt=\"Microsoft Sign-In Window\" width=\"515\" height=\"606\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Sign-In-Page.png 515w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Sign-In-Page-255x300.png 255w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Sign-In-Page-400x471.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Sign-In-Page-360x424.png 360w\" sizes=\"auto, (max-width: 515px) 100vw, 515px\" \/><\/p>\n<p>If you do not have an Azure \u201cPay-As-You-Go\u201c subscription (e.g. through your company), create one. Search for \u201cSubscriptions\u201c, click \u201cAdd\u201c, select \u201cPay-As-You-Go\u201c, enter a &#8222;Subscription name&#8220;, click &#8222;Review + create&#8220; -&gt; &#8222;Create&#8220; and follow the instructions. Note that after providing credit card information, you will be asked which Azure Support Plan with monthly charges you want to choose. Select the option without the support and monthly charges. You will only be charged for the computing resources (VMs) actively used by Databricks and for some storage and virtual network. In Section \u201cUnderstanding available Azure Databricks VMs and Pricing\u201c you will learn more about pricing.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34462 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Subscription-Page.png\" alt=\"Screenshot of Azure Subscription Plans\" width=\"1065\" height=\"584\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Subscription-Page.png 1065w, https:\/\/www.inovex.de\/wp-content\/uploads\/Subscription-Page-300x165.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Subscription-Page-1024x562.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Subscription-Page-768x421.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Subscription-Page-400x219.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Subscription-Page-528x290.png 528w, https:\/\/www.inovex.de\/wp-content\/uploads\/Subscription-Page-360x197.png 360w\" sizes=\"auto, (max-width: 1065px) 100vw, 1065px\" \/><\/p>\n<h4><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34464 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Pay-As-You-Go-Subscription.png\" alt=\"Screenshot of Azure Subscription Offers\" width=\"1251\" height=\"459\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Pay-As-You-Go-Subscription.png 1251w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Pay-As-You-Go-Subscription-300x110.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Pay-As-You-Go-Subscription-1024x376.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Pay-As-You-Go-Subscription-768x282.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Pay-As-You-Go-Subscription-400x147.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Pay-As-You-Go-Subscription-360x132.png 360w\" sizes=\"auto, (max-width: 1251px) 100vw, 1251px\" \/><\/h4>\n<h4><span class=\"ez-toc-section\" id=\"Create-a-Resource-Group\"><\/span>Create a Resource Group<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Now you need to create a resource group, a container to couple Azure resources.<\/p>\n<p>Search for \u201cResource groups\u201c, click \u201cCreate\u201c, provide a resource group name and select the respective subscription as well as a region of your choice. Note that not all regions support the GPU accelerated VMs. To be on the safe side, choose a major region such as \u201cWest Europe\u201c. Select \u201cReview + create\u201c and \u201cCreate\u201c after validation completes.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34469 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Resource-Group-1.png\" alt=\"Screenshot Resource Groups\" width=\"1219\" height=\"559\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Resource-Group-1.png 1219w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Resource-Group-1-300x138.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Resource-Group-1-1024x470.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Resource-Group-1-768x352.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Resource-Group-1-400x183.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Resource-Group-1-360x165.png 360w\" sizes=\"auto, (max-width: 1219px) 100vw, 1219px\" \/><\/p>\n<h4><span class=\"ez-toc-section\" id=\"Create-a-Databricks-Workspace\"><\/span>Create a Databricks Workspace<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Now you can create a Databricks Workspace, which is the web app to interact with Databricks. Search for \u201cDatabricks\u201c, select \u201cCreate\u201c, select the respective subscription and resource group, provide a name for the workspace, select the same region as in your resource group and \u201cStandard\u201c Pricing Tier. You may also choose \u201cTrial\u201c to not incur the cost for the Databricks Service for 14 days. You will however still pay for the Azure compute resources and the storage! Check out the <a href=\"https:\/\/databricks.com\/product\/pricing\" target=\"_blank\" rel=\"noopener\">Databricks pricing page<\/a>\u00a0for more information on Databricks pricing. Click on \u201cReview + create\u201c and on \u201cCreate\u201c after validating.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34471 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Azure-Databricks.png\" alt=\"Azure Data Bricks workspace creation\" width=\"1167\" height=\"690\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Azure-Databricks.png 1167w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Azure-Databricks-300x177.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Azure-Databricks-1024x605.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Azure-Databricks-768x454.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Azure-Databricks-400x237.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Azure-Databricks-360x213.png 360w\" sizes=\"auto, (max-width: 1167px) 100vw, 1167px\" \/><\/p>\n<p>It will take a couple of minutes until the workspace is created.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34473 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Databricks-Deployment.png\" alt=\"Overview of Deployment status\" width=\"1296\" height=\"438\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Databricks-Deployment.png 1296w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Databricks-Deployment-300x101.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Databricks-Deployment-1024x346.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Databricks-Deployment-768x260.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Databricks-Deployment-400x135.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-Databricks-Deployment-360x122.png 360w\" sizes=\"auto, (max-width: 1296px) 100vw, 1296px\" \/><\/p>\n<h4><span class=\"ez-toc-section\" id=\"Adjusting-Azure-VM-core-quota\"><\/span>Adjusting Azure VM core quota<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Azure limits how many VMs can be deployed in parallel using a CPU core quota. You need to make sure that your quota is high enough and adjust it if necessary. Note that you might not have the proper privileges to increase the core quota, which is usually the case in corporate settings. In this case, you must contact your company\u2019s Azure administrator.<\/p>\n<p>Azure limits the quota per region and per VM family. Thus, you need to increase both.<\/p>\n<p>Search for \u201cSubscriptions\u201c and select your respective subscription. In the left panel select \u201cUsage + quotas\u201c. Filter for your region e.g. \u201cWest Europe\u201c, set the provider to \u201cMicrosoft. Compute\u201c and search for \u201cTotal Regional vCPUs\u201c.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34476 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Increase-Azure-Regional-VMs.png\" alt=\"Azure Databricks SubscriptionSettings Quota\" width=\"1193\" height=\"795\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Increase-Azure-Regional-VMs.png 1193w, https:\/\/www.inovex.de\/wp-content\/uploads\/Increase-Azure-Regional-VMs-300x200.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Increase-Azure-Regional-VMs-1024x682.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Increase-Azure-Regional-VMs-768x512.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Increase-Azure-Regional-VMs-400x267.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Increase-Azure-Regional-VMs-360x240.png 360w\" sizes=\"auto, (max-width: 1193px) 100vw, 1193px\" \/><\/p>\n<p>Click on the pencil and set a quota higher than 48 to comfortably follow this article. It may take a couple of minutes until the quota increase is approved.<\/p>\n<p>Search and increase the core quota for following families:<\/p>\n<ul>\n<li>\u201cStandard DASv4 Family vCPUs\u201c to at least 16<\/li>\n<li>\u201cStandard NCASv3_T4 Family vCPUs\u201c (T4 GPU VMs) to at least 32<\/li>\n<li>\u201cStandard NCSv3 Family vCPUs\u201c (V100 GPU VMs) to at least 12<\/li>\n<\/ul>\n<p>Note, if the quota cannot be increased automatically, you will be prompted to open a support ticket with Azure, so a Support Engineer will take care of it. In this case, just click on &#8222;Create a support request&#8220; and follow the instructions. It is possible that the quota request may not get approved at that moment due to high demand. This may\u00a0 especially happen if you are not an institutional client.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34480 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Create-a-Azure-Support-Request.png\" alt=\"Databricks warning screen\" width=\"560\" height=\"416\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Create-a-Azure-Support-Request.png 560w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-a-Azure-Support-Request-300x223.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-a-Azure-Support-Request-400x297.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-a-Azure-Support-Request-360x267.png 360w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><\/p>\n<p>For more information on increasing quotas please refer to <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/azure-portal\/supportability\/regional-quota-requests\" target=\"_blank\" rel=\"noopener\">respective Azure documentation<\/a>.<\/p>\n<p>You can come back later and increase the quota even further and add additional VM families supported in Databricks.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Launching-the-Databricks-Workspace\"><\/span>Launching the Databricks Workspace<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>When the quota increase is approved and the workspace deployment is completed you can search again for \u201cDatabricks\u201c, select the workspace just created and click on \u201cLaunch Workspace\u201c.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34482 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Launch-Databricks-Workspace.png\" alt=\"Launch Workspace Icon\" width=\"211\" height=\"200\" \/><\/p>\n<h3><span class=\"ez-toc-section\" id=\"Quick-Tour-Through-Databricks\"><\/span>Quick Tour Through Databricks<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">This is the Workspaces main page.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34484 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Workspace-Main-Page.png\" alt=\"Databricks Workspace Main page\" width=\"1287\" height=\"752\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Workspace-Main-Page.png 1287w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Workspace-Main-Page-300x175.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Workspace-Main-Page-1024x598.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Workspace-Main-Page-768x449.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Workspace-Main-Page-400x234.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Workspace-Main-Page-360x210.png 360w\" sizes=\"auto, (max-width: 1287px) 100vw, 1287px\" \/><\/p>\n<h4><span class=\"ez-toc-section\" id=\"Running-Spark-Jobs\"><\/span>Running Spark Jobs<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>On the left navigation bar, you can select \u201cCompute\u201c and then \u201cCreate Cluster\u201c to create an \u201cAll-Purpose Cluster\u201c. An \u201cAll-Purpose Cluster\u201c is a cluster that can be used interactively with Databricks Notebooks (similar to Jupyter Notebooks).<\/p>\n<p>On the \u201cCreate Cluster\u201c page you can configure the cluster. This should be pretty self-explanatory if you are familiar with Spark. If you need some more information, visit the <a href=\"https:\/\/docs.databricks.com\/clusters\/create-cluster.html\" target=\"_blank\" rel=\"noopener\">Create Cluster documentation page<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34486 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Databrick-Create-Cluster.png\" alt=\"Screenshot of Cluster Creation Page\" width=\"1302\" height=\"1044\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Databrick-Create-Cluster.png 1302w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databrick-Create-Cluster-300x241.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databrick-Create-Cluster-1024x821.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databrick-Create-Cluster-768x616.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databrick-Create-Cluster-400x321.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databrick-Create-Cluster-360x289.png 360w\" sizes=\"auto, (max-width: 1302px) 100vw, 1302px\" \/><\/p>\n<p>When you click on \u201cCreate Cluster\u201c a cluster will be deployed. This may take a few minutes. After the cluster is created you can select \u201cWorkspace\u201c &gt; \u201cShared\u201c &gt; \u201cArrow Down\u201c &gt; \u201cCreate\u201c &gt; \u201cNotebook\u201c to create a Databricks Notebook and execute Spark Queries interactively in Python, Scala, Java or SQL. The SparkSession will already be created and accessible over the spark\u00a0 variable.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34488 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Databricks-Notebook.png\" alt=\"Databricks Workplace, Create a new Notebook\" width=\"1309\" height=\"550\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Databricks-Notebook.png 1309w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Databricks-Notebook-300x126.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Databricks-Notebook-1024x430.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Databricks-Notebook-768x323.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Databricks-Notebook-400x168.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Create-Databricks-Notebook-360x151.png 360w\" sizes=\"auto, (max-width: 1309px) 100vw, 1309px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34490 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Test-Notebook.png\" alt=\"Page of a test notebook on Databricks\" width=\"1312\" height=\"649\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Test-Notebook.png 1312w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Test-Notebook-300x148.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Test-Notebook-1024x507.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Test-Notebook-768x380.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Test-Notebook-400x198.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Test-Notebook-360x178.png 360w\" sizes=\"auto, (max-width: 1312px) 100vw, 1312px\" \/><\/p>\n<p>Under \u201cImport\u201c you can upload a file (e.g. .py, Jupyter Notebook etc.) from your local file system to the workspace as well.<\/p>\n<p>Alternatively, you can create a Databricks Job by selecting the \u201cJobs\u201c tab &gt; \u201cCreate Job\u201c.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34492 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Job.png\" alt=\"Screen for new job creation\" width=\"908\" height=\"701\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Job.png 908w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Job-300x232.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Job-768x593.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Job-400x309.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Databricks-Job-360x278.png 360w\" sizes=\"auto, (max-width: 908px) 100vw, 908px\" \/><\/p>\n<p>Here you define a cluster as in the \u201cCompute\u201c section and link a Notebook to it. Instead of working interactively the job can only run as a whole. However, it only incurs the cheaper \u201cJobs Compute\u201c costs instead of the \u201cAll-Purpose Compute\u201c costs (see next section).<\/p>\n<p>For a complete guide on the Databricks Workspace please refer to the <a href=\"https:\/\/docs.databricks.com\/getting-started\/quick-start.html\" target=\"_blank\" rel=\"noopener\">Databricks quick start documentation<\/a>.<\/p>\n<h4><span class=\"ez-toc-section\" id=\"Understanding-available-Azure-Databricks-VMs-and-Pricing\"><\/span>Understanding available Azure Databricks VMs and Pricing<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>The costs for an Azure Databricks Workspace can be divided into two main categories. Azure Resources Costs and costs for Databricks Units (DBUs). The cost of a DBU depends mostly on whether you are using the Standard or Premium Tier when setting up the Databricks Workspace and whether you are using an \u201cAll-Purpose Clusters\u201c or a \u201cJob Clusters\u201c.<\/p>\n<figure id=\"attachment_34495\" aria-describedby=\"caption-attachment-34495\" style=\"width: 1271px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-34495 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/DBU-Prices.png\" alt=\"Pricing Plan for virtual machines\" width=\"1271\" height=\"454\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/DBU-Prices.png 1271w, https:\/\/www.inovex.de\/wp-content\/uploads\/DBU-Prices-300x107.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/DBU-Prices-1024x366.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/DBU-Prices-768x274.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/DBU-Prices-400x143.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/DBU-Prices-360x129.png 360w\" sizes=\"auto, (max-width: 1271px) 100vw, 1271px\" \/><figcaption id=\"caption-attachment-34495\" class=\"wp-caption-text\">Source: https:\/\/azure.microsoft.com\/en-gb\/pricing\/details\/databricks\/<\/figcaption><\/figure>\n<p>The amount of DBUs consumed per hour depends on the VM type and is displayed on the \u201cCreate Cluster\u201c window in the Databricks Workspace.<\/p>\n<p>The costs for the Azure Resources are dominated by the price Azure charges for its VMs. An overview of the prices can be found on the <a href=\"https:\/\/azure.microsoft.com\/en-gb\/pricing\/details\/virtual-machines\/windows\/\" target=\"_blank\" rel=\"noopener\">VM pricing page.<\/a><\/p>\n<p>Table 1 shows relevant VMs and their associated costs per hour.\u00a0 Note that the price-per-hour for the &#8222;NCas Family&#8220; (T4 GPU) is quite comparable to the CPU VMs.<\/p>\n<figure id=\"attachment_34498\" aria-describedby=\"caption-attachment-34498\" style=\"width: 887px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-34498 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-VM-Pricing.png\" alt=\"VMs and their associated costs per hour\" width=\"887\" height=\"414\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-VM-Pricing.png 887w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-VM-Pricing-300x140.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-VM-Pricing-768x358.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-VM-Pricing-400x187.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Azure-VM-Pricing-360x168.png 360w\" sizes=\"auto, (max-width: 887px) 100vw, 887px\" \/><figcaption id=\"caption-attachment-34498\" class=\"wp-caption-text\">Table 1: Virtual Machines and Prices in Databricks. Total price per hour is price per hour for the VM plus the price for the DBUs per hour. One DBU here is 0.13 EUR for standard job compute and 0.36 EUR for all-purpose compute.<br \/>Sources:<br \/>https:\/\/azure.microsoft.com\/en-gb\/pricing\/details\/databricks\/<br \/>https:\/\/azure.microsoft.com\/en-gb\/pricing\/details\/virtual-machines\/linux\/<\/figcaption><\/figure>\n<p>Additionally, Databricks creates an Azure Blob Storage and Azure Virtual Network during workspace deployment that create some additional costs that we will not go into here.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Setting-up-the-Rapids-Accelerator-on-Databricks\"><\/span>Setting up the Rapids Accelerator on Databricks<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Installing the plugin on Databricks mainly consists of configuring a Databricks cluster with appropriate cluster and Spark configs as well as adding a init script that is executed at each cluster creation on each worker. The init script downloads and installs the actual plugin on the cluster.<\/p>\n<p>Thus, the first step is getting the init script into the Databricks File System (DBFS). This has to be done once. One way is to create a small cluster (e.g. Single Node Standard_DS3_v2) and run a prepared notebook that writes the init script for a given Databricks Runtime from a python string to DBFS. Afterward, you can create the actual GPU cluster, reference the init script and set the configs.<\/p>\n<p>Create a small CPU single node cluster and execute the following Python code in a notebook:<\/p>\n<pre class=\"lang:python decode:true\" title=\"Create Init Script\"># Create init script for Runtime 9.1 LTS ML GPU\r\n\r\ndbutils.fs.mkdirs(\"dbfs:\/databricks\/init_scripts\/\")\r\n \r\ndbutils.fs.put(\"\/databricks\/init_scripts\/gpu_init.sh\",\"\"\"\r\n#!\/bin\/bash\r\nsudo wget -O \/databricks\/jars\/rapids-4-spark_2.12-21.12.0.jar https:\/\/repo1.maven.org\/maven2\/com\/nvidia\/rapids-4-spark_2.12\/21.12.0\/rapids-4-spark_2.12-21.12.0.jar\r\nsudo wget -O \/databricks\/jars\/cudf-21.12.2-cuda11.jar https:\/\/repo1.maven.org\/maven2\/ai\/rapids\/cudf\/21.12.2\/cudf-21.12.2-cuda11.jar\"\"\", True)\r\n\r\n# Source: https:\/\/nvidia.github.io\/spark-rapids\/docs\/demo\/Databricks\/generate-init-script.ipynb<\/pre>\n<p><span style=\"font-weight: 400;\">Terminate the cluster afterwards.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Now you will create the actual GPU cluster that runs the plugin. Create a cluster with:<\/span><\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">Databricks Runtime 9.1 LTS ML GPU.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Cluster mode \u201cStandard\u201c. Single Node is not supported.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Disabled autoscaling.<\/span><\/li>\n<li><span style=\"font-weight: 400;\">A GPU worker type from <\/span><span style=\"font-weight: 400;\">Table 1<\/span><span style=\"font-weight: 400;\"> e.g. \u201cStandard NC4as T4 v3\u201c. U<\/span><span style=\"font-weight: 400;\">sing multiple GPUs on one worker is not supported in Databricks. <\/span><span style=\"font-weight: 400;\">The GPU architecture must be compatible with the plugin, i.e. Nvidia T4, V100 or A Series.<\/span><\/li>\n<li style=\"list-style-type: none;\"><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">The desired worker count, e.g. \u201c1\u201c should do for starters.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">A driver type.\u00a0 I would suggest the cheapest one \u201cStandard NC4as T4 v3\u201c unless there is a good reason to use a bigger one.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">The path to the init script:<span class=\"lang:default decode:true crayon-inline \">dbfs:\/databricks\/init_scripts\/gpu_init.sh<\/span> . (&#8222;Advanced options&#8220; &gt;&gt; &#8222;Init Scripts&#8220;)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Following Spark configs (&#8222;Advanced options&#8220; &gt;&gt; &#8222;Spark&#8220; &gt;&gt; &#8222;Spark configs&#8220;)<\/span><\/li>\n<\/ul>\n<pre class=\"lang:default decode:true\" title=\"Spark Rapids Required Configs\">spark.plugins com.nvidia.spark.SQLPlugin\r\nspark.task.resource.gpu.amount 0.1\r\nspark.rapids.memory.pinnedPool.size 2G\r\nspark.locality.wait 0s\r\nspark.databricks.delta.optimizeWrite.enabled false\r\nspark.sql.adaptive.enabled false\r\nspark.rapids.sql.concurrentGpuTasks 2<\/pre>\n<p><span style=\"font-weight: 400;\"><span class=\"lang:default decode:true crayon-inline\">spark.plugins com.nvidia.spark.SQLPlugin<\/span> activates the plugin and is the most essential config entry. <span class=\"lang:default decode:true crayon-inline\">spark.sql.adaptive.enabled false<\/span> disables adaptive query execution and <span class=\"lang:default decode:true crayon-inline\">spark.databricks.delta.optimizeWrite.enabled false<\/span>\u00a0disables delta optimized write. Disabling these two features is specific to Databricks. The remaining configs influence the performance of Spark and are being set to more appropriate defaults for the GPU. For now, just set them as suggested and don\u2019t think too much about them. We will look at Spark configs to tune the application in the next blog post.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Finally, certain features such as pandas UDF GPU support require some additional configs to be set. Right now, do not bother with that either. Otherwise, check out the official Nvidia documentation.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34517 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Rapids-Databricks-Setup.png\" alt=\"Overview page of Test GPU Cluster\" width=\"1161\" height=\"1376\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Rapids-Databricks-Setup.png 1161w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Rapids-Databricks-Setup-253x300.png 253w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Rapids-Databricks-Setup-864x1024.png 864w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Rapids-Databricks-Setup-768x910.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Rapids-Databricks-Setup-400x474.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Rapids-Databricks-Setup-360x427.png 360w\" sizes=\"auto, (max-width: 1161px) 100vw, 1161px\" \/><\/p>\n<p>It is important to configure the cluster exactly as intended. Failing to do so may easily lead to the plugin not working or the cluster shutting down without an appropriate error message. For instance, I initially chose \u201cSingle Node\u201c mode in Databricks (driver = worker) although \u201cStandard Mode\u201c must be used. The cluster did start and regular Python and Scala calls could be executed in a notebook. However, when running a Spark action, the job started but never progressed.<\/p>\n<p>Therefore, to check if everything is set up correctly, you may run the following Python code to test if the plugin is working as expected.<\/p>\n<pre class=\"lang:python decode:true \" title=\"Check correct Spark-Rapids Installation\"># Execute in a notebook \r\ndf = spark.range(10)\r\ndf.explain()\r\n\r\n# Output should be similar to this: \r\n#\r\n# == Physical Plan ==\r\n# GpuColumnarToRow false\r\n# +- GpuRange (0, 10, step=1, splits=4)\r\n\r\n# In another cell\r\ndf.write.mode(\"overwrite\").parquet(\"\/test\/dummy.parquet\")  # check Spark UI of Job<\/pre>\n<p><span style=\"font-weight: 400;\">The output of explain should show the `GpuRange` operation and the job writing the dummy data should finish successfully. You might also have a look at the execution plan in the Spark UI that should show the GPU operations.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34520 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-UI-for-Spark-Rapids.png\" alt=\"Spark UI that shows the GPU operations\" width=\"1390\" height=\"849\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-UI-for-Spark-Rapids.png 1390w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-UI-for-Spark-Rapids-300x183.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-UI-for-Spark-Rapids-1024x625.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-UI-for-Spark-Rapids-768x469.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-UI-for-Spark-Rapids-400x244.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-UI-for-Spark-Rapids-555x338.png 555w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-UI-for-Spark-Rapids-360x220.png 360w\" sizes=\"auto, (max-width: 1390px) 100vw, 1390px\" \/><\/p>\n<h4><span class=\"ez-toc-section\" id=\"Limitations-of-using-the-Rapids-Accelerator-on-Databricks\"><\/span>Limitations of using the Rapids Accelerator on Databricks<span class=\"ez-toc-section-end\"><\/span><\/h4>\n<p>Using the plugin on Databricks comes with several drawbacks compared to other environments. Most importantly, Adaptive Query Execution (AQE) is not supported which negatively impacts the performance. Also, a GPU VM as the Spark Driver Node must be used which can be a cost disadvantage if the overall cluster size is not too big. Additionally, only a single GPU per node can be effectively utilized. This leaves only 4 VM types that we can effectively use in Azure for our purposes. These are the ones summarized in Table 1 above. Note that in Azure Databricks there is no VM with an A100 GPU available and that the V100 machine has only 6 CPU cores. We will see that the number of CPU cores is still relevant for the Spark plugin, even when using GPUs.<\/p>\n<p>On the other hand, setting up the plugin on Databricks and experimentally working with it is comparatively straight forward which makes Databricks a good choice to get started.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Running-a-Demo\"><\/span>Running a Demo<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Let&#8217;s have a look at a demo that Nvidia offers. In the \u201cGetting Started\u201c section of Databricks, AWS-EMR and GCP Dataproc there is a reference to a Python Jupyter Notebook. The notebook is slightly different w.r.t. to the initial setup in the respective service. You can find the \u201cDatabricks version\u201c of the notebook under the demo tab as well.<\/p>\n<p>The notebook downloads a approx. 4 GB version of the \u201c<a href=\"https:\/\/capitalmarkets.fanniemae.com\/credit-risk-transfer\/single-family-credit-risk-transfer\/fannie-mae-single-family-loan-performance-data\" target=\"_blank\" rel=\"noopener\">Fannie Mae Single-Family Loan Performance Data<\/a>\u201c consisting of two dataset \u201cacquisition\u201c an \u201cperformance\u201c partitioned over 4 csv files. The datasets are summarized in Table 2.<\/p>\n<figure id=\"attachment_34522\" aria-describedby=\"caption-attachment-34522\" style=\"width: 474px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-34522\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Fannie-Mae-Data-Summary.png\" alt=\"\" width=\"474\" height=\"209\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Fannie-Mae-Data-Summary.png 388w, https:\/\/www.inovex.de\/wp-content\/uploads\/Fannie-Mae-Data-Summary-300x132.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Fannie-Mae-Data-Summary-360x159.png 360w\" sizes=\"auto, (max-width: 474px) 100vw, 474px\" \/><figcaption id=\"caption-attachment-34522\" class=\"wp-caption-text\">Table 2: Small Fannie Mae Single-Family Loan Performance Data<\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">In code, the schema for the datasets is defined and both datasets are transcoded to parquet format in an initial Spark action. The parquet encoded datasets are loaded again, transformed, joined and written out as another parquet file in another Spark action thereafter. For both Spark actions the runtime is measured and printed after each action completes. Before each action, some Spark configs are set to optimize the performance of the GPU runs. <\/span><\/p>\n<pre class=\"lang:python decode:true \" title=\"Excerpt from Nvidia Spark-Rapids Demo\"># Exerpt from notebook available at: \r\n\r\nspark.conf.set('spark.rapids.sql.enabled','true')\r\nspark.conf.set('spark.rapids.sql.explain', 'ALL')\r\nspark.conf.set('spark.rapids.sql.incompatibleOps.enabled', 'true')\r\nspark.conf.set('spark.rapids.sql.batchSizeBytes', '512M')\r\nspark.conf.set('spark.rapids.sql.reader.batchSizeBytes', '768M')\r\n\r\n# Lets transcode the data first\r\nstart = time.time()\r\n# we want a few big files instead of lots of small files\r\nspark.conf.set('spark.sql.files.maxPartitionBytes', '200G')\r\nacq = read_acq_csv(spark, orig_acq_path)\r\nacq.repartition(12).write.parquet(tmp_acq_path, mode='overwrite')\r\nperf = read_perf_csv(spark, orig_perf_path)\r\nperf.coalesce(96).write.parquet(tmp_perf_path, mode='overwrite')\r\nend = time.time()\r\nprint(end - start)\r\n\r\n# Now lets actually process the data\\n\",\r\nstart = time.time()\r\nspark.conf.set('spark.sql.files.maxPartitionBytes', '1G')\r\nspark.conf.set('spark.sql.shuffle.partitions', '192')\r\nperf = spark.read.parquet(tmp_perf_path)\r\nacq = spark.read.parquet(tmp_acq_path)\r\nout = run_mortgage(spark, perf, acq)\r\nout.write.parquet(output_path, mode='overwrite')\r\nend = time.time()\r\nprint(end - start)<\/pre>\n<p>In the beginning, leave these configs as they are and play around with the infrastructure first. Comment these settings out when running on \u201cCPU only\u201c though. You may want to try out different numbers of workers\/instances, workers with a different number of CPU cores (yes, also when running on the GPU), different GPUs (T4, V100) and different numbers of GPUs per worker. Whatever is supported in your setup. It may be helpful to adjust the notebook to log the runtime and the environment settings. In the appendix at the end of this article, you will find some sample code to extract some important metrics to log from the Databricks environment. Additionally, the runtime for the GPU has a higher variance as compared to the CPU. Therefore, I would suggest repeating the measurements several times and taking a median of them.<\/p>\n<p>On this problem, you will probably experience that the biggest performance differences come from the number of CPU cores available (in CPU only and GPU accelerated mode) and whether a GPU is available or not. That is, using a more powerful GPU (e.g. V100 instead T4) will not make much of a difference and having more GPUs does also not improve the performance if the number of CPU cores stays the same. Since the data is not too big this might not be too surprising. What I initially totally underestimated though is the impact of the number of CPU cores in GPU mode. Figure 1 summarizes my measurements.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Performance-and-Cost-Results-on-Demo\"><\/span>Performance and Cost Results on Demo<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<figure id=\"attachment_34525\" aria-describedby=\"caption-attachment-34525\" style=\"width: 1532px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-34525 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Runtime-on-Nvida-Mortgage.png\" alt=\"Diagramm of Runtime on Mortgage.\" width=\"1532\" height=\"637\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Runtime-on-Nvida-Mortgage.png 1532w, https:\/\/www.inovex.de\/wp-content\/uploads\/Runtime-on-Nvida-Mortgage-300x125.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Runtime-on-Nvida-Mortgage-1024x426.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Runtime-on-Nvida-Mortgage-768x319.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Runtime-on-Nvida-Mortgage-400x166.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Runtime-on-Nvida-Mortgage-360x150.png 360w\" sizes=\"auto, (max-width: 1532px) 100vw, 1532px\" \/><figcaption id=\"caption-attachment-34525\" class=\"wp-caption-text\">Figure 1: Runtime on Mortgage. <span style=\"font-weight: 400;\">Environment: Azure Databricks. Left: CPU Machines D8as v4. 8 CPU cores. 32 GB RAM. Runtime 8.2. Right: GPU Machines. NCas T4 v3 family. 4,8,16 CPU cores. 28, 56, 112 GB RAM. T4 16 GB GPU. 1 GPU per worker. Runtime 8.2 ML GPU. Spark-Rapids v.21.06.2. \u201cRuntime in sec\u201c as median over 3 runs. Transcode (green bars) denotes the initial csv to parquet transcoding (first Spark action).\u00a0Transform (blue bars) denotes the following transformation and join operation (second Spark action). Note that DB Runtime 8.2 is not supported anymore. Note that the plugin is not available for DB Runtime 8.2 anymore<\/span><\/figcaption><\/figure>\n<p>As you can see on the right of Figure 1 when using 1x worker with 4x CPU cores and a T4 GPU (1&#215;4 GPU on the right side) compared to 1&#215;8 GPU the total runtime drops from 316 sec to 135 sec (2.34x speedup). On the other hand, the 2&#215;4 GPU setup which also gives us 8x CPU cores in total but with 2x T4 GPUs is even slightly slower than the 1&#215;8 GPU cluster. As such the improvement here most probably comes from the additional CPU cores. A similar pattern can be observed when comparing the GPU setup with 8x CPU cores to 16x CPU cores (1&#215;16 GPU and 2&#215;8 GPU).<\/p>\n<p>When examining the Spark UI, we will notice that some operations are actually performed on the CPU which explains the importance of the number of CPU cores. This is caused by the plugin having a cost-based optimizer that decides whether it is beneficial to send the data to the GPU for processing or whether it is better to process on the CPU. However, we will explore in the next blog post that this effect also occurs when only GPU operations are included in the execution plan.<\/p>\n<figure id=\"attachment_34527\" aria-describedby=\"caption-attachment-34527\" style=\"width: 237px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-34527 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Plan-to-Transcode-Perf.png\" alt=\"Spark plan for executing the transcoding of csv to parquet on the performance dataset\" width=\"237\" height=\"512\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Plan-to-Transcode-Perf.png 237w, https:\/\/www.inovex.de\/wp-content\/uploads\/Spark-Plan-to-Transcode-Perf-139x300.png 139w\" sizes=\"auto, (max-width: 237px) 100vw, 237px\" \/><figcaption id=\"caption-attachment-34527\" class=\"wp-caption-text\">Spark plan for executing the transcoding of csv to parquet on the performance dataset. The reading is executed on CPU as indicated by the \u201cScan csv\u201c operation (instead of \u201cGpuScan csv\u201c).<\/figcaption><\/figure>\n<p>Coming back to Figure 1: When comparing the CPU to the GPU cluster i.e. 1&#215;8 CPU (left) to 1&#215;8 GPU (right) and 2&#215;8 CPU to 2&#215;8 GPU, we observe that applying the plugin gives an approx. 2.2x speedup in both cases. Although \u201cTranscode\u201c and \u201cTransform\u201c both improve on their own, the improvement of \u201cTransform\u201c (2.45x speedup on 1&#215;8) is much larger than of \u201cTranscode\u201c (1.63x speedup on 1&#215;8). This is surprising since the FAQ page states that the GPU is especially good at transcoding from csv to parquet.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34529 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Nvidia-Spark-Rapids-FAQ.png\" alt=\"Screenshot of FAQ page of Databricks\" width=\"957\" height=\"314\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Nvidia-Spark-Rapids-FAQ.png 957w, https:\/\/www.inovex.de\/wp-content\/uploads\/Nvidia-Spark-Rapids-FAQ-300x98.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Nvidia-Spark-Rapids-FAQ-768x252.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Nvidia-Spark-Rapids-FAQ-400x131.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Nvidia-Spark-Rapids-FAQ-360x118.png 360w\" sizes=\"auto, (max-width: 957px) 100vw, 957px\" \/><\/p>\n<p>Figure 2 displays the costs for the above runs excl. the cost for the driver. In other environments than Databricks, the driver can be the same for both CPU and GPU clusters. Therefore, its cost is a constant. Also in real-world setups with more workers, the driver costs become neglectable. In the appendix, you can find a Figure incl. driver cost.<\/p>\n<figure id=\"attachment_34532\" aria-describedby=\"caption-attachment-34532\" style=\"width: 1523px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-34532 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-on-Nvida-Mortgage.png\" alt=\"Cost incl. price for Databricks standard job compute.\" width=\"1523\" height=\"604\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-on-Nvida-Mortgage.png 1523w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-on-Nvida-Mortgage-300x119.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-on-Nvida-Mortgage-1024x406.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-on-Nvida-Mortgage-768x305.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-on-Nvida-Mortgage-400x159.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-on-Nvida-Mortgage-360x143.png 360w\" sizes=\"auto, (max-width: 1523px) 100vw, 1523px\" \/><figcaption id=\"caption-attachment-34532\" class=\"wp-caption-text\">Figure 2: Cost incl. price for Databricks standard job compute. Check Table 1 for the prices.<\/figcaption><\/figure>\n<p>We see that for the CPU clusters the costs are almost the same. This is because 2&#215;8 CPU is twice as expensive as 1&#215;8 CPU but also almost twice as fast. The GPU clusters are in two cases slightly more expensive but in 3 cases cheaper. When comparing the two cheapest variants for CPU and GPU (1&#215;8 CPU and 1&#215;16 GPU), the GPU is 38% cheaper than the CPU cluster. So the second most expensive setup in terms of price-per-hour 1&#215;16 GPU (right after 2&#215;8 GPU which is more expensive because of the additional GPU) is the cheapest in terms of cost-to-solution. Finally, the relevance of finding the right balance between CPU cores and the number of GPUs in the GPU cluster for the given problem becomes apparent.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Nvidias-Cost-and-Performance\"><\/span>Nvidias Cost and Performance<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Recall the documentation&#8217;s home page from the beginning of this article.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-34442 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage.png\" alt=\"Spark Rapids getting Started Page\" width=\"1257\" height=\"846\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage.png 1257w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-300x202.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-1024x689.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-768x517.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-400x269.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-452x304.png 452w, https:\/\/www.inovex.de\/wp-content\/uploads\/Docu-Homepage-360x242.png 360w\" sizes=\"auto, (max-width: 1257px) 100vw, 1257px\" \/><\/p>\n<p>It shows a figure displaying the runtime and cost improvements of the same ETL transformations measured on GCP Databricks using T4 GPUs as well. However, their results are based on a 200 GB version of the dataset and they are thus using a bigger cluster with 12x workers. They report a CPU to GPU speedup of 3.8x and a cost improvement of 50%. Their better results are probably due to their bigger setup. If there is more work to accelerate, the improvements are obviously bigger. However, it is interesting to witness that significant speedups with no higher costs on this small amount of data are possible. If you are interested, watch <a href=\"https:\/\/youtu.be\/4MI_LYah900?t=485\" target=\"_blank\" rel=\"noopener\">this talk<\/a> from Nvidia where they present speedups of up to approx. 25x. <span style=\"font-weight: 400;\">Although such speedups are not what we are used to from training deep learning models, they can still drastically reduce costs and facilitate development iterations for ETL applications.\u00a0<\/span><\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This should be it for the day. In summary, we have explored how to set up the plugin and what pitfalls to avoid when doing so. We have also seen that using GPUs is not necessarily more expensive (in fact it will often be cheaper) and that we can already get speedups on small datasets with low-cost GPUs. Although the computations are run on the GPU the number of CPU cores available is still relevant since not all operations are performed on the GPU due to cost-based optimization. Finally, we looked at what is possible in terms of speedup in larger settings according to Nvidia.<\/p>\n<p>In the next blog post of this series, we will have a look at the most important Spark configs for your GPU Spark application (check out the \u201cTuning\u201c tab on the Website if you cannot wait for it), what other operations the plugin is (supposedly) good at and some experiments with bigger synthetic data created by an own data generator and up to 9x speedup and 82% cost reductions. Stay tuned.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Appendix\"><\/span>Appendix<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3><span class=\"ez-toc-section\" id=\"Log-Databricks-Cluster-Parameters\"><\/span>Log Databricks Cluster Parameters<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<pre class=\"lang:python decode:true\" title=\"Log Databricks Cluster Parameters\"># Run from Databricks Notebook\r\n\r\nimport requests  \r\nimport json\r\n\r\nctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()\r\nhost_name = ctx.tags().get(\"browserHostName\").get()\r\nhost_token = ctx.apiToken().get()\r\ncluster_id = ctx.tags().get(\"clusterId\").get()\r\n\r\nresponse = requests.get(\r\n    f'https:\/\/{host_name}\/api\/2.0\/clusters\/get?cluster_id={cluster_id}',\r\n    headers={'Authorization': f'Bearer {host_token}'}\r\n  ).json()\r\n\r\nlog = {\r\n    \"cluster_name\": response[\"cluster_name\"],\r\n    \"spark_version\": response[\"spark_version\"],\r\n    \"driver_node_type_id\": response[\"driver_node_type_id\"],\r\n    \"node_type_id\": response[\"node_type_id\"],\r\n    \"num_workers\": str(response[\"num_workers\"]), \r\n    \"cluster_cores\": str(response[\"cluster_cores\"]),  # includes no of driver cores as well\r\n    \"cluster_memory_mb\": str(response[\"cluster_memory_mb\"]), # includes driver memory as well\r\n    \"spark_conf\": json.dumps(response[\"spark_conf\"])\r\n}<\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Compute-Costs-incl-Driver\"><\/span>Compute Costs incl. Driver<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<figure id=\"attachment_34534\" aria-describedby=\"caption-attachment-34534\" style=\"width: 1562px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-34534 size-full\" src=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-inkl.-Driver-on-Nvida-Mortgage.png\" alt=\"Compute Cost as Figure 2 incl. Cost for Cheapest CPU\/GPU Driver.\" width=\"1562\" height=\"623\" srcset=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-inkl.-Driver-on-Nvida-Mortgage.png 1562w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-inkl.-Driver-on-Nvida-Mortgage-300x120.png 300w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-inkl.-Driver-on-Nvida-Mortgage-1024x408.png 1024w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-inkl.-Driver-on-Nvida-Mortgage-768x306.png 768w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-inkl.-Driver-on-Nvida-Mortgage-1536x613.png 1536w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-inkl.-Driver-on-Nvida-Mortgage-400x160.png 400w, https:\/\/www.inovex.de\/wp-content\/uploads\/Costs-inkl.-Driver-on-Nvida-Mortgage-360x144.png 360w\" sizes=\"auto, (max-width: 1562px) 100vw, 1562px\" \/><figcaption id=\"caption-attachment-34534\" class=\"wp-caption-text\">Figure 3: Compute Cost as Figure 2 incl. Cost for Cheapest CPU\/GPU Driver.<\/figcaption><\/figure>\n<p>In Databricks we must use a GPU driver node which adds significantly to the total cost-to-solution since we have a small setup with 2 workers at most. For the CPU the additional cost is neglectable since the cheapest CPU node is very cheap compared to the worker nodes. In the best case, we are still slightly cheaper with GPU.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post, we will look at Nvidia&#8217;s Rapids Accelerator for Apache Spark, a plugin for distributed data processing on the GPU. You will learn, how to set it up on Azure Databricks and run a toy example.<\/p>\n","protected":false},"author":245,"featured_media":35130,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"ep_exclude_from_search":false,"footnotes":""},"tags":[513,385,784,404,105],"service":[414],"coauthors":[{"id":245,"display_name":"Alan Mazankiewicz","user_nicename":"amazankiewicz"}],"class_list":["post-33943","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-azure-2","tag-data-engineering","tag-databricks","tag-rapids","tag-spark","service-cloud"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Getting Started with the Rapids Accelerator for Apache Spark<\/title>\n<meta name=\"description\" content=\"We will look at Nvidia&#039;s Rapids Accelerator for Apache Spark, a plugin for distributed data processing on the GPU. It will be set up on Azure Databricks.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Getting Started with the Rapids Accelerator for Apache Spark\" \/>\n<meta property=\"og:description\" content=\"We will look at Nvidia&#039;s Rapids Accelerator for Apache Spark, a plugin for distributed data processing on the GPU. It will be set up on Azure Databricks.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/\" \/>\n<meta property=\"og:site_name\" content=\"inovex GmbH\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/inovexde\" \/>\n<meta property=\"article:published_time\" content=\"2022-03-09T09:53:26+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-11-02T06:57:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Rapids-Sparks-blau_1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Alan Mazankiewicz\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/www.inovex.de\/wp-content\/uploads\/Rapids-Sparks-blau_1-1024x576.png\" \/>\n<meta name=\"twitter:creator\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:site\" content=\"@inovexgmbh\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"Alan Mazankiewicz\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"29\u00a0Minuten\" \/>\n\t<meta name=\"twitter:label3\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data3\" content=\"Alan Mazankiewicz\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/\"},\"author\":{\"name\":\"Alan Mazankiewicz\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/21e7ff530f5b13d4f7486869aa26a883\"},\"headline\":\"Getting Started with the Rapids Accelerator for Apache Spark on Azure Databricks\",\"datePublished\":\"2022-03-09T09:53:26+00:00\",\"dateModified\":\"2023-11-02T06:57:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/\"},\"wordCount\":3912,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Rapids-Sparks-blau_1.png\",\"keywords\":[\"Azure\",\"Data Engineering\",\"Databricks\",\"Rapids\",\"Spark\"],\"articleSection\":[\"Analytics\",\"English Content\",\"General\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/\",\"name\":\"Getting Started with the Rapids Accelerator for Apache Spark\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Rapids-Sparks-blau_1.png\",\"datePublished\":\"2022-03-09T09:53:26+00:00\",\"dateModified\":\"2023-11-02T06:57:34+00:00\",\"description\":\"We will look at Nvidia's Rapids Accelerator for Apache Spark, a plugin for distributed data processing on the GPU. It will be set up on Azure Databricks.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Rapids-Sparks-blau_1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/Rapids-Sparks-blau_1.png\",\"width\":1920,\"height\":1080,\"caption\":\"Rapids and Spark Logo an blue background\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Getting Started with the Rapids Accelerator for Apache Spark on Azure Databricks\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#website\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"name\":\"inovex GmbH\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#organization\",\"name\":\"inovex GmbH\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"contentUrl\":\"https:\\\/\\\/www.inovex.de\\\/wp-content\\\/uploads\\\/2021\\\/03\\\/inovex-logo-16-9-1.png\",\"width\":1921,\"height\":1081,\"caption\":\"inovex GmbH\"},\"image\":{\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/inovexde\",\"https:\\\/\\\/x.com\\\/inovexgmbh\",\"https:\\\/\\\/www.instagram.com\\\/inovexlife\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/inovex\",\"https:\\\/\\\/www.youtube.com\\\/channel\\\/UC7r66GT14hROB_RQsQBAQUQ\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/#\\\/schema\\\/person\\\/21e7ff530f5b13d4f7486869aa26a883\",\"name\":\"Alan Mazankiewicz\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/8704fe26035e0dc70223f02961696832b2dbd589096068e57c270f274d5b658c?s=96&d=retro&r=g0aae549b113ecadc14890c458fee93b8\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/8704fe26035e0dc70223f02961696832b2dbd589096068e57c270f274d5b658c?s=96&d=retro&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/8704fe26035e0dc70223f02961696832b2dbd589096068e57c270f274d5b658c?s=96&d=retro&r=g\",\"caption\":\"Alan Mazankiewicz\"},\"url\":\"https:\\\/\\\/www.inovex.de\\\/de\\\/blog\\\/author\\\/amazankiewicz\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Getting Started with the Rapids Accelerator for Apache Spark","description":"We will look at Nvidia's Rapids Accelerator for Apache Spark, a plugin for distributed data processing on the GPU. It will be set up on Azure Databricks.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/","og_locale":"de_DE","og_type":"article","og_title":"Getting Started with the Rapids Accelerator for Apache Spark","og_description":"We will look at Nvidia's Rapids Accelerator for Apache Spark, a plugin for distributed data processing on the GPU. It will be set up on Azure Databricks.","og_url":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/","og_site_name":"inovex GmbH","article_publisher":"https:\/\/www.facebook.com\/inovexde","article_published_time":"2022-03-09T09:53:26+00:00","article_modified_time":"2023-11-02T06:57:34+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Rapids-Sparks-blau_1.png","type":"image\/png"}],"author":"Alan Mazankiewicz","twitter_card":"summary_large_image","twitter_image":"https:\/\/www.inovex.de\/wp-content\/uploads\/Rapids-Sparks-blau_1-1024x576.png","twitter_creator":"@inovexgmbh","twitter_site":"@inovexgmbh","twitter_misc":{"Verfasst von":"Alan Mazankiewicz","Gesch\u00e4tzte Lesezeit":"29\u00a0Minuten","Written by":"Alan Mazankiewicz"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#article","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/"},"author":{"name":"Alan Mazankiewicz","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/21e7ff530f5b13d4f7486869aa26a883"},"headline":"Getting Started with the Rapids Accelerator for Apache Spark on Azure Databricks","datePublished":"2022-03-09T09:53:26+00:00","dateModified":"2023-11-02T06:57:34+00:00","mainEntityOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/"},"wordCount":3912,"commentCount":0,"publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Rapids-Sparks-blau_1.png","keywords":["Azure","Data Engineering","Databricks","Rapids","Spark"],"articleSection":["Analytics","English Content","General"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/","url":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/","name":"Getting Started with the Rapids Accelerator for Apache Spark","isPartOf":{"@id":"https:\/\/www.inovex.de\/de\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#primaryimage"},"image":{"@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Rapids-Sparks-blau_1.png","datePublished":"2022-03-09T09:53:26+00:00","dateModified":"2023-11-02T06:57:34+00:00","description":"We will look at Nvidia's Rapids Accelerator for Apache Spark, a plugin for distributed data processing on the GPU. It will be set up on Azure Databricks.","breadcrumb":{"@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#primaryimage","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/Rapids-Sparks-blau_1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/Rapids-Sparks-blau_1.png","width":1920,"height":1080,"caption":"Rapids and Spark Logo an blue background"},{"@type":"BreadcrumbList","@id":"https:\/\/www.inovex.de\/de\/blog\/getting-started-with-the-rapids-accelerator-for-apache-spark-on-azure-databricks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.inovex.de\/de\/"},{"@type":"ListItem","position":2,"name":"Getting Started with the Rapids Accelerator for Apache Spark on Azure Databricks"}]},{"@type":"WebSite","@id":"https:\/\/www.inovex.de\/de\/#website","url":"https:\/\/www.inovex.de\/de\/","name":"inovex GmbH","description":"","publisher":{"@id":"https:\/\/www.inovex.de\/de\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.inovex.de\/de\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/www.inovex.de\/de\/#organization","name":"inovex GmbH","url":"https:\/\/www.inovex.de\/de\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/","url":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","contentUrl":"https:\/\/www.inovex.de\/wp-content\/uploads\/2021\/03\/inovex-logo-16-9-1.png","width":1921,"height":1081,"caption":"inovex GmbH"},"image":{"@id":"https:\/\/www.inovex.de\/de\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/inovexde","https:\/\/x.com\/inovexgmbh","https:\/\/www.instagram.com\/inovexlife\/","https:\/\/www.linkedin.com\/company\/inovex","https:\/\/www.youtube.com\/channel\/UC7r66GT14hROB_RQsQBAQUQ"]},{"@type":"Person","@id":"https:\/\/www.inovex.de\/de\/#\/schema\/person\/21e7ff530f5b13d4f7486869aa26a883","name":"Alan Mazankiewicz","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/secure.gravatar.com\/avatar\/8704fe26035e0dc70223f02961696832b2dbd589096068e57c270f274d5b658c?s=96&d=retro&r=g0aae549b113ecadc14890c458fee93b8","url":"https:\/\/secure.gravatar.com\/avatar\/8704fe26035e0dc70223f02961696832b2dbd589096068e57c270f274d5b658c?s=96&d=retro&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/8704fe26035e0dc70223f02961696832b2dbd589096068e57c270f274d5b658c?s=96&d=retro&r=g","caption":"Alan Mazankiewicz"},"url":"https:\/\/www.inovex.de\/de\/blog\/author\/amazankiewicz\/"}]}},"_links":{"self":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/33943","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/users\/245"}],"replies":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/comments?post=33943"}],"version-history":[{"count":5,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/33943\/revisions"}],"predecessor-version":[{"id":49621,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/posts\/33943\/revisions\/49621"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media\/35130"}],"wp:attachment":[{"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/media?parent=33943"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/tags?post=33943"},{"taxonomy":"service","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/service?post=33943"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.inovex.de\/de\/wp-json\/wp\/v2\/coauthors?post=33943"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}