A graphics card with 1 active core

Data Processing Scaled Up and Out with Dask and RAPIDS: Setting up a Kubernetes Cluster with GPU Support (1/3)

Lesezeit
12 ​​min

This blog post tutorial shows how a scalable and high-performance environment for machine learning can be set up using the ingredients GPUs, Kubernetes clusters, Dask, Rapids and Jupyter. The whole blog post series consists of three parts. In this part, we will prepare a  Kubernetes cluster where containerized applications can utilize the processing power of GPUs. In the second part, we take a look at how to prepare docker images for our notebooks in order to utilize Dask, Rapids (GPUs) and Dask-Rapids. In the third part, we then evaluate a simple random forest implementation using only CPUs of a single machine (Sklearn), multiple machines (Dask), and then a single GPU (Rapids) and multiple GPUs (Dask-Rapids).

A few Words Before We Begin …

In times when datasets are getting larger and larger, there is an increasing need for efficient ways of processing them. Computer clusters – together with appropriate tools – allow distributing large amounts of data across many machines and effectively reduce the execution time for data processing workloads through parallelization. On the other hand, using the General Purpose Computation on Graphics Processing Unit (GPGPU) – or simply said, using GPUs for data processing – may lead to enormous speed-ups compared to the execution on CPUs. Combining cluster-computing with GPGPU seems to be a natural choice in terms of fast and parallel processing of extensive amounts of data, and it is getting more and more popular.

There are quite a few tools that allow us to scale and parallelize applications on a cluster. One of them is Dask. It is an open-source project that aims at parallelizing code and seamless scaling. Its API is based on Pandas and NumPy which tremendously shortens the learning curve for people already familiar with these Python libraries. Dask allows processing data volumes, for example as a dataframe, exceeding the RAM of a single machine by distributing them across many workers. For the end-user however, there is only one dataframe that he accesses just like he would do with a Pandas dataframe. In this case, many workers can work on a single dataframe in parallel.

Another important feature is how easy it is to parallelize already existing code. This can be done with minimal changes only. With Dask it is possible to execute code in parallel on a local machine when Dask is allowed to use all available cores of the CPU. In this article we will focus on a different approach: We will set up a Kubernetes cluster and execute Dask in this context.

GPU Computing

GPU programming is nothing new and, for example, CUDA programming is an already established way of using GPUs for different applications. However, learning efficient CUDA programming may take a while. Fortunately, new tools are being developed which are more user friendly.

One of them is the RAPIDS library from NVIDIA. It aims at executing end-to-end data science and analytics pipelines completely on GPUs. It consists of two main parts. We have the cuDF library which basically adopts the Pandas API for GPU-based DataFrame access and then we have the cuML library which focuses on machine learning. Its API is very similar to the one of Scikit-learn. Interestingly, the RAPIDS library can be combined with Dask, resulting in Dask-cuDF and Dask-cuML! Both allow us GPU-Cluster programming which is quite user-friendly, especially to people who are familiar with Pandas, NumPy and Scikit-Learn.

When deciding for a cluster, there are many possibilities, like a bare-metal cluster, the Google Cloud Platform (GCP), AWS and so on. Which one we use does not matter, as long as we use Kubernetes. Citing Wikipedia: “Kubernetes is an open-source container-orchestration system for automating application deployment, scaling and management across clusters of hosts“. Finally, we need an environment to start our Dask (Dask-Rapids) applications. We will deploy a JupyterHub on our Kubernetes cluster. This will provide us with a platform for data science projects where we can test and experiment with Dask and Dask-Rapids.

Summing up, the goal of this blog post series is to demonstrate how to set up a data science platform on Kubernetes, which will allow us to use Dask and Dask-Rapids libraries. We will start from scratch by creating a cluster with a few GPUs on the GCP. Then we configure the cluster in order to use these GPUs, to deploy JupyterHub and to use Dask on it. We will have a short discussion on how to store data on the GCloud. We will go through the Dockerfiles for Jupyter, which installs Dask and (Dask-)Rapids libraries on it. And at the end, we will have a short comparison between CPU and GPU based programming in terms of efficiency.

Kubernetes Cluster Setup

Before we can create our cluster, there are a few things that we are gonna need. First, we need to create a new project on the GCP, install (in our local terminal) the gcloud command-line tool and make sure it is set to our project. Then, in order to manage Kubernetes, we need to have Kubectl installed and set up. Having Docker installed is also necessary since we want to build our own images for Jupyter or Dask workers.

Setup Kubernetes

Assuming the above requirements are fulfilled, we can create a new cluster on the GCP. Let’s set it up with two nodes, each with two GPUs Tesla T4 and the node type n1-highmem-4 with 26GB RAM each. Keep in mind, not all regions offer the T4 GPUs.

After a while, the cluster should be up and running and we can create a namespace for our project-pods:

… and make it default:

Now, we need a package manager for Kubernetes to deploy JupyerHub later. Helm is exactly what we need. While the new version of Helm, Helm3, is not officially supported by JupyterHub, Helm2 is the safest way to go. However, Helm2 has compatibility problems with Kubernetes 1.16 and above. If you work with Kubernetes <1.16, go with the below description, which is based on Helm2. If you work with the newer Kubernetes versions, skip the description below and go to the section “Helm3 and Kubernetes >=1.16”.

Helm2 consists of a Helm client, that runs locally on your machine, and a Helm server (Tiller), that runs on your Kubernetes cluster.  Installing the client is straightforward, for example with snap (Ubuntu):

Before installing Tiller, we need to take care of the role-based access control (RBAC), meaning creating a service account for the Tiller with the right roles and permissions to access resources:

Now, install Tiller on cluster:

Check if helm works by checking the version:

This line ensures the Tiller is secure from access inside the cluster:

Helm3 and Kubernetes >=1.16

Although not officially supported, Helm3 can be used to deploy JupyterHub. The setup is even simpler as the one for Helm2. You basically only need to install Helm3 and the Tiller setup is absent, since Helm3 does not use one for installing charts. It rather relies on Kubernetes’ internal mechanisms. This offers advantages in terms of security. You only need to install Helm3 on your local machine, for example with curl (Ubuntu), using the following command:

And that’s it. No need for extra configuration of service account or initiating Tiller. You can check the installation with:

Storage

If we want to deploy Dask workers to process our data, first we need to have a “place” to store said data. It is important that it is accessible not only from Dask Client (Jupyter) but also from every Dask worker we deploy.

We could store the data in every Dask worker (and Client) image. But, for obvious reasons, that is a bad idea. It would be a lot smarter to have the dataset stored at one place accessible for both, Dask Client and workers. There are two possibilities to achieve that – an NFS server or the Buckets. Now, which one is more suitable, depends on your setting.

I went for the Buckets from Google Storage, since we are looking for big data processing and NFS scales badly. Both Dask and Dask-Rapids support reading from Buckets so it is a valid and good solution.  Buckets are easy to create but one has to keep the access rights in mind. Firstly, let’s create a Bucket. In the GCloud GUI go to Navigation Menu -> Storage -> Browser. From there click on Create Bucket. Give it a name, choose a zone/region (preferably the same as your cluster to avoid unnecessary latency), and specify a few other details.

Finally,  under Navigation Menu -> Storage -> Browser you can see your Bucket and upload files to it. I uploaded a single .csv file which we will use later.

We have to keep in mind that access rights are important. We want to access the Bucket (the .csv file) from Jupyter and for that we need sufficient permissions. One way to handle it is to create a Service Account with Storage, Create & View rights and mount it as a configmap to our pods (Jupyter and for example Dask workers). Firstly, let’s create the Service Account from the GCloud UI and download the JSON key.

Go to Navigation Menu -> IAM & Admin -> Service Accounts.  From there click on CREATE SERVICE ACCOUNT.  Give your Service Account a name (only field necessary for our needs) and click on CREATE.  After this, you can add roles to it. We need two of them: Storage Object Creator and Storage Object Viewer. Click CONTINUE and DONE afterward.

Now we want to generate a JSON key: find the previously created Service Account and under the Actions menu (three points) you will see the CREATE KEY button. Click on it, select JSON and CREATE. The download should start.  Now, we need to create a configmap from that JSON file: In your local terminal, go to the directory with the JSON credentials and execute the following command (provide the correct name of your config file):

That is it for now. Later we will mount this configmap to our pods.

GPU DaemonSet

The GPUs should be available in our cluster (you can check it by using kubectl describe nodes and see the allocated resources) but, since there are no drivers installed, they are not usable yet. To install the drivers, run:

This deploys a DaemonSet provided by Google that takes care of installing device drivers to the nodes. A DaemonSet ensures that all (or selected) nodes run a copy of a pod. No matter how many nodes we have, no matter how many GPUs, we have to do this only once.

Outlook on Part 2

By now you should have a Kubernetes cluster with GPUs enabled and ready to use.  We have bucket storage and we are ready to deploy applications with Helm. In the next article, we prepare a proper notebook image from which we can use GPUs and deploy JupyterHub.

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

Ähnliche Artikel