Run Scalable GPU Workloads on IBM Cloud VPC with Ray and CUDA

Authors: Erez Hadad, Cohen J Omer — IBM Research

Erez Hadad
ITNEXT

--

In Short

Large-scale computations often require a lot of hustling beyond programming. You need resources to get your algorithm running. This typically means a compute cluster, often with GPUs (e.g., for HPC or ML workloads). If you get your cluster from a public cloud service, as many do, then there are two important aspects to consider. First, security and privacy — you want to make sure that the cluster resources can only be used by you, in order to protect your code and data, as well as save you money of unsolicited use. Second, cost efficiency — you want to use the cluster resources efficiently, to minimize your costs, e.g., by shutting down unused nodes. In this blog we present an easy and practical process, starting from scratch, for provisioning a cloud cluster using IBM Cloud VPC and deploying a scalable Python workload on it, securely and cost-efficiently.

Background

IBM Cloud VPC is a public cloud service that allows users to quickly provision clusters of powerful Virtual Server Instances (VSIs = VMs) for demanding workloads — with plenty of CPU cores, memory, storage, and one or more high-end GPUs such as Nvidia V100 . One of the popular GPU-intensive workload types is Machine Learning (ML), where AI models are trained and/or served. Our example workload is going to be a model server, and we will test it with a matching remote client.

Our workload is built on Ray, which is a popular framework for easily scaling out Python applications across multiple CPUs and machines, and it supports GPU resources as well (via CUDA). It also specializes in integrating with ML platforms. A Ray cluster (see image below) consists of multiple servers used for compute (mostly via Python), one of which is a head node designated for cluster management and service access. Ray cluster service typically includes submitting workloads for execution and monitoring the cluster behavior via the Ray dashboard. The rest are worker nodes for compute. Ray clusters can automatically scale out/in in response to workload resource demand, which we will show how to exploit to save costs.

Ray Architecture

In the next part of this blog, we walk you through the installation and deployment process. In the following parts, we discuss the deployment considerations and dive a little deeper into the application itself to understand how the deployment choices affect the actual behavior and cost of workload execution.

Installation Process

We now go through the process of getting the workload to execute on IBM Cloud VPC, starting from scratch. The only pre-requisites you need in order to follow this process are an IBM Cloud account (non-free) and a minimal environment of a Linux bash shell with git and Python. To get quickly through the process and observe the results, simply execute the commands at the beginning of each step.

Step 1: Set up a local development environment

git clone https://github.com/erezh16/ibm-ray-vpc-demo

Start by cloning the example workload’s git repo to a local folder on your laptop. You can experiment with the workload locally by following the instructions in the included README.md.

Step 2: Get a VSI image with CUDA and Ray

pip install vpc-img-inst
vpc-img-inst –f ray –f cuda [-r <region>]

Using GPUs in your IBM Cloud VPC cluster requires that Nvidia’s CUDA — library and drivers — be installed in the VSIs (VMs) in the cluster. Additionally, for quick provisioning, it is best to have Ray pre-installed in the VSIs. If you don’t already have a VSI image with CUDA and Ray, you can easily get one by installing and using the VPC image installer script — see the commands above. The optional IBM Cloud region argument (-r) is where you plan to make your Ray cluster, so the custom image needs to be created there. Be sure to keep the resulting image name for the next steps.

Step 3: Install required components

pip install ray[default] ibm-ray-config

To use Ray on IBM Cloud VPC, you need to install in your dev environment several Python components (best to create a virtual environment and install and work inside it). Beyond Ray itself, the above command actually installs two VPC-related components: a driver library for Ray called ibm-vpc-ray-connector and a configuration tool called ibm-ray-config.

Step 4: Create Ray cluster & configuration

ibm-ray-config

The tool is interactive, taking you through a small set of questions for getting the required information for your cluster. Most questions contain reasonable default answers, so you can just press ENTER to quickly move forward. The screenshots below demonstrate construction of a cluster with GPUs. Let’s take a closer look to understand the main parts.

ibm-ray-config cloud questions

The first part is cloud-level settings. Start by providing your IBM Cloud API key. If you don’t have an API key, you can create one. Next, choose the VPC in which you create the Ray cluster: select cloud region, select create new VPC, select availability zone for the new VPC, and finally name your VPC. From a management perspective, it’s better to provision each Ray cluster in a separate VPC, but this also means you’re paying for an extra VPC.

ibm-ray-config security questions

The next few questions deal with VPC security. First, select whether to open Ray service to public access by exposing its ports (recommended and in this demo: no). Next, select to create a new SSH keypair to be registered in the VPC to allow secure private access to Ray. It is recommended to create a new keypair for each new cluster, to reduce exposure of your default SSH keys, at no extra cost.

ibm-ray-config VSI questions

Next questions configure the VSIs in the VPC. First, select the image to be used for VSIs, out of those available in the region. You should select the image that you prepared in Step 1. Next, assign a floating (publicly routable) IP to the Ray cluster’s head node (VSI) so that it can serve as the access point. If there are no free floating IPs in your selected region, choose to reserve a new one.

ibm-ray-config Ray questions

The last set of questions involves configuring Ray itself. First, name your Ray cluster. Then, select a VSI profile for the head node VSI. In the next questions, you can choose to use that profile for worker nodes as well, and to enable additional profiles for workers. Each profile used for worker also requires setting minimum and maximum number of instances.

For the purpose of this demo, select bx2–2x8 as the profile for the head node, not to be used for workers, and select gx2–8x64x1v100 as another profile for workers with minimum of 0 and maximum of 3. Finally, you can select to limit the total number of workers below the default sum of maxima for all profiles — e.g., for budget purposes. In our demo, press Enter to keep the default.

The tool execution ends creating a new folder under the current folder, containing the Ray cluster configuration, SSH keys and a set of convenience scripts. You can find more details on the output in the Appendix further below.

Step 5: Launch the cluster

<cluster folder>/scripts/up.sh
<cluster folder>/scripts/connect.sh

Now that everything is set up, let’s launch the cluster and connect to it. The easiest way to go about it is to execute the auto-generated scripts listed above. These scripts will start the cluster and securely connect to it via port forwarding, yielding output similar to the screenshots below.

Output of up.sh [1/2]
Output of up.sh [2/2]
Output of connect.sh

Note that the connect.sh script finishes successfully with an endless loop (result of ray dashboard), so either put it in the background or just open another terminal to continue working with the cluster. Let’s start by viewing the cluster’s dashboard. Just pop up a browser and enter the URL of http://localhost:8265 . Remember that local port 8265 is actually port 8265 of the cluster’s head node forwarded from your laptop. The screen below demonstrates what you should get. With this console, you can see the current Ray cluster nodes, monitor running jobs, look at logs and many other things.

Ray Dashboard

In your IBM Cloud console, select “VPC Infrastructure” in the left side menu to switch to the VPC console, and then select “VPCs” in the menu to list the active VPCs. In the table view, select the region where you configured your Ray cluster and you should see its VPC listed, similar to below:

VPC on IBM Cloud console

Next, select “Virtual server instance” in the left side menu and you should see the head node of the Ray cluster, with its name matching the host name of the head node in the Ray console shown previously. There should be no worker nodes for the time being since based on the configuration we chose above. These will be added only when workloads require additional resources.

VSIs on IBM Cloud console

Step 6: Deploy the application

<cluster folder>/scripts/submit.sh <path_to>/tutorial_batch.py
<cluster folder>/scripts/tunnel.sh 8000

The cluster is running and the laptop is connected to it. Let’s deploy our GPU-hungry workload in the cluster. This requires a Ray command (ray submit) which you can invoke using a script in the cluster folder — the first command above. <path_to>/tutorial_batch.py is the full path to the file tutorial_batch.py in the git repo that you cloned in step 1. This file contains the full application.

Output of submit.sh
Output of tunnel.sh

The output you will see is similar to what’s shown in the submit.sh screenshot above. The command remains executing until the workload finishes, and it will generate additional log information as the workload continues to execute. In our case of a model service, it should run indefinitely unless we force Ray Serve to stop or delete the cluster. The service port that Ray Serve sets up is 8000, so we need to tunnel this port from the laptop — that’s the second command above, and its respective output of tunnel.sh above.

Step 7: Test the service

python <path to>/batch_client.py [-r concurrent_requests]

Now, let’s put the deployed service to the test. In the cloned git repo there is a client file called batch_client.py . This client itself is built on Ray, allowing it to concurrently invoke the service (default 10 conc. Requests), creating a predefined load, and measure the request RTT. To invoke the client, run the above command. You should see output similar to the example below.

Output of test client [1/2]
Output of test client [2/2]

Done!

Considerations During Deployment

Security and Privacy

An important security question is whether to expose the Ray service ports in the created cluster. If those ports are exposed, the Ray service can be accessed insecurely through port 8265 by running ray job submit from any machine outside the cluster, so by default, exposing these ports is highly recommended against. Instead, users can get secure access to the Ray service by tunneling port 8265 over SSH, as will be shown later. You should choose to expose these ports only if this is not a security risk — for example, if both the client machine and the Ray cluster are in the same VPC, isolated from the outside world.

Cost-Efficiency

A naïve setting is to set up a fixed cluster of uniform-profile nodes (servers), e.g., all nodes are GPU nodes. However, note the following. First, the head node of a Ray cluster remains running (and being charged for) as long as the cluster is running — even if no workloads are deployed. Second, a node with a powerful GPU may cost more than 10x compared to the same node without a GPU (this is true also in other clouds). This means you can save significant costs by scaling-in the cluster (removing nodes) if there are periods when there are no workloads that require GPUs in the clusters, or no workloads at all, by leveraging the dynamic auto-scaling property of Ray.

The way we recommend doing that for GPU workloads is to select a no-GPU profile for the head node (in our example, bx2–2x8), and select a separate GPU-enabled profile (e.g., gx2–8x64x1v100) that can be used by worker nodes, with a minimum count of zero, and a maximum count that matches your considerations (budget, performance etc). That way, GPU nodes are provisioned only when needed.

If your cluster is running a mix of CPU and GPU workloads, it might make sense to define additional CPU-only profiles for workers, with a minimum of zero and a maximum as you want. This means that when the cluster is idle, a single cheap CPU-profile head node is running. When there are CPU-only workloads running, Ray will provision additional CPU-only worker nodes, saving you the costs of the more expensive GPU nodes. If you already submitted a GPU workload and got GPU nodes running, deploying another CPU workload will use the CPU resources on the GPU nodes, and provision additional CPU nodes only in case of missing capacity, up to the limits you set. All additional worker nodes will be removed within minutes after the last workload completes.

You can test a mixed cluster yourself as following. First, create a cluster using ibm-ray-config with similar settings as described in Step 4 above, except that you choose to use the head profile for worker nodes as well (since it’s a CPU profile), and give it a minimum of 0 and maximum of 3. Continue the process. At step 6, you have two different workloads to deploy. One is the demo GPU workload tutorial_batch.py that we already discussed and the other is a CPU-only workload called ray-pi.py from the same git repo. If you want to get GPU nodes provisioned, be sure to use the test client, as explained in step 7.

Understanding The Workload

Now that you managed to deploy and execute the application, let’s look a little deeper into it, and understand its behavior in the context of the configuration choices we made earlier. Our example workload is serving the GPT2 model from OpenAI. This model is a Natural Language Processing (NLP) model that, among other uses, can be used to answer text questions or complete sentences. Its “knowledge” is based on a large textual dataset collected from the world wide web. Serving this model means deploying it inside a service workload which can be accessed by remote clients. When a client sends a request, the service invokes the model to process the request data (inference) and returns the result.

For creating the model service, the code leverages Ray Serve, which allows application developers to define only the logic for handling the service request, and internally handles all the service plumbing. For easy loading and using the model, it uses Hugging Face Transformers, which automatically retrieves the model resources from the global Hugging Face repository and creates a simple interface for model inference. Under the hood, Hugging Face wraps one of the major ML platforms, which performs the model inference. In this blog, the platform we use is PyTorch.

Code of tutorial_batch.py (except the imports)

The main component in the code is the BatchTextGenerator class (lines 17–34). An object of this class takes a batch of input samples (gathered from received requests), and performs a model inference of the GPT2 model on each of the samples in the batch. Note the data member model created using a Hugging Face pipeline() in line 24 during the object constructor. This single line creates an entire workflow around the GPT2 model that allows it to process text input and generate text output. Then, during the handle_batch() method, which handles the received requests, the model member is used to actually perform the inference on the input samples in line 30.

The next important piece is the @serve.deployment() decorator in lines 11–16, which takes the BatchTextGenerator class and converts it into a micro-service (managed by Ray Serve) that actually handles input requests received over a network port (port 8000 mentioned above) and sends back results. Note the min_replicas and max_replicas fields. They control the number of BatchTextGenerator instances to be dynamically created in response to request load, from 0 to 3, respectively. When there is no load, there are no (0) instances of BatchTextGenerator, and it can grow up to 3 instances at max load, each processing requests concurrently to the others. The last important field is ray_actor_options, which defines resources needed for each instance. We can see it requires 1 GPU. Thus, the application under max load may consume up to 3 GPUs.

Recall that the maximum number of GPU-bearing nodes in the cluster was also chosen to be 3 in the configuration (step 4). This is, of course, not coincidental. Unless the cluster is able to scale out to this many 1-GPU nodes, there won’t be enough GPUs to accommodate the service scale-out as defined in the decorator.

The resulting behavior of the application running on the cluster is completely elastic, scaling resource requirements (incl. GPUs) as needed. When the application is launched, you can see that the cluster remains 1 node (head node only). This is because there is no request load yet. When we run the client with default concurrency of 10 (maximum batch size), a new instance of BatchTextGenerator needs to be created, requiring 1 GPU, and so one additional worker node with GPU is created and is used to serve the requests. When we run the client with higher concurrency values, additional instances/worker nodes are created to match the higher load, all automatically, up to the specified maximum of 3. A few minutes after all requests are served, BatchTextGenerator instances and respective nodes are removed and the cluster scales in again back to head node only. Pay (almost) only for what you use.

Conclusion

In this blog we demonstrated how to leverage GPUs (and any additional resource) at scale for demanding applications on IBM Cloud VPC using Ray, in a simple deployment process. The resulting cluster is secure, accessible and usable only to the user having the configuration folder. We further explained how to configure the cluster so that the expensive resources, namely GPUs, are charged for only during the application execution, and matching the load (for elastic applications).

The above process could easily be adapted for workloads requiring fixed resource settings, such as model training or simulations using a fixed number of GPUs for the execution. You may not even need to change the cluster configuration. Just specify the fixed resource requirements (min=max) of your application’s components in the appropriate Ray decorators and let the magic happen, with nodes being provisioned for the execution and cleaned up afterwards.

Acknowledgement

The research leading to these results has received funding from the European Union’s Horizon Europe Programme under the EXTRACT Project https://extract-project.eu, grant agreement n° 101093110.

Appendix: Output of ibm-ray-config

After running ibm-ray-config, it creates a sub-folder under the current folder where it was run. The name of the sub-folder is structured as following: <cluster>-XXXXX-at-<vpc>-YYYYY-in-<region>. As can be seen, XXXXX and YYYYY are the UUIDs of the cluster and the vpc, embedded in their names, respectively. Hence, this folder name essentially links all the pieces together by recording which VPC your Ray cluster belongs to and which region that VPC resides, so you can easily keep track of all your clusters.

Inside the folder there are several files. First, there is the YAML file for Ray to operate the cluster, called config.yaml. Then, there are the SSH keys — both private and public — for securely communicating with the cluster, in the files called <username>-XXXXX and <username>-XXXXX.pub, respectively (XXXXX being a UUID). These keys are also registered in the cloud region, and their names help to tell your keys from those of other users sharing the same account, resource group and region.

Last, there is an internal sub-folder called scripts that contains a toolbox of scripts for working with the Ray cluster. Those scripts can be run from anywhere, and they use the configuration and SSH keys of this specific cluster where needed.

  1. up.sh — sets up the cluster (ray up) by creating the head node and additional minimal amounts of worker nodes of each profile.
  2. down.sh — tears down the cluster (ray down)
  3. down-vpc.sh — tears down the cluster and the VPC that contains it and associated resources
  4. connect.sh — establish a secure connection from your laptop to the running Ray cluster (ray dashboard) by forwarding the service port 8265 over SSH tunnel to the Ray head node.
  5. disconnect.sh — disconnect from the Ray cluster (kill the tunnel)
  6. submit.sh — submit computation jobs (Ray Python programs) to execute in the created Ray cluster.
  7. tunnel.sh — securely tunnel additional ports from your laptop to the created Ray cluster. This is required e.g., for using Ray Serve-based services — see further below.
  8. ray.sh — a drop-in replacement for the ray CLI tool that covers all its commands, in the context of the current cluster. For commands that require the cluster YAML as argument, just type config.yaml and it will use the cluster’s specific configuration. For commands requiring the service address (e.g., ray job submit), use localhost:8265 when connected to the cluster. All file arguments (except config.yaml) must use absolute path.

--

--