Multi-Process Service (MPS)
Cast AI supports GPU sharing through NVIDIA Multi-Process Service (MPS), which enables multiple CUDA processes to concurrently utilize a single GPU with improved performance for compute-bound workloads.
Monitor your GPU sharing efficiency with GPU utilization metrics once configured.
What is NVIDIA MPS?
NVIDIA Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA API that enables multiple CUDA applications to share a single GPU simultaneously.
Unlike GPU time-slicing, which rapidly switches execution between processes, MPS allows truly concurrent execution of GPU kernels from different processes through a client-server architecture, where:
- Multiple processes run on the GPU at the same time
- The MPS server shares a single set of GPU scheduling resources across all clients
For more information on NVIDIA MPS, see NVIDIA MPS documentation.
Supported configurations
| Provider | MPS support | Notes |
|---|---|---|
| GCP GKE | ✓ | - |
| AWS EKS | Not yet supported | - |
| Azure AKS | Not yet supported | - |
How Cast AI provisions MPS nodes
- Configuration: Enable GPU sharing in your node template with MPS strategy and sharing parameters
- Resource calculation: Cast AI calculates extended GPU capacity as
GPU_COUNT * SHARED_CLIENTS_PER_GPU - Node provisioning: The autoscaler provisions nodes with MPS configured
- Workload scheduling: Pods continue to request
nvidia.com/gpu: 1. On Volta and newer GPUs (compute capability ≥ 7.0), no changes to pod specifications are required. On pre-Volta GPUs,
pods must sethostIPC: trueto communicate with the MPS control daemon
Configuring GPU MPS
GPU sharing with MPS can be configured through multiple methods.
API
Use the Node Templates API to configure GPU sharing programmatically.
Include the gpu object with the sharingStrategy set to mps:
{
"gpu": {
"sharingStrategy": "GPU_SHARING_STRATEGY_MPS",
"defaultSharedClientsPerGpu": 4,
"sharingConfiguration": {
"nvidia-tesla-t4": {
"sharedClientsPerGpu": 4
},
"nvidia-tesla-a100": {
"sharedClientsPerGpu": 8
}
}
}
}Terraform
Configure GPU sharing using the Cast AI Terraform provider.
Add the gpu block with sharing_strategy set to mps:
resource "castai_node_template" "example" {
# ... other configuration
gpu {
sharing_strategy = "mps"
default_shared_clients_per_gpu = 4
sharing_configuration = {
"nvidia-tesla-t4" = {
shared_clients_per_gpu = 4
}
"nvidia-tesla-a100" = {
shared_clients_per_gpu = 8
}
}
}
} Workload configuration
When using GPU MPS, pods continue to request GPUs using the standard nvidia.com/gpu resource. The key difference is targeting nodes with MPS enabled through node selectors and tolerations.
Basic MPS workload
spec:
nodeSelector:
scheduling.cast.ai/node-template: "gpu-sharing-template"
tolerations:
- key: "gpu-sharing-template"
value: "template-affinity"
operator: "Equal"
effect: "NoSchedule"
containers:
- image: my-image
name: gpu-test
resources:
requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
limits:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1Volta and newer GPUs (compute capability ≥ 7.0)
No additional pod configuration is required beyond standard GPU resource requests.
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-mps-workload
spec:
replicas: 4 # Can schedule 4 pods on a single GPU with 4x sharing
selector:
matchLabels:
app: gpu-mps-workload
template:
metadata:
labels:
app: gpu-mps-workload
spec:
nodeSelector:
scheduling.cast.ai/node-template: "gpu-mps-template"
tolerations:
- key: "scheduling.cast.ai/node-template"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: gpu-workload
image: nvidia/samples:nbody
resources:
limits:
nvidia.com/gpu: 1Pre-Volta GPUs (compute capability < 7.0)
Pre-Volta GPUs require hostIPC: true so that containers can communicate with the MPS control daemon through the host's Inter-Process Communication namespace.
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-mps-workload
spec:
replicas: 3
selector:
matchLabels:
app: gpu-mps-workload
template:
metadata:
labels:
app: gpu-mps-workload
spec:
hostIPC: true
nodeSelector:
scheduling.cast.ai/node-template: "gpu-mps-template"
tolerations:
- key: "scheduling.cast.ai/node-template"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: gpu-workload
image: nvidia/samples:nbody
resources:
limits:
nvidia.com/gpu: 1Node labels and taints
Cast AI automatically applies the following labels to MPS-enabled nodes:
| Label | Example value | Description |
|---|---|---|
scheduling.cast.ai/gpu-shared | 4 | Number of max shared clients per GPU |
scheduling.cast.ai/gpu-sharing-strategy | mps | Set to indicate MPS sharing strategy is configured |
Monitoring GPU utilization
Once GPU MPS is configured, monitor your GPU sharing efficiency with GPU utilization metrics.
These metrics help you:
- Track GPU compute utilization across shared workloads
- Identify GPU memory waste
- Analyze cost efficiency of your sharing configuration
- Optimize sharing multipliers based on actual usage patterns
Updated about 6 hours ago
