Runtime Zero
ESC
Browse by topic
Articles  /  VMware

Running AI Workloads on vSphere: A Practical Guide

GPU passthrough, vGPU profiles, and NUMA-aware scheduling for ML training jobs on vSphere 8. What works, what doesn't, and how to avoid burning GPU hours on misconfigured VMs.

MF

The demand to run AI and ML workloads on existing vSphere infrastructure is real and growing. Research teams don't want to wait for cloud accounts, and finance teams don't want to pay cloud GPU prices for long-running training jobs. Here's the honest guide to making it work.

GPU Access: Passthrough vs vGPU

You have two options for exposing NVIDIA GPUs to VMs: DirectPath I/O passthrough and NVIDIA vGPU (via the GRID software stack).

Passthrough vGPU
Isolation Full — one GPU per VM Shared — multiple VMs per GPU
vMotion Not supported Supported (with vGPU-aware vMotion)
Driver complexity Low (standard NVIDIA driver) High (GRID guest driver + vGPU manager on host)
Best for Large training jobs Inference serving, dev environments

For training workloads (the dominant use case in most shops), passthrough is simpler and avoids the GRID licensing cost. For inference serving where GPU utilisation is bursty, vGPU's time-slicing makes better use of hardware.

NUMA Alignment is Non-Negotiable

A VM pinned to a NUMA node that doesn't own the GPU it's using will saturate the interconnect with cross-NUMA memory traffic. On a dual-socket server with 4x A100s (2 per socket), a misconfigured VM can lose 30-40% of theoretical throughput.

Configure vSphere NUMA topology in the VM's advanced settings:

numa.nodeAffinity = "0"          # Pin to NUMA node 0
numa.autosize = "FALSE"          # Prevent vSphere from overriding

Or enforce it via a VM group + host affinity rule, which survives reboots more reliably than advanced settings in some vSphere versions.

Storage for Training Data

NVMe-backed vSAN ESA datastores can saturate a GPU's PCIe bandwidth if reading training data directly, but most frameworks use a preprocessing pipeline. The practical bottleneck is usually the data loader, not raw storage throughput.

For datasets larger than local storage, mount an NFS datastore from a high-performance NAS inside the VM. Avoid vSAN stretched clusters for training data — cross-site latency adds up when a data loader is reading millions of small files.

Scheduling Large Jobs with DRS

vSphere DRS doesn't natively understand GPU topology, so a GPU-assigned VM might be DRS-migrated to a host without the correct GPU type. Use VM-to-Host affinity rules to lock GPU VMs to the correct host group:

Host Group: gpu-hosts-a100
VM Group: ml-training-vms
Rule: ml-training-vms MUST run on gpu-hosts-a100

Monitoring GPU Utilisation

NVIDIA's DCGM (Data Center GPU Manager) runs in a VM on each GPU host and exposes Prometheus metrics. Scrape these into your existing monitoring stack to track GPU utilisation, memory usage, and error counters without additional VMware tooling.

The most useful alert to configure early: DCGM_FI_DEV_GPU_UTIL < 20 on a VM that's supposed to be training — it usually means your CUDA environment isn't seeing the GPU.