The demand to run AI and ML workloads on existing vSphere infrastructure is real and growing. Research teams don't want to wait for cloud accounts, and finance teams don't want to pay cloud GPU prices for long-running training jobs. Here's the honest guide to making it work.
GPU Access: Passthrough vs vGPU
You have two options for exposing NVIDIA GPUs to VMs: DirectPath I/O passthrough and NVIDIA vGPU (via the GRID software stack).
| Passthrough | vGPU | |
|---|---|---|
| Isolation | Full — one GPU per VM | Shared — multiple VMs per GPU |
| vMotion | Not supported | Supported (with vGPU-aware vMotion) |
| Driver complexity | Low (standard NVIDIA driver) | High (GRID guest driver + vGPU manager on host) |
| Best for | Large training jobs | Inference serving, dev environments |
For training workloads (the dominant use case in most shops), passthrough is simpler and avoids the GRID licensing cost. For inference serving where GPU utilisation is bursty, vGPU's time-slicing makes better use of hardware.
NUMA Alignment is Non-Negotiable
A VM pinned to a NUMA node that doesn't own the GPU it's using will saturate the interconnect with cross-NUMA memory traffic. On a dual-socket server with 4x A100s (2 per socket), a misconfigured VM can lose 30-40% of theoretical throughput.
Configure vSphere NUMA topology in the VM's advanced settings:
numa.nodeAffinity = "0" # Pin to NUMA node 0
numa.autosize = "FALSE" # Prevent vSphere from overriding
Or enforce it via a VM group + host affinity rule, which survives reboots more reliably than advanced settings in some vSphere versions.
Storage for Training Data
NVMe-backed vSAN ESA datastores can saturate a GPU's PCIe bandwidth if reading training data directly, but most frameworks use a preprocessing pipeline. The practical bottleneck is usually the data loader, not raw storage throughput.
For datasets larger than local storage, mount an NFS datastore from a high-performance NAS inside the VM. Avoid vSAN stretched clusters for training data — cross-site latency adds up when a data loader is reading millions of small files.
Scheduling Large Jobs with DRS
vSphere DRS doesn't natively understand GPU topology, so a GPU-assigned VM might be DRS-migrated to a host without the correct GPU type. Use VM-to-Host affinity rules to lock GPU VMs to the correct host group:
Host Group: gpu-hosts-a100
VM Group: ml-training-vms
Rule: ml-training-vms MUST run on gpu-hosts-a100
Monitoring GPU Utilisation
NVIDIA's DCGM (Data Center GPU Manager) runs in a VM on each GPU host and exposes Prometheus metrics. Scrape these into your existing monitoring stack to track GPU utilisation, memory usage, and error counters without additional VMware tooling.
The most useful alert to configure early: DCGM_FI_DEV_GPU_UTIL < 20 on a VM that's supposed to be training — it usually means your CUDA environment isn't seeing the GPU.