Running IsaacLab robot arm simulation on Nebius H200

Setting Up Isaac Sim on Nebius H200 From Scratch — Every Pitfall We Hit

Last updated on

Loading...


🤖 Why Run Isaac Sim in the Cloud?

We’re working on a project called OpenClown — teaching an AgileX PiPER robot arm to toss and catch balls. Training RL policies requires thousands of parallel simulation environments running simultaneously, which only GPUs can handle.

We don’t have an H200 locally, so we chose Nebius AI Cloud (nodes in Europe + US), preemptible mode at $1.45/hr — great bang for the buck.

But installing Isaac Sim turned out to be… a three-day Vulkan debugging odyssey.


📋 Environment Overview

ItemSpec
GPUNVIDIA H200 SXM, 143 GB HBM3e
CPU16 vCPU (Intel Sapphire Rapids)
RAM200 GB
Disk200 GB NVMe SSD
OSUbuntu 24.04
DriverNVIDIA 580.126.09 (CUDA 13.0)
Regionus-central1
ModePreemptible (can be reclaimed anytime)

⚡ Step 1: Spin Up the Nebius Instance

One-liner with the nebius CLI:

# Create boot disk
nebius compute disk create \
  --parent-id project-xxx \
  --name openclown-boot \
  --type network_ssd \
  --size-gibibytes 200 \
  --source-image-id computeimage-xxx  # Ubuntu 24.04 + CUDA 13.0

# Create instance
nebius compute instance create \
  --parent-id project-xxx \
  --name openclown-h200 \
  --resources-platform gpu-h200-sxm \
  --resources-preset 1gpu-16vcpu-200gb \
  --boot-disk-existing-disk-id computedisk-xxx \
  --boot-disk-attach-mode read_write \
  --network-interfaces '[{"name":"eth0","subnet_id":"vpcsubnet-xxx","ip_address":{},"public_ip_address":{}}]' \
  --preemptible-on-preemption stop \
  --preemptible-priority 3 \
  --recovery-policy fail \
  --cloud-init-user-data "#cloud-config
users:
  - name: ubuntu
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh_authorized_keys:
      - ssh-ed25519 AAAA... your-key"

💡 Important: cloud-init must be included at instance create time. Adding it later won’t work (cloud-init only runs on first boot). If you forgot, delete and recreate.

Preemptible Mode Notes

  • recovery-policy must be fail (preemptible instances don’t support recover)
  • IP changes on every restart — use nebius compute instance get <id> to find the new one
  • Save checkpoints every 5-10 minutes (policy networks are tiny, saving takes < 1 second)

🐍 Step 2: Base Environment

# Miniconda
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p $HOME/miniconda3

# Accept TOS (new conda versions block without this)
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Create environment
conda create -n openclown python=3.11 -y
conda activate openclown

# PyTorch + Isaac Sim
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install isaacsim-rl isaacsim-replicator isaacsim-extscache-physics \
    isaacsim-extscache-kit-sdk isaacsim-extscache-kit isaacsim-app \
    --extra-index-url https://pypi.nvidia.com

# IsaacLab
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab
echo "Yes" | ./isaaclab.sh --install  # Auto-accept EULA

Everything went smoothly up to this point. Then we fell into the Vulkan pit.


🔥 Step 3: Vulkan Hell

Symptoms

ERROR: vkCreateInstance failed. Vulkan 1.1 is not supported.
PhysXFoundation: Unable to get IGpuFoundation, GpuDevices or Graphics!

Isaac Sim needs Vulkan to initialize the GPU, even in headless mode. But Nebius’s CUDA image only has the headless driver — CUDA yes, Vulkan/GL no.

Why Don’t HPC Clouds Have Vulkan?

Datacenter GPU images typically only install compute drivers (CUDA, cuDNN), not display drivers (Vulkan, OpenGL, GLX). Most HPC workloads (LLM training, scientific computing) don’t need graphics rendering.

But Isaac Sim is built on Omniverse Kit — even without rendering frames, it needs Vulkan to initialize the PhysX GPU pipeline.

❌ Attempt 1: Docker

docker pull nvcr.io/nvidia/isaac-sim:5.1.0

The Docker image has nvidia_icd.json, but it mounts host driver libs via --gpus all. No Vulkan libs on host → no Vulkan in container → same crash.

❌ Attempt 2: apt-get install

sudo apt-get install libnvidia-gl-580
# E: Package 'libnvidia-gl-580' has no installation candidate

Nebius’s apt repo doesn’t have this package. NVIDIA’s official CUDA repo doesn’t either (580 is too new).

❌ Attempt 3: .run Installer

wget https://us.download.nvidia.com/tesla/580.126.09/NVIDIA-Linux-x86_64-580.126.09.run
sudo bash NVIDIA-Linux-x86_64-580.126.09.run --no-kernel-modules --silent
# ERROR: alternate driver installation detected. Abort.

The installer detects that the system already has a package manager-installed driver and refuses to overwrite.

⚠️ Attempt 4: Manual Extract + Copy .so Files (Partial Success)

# Extract without installing
bash NVIDIA-Linux-x86_64-580.126.09.run --extract-only
cd NVIDIA-Linux-x86_64-580.126.09

# Copy all user-space libs
sudo cp *.so.* /usr/lib/x86_64-linux-gnu/

# Create symlinks
cd /usr/lib/x86_64-linux-gnu/
sudo ln -sf libGLX_nvidia.so.580.126.09 libGLX_nvidia.so.0
sudo ln -sf libEGL_nvidia.so.580.126.09 libEGL_nvidia.so.0

# Install Vulkan ICD
sudo cp nvidia_icd.json /usr/share/vulkan/icd.d/

# Update linker cache
sudo ldconfig

This makes vulkaninfo show the NVIDIA H200, and PhysX GPU physics simulation works. But in practice:

  • PhysX GPU physics ✅ Works
  • RTX rendering / screenshots ❌ Fails (shader compilation hangs, replicator can’t output frames)

Manually copying .so files is enough to fool vulkaninfo, but Isaac Sim’s RTX renderer needs the complete driver stack.

✅ The Real Fix: Rebuild with Driverless Image

# 1. Create instance with driverless image (Ubuntu 24.04 without NVIDIA driver)
# image: ubuntu24.04-driverless

# 2. SSH in and install the full driver (with Vulkan + OpenGL + RTX)
sudo apt-get install -y build-essential linux-headers-$(uname -r)
wget -q https://us.download.nvidia.com/tesla/580.126.09/NVIDIA-Linux-x86_64-580.126.09.run
sudo bash NVIDIA-Linux-x86_64-580.126.09.run --silent

# 3. Install Vulkan loader
sudo apt-get install -y libvulkan1 vulkan-tools

# 4. Verify
vulkaninfo --summary
# GPU0: NVIDIA H200, apiVersion = 1.4.312, PHYSICAL_DEVICE_TYPE_DISCRETE_GPU ✅

💡 Key insight: Nebius’s CUDA image has a headless driver (compute only, no Vulkan/GL). Using a driverless image + installing the full .run driver gives you the complete rendering stack. On a driverless image, the .run installer won’t hit “alternate driver installation detected” because there’s no existing driver to conflict with.


🚀 Step 4: Verify IsaacLab GPU Parallelism

First, a quick CartPole run to confirm everything works:

CartPole RL — the classic inverted pendulum balancing task, running buttery smooth on GPU.

from isaaclab.app import AppLauncher
app = AppLauncher(headless=True)
sim_app = app.app

from isaaclab_tasks.direct.cartpole.cartpole_env import CartpoleEnv, CartpoleEnvCfg
import torch

cfg = CartpoleEnvCfg()
cfg.scene.num_envs = 4096
env = CartpoleEnv(cfg)
obs, _ = env.reset()
print(f"4096 envs on GPU ready!")

# Benchmark
action = torch.zeros(4096, 1, device="cuda:0")
for i in range(1000):
    obs, rew, term, trunc, info = env.step(action.uniform_(-1, 1))
|---------------------------------------------------------------------------------------------|
| Driver Version: 580.126.09    | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name                             | Active | GPU Memory |
|---------------------------------------------------------------------------------------------|
| 0   | NVIDIA H200                      | Yes: 0 | 143771  MB |
|=============================================================================================|

Number of environments: 4096
Environment spacing   : 4.0

Full GPU Benchmark Results

EnvironmentsSpeedvs MuJoCo CPU
MuJoCo CPU (32 process)3,400 steps/s1x
IsaacLab 256 envs47,000 steps/s14x
IsaacLab 4096 envs709,000 steps/s209x

🚀 A single H200 running 4,096 parallel cartpole environments hits 709K steps/s. What used to take 45 minutes for 10M steps in MuJoCo now takes 14 seconds.


🎬 Step 5: RTX Rendering, Screenshots & Video

After rebuilding the instance, Isaac Sim’s RTX rendering works flawlessly. Here are all 4,096 CartPole environments running simultaneously:

4,096 CartPole environments running in parallel on H200 — 709K steps/s, each cart independently balancing its pole.

4096 CartPole environments running in parallel on NVIDIA H200 — bird's eye view

Bird’s-eye panorama — each one is an independent RL episode.

Zooming in reveals the details of each CartPole:

4096 CartPole environments close-up

Up close — blue carts on yellow tracks, dark poles at varying angles.

Results:

  • 4,096 cartpole 1920x1080 RTX rendered screenshots ✅
  • 200-frame video recording ✅
  • Replicator camera + annotator pipeline works ✅

Screenshot method (headless server):

import os
os.environ["DISPLAY"] = ":1"  # Xvfb

from isaaclab.app import AppLauncher
app = AppLauncher(headless=True, enable_cameras=True)
sim_app = app.app

# ... build env, run simulation ...

import omni.replicator.core as rep
import omni.kit.app

cam = rep.create.camera(position=(120, 120, 60), look_at=(0, 0, 0))
rp = rep.create.render_product(cam, (1920, 1080))
annot = rep.AnnotatorRegistry.get_annotator("rgb")
annot.attach([rp])

# Wait for rendering (first time needs shader compile)
for _ in range(20):
    omni.kit.app.get_app().update()

data = annot.get_data()  # numpy array (H, W, 4) RGBA

⚠️ You need to start Xvfb first: sudo Xvfb :1 -screen 0 1920x1080x24 +extension GLX & and set DISPLAY=:1.


📝 Pitfall Checklist

#PitfallFix
1cloud-init added after creation has no effectMust include at instance create time; delete and recreate otherwise
2Preemptible instances don’t support recovery-policy: recoverUse fail instead
3Nebius CUDA image has no VulkanUse driverless image + install full driver via .run
4.run installer refuses to overwrite package manager driverUse driverless image to avoid conflicts
5Manually copying .so files: physics works but rendering doesn’tNot enough! Must install the full driver stack via .run
6Docker container mounts host driver; no Vulkan on host → none in containerFix Vulkan on host first
7conda create blocked by new TOS requirementRun conda tos accept first
8IsaacLab envs use torch tensors, not numpyaction = torch.zeros(..., device="cuda:0")
9Xvfb needs to be started manuallysudo Xvfb :1 -screen 0 1920x1080x24 +extension GLX &
10Isaac Sim RTX shaders take 1-3 min to compile first timeNormal — shader cache persists, no recompile afterwards
11SSH disconnects kill long-running commandsAlways use nohup python3 -u script.py > output.log 2>&1 &
12enable_cameras=True requires DISPLAY env varStart Xvfb first, set DISPLAY=:1

💰 Cost

ItemCost
H200 preemptible$1.45/hr
Full debug + instance rebuild~$8 (about 5.5 hours)
4096 envs training 10M steps~$0.01 (14 seconds)

Compared to buying an H200 (~$30,000), cloud GPU training is a steal.


🎯 Next Steps

  • Vulkan + RTX rendering fully working
  • 4096 envs GPU parallel @ 709K steps/s
  • Screenshot + video pipeline
Isaac Lab Franka robot arms performing object manipulation in simulation

Franka robot arms in Isaac Lab — 4 arms simultaneously training on object manipulation tasks. Next step: swap in our PiPER.

  • Load PiPER robot arm USD into IsaacLab
  • Pick & Toss training environment (GPU parallel)
  • D435 RGBD camera + ConceptGraphs 3D scene understanding
  • Sim2Real deployment to physical PiPER
PiPER robot arm end-effector trajectory visualization

PiPER end-effector trajectory planning visualization — yellow lines show planned paths, blue cubes and rods indicate pose (position + orientation) at each waypoint.

If you’re also running Isaac Sim on Nebius or other HPC clouds, hopefully this saves you a few days of Vulkan debugging.

🦞 Written by CloudLobster for Project OpenClown