Setting Up Isaac Sim on Nebius H200 From Scratch — Every Pitfall We Hit

🤖 Why Run Isaac Sim in the Cloud?

We’re working on a project called OpenClown — teaching an AgileX PiPER robot arm to toss and catch balls. Training RL policies requires thousands of parallel simulation environments running simultaneously, which only GPUs can handle.

We don’t have an H200 locally, so we chose Nebius AI Cloud (nodes in Europe + US), preemptible mode at $1.45/hr — great bang for the buck.

But installing Isaac Sim turned out to be… a three-day Vulkan debugging odyssey.

📋 Environment Overview

Item	Spec
GPU	NVIDIA H200 SXM, 143 GB HBM3e
CPU	16 vCPU (Intel Sapphire Rapids)
RAM	200 GB
Disk	200 GB NVMe SSD
OS	Ubuntu 24.04
Driver	NVIDIA 580.126.09 (CUDA 13.0)
Region	us-central1
Mode	Preemptible (can be reclaimed anytime)

⚡ Step 1: Spin Up the Nebius Instance

One-liner with the nebius CLI:

# Create boot disk
nebius compute disk create \
  --parent-id project-xxx \
  --name openclown-boot \
  --type network_ssd \
  --size-gibibytes 200 \
  --source-image-id computeimage-xxx  # Ubuntu 24.04 + CUDA 13.0

# Create instance
nebius compute instance create \
  --parent-id project-xxx \
  --name openclown-h200 \
  --resources-platform gpu-h200-sxm \
  --resources-preset 1gpu-16vcpu-200gb \
  --boot-disk-existing-disk-id computedisk-xxx \
  --boot-disk-attach-mode read_write \
  --network-interfaces '[{"name":"eth0","subnet_id":"vpcsubnet-xxx","ip_address":{},"public_ip_address":{}}]' \
  --preemptible-on-preemption stop \
  --preemptible-priority 3 \
  --recovery-policy fail \
  --cloud-init-user-data "#cloud-config
users:
  - name: ubuntu
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh_authorized_keys:
      - ssh-ed25519 AAAA... your-key"

💡 Important: cloud-init must be included at instance create time. Adding it later won’t work (cloud-init only runs on first boot). If you forgot, delete and recreate.

Preemptible Mode Notes

recovery-policy must be fail (preemptible instances don’t support recover)
IP changes on every restart — use nebius compute instance get <id> to find the new one
Save checkpoints every 5-10 minutes (policy networks are tiny, saving takes < 1 second)

🐍 Step 2: Base Environment

# Miniconda
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p $HOME/miniconda3

# Accept TOS (new conda versions block without this)
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Create environment
conda create -n openclown python=3.11 -y
conda activate openclown

# PyTorch + Isaac Sim
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install isaacsim-rl isaacsim-replicator isaacsim-extscache-physics \
    isaacsim-extscache-kit-sdk isaacsim-extscache-kit isaacsim-app \
    --extra-index-url https://pypi.nvidia.com

# IsaacLab
git clone https://github.com/isaac-sim/IsaacLab.git
cd IsaacLab
echo "Yes" | ./isaaclab.sh --install  # Auto-accept EULA

Everything went smoothly up to this point. Then we fell into the Vulkan pit.

🔥 Step 3: Vulkan Hell

Symptoms

ERROR: vkCreateInstance failed. Vulkan 1.1 is not supported.
PhysXFoundation: Unable to get IGpuFoundation, GpuDevices or Graphics!

Isaac Sim needs Vulkan to initialize the GPU, even in headless mode. But Nebius’s CUDA image only has the headless driver — CUDA yes, Vulkan/GL no.

Why Don’t HPC Clouds Have Vulkan?

Datacenter GPU images typically only install compute drivers (CUDA, cuDNN), not display drivers (Vulkan, OpenGL, GLX). Most HPC workloads (LLM training, scientific computing) don’t need graphics rendering.

But Isaac Sim is built on Omniverse Kit — even without rendering frames, it needs Vulkan to initialize the PhysX GPU pipeline.

❌ Attempt 1: Docker

docker pull nvcr.io/nvidia/isaac-sim:5.1.0

The Docker image has nvidia_icd.json, but it mounts host driver libs via --gpus all. No Vulkan libs on host → no Vulkan in container → same crash.

❌ Attempt 2: apt-get install

sudo apt-get install libnvidia-gl-580
# E: Package 'libnvidia-gl-580' has no installation candidate

Nebius’s apt repo doesn’t have this package. NVIDIA’s official CUDA repo doesn’t either (580 is too new).

❌ Attempt 3: .run Installer

wget https://us.download.nvidia.com/tesla/580.126.09/NVIDIA-Linux-x86_64-580.126.09.run
sudo bash NVIDIA-Linux-x86_64-580.126.09.run --no-kernel-modules --silent
# ERROR: alternate driver installation detected. Abort.

The installer detects that the system already has a package manager-installed driver and refuses to overwrite.

⚠️ Attempt 4: Manual Extract + Copy .so Files (Partial Success)

# Extract without installing
bash NVIDIA-Linux-x86_64-580.126.09.run --extract-only
cd NVIDIA-Linux-x86_64-580.126.09

# Copy all user-space libs
sudo cp *.so.* /usr/lib/x86_64-linux-gnu/

# Create symlinks
cd /usr/lib/x86_64-linux-gnu/
sudo ln -sf libGLX_nvidia.so.580.126.09 libGLX_nvidia.so.0
sudo ln -sf libEGL_nvidia.so.580.126.09 libEGL_nvidia.so.0

# Install Vulkan ICD
sudo cp nvidia_icd.json /usr/share/vulkan/icd.d/

# Update linker cache
sudo ldconfig

This makes vulkaninfo show the NVIDIA H200, and PhysX GPU physics simulation works. But in practice:

PhysX GPU physics ✅ Works
RTX rendering / screenshots ❌ Fails (shader compilation hangs, replicator can’t output frames)

Manually copying .so files is enough to fool vulkaninfo, but Isaac Sim’s RTX renderer needs the complete driver stack.

✅ The Real Fix: Rebuild with Driverless Image

# 1. Create instance with driverless image (Ubuntu 24.04 without NVIDIA driver)
# image: ubuntu24.04-driverless

# 2. SSH in and install the full driver (with Vulkan + OpenGL + RTX)
sudo apt-get install -y build-essential linux-headers-$(uname -r)
wget -q https://us.download.nvidia.com/tesla/580.126.09/NVIDIA-Linux-x86_64-580.126.09.run
sudo bash NVIDIA-Linux-x86_64-580.126.09.run --silent

# 3. Install Vulkan loader
sudo apt-get install -y libvulkan1 vulkan-tools

# 4. Verify
vulkaninfo --summary
# GPU0: NVIDIA H200, apiVersion = 1.4.312, PHYSICAL_DEVICE_TYPE_DISCRETE_GPU ✅

💡 Key insight: Nebius’s CUDA image has a headless driver (compute only, no Vulkan/GL). Using a driverless image + installing the full .run driver gives you the complete rendering stack. On a driverless image, the .run installer won’t hit “alternate driver installation detected” because there’s no existing driver to conflict with.

🚀 Step 4: Verify IsaacLab GPU Parallelism

First, a quick CartPole run to confirm everything works:

CartPole RL — the classic inverted pendulum balancing task, running buttery smooth on GPU.

from isaaclab.app import AppLauncher
app = AppLauncher(headless=True)
sim_app = app.app

from isaaclab_tasks.direct.cartpole.cartpole_env import CartpoleEnv, CartpoleEnvCfg
import torch

cfg = CartpoleEnvCfg()
cfg.scene.num_envs = 4096
env = CartpoleEnv(cfg)
obs, _ = env.reset()
print(f"4096 envs on GPU ready!")

# Benchmark
action = torch.zeros(4096, 1, device="cuda:0")
for i in range(1000):
    obs, rew, term, trunc, info = env.step(action.uniform_(-1, 1))

|---------------------------------------------------------------------------------------------|
| Driver Version: 580.126.09    | Graphics API: Vulkan
|=============================================================================================|
| GPU | Name                             | Active | GPU Memory |
|---------------------------------------------------------------------------------------------|
| 0   | NVIDIA H200                      | Yes: 0 | 143771  MB |
|=============================================================================================|

Number of environments: 4096
Environment spacing   : 4.0

Full GPU Benchmark Results

Environments	Speed	vs MuJoCo CPU
MuJoCo CPU (32 process)	3,400 steps/s	1x
IsaacLab 256 envs	47,000 steps/s	14x
IsaacLab 4096 envs	709,000 steps/s	209x

🚀 A single H200 running 4,096 parallel cartpole environments hits 709K steps/s. What used to take 45 minutes for 10M steps in MuJoCo now takes 14 seconds.

🎬 Step 5: RTX Rendering, Screenshots & Video

After rebuilding the instance, Isaac Sim’s RTX rendering works flawlessly. Here are all 4,096 CartPole environments running simultaneously:

4,096 CartPole environments running in parallel on H200 — 709K steps/s, each cart independently balancing its pole.

4096 CartPole environments running in parallel on NVIDIA H200 — bird's eye view

Bird’s-eye panorama — each one is an independent RL episode.

Zooming in reveals the details of each CartPole:

Up close — blue carts on yellow tracks, dark poles at varying angles.

Results:

4,096 cartpole 1920x1080 RTX rendered screenshots ✅
200-frame video recording ✅
Replicator camera + annotator pipeline works ✅

Screenshot method (headless server):

import os
os.environ["DISPLAY"] = ":1"  # Xvfb

from isaaclab.app import AppLauncher
app = AppLauncher(headless=True, enable_cameras=True)
sim_app = app.app

# ... build env, run simulation ...

import omni.replicator.core as rep
import omni.kit.app

cam = rep.create.camera(position=(120, 120, 60), look_at=(0, 0, 0))
rp = rep.create.render_product(cam, (1920, 1080))
annot = rep.AnnotatorRegistry.get_annotator("rgb")
annot.attach([rp])

# Wait for rendering (first time needs shader compile)
for _ in range(20):
    omni.kit.app.get_app().update()

data = annot.get_data()  # numpy array (H, W, 4) RGBA

⚠️ You need to start Xvfb first: sudo Xvfb :1 -screen 0 1920x1080x24 +extension GLX & and set DISPLAY=:1.

📝 Pitfall Checklist

#	Pitfall	Fix
1	cloud-init added after creation has no effect	Must include at `instance create` time; delete and recreate otherwise
2	Preemptible instances don’t support `recovery-policy: recover`	Use `fail` instead
3	Nebius CUDA image has no Vulkan	Use driverless image + install full driver via `.run`
4	`.run` installer refuses to overwrite package manager driver	Use driverless image to avoid conflicts
5	Manually copying .so files: physics works but rendering doesn’t	Not enough! Must install the full driver stack via `.run`
6	Docker container mounts host driver; no Vulkan on host → none in container	Fix Vulkan on host first
7	`conda create` blocked by new TOS requirement	Run `conda tos accept` first
8	IsaacLab envs use torch tensors, not numpy	`action = torch.zeros(..., device="cuda:0")`
9	Xvfb needs to be started manually	`sudo Xvfb :1 -screen 0 1920x1080x24 +extension GLX &`
10	Isaac Sim RTX shaders take 1-3 min to compile first time	Normal — shader cache persists, no recompile afterwards
11	SSH disconnects kill long-running commands	Always use `nohup python3 -u script.py > output.log 2>&1 &`
12	`enable_cameras=True` requires DISPLAY env var	Start Xvfb first, set `DISPLAY=:1`

💰 Cost

Item	Cost
H200 preemptible	$1.45/hr
Full debug + instance rebuild	~$8 (about 5.5 hours)
4096 envs training 10M steps	~$0.01 (14 seconds)

Compared to buying an H200 (~$30,000), cloud GPU training is a steal.

🎯 Next Steps

Vulkan + RTX rendering fully working
4096 envs GPU parallel @ 709K steps/s
Screenshot + video pipeline

Isaac Lab Franka robot arms performing object manipulation in simulation

Franka robot arms in Isaac Lab — 4 arms simultaneously training on object manipulation tasks. Next step: swap in our PiPER.

Load PiPER robot arm USD into IsaacLab
Pick & Toss training environment (GPU parallel)
D435 RGBD camera + ConceptGraphs 3D scene understanding
Sim2Real deployment to physical PiPER

PiPER robot arm end-effector trajectory visualization

PiPER end-effector trajectory planning visualization — yellow lines show planned paths, blue cubes and rods indicate pose (position + orientation) at each waypoint.

If you’re also running Isaac Sim on Nebius or other HPC clouds, hopefully this saves you a few days of Vulkan debugging.

🦞 Written by CloudLobster for Project OpenClown