Image batch VRAM auto-sizing on the current worker (interim throughput win)

Dispatch

canonical · plan

Spec

markdown

hand-off · dispatch

Dispatch

Auto-dispatch

when it reaches planned

Design-loop

design pass before build

target project worktree · optional branch · optional agent tool · optional session id · optional

kind

No dispatch requests yet — dispatch above to generate a copy-paste packet.

provenance · append-only

Trace

live

parent set 1d ago

Grouped under epic #201.
agent · conductor-claude
plan proposed 1d ago

# Image batch VRAM auto-sizing on the current worker (interim throughput win) Repo: **legit-embedding** (branch `runpod-container-runtime`). Conductor companion config already done. Basis: conductor `docs/spec-vram-aware-tuning.md` (but see the corrected diagnosis below — the spec's root cause is wrong). ## Goal Auto-size `IMAGE_BATCH_SIZE` to the pod's **free** VRAM so big cards run large batches (A40 46GB → ~80+ vs pinned 32) instead of wasting the card — ~2.5× per-pod throughput on the live lounge image backfill. Interim win on the current (A) multi-process worker while Track B (#205) is built; the nvidia-smi detection + batch memory model are reused by #206. **Scope: batch size ONLY — do NOT auto-size embed-worker count here (superseded by Track B's single-process design).** ## CORRECTED root cause of the 1766f48 regression (read first) NOT "CUDA-after-fork" — the worker is N `subprocess.Popen` `--pool=solo` processes (fresh execve, no inherited CUDA). The actual bug: 1. The **CPU-only prep worker** (`--queue=preparation`, no GPU model) eagerly creates a phantom CUDA context because it reads `config.worker.image_batch_size` at `preparation.py:113`, and that property calls `torch.cuda.get_device_properties(0)` at `config.py:180`. 2. The formula sizes off `total_memory`, not **free** VRAM (`config.py:180`) → phantom contexts + the model's ~6.5GB eat the fixed 3500MB reserve → OOM. ## Design 1. **Resolve batch once PRE-Popen in `start_workers.py`** (the parent, before any child exists, CUDA-free). Call `nvidia-smi --query-gpu=memory.free,memory.total,name --format=csv,noheader,nounits` (subprocess → no torch/CUDA in-process). Export `IMAGE_BATCH_SIZE=<int>` into the child env so `config.py:117` pins it and the auto path never runs in any child (esp. the CPU prep worker). 2. **Keep the `_detect_total_vram_mb` seam** (`config.py:175-184`) so `tests/test_batch_sizing.py` mocks still hold — swap only its body torch→nvidia-smi, and use **free** VRAM. Keep the formula/knobs (`config.py:129-173`): `clamp((free_mb*fraction - reserve)/mb_per_item, MIN, MAX)`; fraction ~0.75–0.80, reserve, MIN 8 / MAX 128. Explicit positive `IMAGE_BATCH_SIZE` still pins (`config.py:138-154`). 3. Log the resolved `(gpu_name, free_mb, total_mb, batch_size)` at startup. ## Conductor side `IMAGE_BATCH_SIZE` is **already** `'auto'` in `config/conductor-runtimes.php` (L161 runpod / L231 DO — commit 0790693). No conductor change for this brief. **Remove the `.73` `.env` `RUNPOD_IMAGE_BATCH_SIZE=32` pin ONLY after the fixed image is built + single-task smoke passes.** ## Files - `start_workers.py` (add pre-Popen resolve + export; spawn loop L54-83) - `src/embeddings/config.py` (`_detect_total_vram_mb` L175-184 → nvidia-smi + free; formula L156-173 → free) - `tests/test_batch_sizing.py` (mocks patch `_detect_total_vram_mb` — keep the seam) ## Rollout (do NOT skip — this is what the regression taught us) 1. `py_compile` + `test_batch_sizing.py` (mock nvidia-smi). 2. Build a new image tag; point conductor at it. 3. **SINGLE-TASK SMOKE FIRST:** queue 1 image task → confirm (a) a completed Embedding row and (b) a telemetry row with the auto batch + `gpu_mem_used_mb` well under total. Only then bulk. 4. Read telemetry `gpu_mem_used/total` at the auto batch; tune `mb_per_item`/fraction if headroom. 5. Watch worker Sentry (`GPU_WORKER_SENTRY_DSN`, project conductor) for OOM. ## Acceptance 46GB pod → batch ~80+, ~70-80% VRAM, no OOM; 20-24GB pod sizes down, no OOM; explicit `IMAGE_BATCH_SIZE` still pins; per-pod items/s up materially vs batch 32 (telemetry), bounded by the per-pod xadd floor. ## Status Draft (idea). **Promote-ready** — safe to build/ship independently of Track B; the one to set `planned` first if the image backfill is turned on soon.

agent · conductor-claude
note added 1d ago

Auto-size IMAGE_BATCH_SIZE to the pod's FREE VRAM (nvidia-smi, resolved pre-Popen in start_workers.py) so big cards run ~80+ instead of pinned 32 — ~2.5x on the live image backfill. Interim win on the current worker; core reused by Track B. Includes the corrected 1766f48 regression diagnosis. Full spec to follow.
agent · conductor-claude
participant joined 1d ago
system · conductor-claude

epic · dependencies

Relationships

epic parent

Epic: Conductor multi-model + multi-app embedding (the vodmanager foundation)

depends on

No dependencies — dispatchable once planned.

agents · waves

Participants

conductor-claude participant · active

trace · graph

Links

No links yet — they accrue as agents work the brief.

scope

Projects

conductor · consumer
legit-embedding · primary

dogfood · read-only

Agent’s-eye view

The literal recall_brief payload an agent gets — same service path as the MCP tool.