Worker-type VRAM cost table + greedy packer (Track B)

canonical · plan

Spec

markdown

hand-off · dispatch

Dispatch

Auto-dispatch

when it reaches planned

Design-loop

design pass before build

Blocked — dispatch is gated

Waiting on 1 unfinished dependency. Complete or cancel it to dispatch.

#205 · idea

provenance · append-only

Trace

live

dependency added 1d ago

Now depends on #205 (Build single-process multi-model async worker + worker-side reply_to (Track B)).
agent · conductor-claude
parent set 1d ago

Grouped under epic #201.
agent · conductor-claude
plan proposed 1d ago

# Worker-type VRAM cost table + greedy packer (Track B) Repo: **legit-embedding**, on the #205 single-process runtime. **Depends on #205.** This is the operator's "dynamic memory / heterogeneous worker mix" idea in its simplest correct form. ## What A per-worker-type resource table + a **greedy** packer that fits a heterogeneous model mix onto whatever GPU we get, weighted by queue depth. ### Cost table (static config, measured; versioned) Per `worker_type ∈ {image_embed, text_embed, (future) audio, video}`: - `base_mb` — model weights + context per resident model (MEASURE via nvidia-smi on a real pod) - `per_item_mb` — activation per item at its batch (MEASURE ≥2 points to fit base vs per-item) - `min_batch` / `max_batch` - `in_flight` — async concurrency to hide the xadd write (the I/O-overlap knob, NOT a compute knob) - `target_depth_per_worker` / `priority` — the queue-depth signal ### Greedy packer (NOT a solver) At **pod startup** (latency-insensitive → no runtime re-packing in v1): detect free VRAM (nvidia-smi, reuse #202); for each model type with backlog, greedily allocate resident model + batch + in_flight, subtracting `base_mb + batch*per_item_mb` from the budget, ordered by backlog pressure/priority, until the budget is exhausted or all backlogged types are served. Industry pattern — **resource requests + bin-packing** (Borg/K8s/Nomad) + **queue-depth autoscaling** (KEDA: `desired ≈ ceil(depth/target)`). Greedy keeps it debuggable — the "no horribly complicated system" guarantee. ### v1 scope / explicitly deferred - v1: compute the mix ONCE at startup from current per-queue depths; a pod is short-lived → re-pack on next spawn. - Deferred (phase 2): runtime dynamic re-packing (add/evict models mid-life) — adds restart complexity, rarely worth it. - **Ruled out: MIG** (fleet A40/A4500/3090/4090 don't support it). **MPS** = escalation only if co-located models contend for SMs (GPU ~30% util → headroom exists; defer until measured). ## Payoff Real value lands with the 2nd+ concurrent workload — text now (#207), video/vodmanager later. Single-workload (image-only) still benefits: one model copy + more in_flight = a bigger batch than N duplicate processes could afford. ## Status Draft (idea). Depends #205.

agent · conductor-claude
note added 1d ago

Per-worker-type resource table (base_mb, per_item_mb, batch bounds, in_flight, target_depth) + a GREEDY VRAM packer that fits a heterogeneous model mix weighted by queue depth, computed at pod startup. Industry pattern: resource requests + bin-packing + KEDA-style queue-depth autoscaling. Depends on #3. Full spec to follow.

agent · conductor-claude
participant joined 1d ago
system · conductor-claude

epic · dependencies

Relationships

epic parent

Epic: Conductor multi-model + multi-app embedding (the vodmanager foundation)

depends on

#205 Build single-process multi-model...

agents · waves

Participants

conductor-claude participant · active

trace · graph

Links

No links yet — they accrue as agents work the brief.

scope

Projects

legit-embedding · primary

dogfood · read-only

Agent’s-eye view

The literal recall_brief payload an agent gets — same service path as the MCP tool.