Spike: choose the single-process multi-model serving substrate (Track B)

Dispatch

canonical · plan

Spec

markdown

hand-off · dispatch

Dispatch

Auto-dispatch

when it reaches planned

Design-loop

design pass before build

target project worktree · optional branch · optional agent tool · optional session id · optional

kind

No dispatch requests yet — dispatch above to generate a copy-paste packet.

provenance · append-only

Trace

live

parent set 1d ago

Grouped under epic #201.
agent · conductor-claude
plan proposed 1d ago

# Spike: choose the single-process multi-model serving substrate (Track B) Repo: **legit-embedding** (isolated worktree). A decision gate that feeds #205. Latency-insensitive workload → optimize throughput/$. ## Decide Substrate for the single-process, multi-model async worker: - **Ray Serve** — Python-native, arbitrary pre/post (our fetch→embed→XADD), dynamic batching, multi-model, replica autoscale. Likely front-runner. - **Custom asyncio loop** — simplest, no heavy dep, but reimplements dynamic batching + lifecycle. Viable given how contained our loop is. - **Triton** — evaluate but expect to REJECT: rigid for our Redis-stream I/O + pre/post (Python backend/ensemble is fiddly). ## Prototype must prove 1. Load EVA02 image model + the text model **once** in one process (drop per-process duplication: `models.py:77` SingletonEmbeddingModel, `text_models.py:13` ModelManager). 2. **Async overlap**: a batch's XADD write (async) overlaps the next batch's GPU compute — the core win. Enough in-flight to hide ~1235ms xadd behind ~530ms compute (~4 in-flight to saturate; HANDOFF §5) with ONE model copy + N buffers, not N copies. 3. Per-model dynamic batching (size from a VRAM budget — feeds #206). 4. Clean Redis consumer-group draining + a **single, consolidated telemetry identity** — current names embed hostname+PID (`cli.py:67`, `embedding.py:23`, `preparation.py:68`, `stats_publisher.py`). A single process must consolidate these or conductor dashboards/telemetry keys change shape. ## Constraints / carry-overs Keep the wire contract unchanged: `encode_embedding` (`encoding.py:22`, base64_fp32) + `StreamConfig` (`config.py`). Must be able to honor `reply_to` (feeds #205). Must drain the same task streams as the current prep→embed handoff (replacing the `/dev/shm` + Celery `send_task` hop, `preparation.py:429`). ## Deliverable A short decision doc (chosen substrate + why) + a runnable prototype + a concrete architecture for #205. ## Status Draft (idea). Track B entry point; independent.

agent · conductor-claude
note added 1d ago

Decide the substrate for the single-process multi-model async worker: Ray Serve vs custom asyncio (Triton likely too rigid). Prototype must prove load-once + async xadd/compute overlap with one model copy. Decision gate feeding #3. Full spec to follow.
agent · conductor-claude
participant joined 1d ago
system · conductor-claude

epic · dependencies

Relationships

epic parent

Epic: Conductor multi-model + multi-app embedding (the vodmanager foundation)

depends on

No dependencies — dispatchable once planned.

blocking

#205 Build single-process multi-model async worker + worker-side reply_to (Track B)

agents · waves

Participants

conductor-claude participant · active

trace · graph

Links

No links yet — they accrue as agents work the brief.

scope

Projects

legit-embedding · primary

dogfood · read-only

Agent’s-eye view

The literal recall_brief payload an agent gets — same service path as the MCP tool.