flower
/

review · segments

Design central GPU task processing service

claude 3295 events 13 segments main

segment 1 of 13

Design central GPU task processing service (conductor)

Done

Explored lounge's GPU Runtime code, interviewed the user, and iteratively wrote a comprehensive plan document for extracting a headless watcher+thin API service (conductor) with a Laravel client package. Settled on naming conventions, scope boundaries, storage ownership decisions, and 6 rollout phases. Recorded 22+ architectural decisions in the plan.

outcome

Comprehensive plan document (_conductor-plan.md) exists with all architectural decisions, scope boundaries, and a 6-phase rollout.

next steps

key decisions

  • Service shape: headless watcher + thin API; data plane = Redis streams, control plane = small daemon + minimal HTTP API
  • Hosting: dedicated Redis on .73 (enm-storage) alongside MinIO
  • Worker connectivity: Tailscale tailnet canonical
  • Compute: cloud-only (RunPod primary, DigitalOcean secondary); homelab not in production pool
  • Runtime duration: no hard cap — runs while work exists, idle teardown only; Sentry alert at 24h
  • Scaling: fixed max 1 pod in v1; depth gates spawn/no-spawn
  • Cutover: fresh empty conductor-Redis; unfinished work re-enqueues from embeddings table cursor
  • conductor-client owns opinionated storage layer (model+registry+trait+migrations+ResultHandler)
  • conductor:make-embeddable scaffold command is fast-follow after lounge adoption
  • Env convention: CONDUCTOR_REDIS_*, CONDUCTOR_APP, CONDUCTOR_TARGET_DEPTH, CONDUCTOR_API_*

open questions

3 weeks ago 3 weeks ago

segment 2 of 13

Build first-cut conductor service and client repos

Done

Used a 12-agent workflow to scaffold and build both conductor (Laravel 12 service app) and conductor-client (Laravel package) repos locally. Verified workflow results and identified a missing GpuRuntimeHealthService gap. Launched refinement wave 1 via Solo Codex workers to fix the health service and enqueue row-creation. Verified and closed both refinements; documented carry-forward items.

outcome

Both repos built and refined; GpuRuntimeHealthService and enqueue row-creation fixed; carry-forward documented.

next steps

key decisions

  • Use Codex workers for refinement with self-contained task specs
  • Worker findings must be verified against ground truth before acceptance
  • Carry-forward orphan reconciliation to wave 2; per-app ACL users deferred to before Phase 4

open questions

3 weeks ago 2 weeks ago

segment 3 of 13

Set up Phase 0 infrastructure on .73

Done

Performed read-only recon of .73 (enm-storage), harvested GPU credentials from local lounge .env files, and finalized decisions (12GB conductor-redis with noeviction, light RDB, Tailscale fresh install). Executed Phase 0 steps 1-4: pushed repos to GitHub, installed conductor-redis on port 6390, installed Tailscale (node enm-storage-conductor at 100.102.178.49), deployed conductor service with all 44 env keys, and set up supervisor daemons (watch, janitor) staged but stopped.

outcome

conductor-redis running on .73:6390, Tailscale node joined, conductor service deployed and verified.

next steps

key decisions

  • conductor-redis maxmemory 12GB with noeviction (preserve RAM for future vector index)
  • RDB snapshots every 5min if >=1000 changes, no AOF (transport data re-enqueable)
  • Sentry project 'conductor' in legitphp org
  • Single requirepass for v1; ACL users deferred
  • Exclude bootstrap/cache from rsync to avoid stale provider cache

open questions

2 weeks ago 2 weeks ago

segment 4 of 13

Complete end-to-end smoke test and fix pod spawn issues

Done

Initiated smoke test by spawning RunPod pod, diagnosed worker failures (CUDA mismatch, legacy GraphQL API limitation), switched RunPod adapter to REST API, fixed tailnet connectivity (REDIS_TAILSCALE_PROXY=true), and ran successful end-to-end test producing 3 text embeddings. Autonomous daemons remained stopped due to classifier restriction.

outcome

Smoke test passed; REST spawn and tailnet proxy fix deployed; embeddings verified at 384-dim.

next steps

key decisions

  • Default allowed CUDA versions changed to 12.8,12.7,12.6 only (removed 12.5,12.4)
  • Redis password rotated after exposure in transcript
  • Use REST /v1/pods with list-based gpuTypeIds instead of single-type GraphQL mutation for spawn
  • REDIS_TAILSCALE_PROXY must be true for userspace-networking Tailscale containers
  • Full GPU availability list enabled (not just cheapest) to avoid capacity delays

open questions

2 weeks ago 2 weeks ago

segment 5 of 13

Authorize go-live, validate autonomous auto-spawn, and plan lounge Phase 2 adoption

Done

Persisted REDIS_TAILSCALE_PROXY default to true, received user authorization for go-live, started conductor daemons, validated autonomous auto-spawn (safety check passed; spawn trigger worked but RunPod capacity blocked full cycle). Mapped lounge's embedding pipeline, identified global key prefix conflict, and wrote lounge-conductor-adoption-spec.md with toggle-gated, additive approach.

outcome

Conductor daemons live and autonomous on .73; safety check passed; adoption spec documented.

next steps

key decisions

  • Lounge integration uses conductor-client package (not simple repoint) to avoid global key prefix collision
  • Lounge's old GPU-Runtime machinery removed after text cutover proven
  • Lounge integration is additive, toggle-gated (default OFF)

open questions

2 weeks ago 2 weeks ago

segment 6 of 13

Implement lounge conductor-client adoption and consolidation

Done

Spawned Codex worker to install conductor-client in lounge. Fixed package Laravel 11 compatibility by widening illuminate constraints to ^11.0|^12.0. Verified prefix-free connection. Completed adoption (commit cdfe1b9e). Designed consolidation: remove GpuRuntime machinery unconditionally (never live in production). Implemented clip-vit-b-32 as selectable image model. Verified consolidation and closed.

outcome

Lounge text embedding unconditionally routed through conductor; clip-vit-b-32 implemented; GpuRuntime removed.

next steps

key decisions

  • conductor-client illuminate constraints widened to ^11.0|^12.0
  • Skip package's create-table migrations for embeddings/embedding_models (lounge already has tables)
  • Add explicit options.prefix='' to conductor connection to ensure bare key storage
  • Image model selection mirrors text path's model_name-based approach
  • Image preparation groups tasks by resolved model_name to avoid mixing dimensions

open questions

2 weeks ago 2 weeks ago

segment 7 of 13

Wire image embedding through conductor

Done

Researched codebase and wrote image conductor integration spec. Spawned Codex worker to reroute image XADDs, add image result consumer, and retire legacy process-results. Corrected dimension registry from 768 to 1024 (corpus-proven). Swapped daemons, enqueued test tasks, discovered spawn threshold of 50 blocks small batches.

outcome

Image routing committed and deployed; dimension verified 1024 via corpus ground truth.

next steps

key decisions

  • Use existing conductor config streams for image (no new config keys)
  • Image embedding dimension is 1024, not 768 (corpus proven from 5 completed embeddings)

open questions

2 weeks ago 2 weeks ago

segment 8 of 13

Fix autonomous GPU spawn bugs and validate image pipeline at scale

Done

Diagnosed lack of spawn due to threshold and dormant API. Wired lounge API credentials and served conductor API on .73. Enqueued 55 tasks but RunPod reported no_capacity. Switched from COMMUNITY to SECURE cloud type, fixed broken availability gate, corrected GPU type IDs to enum-valid values. Discovered zero-results bug was premature idle-teardown during model loading. Validated 300/300 image embeddings at 1024-dim. Teardown debug pod and documented telemetry follow-up.

outcome

Autonomous spawn working on SECURE; image pipeline validated at scale (300 embeddings, 0 failures).

next steps

key decisions

  • When datacenter availability query returns empty, assume available and let REST pod creation decide
  • Use only GPU types that appear in RunPod REST API's gpuTypeIds enum
  • Default cloud_type changed from COMMUNITY to SECURE after repeated placement errors
  • Raise idle teardown window temporarily for debugging; restore after validation

open questions

2 weeks ago 2 weeks ago

segment 9 of 13

Design and implement batch telemetry system

Done

Proposed and documented batch telemetry architecture (stream contract, worker emit, conductor schema/consumer/API). Spawned Codex agent to build Phase 1+2 across conductor (migration, BatchTelemetry model, consume command, /api/stats endpoint) and legit-embedding (publish_batch_telemetry helper, emit from image and text paths). Verified both commits with full test suites green.

outcome

Telemetry system built and committed locally for both conductor and legit-embedding repos.

next steps

key decisions

  • Conductor is the natural home for telemetry persistence and analysis
  • Telemetry emit is best-effort: never block embedding on a telemetry failure
  • Worker emit uses catches/logs failures without raising, gated by WORKER_TELEMETRY_ENABLED default true
  • Telemetry_sent guard prevents double-emit on per-model groups in text pipeline

open questions

2 weeks ago 2 weeks ago

segment 10 of 13

Bake models into worker image, publish container, and create deploy process

Done

Created prefetch script to download EVA02, e5, and CLIP models at Docker build time (CPU loading) to eliminate runtime cold-start. Published container to GHCR. Verified fast cold-start (~4min) and 1024-dim embedding. Created repeatable deploy script (deploy/deploy.sh) and clean-deployed conductor to .73 with telemetry changes. Reset master branch to runpod-container-runtime to align history.

outcome

Baked-worker container published to GHCR; deploy script created; conductor clean-deployed with telemetry; master reset.

next steps

key decisions

  • Bake models into image instead of network volume (avoids DC pinning)
  • Prefetch models at build time using worker's own model loaders on CPU
  • Add disk-freeing step to publish workflow to avoid runner disk exhaustion
  • Use rsync-based deploy script (not git-ify-in-place) to preserve live SQLite state

open questions

2 weeks ago 2 weeks ago

segment 11 of 13

Complete prod go-live: fix slow query and validate full pipeline

Done

Validated prod .74 to conductor-redis connectivity (fixed missing CONDUCTOR_* env keys). Started daemons (consume-image-results, process-downloads). Seeded backfill but discovered expensive count query over 70M rows causing multi-second hang. Removed the count query, committed and pushed. Full pipeline cycle completed: 120 embeddings in 20min with 0 failures. Recorded go-live milestone.

outcome

Prod go-live proven: 120 completed embeddings at 1024-dim, 0 failures, unattended GPU embed with auto-spawn/teardown.

next steps

key decisions

  • Drop the full-corpus count query entirely; dispatch loop already handles zero backlog via empty-page break
  • Disable backfill app setting while deciding forward path to avoid continuous GPU cost
  • CONDUCTOR_REDIS_HOST on prod uses 10.0.0.2 (private 10G link) rather than tailnet IP

open questions

2 weeks ago 1 week ago

segment 12 of 13

Set up telemetry, fix watcher idle-clock, and pipeline result XADDs

Done

Deployed conductor telemetry daemon but it crashed with phpredis XREADGROUP arity bug; fixed by calling native xReadGroup(). Identified that XADD is 81% of batch time (1923ms of 2378ms) due to sequential round-trips. Pipelined both image and text result XADDs to collapse N round-trips into one. Deployed pipelined image and discovered zombie-instance bug: teardown left is_active=true, blocking future spawns. Fixed touchFromStatus to deactivate on non-running states and created shared NON_RUNNING_STATES constant.

outcome

Telemetry daemon running; result XADDs pipelined (theoretical ~1.9x throughput gain); zombie-instance bug fixed.

next steps

key decisions

  • Call phpredis xReadGroup() natively instead of passing raw RESP tokens to command()
  • Broaden activity detection to task-depth movement + PEL, not just result-stream XLEN deltas
  • Add runtimeAgeMinutes min-runtime grace (>= idleMinutes) to prevent premature teardown of young pods
  • Anchor idle clock to pod birth on spawn
  • Group result payloads by model for pipelining to preserve per-model xadd_ms telemetry

open questions

1 week ago 1 week ago

segment 13 of 13

Implement base64-fp32 wire format, fix spawn resume, analyze performance, and document handoff

Done

Implemented base64-fp32 wire format for embedding results (~3x smaller than JSON), deployed across all three repos with backward compatibility. Discovered per-batch fixed latency dominates (~880ms), making wire format gain marginal. Reverted to SECURE cloud, fixed stale pod resume bug by making teardown destroy by default. Attempted VRAM-aware batch auto-sizing but caused CUDA-after-fork regression; rolled back and pinned batch 32. Reclaimed orphaned pending embeddings. Created handoff documents (HANDOFF.md, spec-vram-aware-tuning.md, spec-multi-pod.md) and set up Solo project with remaining todos.

outcome

base64-fp32 wire format deployed; teardown fixed to destroy; batch size pinned at 32; handoff documented; VRAM tuning deferred.

next steps

  • Start VRAM-aware tuning work using nvidia-smi subprocess instead of torch.cuda
  • Consider multi-pod parallelism for realistic backfill timeline

key decisions

  • Wire format is base64-fp32 (little-endian float32) for zero precision loss and near-pass-through storage
  • Consumer is backwards-compatible: accepts both 'base64_fp32' and legacy 'json' via embedding_format marker
  • Default teardown mode is 'destroy' because 'stop' causes resume stalls on RunPod with pinned allowedCudaVersions
  • VRAM detection via torch.cuda is unsafe; use nvidia-smi subprocess instead
  • Batch 32 is a safe no-rebuild win; it fits every GPU in our enum

open questions

  • What is the correct nvidia-smi command to get GPU memory total in a suitable format?
  • Should we adjust batch size further based on VRAM headroom?
  • Can a single pod handle multiple concurrent batches to hide per-batch latency?

1 week ago 5 days ago