Segments — flower

segment 1 of 3

Research and plan GLM-5.2 deployment strategy on RunPod

Done

Analyzed GLM-5.2 model specs (744B total/40B active MoE, MLA attention, 1M context), quant options (FP8 on Hopper, NVFP4 on Blackwell, AWQ-INT4), and RunPod pricing/stock. Decided on a two-phase plan: cheap dress rehearsal first, then real run. Fetched official vLLM and SGLang recipes to verify launch commands.

outcome

Verified model specs, quant sizes, and canonical serve commands for both vLLM (FP8) and SGLang (NVFP4). Established two-phase plan.

next steps

—

key decisions

Use INT4/AWQ first on 4×H200 for cheap pipeline shakeout, then FP8 on 8×H200 for faithful run
Target OpenAI-compatible endpoint for Pi integration (no proxy needed)
Use official zai-org/GLM-5.2-FP8 checkpoint (not community unsloth FP8)
Use purpose-built images (vllm/vllm-openai:glm52 and lmsysorg/sglang:latest)

open questions

—

1 week ago → 1 week ago

segment 2 of 3

Build reusable RunPod orchestration harness and validate with dress rehearsal

Done

Created runpod.mjs — a reusable Node.js CLI for RunPod deployment (deploy/wait/status/test/piconfig/teardown/list). Wrote README.md runbook with all verified facts. Ran Phase 0 dress rehearsal on a single RTX 3090 ($0.22/hr) with Qwen2.5-7B-Instruct via vLLM: deployed, waited for /health, validated chat + tool-call round-trips over the public proxy with Bearer auth, then tore down. Balance remained $29.89.

outcome

Full pipeline validated end-to-end on cheap hardware. Harness and runbook written to ~/Documents/code/_glm/.

next steps

—

key decisions

Use RTX 3090 (24GB, $0.22/hr, Medium stock) for dress rehearsal
Use vLLM OpenAI server with Bearer auth to match Pi's expected connection pattern
Write all artifacts to ~/Documents/code/_glm/ for future reuse
Add memory pointer so future sessions find the experiment

open questions

—

1 week ago → 1 week ago

segment 3 of 3

Deploy and evaluate GLM-5.2 NVFP4 on 4× B200

In progress

After user topped up to $79.85, extended the harness with deployNvfp4() targeting 4× B200 (secure cloud, $23.56/hr) with Mapika/GLM-5.2-NVFP4 on SGLang. Pod apei2qqlr2id24 was deployed and the 440 GB model was successfully downloaded and loaded into VRAM (119 GB/GPU, 67% utilization across all 4 B200s). At session end, the model was in CUDA-graph capture/kernel compilation phase (502 from proxy, volume usage climbing to 230 GB). The eval function was written but not yet executed.

outcome

Pod running on 4× B200 (secure), NVFP4 weights loaded into VRAM, awaiting graph-capture completion to serve on port 30000.

next steps

Wait for SGLang to finish graph capture and bind port 30000
Run evalModel() to test correctness, coding, reasoning, tool calls, and throughput
Wire into Pi agent for real usage testing
Consider fixing HF_HOME cache path to use /workspace volume for download persistence
Clean up pre-existing lounge-embedding-worker pod if desired

key decisions

Target 4× B200 secure cloud ($23.56/hr) for NVFP4 run (NVSwitch topology guaranteed)
Use Mapika/GLM-5.2-NVFP4 (440 GB) with SGLang v0.5.13.post1-cu130
Use --context-length 32768 for initial testing (not full 1M)
Deploy with 500 GB volume disk for model storage

open questions

Why did the model download land on container disk (60 GB) instead of /workspace volume (500 GB)? Need to fix HF_HOME routing.
Will the flashinfer_cutlass MoE backend work correctly on B200 for NVFP4?
What is the actual throughput (tok/s) and quality of the NVFP4 quant vs FP8?

1 week ago → 1 week ago

Set up GLM-5.2 on RunPod with H200 GPUs