review · segments
Set up GLM-5.2 on RunPod with H200 GPUs
claude 496 events 3 segments main
segment 1 of 3
Research and plan GLM-5.2 deployment strategy on RunPod
Analyzed GLM-5.2 model specs (744B total/40B active MoE, MLA attention, 1M context), quant options (FP8 on Hopper, NVFP4 on Blackwell, AWQ-INT4), and RunPod pricing/stock. Decided on a two-phase plan: cheap dress rehearsal first, then real run. Fetched official vLLM and SGLang recipes to verify launch commands.
outcome
Verified model specs, quant sizes, and canonical serve commands for both vLLM (FP8) and SGLang (NVFP4). Established two-phase plan.
next steps
—
key decisions
- Use INT4/AWQ first on 4×H200 for cheap pipeline shakeout, then FP8 on 8×H200 for faithful run
- Target OpenAI-compatible endpoint for Pi integration (no proxy needed)
- Use official zai-org/GLM-5.2-FP8 checkpoint (not community unsloth FP8)
- Use purpose-built images (vllm/vllm-openai:glm52 and lmsysorg/sglang:latest)
open questions
—
1 week ago → 1 week ago
segment 2 of 3
Build reusable RunPod orchestration harness and validate with dress rehearsal
Created runpod.mjs — a reusable Node.js CLI for RunPod deployment (deploy/wait/status/test/piconfig/teardown/list). Wrote README.md runbook with all verified facts. Ran Phase 0 dress rehearsal on a single RTX 3090 ($0.22/hr) with Qwen2.5-7B-Instruct via vLLM: deployed, waited for /health, validated chat + tool-call round-trips over the public proxy with Bearer auth, then tore down. Balance remained $29.89.
outcome
Full pipeline validated end-to-end on cheap hardware. Harness and runbook written to ~/Documents/code/_glm/.
next steps
—
key decisions
- Use RTX 3090 (24GB, $0.22/hr, Medium stock) for dress rehearsal
- Use vLLM OpenAI server with Bearer auth to match Pi's expected connection pattern
- Write all artifacts to ~/Documents/code/_glm/ for future reuse
- Add memory pointer so future sessions find the experiment
open questions
—
1 week ago → 1 week ago
segment 3 of 3
Deploy and evaluate GLM-5.2 NVFP4 on 4× B200
After user topped up to $79.85, extended the harness with deployNvfp4() targeting 4× B200 (secure cloud, $23.56/hr) with Mapika/GLM-5.2-NVFP4 on SGLang. Pod apei2qqlr2id24 was deployed and the 440 GB model was successfully downloaded and loaded into VRAM (119 GB/GPU, 67% utilization across all 4 B200s). At session end, the model was in CUDA-graph capture/kernel compilation phase (502 from proxy, volume usage climbing to 230 GB). The eval function was written but not yet executed.
outcome
Pod running on 4× B200 (secure), NVFP4 weights loaded into VRAM, awaiting graph-capture completion to serve on port 30000.
next steps
- Wait for SGLang to finish graph capture and bind port 30000
- Run evalModel() to test correctness, coding, reasoning, tool calls, and throughput
- Wire into Pi agent for real usage testing
- Consider fixing HF_HOME cache path to use /workspace volume for download persistence
- Clean up pre-existing lounge-embedding-worker pod if desired
key decisions
- Target 4× B200 secure cloud ($23.56/hr) for NVFP4 run (NVSwitch topology guaranteed)
- Use Mapika/GLM-5.2-NVFP4 (440 GB) with SGLang v0.5.13.post1-cu130
- Use --context-length 32768 for initial testing (not full 1M)
- Deploy with 500 GB volume disk for model storage
open questions
- Why did the model download land on container disk (60 GB) instead of /workspace volume (500 GB)? Need to fix HF_HOME routing.
- Will the flashinfer_cutlass MoE backend work correctly on B200 for NVFP4?
- What is the actual throughput (tok/s) and quality of the NVFP4 quant vs FP8?
1 week ago → 1 week ago