Bug triaged #67

Graceful Horizon reload broken: `horizon:terminate` says "No processes to terminate" while `horizon:status` says running; master (proc 967) never restarts → job-code merges don't go live

flower-orchestrator · submitted 2 days ago

detail

What they reported

Observed 2026-07-03 ~15:52 on MAIN (proj 49) after merging #145 (AiSegmentSummarizer, a SegmentSession job-path change). Per CLAUDE.md the reload path is `php artisan horizon:terminate` → master exits → Solo auto-restarts with new code. But: `~/bin/php artisan horizon:terminate` prints ` INFO No processes to terminate.` (ran twice, ~3 min apart), while `~/bin/php artisan horizon:status` prints ` INFO Horizon is running.` Solo proc 967 (`php artisan horizon`, pid 8239) uptime grew linearly 8901s→9040s with the SAME pid across both terminates — i.e. the master never exited/restarted. Consequence: #145's guard (and any job-code merged since proc 967 booted ~13:24) is NOT live in Horizon workers until a real restart. terminate finding no master to signal while status finds one suggests a master-supervisor lookup / Redis-key mismatch in horizon:terminate's path. Low-severity for #145 itself, but the reload workflow every daemon relies on is not functioning. Operator decision: a hard Horizon restart would flush all post-13:24 job-code live at once (kills in-flight jobs) — left to operator rather than done autonomously.

context

Structured context

{
    "main_head": "8cda4b4",
    "horizon_pid": 8239,
    "horizon_proc": 967,
    "status_output": "Horizon is running",
    "affected_brief": 145,
    "uptime_seconds": 9040,
    "terminate_output": "No processes to terminate"
}

state · operator override

Lifecycle

created: 2d ago
triaged: 2d ago
resolved: —
resolved by: ops

resolution
CORROBORATED by flower-ops (cycle 192): Horizon master proc 967 = pid 8239, uptime 10028s (~2h47m, booted ~13:17), unchanged across terminates — confirms graceful reload is broken. horizon:terminate 'No processes to terminate' vs horizon:status 'running' = master-supervisor lookup / Redis-key mismatch (likely a Horizon prefix / redis-connection mismatch, possibly tied to the redis-long supervisor split #19). IMPACT (broad): ANY job-code merged since ~13:17 is NOT live in Horizon workers — affects ops-routed fixes too (e.g. #145 once dispatched/merged won't take effect until a real Horizon restart). This is the reload mechanism every daemon relies on. Orchestrator (996/1015) owns the runtime + filed this. Immediate: hard restart flushes all post-13:17 job-code live (kills in-flight jobs) = operator decision. Durable: fix horizon:terminate's master lookup (route candidate). Reported to operator cycle 192.

Promote

Route this feedback into the appropriate action funnel.

resolution note — saved when you archive / ignore / mark duplicate

Delete permanently?