Diagnose + repair the Horizon graceful-reload path (horizon:terminate no-ops while status=running)

canonical · plan

Spec

markdown

hand-off · dispatch

Dispatch

Auto-dispatch

when it reaches planned

Design-loop

design pass before build

This brief is complete — dispatch is closed.

#33 done fresh flower · flower/127-horizon-reload-diag

agent: codex

You are being dispatched from flower Brief #127: Diagnose + repair the Horizon graceful-reload path (horizon:terminate no-ops while status=running)

Recall pointer:
- Use recall_brief with id 127 for the full folder if you need provenance.

Target:
- project: flower (/Users/mikeferrara/Documents/code/flower)
- branch: flower/127-horizon-reload-diag
- worktree: not specified
- kind: fresh

Current brief spec:
(no spec yet)

This is a direct request, not a fully-specced plan. If it's clear, resolve it. If you hit a blocking ambiguity, call brief_ask (or brief_append) with your questions and flip the brief to `refining` before proceeding — don't guess.

Recent/key trace events:
[1] participant_joined flower-orchestrator: (no body)
[2] note_added flower-orchestrator: Operator-reported flaky behavior (2026-07-03): `php artisan horizon:terminate` intermittently reports "No processes to terminate" while `php artisan horizon:status` reports "Horizon is running" (observed live: Solo proc 967 / pid unchanged after terminate → the graceful reload silently did nothing). Consequence: the documented graceful pipeline reload does NOT reliably reload Horizon, so every job/pipeline code deploy currently needs a manual hard `restart_process(967)` (against the intended workflow, hard-kills in-flight jobs). Operator: "sometimes it works, sometimes it doesn't — seems like it might be a Solo thing."

⚠️ SAFETY — READ FIRST. MAIN's Horizon (Solo proc 967) is the LIVE pipeline and shares MAIN's Redis. This brief is DIAGNOSTIC-ONLY. Do NOT run `php artisan horizon`, `horizon:terminate`, `horizon:pause/continue/clear`, or any restart against MAIN or its shared Redis — you could disrupt live ingest/embed/summarize on the operator's real data. READ-ONLY inspection only. If you conclude a live REPRODUCTION is required, STOP and report exactly what isolated setup you need (a worktree with its OWN isolated Redis instance — a separate redis, not just a key prefix on the shared one). Do NOT improvise isolation.

INVESTIGATE (read-only):
1. How Laravel Horizon's `terminate` locates the master to signal (it reads the master supervisor record/PID from cache/Redis — `MasterSupervisor`, `horizon:` keys). Determine WHY terminate finds "no processes" while `status` says running. Candidate causes: (a) cache-store / Redis connection / prefix mismatch between the CLI invocation's env and the running Horizon's env (Solo launched 967 ~14h before the observation — has .env / HORIZON_PREFIX / cache prefix drifted since boot?); (b) the master PID record expired/evicted from cache while the process still lives; (c) terminate relies on a POSIX signal to a PID that Solo's process supervision masks/reparents; (d) a Horizon version nuance in this app's installed version.
2. Inspect flower's config read-only: `config/horizon.php` (prefix/use/waits), the cache + redis connections Horizon uses, and how Solo launches proc 967 (env, cwd, prefix) — cross-check the running env vs a fresh CLI env.
3. Read-only Redis inspection of the horizon master keys (e.g. via the app's redis connection / redis-cli GET/KEYS on `horizon:*` master keys) to see whether the master record exists and matches what `terminate` looks up. Do NOT write/delete keys.

DELIVERABLE (post to this brief via brief_append): root-cause finding + a concrete recommended fix — e.g. align the cache/prefix, a Solo-native reliable reload, a small `flower:horizon-reload` wrapper command that signals reliably, or a documented reliable reload procedure. If the fix is code, propose it as a Phase-2 build (separate go). If it's an ops/config issue, document the reliable procedure. Diagnosis may be REPORT-ONLY (no commit) — that is a fine outcome; if you do commit, `Brief: #<this-id>` trailer, `php artisan test` green + pint.

Provenance: operator-requested review/debug (2026-07-03), after the orchestrator hit this live merging Brief #114's job-code change and had to fall back to authorized restart_process(967).
[3] status_change flower-orchestrator: (no body)

Recommended linked context:
{
    "todos": [],
    "scratchpads": []
}

Execution notes:
- Treat the brief as the source of truth.
- Keep work scoped to this dispatch request.
- Use brief_append / brief_update_status when reporting material progress; as your final dispatched-worker step, call brief_dispatch_complete with dispatch_request_id (or brief_id) and actor_ref.
- Codex workers should verify mutating Flower tools with tool_search query `brief_append brief_dispatch_complete flower_feedback` (limit 20) when tool availability is in doubt; report raw SEE/LOAD vs NOT visible instead of silently using local fallbacks.
- Add a git commit trailer `Brief: #127` to every commit for this brief so flower can exact-link commits back to the brief.

provenance · append-only

Trace

live

status change 2d ago
agent · flower-foundation-127
dispatched 2d ago

Dispatch request #33 marked done.
agent · flower-foundation-127
comment 2d ago

ROOT-CAUSE DIAGNOSTIC (read-only; no code changes) Scope/safety: I did not run `php artisan horizon`, `horizon:terminate`, pause/continue/clear, or any restart. Redis inspection was read-only only. Operator clarified the failed terminate happened earlier against the prior master (pid 92337, ~14h uptime) and that master has since been restarted away; the currently visible master is the healthy post-restart one and cannot reproduce the old failure. Findings: - Installed Horizon is laravel/horizon v5.47.2 (`composer.lock` source tree v5.47.2). - Exact vendor-code divergence: - `Laravel\Horizon\Console\StatusCommand::handle()` calls `MasterSupervisorRepository::all()`. If that returns any masters, and none are paused, it reports `Horizon is running.` It does not filter by current machine basename. - `Laravel\Horizon\Console\TerminateCommand::handle()` also calls `MasterSupervisorRepository::all()`, but then filters to masters where `Str::startsWith($master->name, MasterSupervisor::basename())`. Only after that filter does it pluck PIDs and call `posix_kill($pid, SIGTERM)`. If the filtered set is empty, it prints `No processes to terminate.` - Therefore `status=running` + `terminate=No processes to terminate` is possible when a fresh master record exists but its name does not start with the basename resolved by the terminating CLI process. A PID/signal problem would not produce this message; it would reach the PID list and report a per-process `posix_kill` failure. - Horizon master records are short-lived Redis metadata. `RedisMasterSupervisorRepository::names()` returns members from `masters` scored within the last 14s; `update()` writes `master:<name>`, zadds `masters`, and sets the master hash TTL to 15s. Horizon uses the `horizon` Redis connection configured from `config('horizon.use') = default` and `config('horizon.prefix')`, which resolves here to `laravel_horizon:` unless `HORIZON_PREFIX` is set. - Current post-restart master evidence (healthy, not the failing old pid): Solo project 49 process 967 is `php artisan horizon`, OS pid 23668, cwd `/Users/mikeferrara/Documents/code/flower`. Redis has `laravel_horizon:masters` and `laravel_horizon:master:alargepaperweight-Bnuu`; the hash says `name=alargepaperweight-Bnuu`, `pid=23668`, `status=running`, `environment=local`, supervisors fast+long, TTL 15s. The ZSET score refreshed during inspection (1783080043 -> 1783080224, i.e. 07:00:43 -> 07:03:44 CDT). `hostname` is `ALargePaperWeight`, whose Horizon basename slug is `alargepaperweight`, matching the current master name prefix. - Config/env check: MAIN and the worktree both have `APP_NAME=Laravel`, `CACHE_STORE=redis`, `QUEUE_CONNECTION=redis`, and no visible `HORIZON_PREFIX`. The worktree has `REDIS_PREFIX=wt_foundation` while MAIN does not, but Horizon metadata is not using that queue/cache prefix; Horizon overrides the `horizon` Redis connection prefix with `config('horizon.prefix')` (`laravel_horizon:` here). This does not explain the current healthy master. Best root-cause hypothesis for the OLD failure: - Most likely: an intermittent machine-basename/name mismatch at the time the old master (pid 92337) was running. This is the only code path that cleanly explains the exact observed split if `horizon:status` and `horizon:terminate` were run close together: `status` saw a fresh master record, while `terminate` filtered that same record out because the old master name did not start with the CLI process's current `MasterSupervisor::basename()`. - Less likely but still possible if the observations were not atomic: the old master record/ZSET score briefly aged out of Horizon's 14s fresh window before `terminate` ran. That would make terminate's `all()` empty, but a later/fresh status should then say inactive unless the master resumed refreshing or another master record existed. - Unlikely for the specific `No processes to terminate` message: Solo PID reparenting or POSIX-signal masking. If the PID was present but `posix_kill` failed, Horizon would print `Sending TERM signal to processes.` and then a failure for that PID, not `No processes to terminate.` Recommended reliable-reload fix: - Phase 2 build recommendation: add a project-local `flower:horizon-reload` command/procedure that does not depend on Horizon's current-machine basename filter. It should resolve the intended MAIN Horizon master by trusted runtime context (MAIN cwd/Solo process 967 and/or the single fresh `laravel_horizon:master:*` record), verify it maps to the expected OS pid and command (`php artisan horizon` under `/Users/mikeferrara/Documents/code/flower`), then send SIGTERM to that exact pid and report the before/after evidence. It must be read/verify-first and refuse ambiguous cases (multiple fresh masters, mismatched cwd/command, stale record, missing Solo process). This keeps the graceful signal semantics without relying on the fragile basename filter. - Until that Phase 2 command exists, the current documented reload path should be treated as unreliable when it reports `No processes to terminate`. The sanctioned reliable fallback is the operator-approved Solo reload/restart of the Horizon command process 967, with explicit acknowledgement that it is less graceful and can interrupt in-flight jobs. Do not use `horizon:terminate` success as proof unless it prints `Sending TERM signal to processes.` and the Solo pid changes/restarts afterward. - No live repro is possible now because the failing old master was restarted away; reproducing this safely would require a separate isolated Redis instance and isolated Horizon daemon, not a prefix on MAIN's shared Redis.

agent · flower-foundation-127
participant joined 2d ago
system · flower-foundation-127
link added 2d ago
agent · system:brief-autolink
link added 2d ago
agent · system:brief-autolink
link added 2d ago
agent · system:brief-autolink
link added 2d ago
agent · system:brief-autolink
comment 2d ago

Target branch flower/127-horizon-reload-diag is merged to the default branch; suggest marking the brief complete.
system · system:brief-autolink
participant joined 2d ago
system · system:brief-autolink
dispatched 2d ago

Dispatch request #33 queued for flower.
agent · flower-orchestrator
status change 2d ago
agent · flower-orchestrator
status change 2d ago
agent · flower-orchestrator
note added 2d ago

Operator-reported flaky behavior (2026-07-03): `php artisan horizon:terminate` intermittently reports "No processes to terminate" while `php artisan horizon:status` reports "Horizon is running" (observed live: Solo proc 967 / pid unchanged after terminate → the graceful reload silently did nothing). Consequence: the documented graceful pipeline reload does NOT reliably reload Horizon, so every job/pipeline code deploy currently needs a manual hard `restart_process(967)` (against the intended workflow, hard-kills in-flight jobs). Operator: "sometimes it works, sometimes it doesn't — seems like it might be a Solo thing." ⚠️ SAFETY — READ FIRST. MAIN's Horizon (Solo proc 967) is the LIVE pipeline and shares MAIN's Redis. This brief is DIAGNOSTIC-ONLY. Do NOT run `php artisan horizon`, `horizon:terminate`, `horizon:pause/continue/clear`, or any restart against MAIN or its shared Redis — you could disrupt live ingest/embed/summarize on the operator's real data. READ-ONLY inspection only. If you conclude a live REPRODUCTION is required, STOP and report exactly what isolated setup you need (a worktree with its OWN isolated Redis instance — a separate redis, not just a key prefix on the shared one). Do NOT improvise isolation. INVESTIGATE (read-only): 1. How Laravel Horizon's `terminate` locates the master to signal (it reads the master supervisor record/PID from cache/Redis — `MasterSupervisor`, `horizon:` keys). Determine WHY terminate finds "no processes" while `status` says running. Candidate causes: (a) cache-store / Redis connection / prefix mismatch between the CLI invocation's env and the running Horizon's env (Solo launched 967 ~14h before the observation — has .env / HORIZON_PREFIX / cache prefix drifted since boot?); (b) the master PID record expired/evicted from cache while the process still lives; (c) terminate relies on a POSIX signal to a PID that Solo's process supervision masks/reparents; (d) a Horizon version nuance in this app's installed version. 2. Inspect flower's config read-only: `config/horizon.php` (prefix/use/waits), the cache + redis connections Horizon uses, and how Solo launches proc 967 (env, cwd, prefix) — cross-check the running env vs a fresh CLI env. 3. Read-only Redis inspection of the horizon master keys (e.g. via the app's redis connection / redis-cli GET/KEYS on `horizon:*` master keys) to see whether the master record exists and matches what `terminate` looks up. Do NOT write/delete keys. DELIVERABLE (post to this brief via brief_append): root-cause finding + a concrete recommended fix — e.g. align the cache/prefix, a Solo-native reliable reload, a small `flower:horizon-reload` wrapper command that signals reliably, or a documented reliable reload procedure. If the fix is code, propose it as a Phase-2 build (separate go). If it's an ops/config issue, document the reliable procedure. Diagnosis may be REPORT-ONLY (no commit) — that is a fine outcome; if you do commit, `Brief: #<this-id>` trailer, `php artisan test` green + pint. Provenance: operator-requested review/debug (2026-07-03), after the orchestrator hit this live merging Brief #114's job-code change and had to fall back to authorized restart_process(967).

agent · flower-orchestrator
participant joined 2d ago
system · flower-orchestrator

epic · dependencies

Relationships

epic parent

depends on

No dependencies — dispatchable once planned.

agents · waves

Participants

flower-orchestrator participant · active
system:brief-autolink participant · active
flower-foundation-127 participant · active

trace · graph

Projects

flower · primary

dogfood · read-only

Agent’s-eye view

The literal recall_brief payload an agent gets — same service path as the MCP tool.