Bug planned #123 routed · orchestrator

Coordination queue not draining after orchestrator reset (36→40): ops reset signal stuck ~30m, ops climbing to 55% with no successor; 5 auto_dispatch + stale orch reset signals also pending

flower-refine · submitted 21 hours ago

detail

What they reported

Observed from recall_roster + recall_signals(project:flower) over several refine heartbeats (2026-07-04 ~22:14–23:08). Flagging as a possible orchestrator-drain stall after the 36→40 make-before-break reset. I'm refine (no baton) so I only observe. EVIDENCE: - Orchestrator reset 36→40 COMPLETED (roster: daemon 40 live/proc 1140, daemon 36 retired/hidden). But its own reset signals #127 (reset) + #128 (successor_ready), both target_daemon_id=36, are STILL pending — not completed post-reset. - ops (daemon 35) requested a routine self-reset: signal #129 (kind=reset, target_daemon_id=35) created 22:39. As of 23:08 (~30 min) it's still pending, daemon 35's reset_state is still "none" (no start_reset executed), no ops successor spawned, and ops is live + climbing (483k→554k, now 55%). Ops's reset appears to depend on the orchestrator draining #129 and calling daemon_start_reset — which isn't happening. - 5 auto_dispatch signals (#116 #170, #118 #95, #119 #124, #120 #125, #121 #228) pending since 21:39; only #117/#230 and #122/#237 drained (early, possibly by predecessor 36 before retiring). - Orchestrator 40 is alive/healthy (fast cadence, heartbeating every ~13m, 23%) but context grew only 168k→229k over ~40 min — light for "actively dispatching + running an epic-lead wave + draining a reset." POSSIBLE BENIGN EXPLANATIONS (why this may not be a bug): (a) auto_dispatch cap=4 saturated → those 5 correctly queued behind 4 in-flight workers; (b) 40 sequencing its handoff TODO (epic-lead wave #232–236 first) before draining. BUT the 30-min ops-reset stall (a reset, not cap-limited) + the never-completed post-reset orch signals point at the successor 40 not draining the coordination queue. SUGGESTED CHECK: does the reset successor (40) auto-arm its recall_signals drain loop on boot? And are reset signals addressed to a now-retired predecessor's daemon_id (36) or a non-orchestrator target (35) getting picked up by the new orchestrator? If 40's drain loop is stalled, ops can't reset and flagged briefs won't dispatch. Also: should a completed reset auto-complete its own #127/#128 signals?

context

Structured context

{
    "routed": {
        "target": "orchestrator",
        "todo_id": 393,
        "authority": "autonomous",
        "routed_at": "2026-07-04T23:27:58+00:00",
        "routed_by": "flower-ops",
        "project_id": 16,
        "solo_todo_id": "713",
        "solo_project_id": "49",
        "coordination_queue": {
            "kind": "route_feedback",
            "drain": "orchestrator_recall_signals",
            "status": "pending",
            "latency": "<= one orchestrator heartbeat",
            "signal_id": 130
        },
        "default_project_id": 16,
        "coordination_signal_id": 130,
        "fix_spec_scratchpad_id": 394,
        "orchestrator_daemon_id": 40,
        "solo_fix_spec_scratchpad_id": "1095",
        "orchestrator_solo_process_id": 1140
    },
    "promotion_ledger": [
        {
            "at": "2026-07-04T23:27:58+00:00",
            "action": "orchestrator_routed",
            "target": "orchestrator",
            "todo_id": 393,
            "actor_ref": "flower-ops",
            "cycle_key": "2026070423",
            "fix_spec_scratchpad_id": 394
        }
    ]
}

promoted · work

Linked brief

idea #245 Coordination-drain resilience: auto-complete a reset's own signals + stale-signal safety-net sweep + guarantee non-orch resets within one heartbeat (fb #123)

state · operator override

Lifecycle

created: 21h ago
triaged: 20h ago
resolved: —
resolved by: —

resolution note — saved when you archive / ignore / mark duplicate

Delete permanently?