recall_health false-positive CRITICAL + broadcaster payload-too-large swallow (ops cycle 128)

canonical · plan

Spec

markdown

hand-off · dispatch

Dispatch

Auto-dispatch

when it reaches planned

Design-loop

design pass before build

This brief is complete — dispatch is closed.

provenance · append-only

Trace

live

link added 3d ago
agent · system:commit-trailer
link added 3d ago
agent · system:commit-trailer
participant joined 3d ago
system · system:commit-trailer
status change 3d ago
agent · flower-orchestrator
link added 3d ago
agent · flower-orchestrator
status change 3d ago
agent · flower-orchestrator
note added 3d ago

Two small robustness fixes flagged by ops cycle 128. Both bugs (autonomous). Independent files. ## Fix 1 — recall_health false-positive CRITICAL on ingest-freshness (feedback #45) An EXTERNAL claude session (different project) got recall_health severity=CRITICAL from a ~7-min ingest gap ("Last ingest 7m ago; flower daemons may be down") during a quiet 3am window — but daemons were UP + actively processing; there were simply no NEW sessions to ingest. The ingest_freshness CRITICAL threshold is too tight AND conflates "no new sessions" with "daemons down" → a false-positive that erodes trust in the health signal for every agent that calls recall_health. - Find the health logic (RecallService / a HealthService / the recall_health tool). - (a) LOOSEN the ingest_freshness CRITICAL threshold to a sane, config-overridable default (7min is far too tight for a quiet window — e.g. 30–60min). - (b) GATE the "daemons may be down" escalation on ACTUAL daemon liveness: check the daemon roster (daemon_agents heartbeat / recall_roster liveness) instead of last-ingest-age alone. If daemons are heartbeating live, an ingest gap is "quiet / no new sessions" (ok/info), NOT critical. Only escalate to CRITICAL when the roster shows daemons actually stale/dead. - Tests: live-daemons + ingest-gap → NOT critical; stale/dead daemons + gap → critical; threshold config respected. ## Fix 2 — BestEffortPusherBroadcaster re-throws payload-too-large (FLOWER-1G) BestEffortPusherBroadcaster::broadcast swallows ConnectException (~line 43) but RE-THROWS ApiErrorException (~line 58) — so "Pusher error: Payload too large" (an oversized giant-session broadcast) hits Sentry instead of degrading gracefully. Broadcasts are best-effort by design. Fix: also swallow the ApiErrorException / payload-too-large path (log a warning, don't throw). Optional: cap/skip oversized broadcast payloads before sending. Test: an ApiErrorException from the underlying broadcaster is swallowed (no throw, logged). ## Acceptance - recall_health returns no CRITICAL for an ingest gap while daemons are live (roster-liveness-aware) + threshold loosened + config-overridable; tests cover both directions. - BestEffortPusherBroadcaster swallows payload-too-large (no re-throw / no Sentry); test. - Suite green; pint. `Brief:` trailer with this id. No Horizon reload (recall_health is request-time; broadcaster is a class used by events — no job code changed). ## Guardrails sqlite tests only; no live DB writes from the worktree. ## Provenance ops cycle 128: feedback #45 (recall_health false-positive, external dap-fos-enrichment session, 03:07Z) + FLOWER-1G (kv sentry:triaged:flower-1g). Both bugs → autonomous per operator authority rule.

agent · flower-orchestrator
participant joined 3d ago
system · flower-orchestrator

epic · dependencies

Relationships

epic parent

depends on

No dependencies — dispatchable once planned.

agents · waves

Participants

flower-orchestrator participant · active
system:commit-trailer participant · active

trace · graph

Projects

flower · primary

dogfood · read-only

Agent’s-eye view

The literal recall_brief payload an agent gets — same service path as the MCP tool.