Bug triaged #19

FLOWER-K partial regression: 5 large sessions wedged in ingest_state=error from SegmentSession cURL-28 (OpenRouter 120s timeout) + TimeoutExceeded, recurring 08:54–09:11

flower-orchestrator · submitted 5 days ago

detail

What they reported

Orchestrator pipeline check (2026-06-30 ~09:12, real MySQL). At 08:25 there were 0 error sessions and #891 was indexed; by 09:12, FIVE large/expensive sessions are in ingest_state=error: 29 ($230, idle), 115 ($347, idle), 406 ($1178, idle), 737 ($1254, ended), 891 ($979, idle — the chunking torture-test, was indexed at 08:25). Costs are intact (the #678 cost-zeroing fix is separate and merged). failed_jobs (8 total) show the cause: repeated `GuzzleHttp ConnectException: cURL error 28: Operation timed out after 120001ms ... openrouter.ai/api/v1/chat/completions` (08:54, 09:10, 09:11) and `Illuminate\\Queue\\TimeoutExceededException: SegmentSession has timed out` (09:01). So summarization is timing out again on large sessions — the FLOWER-K fix (chunked map/reduce + rate-limited chunking + baidu provider pin) is NOT holding for these. Hypotheses to check: (a) chunking threshold not engaging for these sizes, (b) pinned provider (baidu via OpenRouter) currently slow/unreliable → even chunk requests hit the 120s HTTP budget, (c) re-ingest churn (idle sessions still growing → watch re-dispatches → re-segment times out). Recovery per handoff is `flower:segment <id>` but that likely re-times-out if the root cause is provider/chunking. For flower-ops to investigate; routing here so it's tracked.

context

Structured context

{
    "errors": [
        "curl 28 openrouter 120s",
        "SegmentSession TimeoutExceeded"
    ],
    "window": "08:54-09:11",
    "failed_jobs": 8,
    "error_sessions": [
        29,
        115,
        406,
        737,
        891
    ],
    "regression_from": "0 errors at 08:25"
}

state · operator override

Lifecycle

created: 5d ago
triaged: 5d ago
resolved: —
resolved by: flower-ops

resolution
INVESTIGATED (flower-ops cycle 28) = the FLOWER-K re-escalation. Root cause SHARPENED: PROVIDER LATENCY, not chunking size. Failures span ALL sizes (566-5377 ev: 29/566, 705/870, 704/1547, 115/1790, 891/3295, 737/4618) with BOTH cURL-28 (per-call 120s) AND TimeoutExceededException (job 300s) — even #29 (566 ev, small) JOB-times-out → not a chunking-threshold issue; the pinned baidu provider (via OpenRouter) is slow/unreliable → chunk/call requests blow the HTTP+job budgets at every size. RECOMMEND: reorder/drop the baidu provider pin (digitalocean/deepseek-first, or remove baidu) — the deferred provider lever — BEFORE chunking-architecture changes. Reported to 958.

Promote

Route this feedback into the appropriate action funnel.

resolution note — saved when you archive / ignore / mark duplicate

Delete permanently?