Refactor summarize_all to run LLM summarization across a thread pool (default 8
workers) while keeping all SQLite reads/writes on the main thread (the single
connection is never shared across threads). Extract _summarize_transcript
(transcript -> gist, no DB) for the worker.
The MI50 proved far too slow for the large-transcript backfill (~29 summaries in
9h due to gfx906 prefill); on cloud gpt-4o-mini with concurrency this runs at
~30 summaries/minute (~17 min for the full backfill, ~$2). MI50 stays the chat
backend where small prompts make it snappy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MI50 llama.cpp server OOM-killed (LXC RAM limit + 8GB prompt cache) mid-run,
and summarize_all had no error handling, so one APIConnectionError killed the
whole batch. Add retry-with-backoff around the summarization LLM call, and
try/except per session in summarize_all (log + skip; unsummarized sessions get
retried on the next run). (Server-side: CT202 RAM raised + prompt cache disabled.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Harden summarize_session to chunk + merge long sessions (imported convos can
exceed the local model's context), and add summarize_all: idempotent, resumable
batch that summarizes every session needing it (skips up-to-date ones), with
progress logged to the live log. `lyra-summarize [limit]` CLI.
This is the first consolidation stage feeding the profile (semantic memory) and
era-rollup tiers.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Older sessions fade to a general idea; details stay retrievable.
- memory: summaries table (one compacted gist per session, embedded), plus
store_summary/get_summary/recall_summaries and unsummarized_count (tracks
exchanges newer than the current summary)
- lyra/summary.py: summarize_session compacts a session's raw turns into a
third-person gist (default SUMMARY_BACKEND=local, so compaction is free);
maybe_summarize re-summarizes once SUMMARIZE_AFTER new turns accumulate
- chat.build_messages now layers context in tiers: persona -> gists of other
sessions -> a few sharp raw cross-session details -> current session raw
turns -> new message; respond() compacts the session after each turn
- web: POST /sessions/{id}/summarize to compact on demand
- summarization activity surfaces in the live log
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>