docs updated for v0.5.1

2025-12-11 03:49:23 -05:00
parent e45cdbe54e
commit d5d7ea3469
2 changed files with 1227 additions and 157 deletions
--- a/PROJECT_SUMMARY.md
+++ b/PROJECT_SUMMARY.md
@@ -1,71 +1,925 @@
-# Lyra Core — Project Summary
+# Project Lyra — Comprehensive AI Context Summary
-## v0.4 (2025-10-03)
+**Version:** v0.5.1 (2025-12-11)
 **Status:** Production-ready modular AI companion system
 **Purpose:** Memory-backed conversational AI with multi-stage reasoning, persistent context, and modular LLM backend architecture
-### 🧠 High-Level Architecture
+---
 - **Lyra Core (v0.3.1)** — Orchestration layer.  
  - Accepts chat requests (`/v1/chat/completions`).  
  - Routes through Cortex for subconscious annotation.  
  - Stores everything in Mem0 (no discard).  
  - Fetches persona + relevant memories.  
  - Injects context back into LLM.  
- **Cortex (v0.3.0)** — Subconscious annotator.  
+## Executive Summary
-  - Runs locally via `llama.cpp` (Phi-3.5-mini Q4_K_M).  
+
-  - Strict JSON schema:  
+Project Lyra is a **self-hosted AI companion system** designed to overcome the limitations of typical chatbots by providing:
-    ```json
+- **Persistent long-term memory** (NeoMem: PostgreSQL + Neo4j graph storage)
-    {
+- **Multi-stage reasoning pipeline** (Cortex: reflection → reasoning → refinement → persona)
-      "sentiment": "positive" | "neutral" | "negative",
+- **Short-term context management** (Intake: session-based summarization embedded in Cortex)
-      "novelty": 0.0–1.0,
+- **Flexible LLM backend routing** (supports llama.cpp, Ollama, OpenAI, custom endpoints)
-      "tags": ["keyword", "keyword"],
+- **OpenAI-compatible API** (drop-in replacement for chat applications)
-      "notes": "short string"
+
 **Core Philosophy:** Like a human brain has different regions for different functions, Lyra has specialized modules that work together. She's not just a chatbot—she's a notepad, schedule, database, co-creator, and collaborator with her own executive function.
 ---
 ## Quick Context for AI Assistants
 If you're an AI being given this project to work on, here's what you need to know:
 ### What This Project Does
 Lyra is a conversational AI system that **remembers everything** across sessions. When a user says something in passing, Lyra stores it, contextualizes it, and can recall it later. She can:
 - Track project progress over time
 - Remember user preferences and past conversations
 - Reason through complex questions using multiple LLM calls
 - Apply a consistent personality across all interactions
 - Integrate with multiple LLM backends (local and cloud)
 ### Current Architecture (v0.5.1)
 ```
 User → Relay (Express/Node.js, port 7078)
  ↓
 Cortex (FastAPI/Python, port 7081)
  ├─ Intake module (embedded, in-memory SESSIONS)
  ├─ 4-stage reasoning pipeline
  └─ Multi-backend LLM router
  ↓
 NeoMem (FastAPI/Python, port 7077)
  ├─ PostgreSQL (vector storage)
  └─ Neo4j (graph relationships)
 ```
 ### Key Files You'll Work With
 **Backend Services:**
 - [cortex/router.py](cortex/router.py) - Main Cortex routing logic (306 lines, `/reason`, `/ingest` endpoints)
 - [cortex/intake/intake.py](cortex/intake/intake.py) - Short-term memory module (367 lines, SESSIONS management)
 - [cortex/reasoning/reasoning.py](cortex/reasoning/reasoning.py) - Draft answer generation
 - [cortex/reasoning/refine.py](cortex/reasoning/refine.py) - Answer refinement
 - [cortex/reasoning/reflection.py](cortex/reasoning/reflection.py) - Meta-awareness notes
 - [cortex/persona/speak.py](cortex/persona/speak.py) - Personality layer
 - [cortex/llm/llm_router.py](cortex/llm/llm_router.py) - LLM backend selector
 - [core/relay/server.js](core/relay/server.js) - Main orchestrator (Node.js)
 - [neomem/main.py](neomem/main.py) - Long-term memory API
 **Configuration:**
 - [.env](.env) - Root environment variables (LLM backends, databases, API keys)
 - [cortex/.env](cortex/.env) - Cortex-specific overrides
 - [docker-compose.yml](docker-compose.yml) - Service definitions (152 lines)
 **Documentation:**
 - [CHANGELOG.md](CHANGELOG.md) - Complete version history (836 lines, chronological format)
 - [README.md](README.md) - User-facing documentation (610 lines)
 - [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - This file
 ### Recent Critical Fixes (v0.5.1)
 The most recent work fixed a critical bug where Intake's SESSIONS buffer wasn't persisting:
 1. **Fixed**: `bg_summarize()` was only a TYPE_CHECKING stub → implemented as logging stub
 2. **Fixed**: `/ingest` endpoint had unreachable code → removed early return, added lenient error handling
 3. **Added**: `cortex/intake/__init__.py` → proper Python package structure
 4. **Added**: Diagnostic endpoints `/debug/sessions` and `/debug/summary` for troubleshooting
 **Key Insight**: Intake is no longer a standalone service—it's embedded in Cortex as a Python module. SESSIONS must persist in a single Uvicorn worker (no multi-worker support without Redis).
 ---
 ## Architecture Deep Dive
 ### Service Topology (Docker Compose)
 **Active Containers:**
 1. **relay** (Node.js/Express, port 7078)
   - Entry point for all user requests
   - OpenAI-compatible `/v1/chat/completions` endpoint
   - Routes to Cortex for reasoning
   - Async calls to Cortex `/ingest` after response
 2. **cortex** (Python/FastAPI, port 7081)
   - Multi-stage reasoning pipeline
   - Embedded Intake module (no HTTP, direct Python imports)
   - Endpoints: `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary`
 3. **neomem-api** (Python/FastAPI, port 7077)
   - Long-term memory storage
   - Fork of Mem0 OSS (fully local, no external SDK)
   - Endpoints: `/memories`, `/search`, `/health`
 4. **neomem-postgres** (PostgreSQL + pgvector, port 5432)
   - Vector embeddings storage
   - Memory history records
 5. **neomem-neo4j** (Neo4j, ports 7474/7687)
   - Graph relationships between memories
   - Entity extraction and linking
 **Disabled Services:**
 - `intake` - No longer needed (embedded in Cortex as of v0.5.1)
 - `rag` - Beta Lyrae RAG service (planned re-enablement)
 ### External LLM Backends (HTTP APIs)
 **PRIMARY Backend** - llama.cpp @ `http://10.0.0.44:8080`
 - AMD MI50 GPU-accelerated inference
 - Model: `/model` (path-based routing)
 - Used for: Reasoning, refinement, summarization
 **SECONDARY Backend** - Ollama @ `http://10.0.0.3:11434`
 - RTX 3090 GPU-accelerated inference
 - Model: `qwen2.5:7b-instruct-q4_K_M`
 - Used for: Configurable per-module
 **CLOUD Backend** - OpenAI @ `https://api.openai.com/v1`
 - Cloud-based inference
 - Model: `gpt-4o-mini`
 - Used for: Reflection, persona layers
 **FALLBACK Backend** - Local @ `http://10.0.0.41:11435`
 - CPU-based inference
 - Model: `llama-3.2-8b-instruct`
 - Used for: Emergency fallback
 ### Data Flow (Request Lifecycle)
 ```
 1. User sends message → Relay (/v1/chat/completions)
   ↓
 2. Relay → Cortex (/reason)
   ↓
 3. Cortex calls Intake module (internal Python)
   - Intake.summarize_context(session_id, exchanges)
   - Returns L1/L5/L10/L20/L30 summaries
   ↓
 4. Cortex 4-stage pipeline:
   a. reflection.py → Meta-awareness notes (CLOUD backend)
      - "What is the user really asking?"
      - Returns JSON: {"notes": [...]}
   b. reasoning.py → Draft answer (PRIMARY backend)
      - Uses context from Intake
      - Integrates reflection notes
      - Returns draft text
   c. refine.py → Refined answer (PRIMARY backend)
      - Polishes draft for clarity
      - Ensures factual consistency
      - Returns refined text
   d. speak.py → Persona layer (CLOUD backend)
      - Applies Lyra's personality
      - Natural, conversational tone
      - Returns final answer
   ↓
 5. Cortex → Relay (returns persona answer)
   ↓
 6. Relay → Cortex (/ingest) [async, non-blocking]
   - Sends (session_id, user_msg, assistant_msg)
   - Cortex calls add_exchange_internal()
   - Appends to SESSIONS[session_id]["buffer"]
   ↓
 7. Relay → User (returns final response)
   ↓
 8. [Planned] Relay → NeoMem (/memories) [async]
   - Store conversation in long-term memory
 ```
 ### Intake Module Architecture (v0.5.1)
 **Location:** `cortex/intake/`
 **Key Change:** Intake is now **embedded in Cortex** as a Python module, not a standalone service.
 **Import Pattern:**
 ```python
 from intake.intake import add_exchange_internal, SESSIONS, summarize_context
 ```
 **Core Data Structure:**
 ```python
 SESSIONS: dict[str, dict] = {}
 # Structure:
 SESSIONS[session_id] = {
    "buffer": deque(maxlen=200),  # Circular buffer of exchanges
    "created_at": datetime
 }
 # Each exchange in buffer:
 {
    "session_id": "...",
    "user_msg": "...",
    "assistant_msg": "...",
    "timestamp": "2025-12-11T..."
 }
 ```
 **Functions:**
 1. **`add_exchange_internal(exchange: dict)`**
   - Adds exchange to SESSIONS buffer
   - Creates new session if needed
   - Calls `bg_summarize()` stub
   - Returns `{"ok": True, "session_id": "..."}`
 2. **`summarize_context(session_id: str, exchanges: list[dict])`** [async]
   - Generates L1/L5/L10/L20/L30 summaries via LLM
   - Called during `/reason` endpoint
   - Returns multi-level summary dict
 3. **`bg_summarize(session_id: str)`**
   - **Stub function** - logs only, no actual work
   - Defers summarization to `/reason` call
   - Exists to prevent NameError
 **Critical Constraint:** SESSIONS is a module-level global dict. This requires **single-worker Uvicorn** mode. Multi-worker deployments need Redis or shared storage.
 **Diagnostic Endpoints:**
 - `GET /debug/sessions` - Inspect all SESSIONS (object ID, buffer sizes, recent exchanges)
 - `GET /debug/summary?session_id=X` - Test summarization for a session
 ---
 ## Environment Configuration
 ### LLM Backend Registry (Multi-Backend Strategy)
 **Root `.env` defines all backend OPTIONS:**
 ```bash
 # PRIMARY Backend (llama.cpp)
 LLM_PRIMARY_PROVIDER=llama.cpp
 LLM_PRIMARY_URL=http://10.0.0.44:8080
 LLM_PRIMARY_MODEL=/model
 # SECONDARY Backend (Ollama)
 LLM_SECONDARY_PROVIDER=ollama
 LLM_SECONDARY_URL=http://10.0.0.3:11434
 LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
 # CLOUD Backend (OpenAI)
 LLM_OPENAI_PROVIDER=openai
 LLM_OPENAI_URL=https://api.openai.com/v1
 LLM_OPENAI_MODEL=gpt-4o-mini
 OPENAI_API_KEY=sk-proj-...
 # FALLBACK Backend
 LLM_FALLBACK_PROVIDER=openai_completions
 LLM_FALLBACK_URL=http://10.0.0.41:11435
 LLM_FALLBACK_MODEL=llama-3.2-8b-instruct
 ```
 **Module-specific backend selection:**
 ```bash
 CORTEX_LLM=SECONDARY      # Cortex uses Ollama
 INTAKE_LLM=PRIMARY        # Intake uses llama.cpp
 SPEAK_LLM=OPENAI          # Persona uses OpenAI
 NEOMEM_LLM=PRIMARY        # NeoMem uses llama.cpp
 UI_LLM=OPENAI             # UI uses OpenAI
 RELAY_LLM=PRIMARY         # Relay uses llama.cpp
 ```
 **Philosophy:** Root `.env` provides all backend OPTIONS. Each service chooses which backend to USE via `{MODULE}_LLM` variable. This eliminates URL duplication while preserving flexibility.
 ### Database Configuration
 ```bash
 # PostgreSQL (vector storage)
 POSTGRES_USER=neomem
 POSTGRES_PASSWORD=neomempass
 POSTGRES_DB=neomem
 POSTGRES_HOST=neomem-postgres
 POSTGRES_PORT=5432
 # Neo4j (graph storage)
 NEO4J_URI=bolt://neomem-neo4j:7687
 NEO4J_USERNAME=neo4j
 NEO4J_PASSWORD=neomemgraph
 ```
 ### Service URLs (Docker Internal Network)
 ```bash
 NEOMEM_API=http://neomem-api:7077
 CORTEX_API=http://cortex:7081
 CORTEX_REASON_URL=http://cortex:7081/reason
 CORTEX_INGEST_URL=http://cortex:7081/ingest
 RELAY_URL=http://relay:7078
 ```
 ### Feature Flags
 ```bash
 CORTEX_ENABLED=true
 MEMORY_ENABLED=true
 PERSONA_ENABLED=false
 DEBUG_PROMPT=true
 VERBOSE_DEBUG=true
 ```
 ---
 ## Code Structure Overview
 ### Cortex Service (`cortex/`)
 **Main Files:**
 - `main.py` - FastAPI app initialization
 - `router.py` - Route definitions (`/reason`, `/ingest`, `/health`, `/debug/*`)
 - `context.py` - Context aggregation (Intake summaries, session state)
 **Reasoning Pipeline (`reasoning/`):**
 - `reflection.py` - Meta-awareness notes (Cloud LLM)
 - `reasoning.py` - Draft answer generation (Primary LLM)
 - `refine.py` - Answer refinement (Primary LLM)
 **Persona Layer (`persona/`):**
 - `speak.py` - Personality application (Cloud LLM)
 - `identity.py` - Persona loader
 **Intake Module (`intake/`):**
 - `__init__.py` - Package exports (SESSIONS, add_exchange_internal, summarize_context)
 - `intake.py` - Core logic (367 lines)
  - SESSIONS dictionary
  - add_exchange_internal()
  - summarize_context()
  - bg_summarize() stub
 **LLM Integration (`llm/`):**
 - `llm_router.py` - Backend selector and HTTP client
  - call_llm() function
  - Environment-based routing
  - Payload formatting per backend type
 **Utilities (`utils/`):**
 - Helper functions for common operations
 **Configuration:**
 - `Dockerfile` - Single-worker constraint documented
 - `requirements.txt` - Python dependencies
 - `.env` - Service-specific overrides
 ### Relay Service (`core/relay/`)
 **Main Files:**
 - `server.js` - Express.js server (Node.js)
  - `/v1/chat/completions` - OpenAI-compatible endpoint
  - `/chat` - Internal endpoint
  - `/_health` - Health check
 - `package.json` - Node.js dependencies
 **Key Logic:**
 - Receives user messages
 - Routes to Cortex `/reason`
 - Async calls to Cortex `/ingest` after response
 - Returns final answer to user
 ### NeoMem Service (`neomem/`)
 **Main Files:**
 - `main.py` - FastAPI app (memory API)
 - `memory.py` - Memory management logic
 - `embedder.py` - Embedding generation
 - `graph.py` - Neo4j graph operations
 - `Dockerfile` - Container definition
 - `requirements.txt` - Python dependencies
 **API Endpoints:**
 - `POST /memories` - Add new memory
 - `POST /search` - Semantic search
 - `GET /health` - Service health
 ---
 ## Common Development Tasks
 ### Adding a New Endpoint to Cortex
 **Example: Add `/debug/buffer` endpoint**
 1. **Edit `cortex/router.py`:**
 ```python
@cortex_router.get("/debug/buffer")
 async def debug_buffer(session_id: str, limit: int = 10):
    """Return last N exchanges from a session buffer."""
    from intake.intake import SESSIONS
    session = SESSIONS.get(session_id)
    if not session:
        return {"error": "session not found", "session_id": session_id}
    buffer = session["buffer"]
    recent = list(buffer)[-limit:]
    return {
        "session_id": session_id,
        "total_exchanges": len(buffer),
        "recent_exchanges": recent
    }
-    ```  
+```
  - Normalizes keys (lowercase).  
  - Strips Markdown fences before parsing.  
  - Configurable via `.env` (`CORTEX_ENABLED=true|false`).  
  - Currently generates annotations, but not yet persisted into Mem0 payloads (stored as empty `{cortex:{}}`).  
- **Mem0 (v0.4.0)** — Persistent memory layer.  
+2. **Restart Cortex:**
-  - Handles embeddings, graph storage, and retrieval.  
+```bash
-  - Dual embedder support:  
+docker-compose restart cortex
-    - **OpenAI Cloud** (`text-embedding-3-small`, 1536-dim).  
+```
    - **HuggingFace TEI** (gte-Qwen2-1.5B-instruct, 1536-dim, hosted on 3090).  
  - Environment toggle for provider (`.env.openai` vs `.env.3090`).  
  - Memory persistence in Postgres (`payload` JSON).  
  - CSV export pipeline confirmed (id, user_id, data, created_at).  
- **Persona Sidecar**  
+3. **Test:**
-  - Provides personality, style, and protocol instructions.  
+```bash
-  - Injected at runtime into Core prompt building.  
+curl "http://localhost:7081/debug/buffer?session_id=test&limit=5"
 ```
 ### Modifying LLM Backend for a Module
 **Example: Switch Cortex to use PRIMARY backend**
 1. **Edit `.env`:**
 ```bash
 CORTEX_LLM=PRIMARY  # Change from SECONDARY to PRIMARY
 ```
 2. **Restart Cortex:**
 ```bash
 docker-compose restart cortex
 ```
 3. **Verify in logs:**
 ```bash
 docker logs cortex | grep "Backend"
 ```
 ### Adding Diagnostic Logging
 **Example: Log every exchange addition**
 1. **Edit `cortex/intake/intake.py`:**
 ```python
 def add_exchange_internal(exchange: dict):
    session_id = exchange.get("session_id")
    # Add detailed logging
    print(f"[DEBUG] Adding exchange to {session_id}")
    print(f"[DEBUG] User msg: {exchange.get('user_msg', '')[:100]}")
    print(f"[DEBUG] Assistant msg: {exchange.get('assistant_msg', '')[:100]}")
    # ... rest of function
 ```
 2. **View logs:**
 ```bash
 docker logs cortex -f | grep DEBUG
 ```
 ---
-### 🚀 Recent Changes
+## Debugging Guide
 - **Mem0**  
  - Added HuggingFace TEI integration (local 3090 embedder).  
  - Enabled dual-mode environment switch (OpenAI cloud ↔ local TEI).  
  - Fixed `.env` line ending mismatch (CRLF vs LF).  
  - Added memory dump/export commands for Postgres.  
- **Core/Relay**  
+### Problem: SESSIONS Not Persisting
  - No major changes since v0.3.1 (still routing input → Cortex → Mem0).  
- **Cortex**  
+**Symptoms:**
-  - Still outputs annotations, but not yet persisted into Mem0 payloads.  
+- `/debug/sessions` shows empty or only 1 exchange
 - Summaries always return empty
 - Buffer size doesn't increase
 **Diagnosis Steps:**
 1. Check Cortex logs for SESSIONS object ID:
   ```bash
   docker logs cortex | grep "SESSIONS object id"
   ```
   - Should show same ID across all calls
   - If IDs differ → module reloading issue
 2. Verify single-worker mode:
   ```bash
   docker exec cortex cat Dockerfile | grep uvicorn
   ```
   - Should NOT have `--workers` flag or `--workers 1`
 3. Check `/debug/sessions` endpoint:
   ```bash
   curl http://localhost:7081/debug/sessions | jq
   ```
   - Should show sessions_object_id and current sessions
 4. Inspect `__init__.py` exists:
   ```bash
   docker exec cortex ls -la intake/__init__.py
   ```
 **Solution (Fixed in v0.5.1):**
 - Ensure `cortex/intake/__init__.py` exists with proper exports
 - Verify `bg_summarize()` is implemented (not just TYPE_CHECKING stub)
 - Check `/ingest` endpoint doesn't have early return
 - Rebuild Cortex container: `docker-compose build cortex && docker-compose restart cortex`
 ### Problem: LLM Backend Timeout
 **Symptoms:**
 - Cortex `/reason` hangs
 - 504 Gateway Timeout errors
 - Logs show "waiting for LLM response"
 **Diagnosis Steps:**
 1. Test backend directly:
   ```bash
   # llama.cpp
   curl http://10.0.0.44:8080/health
   # Ollama
   curl http://10.0.0.3:11434/api/tags
   # OpenAI
   curl https://api.openai.com/v1/models \
     -H "Authorization: Bearer $OPENAI_API_KEY"
   ```
 2. Check network connectivity:
   ```bash
   docker exec cortex ping -c 3 10.0.0.44
   ```
 3. Review Cortex logs:
   ```bash
   docker logs cortex -f | grep "LLM"
   ```
 **Solutions:**
 - Verify backend URL in `.env` is correct and accessible
 - Check firewall rules for backend ports
 - Increase timeout in `cortex/llm/llm_router.py`
 - Switch to different backend temporarily: `CORTEX_LLM=CLOUD`
 ### Problem: Docker Compose Won't Start
 **Symptoms:**
 - `docker-compose up -d` fails
 - Container exits immediately
 - "port already in use" errors
 **Diagnosis Steps:**
 1. Check port conflicts:
   ```bash
   netstat -tulpn | grep -E '7078|7081|7077|5432'
   ```
 2. Check container logs:
   ```bash
   docker-compose logs --tail=50
   ```
 3. Verify environment file:
   ```bash
   cat .env | grep -v "^#" | grep -v "^$"
   ```
 **Solutions:**
 - Stop conflicting services: `docker-compose down`
 - Check `.env` syntax (no quotes unless necessary)
 - Rebuild containers: `docker-compose build --no-cache`
 - Check Docker daemon: `systemctl status docker`
 ---
-### 📈 Versioning
+## Testing Checklist
- **Lyra Core** → v0.3.1  
+
- **Cortex** → v0.3.0  
+### After Making Changes to Cortex
- **Mem0** → v0.4.0  
+
 **1. Build and restart:**
 ```bash
 docker-compose build cortex
 docker-compose restart cortex
 ```
 **2. Verify service health:**
 ```bash
 curl http://localhost:7081/health
 ```
 **3. Test /ingest endpoint:**
 ```bash
 curl -X POST http://localhost:7081/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "test",
    "user_msg": "Hello",
    "assistant_msg": "Hi there!"
  }'
 ```
 **4. Verify SESSIONS updated:**
 ```bash
 curl http://localhost:7081/debug/sessions | jq '.sessions.test.buffer_size'
 ```
 - Should show 1 (or increment if already populated)
 **5. Test summarization:**
 ```bash
 curl "http://localhost:7081/debug/summary?session_id=test" | jq '.summary'
 ```
 - Should return L1/L5/L10/L20/L30 summaries
 **6. Test full pipeline:**
 ```bash
 curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Test message"}],
    "session_id": "test"
  }' | jq '.choices[0].message.content'
 ```
 **7. Check logs for errors:**
 ```bash
 docker logs cortex --tail=50
 ```
 ---
-### 📋 Next Steps
+## Project History & Context
- [ ] Wire Cortex annotations into Mem0 payloads (`cortex` object).  
+
- [ ] Add “export all memories” script to standard workflow.  
+### Evolution Timeline
- [ ] Consider async embedding for faster `mem.add`.  
+
- [ ] Build visual diagram of data flow (Core ↔ Cortex ↔ Mem0 ↔ Persona).  
+**v0.1.x (2025-09-23 to 2025-09-25)**
- [ ] Explore larger LLMs for Cortex (Qwen2-7B, etc.) for richer subconscious annotation.  
+- Initial MVP: Relay + Mem0 + Ollama
 - Basic memory storage and retrieval
 - Simple UI with session support
 **v0.2.x (2025-09-24 to 2025-09-30)**
 - Migrated to mem0ai SDK
 - Added sessionId support
 - Created standalone Lyra-Mem0 stack
 **v0.3.x (2025-09-26 to 2025-10-28)**
 - Forked Mem0 → NVGRAM → NeoMem
 - Added salience filtering
 - Integrated Cortex reasoning VM
 - Built RAG system (Beta Lyrae)
 - Established multi-backend LLM support
 **v0.4.x (2025-11-05 to 2025-11-13)**
 - Major architectural rewire
 - Implemented 4-stage reasoning pipeline
 - Added reflection, refinement stages
 - RAG integration
 - LLM router with per-stage backend selection
 **Infrastructure v1.0.0 (2025-11-26)**
 - Consolidated 9 `.env` files into single source of truth
 - Multi-backend LLM strategy
 - Docker Compose consolidation
 - Created security templates
 **v0.5.0 (2025-11-28)**
 - Fixed all critical API wiring issues
 - Added OpenAI-compatible Relay endpoint
 - Fixed Cortex → Intake integration
 - End-to-end flow verification
 **v0.5.1 (2025-12-11) - CURRENT**
 - **Critical fix**: SESSIONS persistence bug
 - Implemented `bg_summarize()` stub
 - Fixed `/ingest` unreachable code
 - Added `cortex/intake/__init__.py`
 - Embedded Intake in Cortex (no longer standalone)
 - Added diagnostic endpoints
 - Lenient error handling
 - Documented single-worker constraint
 ### Architectural Philosophy
 **Modular Design:**
 - Each service has a single, clear responsibility
 - Services communicate via well-defined HTTP APIs
 - Configuration is centralized but allows per-service overrides
 **Local-First:**
 - No reliance on external services (except optional OpenAI)
 - All data stored locally (PostgreSQL + Neo4j)
 - Can run entirely air-gapped with local LLMs
 **Flexible LLM Backend:**
 - Not tied to any single LLM provider
 - Can mix local and cloud models
 - Per-stage backend selection for optimal performance/cost
 **Error Handling:**
 - Lenient mode: Never fail the chat pipeline
 - Log errors but continue processing
 - Graceful degradation
 **Observability:**
 - Diagnostic endpoints for debugging
 - Verbose logging mode
 - Object ID tracking for singleton verification
 ---
 ## Known Issues & Limitations
 ### Fixed in v0.5.1
 - ✅ Intake SESSIONS not persisting → **FIXED**
 - ✅ `bg_summarize()` NameError → **FIXED**
 - ✅ `/ingest` endpoint unreachable code → **FIXED**
 ### Current Limitations
 **1. Single-Worker Constraint**
 - Cortex must run with single Uvicorn worker
 - SESSIONS is in-memory module-level global
 - Multi-worker support requires Redis or shared storage
 - Documented in `cortex/Dockerfile` lines 7-8
 **2. NeoMem Integration Incomplete**
 - Relay doesn't yet push to NeoMem after responses
 - Memory storage planned for v0.5.2
 - Currently all memory is short-term (SESSIONS only)
 **3. RAG Service Disabled**
 - Beta Lyrae (RAG) commented out in docker-compose.yml
 - Awaiting re-enablement after Intake stabilization
 - Code exists but not currently integrated
 **4. Session Management**
 - No session cleanup/expiration
 - SESSIONS grows unbounded (maxlen=200 per session, but infinite sessions)
 - No session list endpoint in Relay
 **5. Persona Integration**
 - `PERSONA_ENABLED=false` in `.env`
 - Persona Sidecar not fully wired
 - Identity loaded but not consistently applied
 ### Future Enhancements
 **Short-term (v0.5.2):**
 - Enable NeoMem integration in Relay
 - Add session cleanup/expiration
 - Session list endpoint
 - NeoMem health monitoring
 **Medium-term (v0.6.x):**
 - Re-enable RAG service
 - Migrate SESSIONS to Redis for multi-worker support
 - Add request correlation IDs
 - Comprehensive health checks
 **Long-term (v0.7.x+):**
 - Persona Sidecar full integration
 - Autonomous "dream" cycles (self-reflection)
 - Verifier module for factual grounding
 - Advanced RAG with hybrid search
 - Memory consolidation strategies
 ---
 ## Troubleshooting Quick Reference
 | Problem | Quick Check | Solution |
 |---------|-------------|----------|
 | SESSIONS empty | `curl localhost:7081/debug/sessions` | Rebuild Cortex, verify `__init__.py` exists |
 | LLM timeout | `curl http://10.0.0.44:8080/health` | Check backend connectivity, increase timeout |
 | Port conflict | `netstat -tulpn \| grep 7078` | Stop conflicting service or change port |
 | Container crash | `docker logs cortex` | Check logs for Python errors, verify .env syntax |
 | Missing package | `docker exec cortex pip list` | Rebuild container, check requirements.txt |
 | 502 from Relay | `curl localhost:7081/health` | Verify Cortex is running, check docker network |
 ---
 ## API Reference (Quick)
 ### Relay (Port 7078)
 **POST /v1/chat/completions** - OpenAI-compatible chat
 ```json
 {
  "messages": [{"role": "user", "content": "..."}],
  "session_id": "..."
 }
 ```
 **GET /_health** - Service health
 ### Cortex (Port 7081)
 **POST /reason** - Main reasoning pipeline
 ```json
 {
  "session_id": "...",
  "user_prompt": "...",
  "temperature": 0.7  // optional
 }
 ```
 **POST /ingest** - Add exchange to SESSIONS
 ```json
 {
  "session_id": "...",
  "user_msg": "...",
  "assistant_msg": "..."
 }
 ```
 **GET /debug/sessions** - Inspect SESSIONS state
 **GET /debug/summary?session_id=X** - Test summarization
 **GET /health** - Service health
 ### NeoMem (Port 7077)
 **POST /memories** - Add memory
 ```json
 {
  "messages": [{"role": "...", "content": "..."}],
  "user_id": "...",
  "metadata": {}
 }
 ```
 **POST /search** - Semantic search
 ```json
 {
  "query": "...",
  "user_id": "...",
  "limit": 10
 }
 ```
 **GET /health** - Service health
 ---
 ## File Manifest (Key Files Only)
 ```
 project-lyra/
 ├── .env                           # Root environment variables
 ├── docker-compose.yml             # Service definitions (152 lines)
 ├── CHANGELOG.md                   # Version history (836 lines)
 ├── README.md                      # User documentation (610 lines)
 ├── PROJECT_SUMMARY.md             # This file (AI context)
 │
 ├── cortex/                        # Reasoning engine
 │   ├── Dockerfile                 # Single-worker constraint documented
 │   ├── requirements.txt
 │   ├── .env                       # Cortex overrides
 │   ├── main.py                    # FastAPI initialization
 │   ├── router.py                  # Routes (306 lines)
 │   ├── context.py                 # Context aggregation
 │   │
 │   ├── intake/                    # Short-term memory (embedded)
 │   │   ├── __init__.py           # Package exports
 │   │   └── intake.py             # Core logic (367 lines)
 │   │
 │   ├── reasoning/                 # Reasoning pipeline
 │   │   ├── reflection.py         # Meta-awareness
 │   │   ├── reasoning.py          # Draft generation
 │   │   └── refine.py             # Refinement
 │   │
 │   ├── persona/                   # Personality layer
 │   │   ├── speak.py              # Persona application
 │   │   └── identity.py           # Persona loader
 │   │
 │   └── llm/                       # LLM integration
 │       └── llm_router.py         # Backend selector
 │
 ├── core/relay/                    # Orchestrator
 │   ├── server.js                 # Express server (Node.js)
 │   └── package.json
 │
 ├── neomem/                        # Long-term memory
 │   ├── Dockerfile
 │   ├── requirements.txt
 │   ├── .env                       # NeoMem overrides
 │   └── main.py                   # Memory API
 │
 └── rag/                           # RAG system (disabled)
    ├── rag_api.py
    ├── rag_chat_import.py
    └── chromadb/
 ```
 ---
 ## Final Notes for AI Assistants
 ### What You Should Know Before Making Changes
 1. **SESSIONS is sacred** - It's a module-level global in `cortex/intake/intake.py`. Don't move it, don't duplicate it, don't make it a class attribute. It must remain a singleton.
 2. **Single-worker is mandatory** - Until SESSIONS is migrated to Redis, Cortex MUST run with a single Uvicorn worker. Multi-worker will cause SESSIONS to be inconsistent.
 3. **Lenient error handling** - The `/ingest` endpoint and other parts of the pipeline use lenient error handling: log errors but always return success. Never fail the chat pipeline.
 4. **Backend routing is environment-driven** - Don't hardcode LLM URLs. Use the `{MODULE}_LLM` environment variables and the llm_router.py system.
 5. **Intake is embedded** - Don't try to make HTTP calls to Intake. Use direct Python imports: `from intake.intake import ...`
 6. **Test with diagnostic endpoints** - Always use `/debug/sessions` and `/debug/summary` to verify SESSIONS behavior after changes.
 7. **Follow the changelog format** - When documenting changes, use the chronological format established in CHANGELOG.md v0.5.1. Group by version, then by change type (Fixed, Added, Changed, etc.).
 ### When You Need Help
 - **SESSIONS issues**: Check `cortex/intake/intake.py` lines 11-14 for initialization, lines 325-366 for `add_exchange_internal()`
 - **Routing issues**: Check `cortex/router.py` lines 65-189 for `/reason`, lines 201-233 for `/ingest`
 - **LLM backend issues**: Check `cortex/llm/llm_router.py` for backend selection logic
 - **Environment variables**: Check `.env` lines 13-40 for LLM backends, lines 28-34 for module selection
 ### Most Important Thing
 **This project values reliability over features.** It's better to have a simple, working system than a complex, broken one. When in doubt, keep it simple, log everything, and never fail silently.
 ---
 **End of AI Context Summary**
 *This document is maintained to provide complete context for AI assistants working on Project Lyra. Last updated: v0.5.1 (2025-12-11)*
--- a/README.md
+++ b/README.md
@@ -1,9 +1,11 @@
-# Project Lyra - README v0.5.0
+# Project Lyra - README v0.5.1
 Lyra is a modular persistent AI companion system with advanced reasoning capabilities.
 It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**,
 with multi-stage reasoning pipeline powered by HTTP-based LLM backends.
 **Current Version:** v0.5.1 (2025-12-11)
 ## Mission Statement
 The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
@@ -22,7 +24,7 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do
 - OpenAI-compatible endpoint: `POST /v1/chat/completions`
 - Internal endpoint: `POST /chat`
 - Routes messages through Cortex reasoning pipeline
- Manages async calls to Intake and NeoMem
+- Manages async calls to NeoMem and Cortex ingest
 **2. UI** (Static HTML)
 - Browser-based chat interface with cyberpunk theme
@@ -41,38 +43,48 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do
 **4. Cortex** (Python/FastAPI) - Port 7081
 - Primary reasoning engine with multi-stage pipeline
 - **Includes embedded Intake module** (no separate service as of v0.5.1)
 - **4-Stage Processing:**
  1. **Reflection** - Generates meta-awareness notes about conversation
  2. **Reasoning** - Creates initial draft answer using context
  3. **Refinement** - Polishes and improves the draft
  4. **Persona** - Applies Lyra's personality and speaking style
- Integrates with Intake for short-term context
+- Integrates with Intake for short-term context via internal Python imports
 - Flexible LLM router supporting multiple backends via HTTP
 - **Endpoints:**
  - `POST /reason` - Main reasoning pipeline
  - `POST /ingest` - Receives conversation exchanges from Relay
  - `GET /health` - Service health check
  - `GET /debug/sessions` - Inspect in-memory SESSIONS state
  - `GET /debug/summary` - Test summarization for a session
-**5. Intake v0.2** (Python/FastAPI) - Port 7080
+**5. Intake** (Python Module) - **Embedded in Cortex**
- Simplified short-term memory summarization
+- **No longer a standalone service** - runs as Python module inside Cortex container
- Session-based circular buffer (deque, maxlen=200)
+- Short-term memory management with session-based circular buffer
- Single-level simple summarization (no cascading)
+- In-memory SESSIONS dictionary: `session_id → {buffer: deque(maxlen=200), created_at: timestamp}`
- Background async processing with FastAPI BackgroundTasks
+- Multi-level summarization (L1/L5/L10/L20/L30) produced by `summarize_context()`
- Pushes summaries to NeoMem automatically
+- Deferred summarization - actual summary generation happens during `/reason` call
- **API Endpoints:**
+- Internal Python API:
-  - `POST /add_exchange` - Add conversation exchange
+  - `add_exchange_internal(exchange)` - Direct function call from Cortex
-  - `GET /summaries?session_id={id}` - Retrieve session summary
+  - `summarize_context(session_id, exchanges)` - Async LLM-based summarization
-  - `POST /close_session/{id}` - Close and cleanup session
+  - `SESSIONS` - Module-level global state (requires single Uvicorn worker)
 ### LLM Backends (HTTP-based)
 **All LLM communication is done via HTTP APIs:**
- **PRIMARY**: vLLM server (`http://10.0.0.43:8000`) - AMD MI50 GPU backend
+- **PRIMARY**: llama.cpp server (`http://10.0.0.44:8080`) - AMD MI50 GPU backend
 - **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend
  - Model: qwen2.5:7b-instruct-q4_K_M
 - **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models
  - Model: gpt-4o-mini
 - **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback
  - Model: llama-3.2-8b-instruct
 Each module can be configured to use a different backend via environment variables.
 Each module can be configured to use a different backend via environment variables. 
 ---
-## Data Flow Architecture (v0.5.0)
+## Data Flow Architecture (v0.5.1)
 ### Normal Message Flow:
@@ -82,43 +94,44 @@ User (UI) → POST /v1/chat/completions
 Relay (7078)
  ↓ POST /reason
 Cortex (7081)
-  ↓ GET /summaries?session_id=xxx
+  ↓ (internal Python call)
-Intake (7080) [RETURNS SUMMARY]
+Intake module → summarize_context()
  ↓
 Cortex processes (4 stages):
-  1. reflection.py → meta-awareness notes
+  1. reflection.py → meta-awareness notes (CLOUD backend)
-  2. reasoning.py → draft answer (uses LLM)
+  2. reasoning.py → draft answer (PRIMARY backend)
-  3. refine.py → refined answer (uses LLM)
+  3. refine.py → refined answer (PRIMARY backend)
-  4. persona/speak.py → Lyra personality (uses LLM)
+  4. persona/speak.py → Lyra personality (CLOUD backend)
  ↓
 Returns persona answer to Relay
  ↓
-Relay → Cortex /ingest (async, stub)
+Relay → POST /ingest (async)
 Relay → Intake /add_exchange (async)
  ↓
-Intake → Background summarize → NeoMem
+Cortex → add_exchange_internal() → SESSIONS buffer
  ↓
 Relay → NeoMem /memories (async, planned)
  ↓
 Relay → UI (returns final response)
 ```
 ### Cortex 4-Stage Reasoning Pipeline:
-1. **Reflection** (`reflection.py`) - Configurable LLM via HTTP
+1. **Reflection** (`reflection.py`) - Cloud LLM (OpenAI)
   - Analyzes user intent and conversation context
   - Generates meta-awareness notes
   - "What is the user really asking?"
-2. **Reasoning** (`reasoning.py`) - Configurable LLM via HTTP
+2. **Reasoning** (`reasoning.py`) - Primary LLM (llama.cpp)
-   - Retrieves short-term context from Intake
+   - Retrieves short-term context from Intake module
   - Creates initial draft answer
   - Integrates context, reflection notes, and user prompt
-3. **Refinement** (`refine.py`) - Configurable LLM via HTTP
+3. **Refinement** (`refine.py`) - Primary LLM (llama.cpp)
   - Polishes the draft answer
   - Improves clarity and coherence
   - Ensures factual consistency
-4. **Persona** (`speak.py`) - Configurable LLM via HTTP
+4. **Persona** (`speak.py`) - Cloud LLM (OpenAI)
   - Applies Lyra's personality and speaking style
   - Natural, conversational output
   - Final answer returned to user
@@ -134,7 +147,7 @@ Relay → UI (returns final response)
 - OpenAI-compatible endpoint: `POST /v1/chat/completions`
 - Internal endpoint: `POST /chat`
 - Health check: `GET /_health`
- Async non-blocking calls to Cortex and Intake
+- Async non-blocking calls to Cortex
 - Shared request handler for code reuse
 - Comprehensive error handling
@@ -154,73 +167,70 @@ Relay → UI (returns final response)
 ### Reasoning Layer
-**Cortex** (v0.5):
+**Cortex** (v0.5.1):
 - Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
 - Flexible LLM backend routing via HTTP
 - Per-stage backend selection
 - Async processing throughout
- IntakeClient integration for short-term context
+- Embedded Intake module for short-term context
- `/reason`, `/ingest` (stub), `/health` endpoints
+- `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary` endpoints
 - Lenient error handling - never fails the chat pipeline
-**Intake** (v0.2):
+**Intake** (Embedded Module):
- Simplified single-level summarization
+- **Architectural change**: Now runs as Python module inside Cortex container
- Session-based circular buffer (200 exchanges max)
+- In-memory SESSIONS management (session_id → buffer)
- Background async summarization
+- Multi-level summarization: L1 (ultra-short), L5 (short), L10 (medium), L20 (detailed), L30 (full)
- Automatic NeoMem push
+- Deferred summarization strategy - summaries generated during `/reason` call
- No persistent log files (memory-only)
+- `bg_summarize()` is a logging stub - actual work deferred
- **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30)
+- **Single-worker constraint**: SESSIONS requires single Uvicorn worker or Redis/shared storage
 **LLM Router**:
 - Dynamic backend selection via HTTP
 - Environment-driven configuration
- Support for vLLM, Ollama, OpenAI, custom endpoints
+- Support for llama.cpp, Ollama, OpenAI, custom endpoints
- Per-module backend preferences
+- Per-module backend preferences:
  - `CORTEX_LLM=SECONDARY` (Ollama for reasoning)
  - `INTAKE_LLM=PRIMARY` (llama.cpp for summarization)
  - `SPEAK_LLM=OPENAI` (Cloud for persona)
  - `NEOMEM_LLM=PRIMARY` (llama.cpp for memory operations)
 ### Beta Lyrae (RAG Memory DB) - Currently Disabled
 # Beta Lyrae (RAG Memory DB) - added 11-3-25
 - **RAG Knowledge DB - Beta Lyrae (sheliak)**
-  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.  
+  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
  - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
-		The system uses:
+  - **Status**: Disabled in docker-compose.yml (v0.5.1)
-  - **ChromaDB** for persistent vector storage  
+
-  - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity  
+The system uses:
-  - **FastAPI** (port 7090) for the `/rag/search` REST endpoint  
+- **ChromaDB** for persistent vector storage
-  - Directory Layout
+- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
-		rag/
+- **FastAPI** (port 7090) for the `/rag/search` REST endpoint
-		├── rag_chat_import.py # imports JSON chat logs
+
-		├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
+Directory Layout:
-		├── rag_build.py # legacy single-folder builder
+```
-		├── rag_query.py # command-line query helper
+rag/
-		├── rag_api.py # FastAPI service providing /rag/search
+├── rag_chat_import.py    # imports JSON chat logs
-		├── chromadb/ # persistent vector store
+├── rag_docs_import.py    # (planned) PDF/EPUB/manual importer
-		├── chatlogs/ # organized source data
+├── rag_build.py          # legacy single-folder builder
-		│ ├── poker/
+├── rag_query.py          # command-line query helper
-		│ ├── work/
+├── rag_api.py            # FastAPI service providing /rag/search
-		│ ├── lyra/
+├── chromadb/             # persistent vector store
-		│ ├── personal/
+├── chatlogs/             # organized source data
-		│ └── ...
+│   ├── poker/
-		└── import.log # progress log for batch runs
+│   ├── work/
-  - **OpenAI chatlog importer.
+│   ├── lyra/
-	  - Takes JSON formatted chat logs and imports it to the RAG.
+│   ├── personal/
-	  - **fetures include:**
+│   └── ...
-	    - Recursive folder indexing with **category detection** from directory name  
+└── import.log            # progress log for batch runs
-		- Smart chunking for long messages (5 000 chars per slice)  
+```
-		- Automatic deduplication using SHA-1 hash of file + chunk
+
-		- Timestamps for both file modification and import time
+**OpenAI chatlog importer features:**
-		- Full progress logging via tqdm
+- Recursive folder indexing with **category detection** from directory name
-		- Safe to run in background with nohup … &
+- Smart chunking for long messages (5,000 chars per slice)
-		- Metadata per chunk:
+- Automatic deduplication using SHA-1 hash of file + chunk
-		  ```json
+- Timestamps for both file modification and import time
-		  {
+- Full progress logging via tqdm
-			"chat_id": "<sha1 of filename>",
+- Safe to run in background with `nohup … &`
 			"chunk_index": 0,
 			"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
 			"title": "cortex LLMs 11-1-25",
 			"role": "assistant",
 			"category": "lyra",
 			"type": "chat",
 			"file_modified": "2025-11-06T23:41:02",
 			"imported_at": "2025-11-07T03:55:00Z"
 		  }```
 ---
@@ -228,13 +238,16 @@ Relay → UI (returns final response)
 All services run in a single docker-compose stack with the following containers:
 **Active Services:**
 - **neomem-postgres** - PostgreSQL with pgvector extension (port 5432)
 - **neomem-neo4j** - Neo4j graph database (ports 7474, 7687)
 - **neomem-api** - NeoMem memory service (port 7077)
 - **relay** - Main orchestrator (port 7078)
- **cortex** - Reasoning engine (port 7081)
+- **cortex** - Reasoning engine with embedded Intake (port 7081)
- **intake** - Short-term memory summarization (port 7080) - currently disabled
+
- **rag** - RAG search service (port 7090) - currently disabled
+**Disabled Services:**
 - **intake** - No longer needed (embedded in Cortex as of v0.5.1)
 - **rag** - Beta Lyrae RAG service (port 7090) - currently disabled
 All containers communicate via the `lyra_net` Docker bridge network.
@@ -242,10 +255,10 @@ All containers communicate via the `lyra_net` Docker bridge network.
 The following LLM backends are accessed via HTTP (not part of docker-compose):
- **vLLM Server** (`http://10.0.0.43:8000`)
+- **llama.cpp Server** (`http://10.0.0.44:8080`)
  - AMD MI50 GPU-accelerated inference
  - Custom ROCm-enabled vLLM build
  - Primary backend for reasoning and refinement stages
  - Model path: `/model`
 - **Ollama Server** (`http://10.0.0.3:11434`)
  - RTX 3090 GPU-accelerated inference
@@ -265,16 +278,38 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
 ## Version History
-### v0.5.0 (2025-11-28) - Current Release
+### v0.5.1 (2025-12-11) - Current Release
 **Critical Intake Integration Fixes:**
 - ✅ Fixed `bg_summarize()` NameError preventing SESSIONS persistence
 - ✅ Fixed `/ingest` endpoint unreachable code
 - ✅ Added `cortex/intake/__init__.py` for proper package structure
 - ✅ Added diagnostic logging to verify SESSIONS singleton behavior
 - ✅ Added `/debug/sessions` and `/debug/summary` endpoints
 - ✅ Documented single-worker constraint in Dockerfile
 - ✅ Implemented lenient error handling (never fails chat pipeline)
 - ✅ Intake now embedded in Cortex - no longer standalone service
 **Architecture Changes:**
 - Intake module runs inside Cortex container as pure Python import
 - No HTTP calls between Cortex and Intake (internal function calls)
 - SESSIONS persist correctly in Uvicorn worker
 - Deferred summarization strategy (summaries generated during `/reason`)
 ### v0.5.0 (2025-11-28)
 - ✅ Fixed all critical API wiring issues
 - ✅ Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`)
 - ✅ Fixed Cortex → Intake integration
 - ✅ Added missing Python package `__init__.py` files
 - ✅ End-to-end message flow verified and working
 ### Infrastructure v1.0.0 (2025-11-26)
 - Consolidated 9 scattered `.env` files into single source of truth
 - Multi-backend LLM strategy implemented
 - Docker Compose consolidation
 - Created `.env.example` security templates
 ### v0.4.x (Major Rewire)
 - Cortex multi-stage reasoning pipeline
 - Intake v0.2 simplification
 - LLM router with multi-backend support
 - Major architectural restructuring
@@ -285,19 +320,30 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
 ---
-## Known Issues (v0.5.0)
+## Known Issues (v0.5.1)
 ### Critical (Fixed in v0.5.1)
 - ~~Intake SESSIONS not persisting~~ ✅ **FIXED**
 - ~~`bg_summarize()` NameError~~ ✅ **FIXED**
 - ~~`/ingest` endpoint unreachable code~~ ✅ **FIXED**
 ### Non-Critical
 - Session management endpoints not fully implemented in Relay
 - Intake service currently disabled in docker-compose.yml
 - RAG service currently disabled in docker-compose.yml
- Cortex `/ingest` endpoint is a stub
+- NeoMem integration in Relay not yet active (planned for v0.5.2)
 ### Operational Notes
 - **Single-worker constraint**: Cortex must run with single Uvicorn worker to maintain SESSIONS state
  - Multi-worker scaling requires migrating SESSIONS to Redis or shared storage
 - Diagnostic endpoints (`/debug/sessions`, `/debug/summary`) available for troubleshooting
 ### Future Enhancements
 - Re-enable RAG service integration
 - Implement full session persistence
 - Migrate SESSIONS to Redis for multi-worker support
 - Add request correlation IDs for tracing
- Comprehensive health checks
+- Comprehensive health checks across all services
 - NeoMem integration in Relay
 ---
@@ -305,21 +351,39 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
 ### Prerequisites
 - Docker + Docker Compose
- At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key)
+- At least one HTTP-accessible LLM endpoint (llama.cpp, Ollama, or OpenAI API key)
 ### Setup
-1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys
+1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys:
   ```bash
   # Required: Configure at least one LLM backend
   LLM_PRIMARY_URL=http://10.0.0.44:8080       # llama.cpp
   LLM_SECONDARY_URL=http://10.0.0.3:11434     # Ollama
   OPENAI_API_KEY=sk-...                        # OpenAI
   ```
 2. Start all services with docker-compose:
   ```bash
   docker-compose up -d
   ```
 3. Check service health:
   ```bash
   # Relay health
   curl http://localhost:7078/_health
   # Cortex health
   curl http://localhost:7081/health
   # NeoMem health
   curl http://localhost:7077/health
   ```
 4. Access the UI at `http://localhost:7078`
 ### Test
 **Test Relay → Cortex pipeline:**
 ```bash
 curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
@@ -329,15 +393,130 @@ curl -X POST http://localhost:7078/v1/chat/completions \
  }'
 ```
 **Test Cortex /ingest endpoint:**
 ```bash
 curl -X POST http://localhost:7081/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "test",
    "user_msg": "Hello",
    "assistant_msg": "Hi there!"
  }'
 ```
 **Inspect SESSIONS state:**
 ```bash
 curl http://localhost:7081/debug/sessions
 ```
 **Get summary for a session:**
 ```bash
 curl "http://localhost:7081/debug/summary?session_id=test"
 ```
 All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.
 ---
 ## Environment Variables
 ### LLM Backend Configuration
 **Backend URLs (Full API endpoints):**
 ```bash
 LLM_PRIMARY_URL=http://10.0.0.44:8080           # llama.cpp
 LLM_PRIMARY_MODEL=/model
 LLM_SECONDARY_URL=http://10.0.0.3:11434         # Ollama
 LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
 LLM_OPENAI_URL=https://api.openai.com/v1
 LLM_OPENAI_MODEL=gpt-4o-mini
 OPENAI_API_KEY=sk-...
 ```
 **Module-specific backend selection:**
 ```bash
 CORTEX_LLM=SECONDARY      # Use Ollama for reasoning
 INTAKE_LLM=PRIMARY        # Use llama.cpp for summarization
 SPEAK_LLM=OPENAI          # Use OpenAI for persona
 NEOMEM_LLM=PRIMARY        # Use llama.cpp for memory
 UI_LLM=OPENAI             # Use OpenAI for UI
 RELAY_LLM=PRIMARY         # Use llama.cpp for relay
 ```
 ### Database Configuration
 ```bash
 POSTGRES_USER=neomem
 POSTGRES_PASSWORD=neomempass
 POSTGRES_DB=neomem
 POSTGRES_HOST=neomem-postgres
 POSTGRES_PORT=5432
 NEO4J_URI=bolt://neomem-neo4j:7687
 NEO4J_USERNAME=neo4j
 NEO4J_PASSWORD=neomemgraph
 ```
 ### Service URLs (Internal Docker Network)
 ```bash
 NEOMEM_API=http://neomem-api:7077
 CORTEX_API=http://cortex:7081
 CORTEX_REASON_URL=http://cortex:7081/reason
 CORTEX_INGEST_URL=http://cortex:7081/ingest
 RELAY_URL=http://relay:7078
 ```
 ### Feature Flags
 ```bash
 CORTEX_ENABLED=true
 MEMORY_ENABLED=true
 PERSONA_ENABLED=false
 DEBUG_PROMPT=true
 VERBOSE_DEBUG=true
 ```
 For complete environment variable reference, see [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md).
 ---
 ## Documentation
- See [CHANGELOG.md](CHANGELOG.md) for detailed version history
+- [CHANGELOG.md](CHANGELOG.md) - Detailed version history
- See `ENVIRONMENT_VARIABLES.md` for environment variable reference
+- [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - Comprehensive project overview for AI context
- Additional information available in the Trilium docs
+- [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md) - Environment variable reference
 - [DEPRECATED_FILES.md](DEPRECATED_FILES.md) - Deprecated files and migration guide
 ---
 ## Troubleshooting
 ### SESSIONS not persisting
 **Symptom:** Intake buffer always shows 0 exchanges, summaries always empty.
 **Solution (Fixed in v0.5.1):**
 - Ensure `cortex/intake/__init__.py` exists
 - Check Cortex logs for `[Intake Module Init]` message showing SESSIONS object ID
 - Verify single-worker mode (Dockerfile: `uvicorn main:app --workers 1`)
 - Use `/debug/sessions` endpoint to inspect current state
 ### Cortex connection errors
 **Symptom:** Relay can't reach Cortex, 502 errors.
 **Solution:**
 - Verify Cortex container is running: `docker ps | grep cortex`
 - Check Cortex health: `curl http://localhost:7081/health`
 - Verify environment variables: `CORTEX_REASON_URL=http://cortex:7081/reason`
 - Check docker network: `docker network inspect lyra_net`
 ### LLM backend timeouts
 **Symptom:** Reasoning stage hangs or times out.
 **Solution:**
 - Verify LLM backend is running and accessible
 - Check LLM backend health: `curl http://10.0.0.44:8080/health`
 - Increase timeout in llm_router.py if using slow models
 - Check logs for specific backend errors
 ---
@@ -356,6 +535,8 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
 - All services communicate via Docker internal networking on the `lyra_net` bridge
 - History and entity graphs are managed via PostgreSQL + Neo4j
 - LLM backends are accessed via HTTP and configured in `.env`
 - Intake module is imported internally by Cortex (no HTTP communication)
 - SESSIONS state is maintained in-memory within Cortex container
 ---
@@ -391,3 +572,38 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
     }'
   ```
 ---
 ## Development Notes
 ### Cortex Architecture (v0.5.1)
 - Cortex contains embedded Intake module at `cortex/intake/`
 - Intake is imported as: `from intake.intake import add_exchange_internal, SESSIONS`
 - SESSIONS is a module-level global dictionary (singleton pattern)
 - Single-worker constraint required to maintain SESSIONS state
 - Diagnostic endpoints available for debugging: `/debug/sessions`, `/debug/summary`
 ### Adding New LLM Backends
 1. Add backend URL to `.env`:
   ```bash
   LLM_CUSTOM_URL=http://your-backend:port
   LLM_CUSTOM_MODEL=model-name
   ```
 2. Configure module to use new backend:
   ```bash
   CORTEX_LLM=CUSTOM
   ```
 3. Restart Cortex container:
   ```bash
   docker-compose restart cortex
   ```
 ### Debugging Tips
 - Enable verbose logging: `VERBOSE_DEBUG=true` in `.env`
 - Check Cortex logs: `docker logs cortex -f`
 - Inspect SESSIONS: `curl http://localhost:7081/debug/sessions`
 - Test summarization: `curl "http://localhost:7081/debug/summary?session_id=test"`
 - Check Relay logs: `docker logs relay -f`
 - Monitor Docker network: `docker network inspect lyra_net`