docs updated for v0.5.1

2025-12-11 03:49:23 -05:00
parent e45cdbe54e
commit d5d7ea3469
2 changed files with 1227 additions and 157 deletions
--- a/README.md
+++ b/README.md
@@ -1,9 +1,11 @@
-# Project Lyra - README v0.5.0
+# Project Lyra - README v0.5.1

 Lyra is a modular persistent AI companion system with advanced reasoning capabilities.
 It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**,
 with multi-stage reasoning pipeline powered by HTTP-based LLM backends.

+**Current Version:** v0.5.1 (2025-12-11)
+
 ## Mission Statement

 The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
@@ -22,7 +24,7 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do
 - OpenAI-compatible endpoint: `POST /v1/chat/completions`
 - Internal endpoint: `POST /chat`
 - Routes messages through Cortex reasoning pipeline
- Manages async calls to Intake and NeoMem
+- Manages async calls to NeoMem and Cortex ingest

 **2. UI** (Static HTML)
 - Browser-based chat interface with cyberpunk theme
@@ -41,38 +43,48 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do

 **4. Cortex** (Python/FastAPI) - Port 7081
 - Primary reasoning engine with multi-stage pipeline
+- **Includes embedded Intake module** (no separate service as of v0.5.1)
 - **4-Stage Processing:**
  1. **Reflection** - Generates meta-awareness notes about conversation
  2. **Reasoning** - Creates initial draft answer using context
  3. **Refinement** - Polishes and improves the draft
  4. **Persona** - Applies Lyra's personality and speaking style
- Integrates with Intake for short-term context
+- Integrates with Intake for short-term context via internal Python imports
 - Flexible LLM router supporting multiple backends via HTTP
+- **Endpoints:**
+  - `POST /reason` - Main reasoning pipeline
+  - `POST /ingest` - Receives conversation exchanges from Relay
+  - `GET /health` - Service health check
+  - `GET /debug/sessions` - Inspect in-memory SESSIONS state
+  - `GET /debug/summary` - Test summarization for a session

-**5. Intake v0.2** (Python/FastAPI) - Port 7080
- Simplified short-term memory summarization
- Session-based circular buffer (deque, maxlen=200)
- Single-level simple summarization (no cascading)
- Background async processing with FastAPI BackgroundTasks
- Pushes summaries to NeoMem automatically
- **API Endpoints:**
-  - `POST /add_exchange` - Add conversation exchange
-  - `GET /summaries?session_id={id}` - Retrieve session summary
-  - `POST /close_session/{id}` - Close and cleanup session
+**5. Intake** (Python Module) - **Embedded in Cortex**
+- **No longer a standalone service** - runs as Python module inside Cortex container
+- Short-term memory management with session-based circular buffer
+- In-memory SESSIONS dictionary: `session_id → {buffer: deque(maxlen=200), created_at: timestamp}`
+- Multi-level summarization (L1/L5/L10/L20/L30) produced by `summarize_context()`
+- Deferred summarization - actual summary generation happens during `/reason` call
+- Internal Python API:
+  - `add_exchange_internal(exchange)` - Direct function call from Cortex
+  - `summarize_context(session_id, exchanges)` - Async LLM-based summarization
+  - `SESSIONS` - Module-level global state (requires single Uvicorn worker)

 ### LLM Backends (HTTP-based)

 **All LLM communication is done via HTTP APIs:**
- **PRIMARY**: vLLM server (`http://10.0.0.43:8000`) - AMD MI50 GPU backend
+- **PRIMARY**: llama.cpp server (`http://10.0.0.44:8080`) - AMD MI50 GPU backend
 - **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend
+  - Model: qwen2.5:7b-instruct-q4_K_M
 - **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models
+  - Model: gpt-4o-mini
 - **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback
+  - Model: llama-3.2-8b-instruct
+
+Each module can be configured to use a different backend via environment variables.

-Each module can be configured to use a different backend via environment variables. 
-			
 ---

-## Data Flow Architecture (v0.5.0)
+## Data Flow Architecture (v0.5.1)

 ### Normal Message Flow:

@@ -82,43 +94,44 @@ User (UI) → POST /v1/chat/completions
 Relay (7078)
  ↓ POST /reason
 Cortex (7081)
-  ↓ GET /summaries?session_id=xxx
-Intake (7080) [RETURNS SUMMARY]
+  ↓ (internal Python call)
+Intake module → summarize_context()
  ↓
 Cortex processes (4 stages):
-  1. reflection.py → meta-awareness notes
-  2. reasoning.py → draft answer (uses LLM)
-  3. refine.py → refined answer (uses LLM)
-  4. persona/speak.py → Lyra personality (uses LLM)
+  1. reflection.py → meta-awareness notes (CLOUD backend)
+  2. reasoning.py → draft answer (PRIMARY backend)
+  3. refine.py → refined answer (PRIMARY backend)
+  4. persona/speak.py → Lyra personality (CLOUD backend)
  ↓
 Returns persona answer to Relay
  ↓
-Relay → Cortex /ingest (async, stub)
-Relay → Intake /add_exchange (async)
+Relay → POST /ingest (async)
  ↓
-Intake → Background summarize → NeoMem
+Cortex → add_exchange_internal() → SESSIONS buffer
+  ↓
+Relay → NeoMem /memories (async, planned)
  ↓
 Relay → UI (returns final response)
 ```

 ### Cortex 4-Stage Reasoning Pipeline:

-1. **Reflection** (`reflection.py`) - Configurable LLM via HTTP
+1. **Reflection** (`reflection.py`) - Cloud LLM (OpenAI)
   - Analyzes user intent and conversation context
   - Generates meta-awareness notes
   - "What is the user really asking?"

-2. **Reasoning** (`reasoning.py`) - Configurable LLM via HTTP
-   - Retrieves short-term context from Intake
+2. **Reasoning** (`reasoning.py`) - Primary LLM (llama.cpp)
+   - Retrieves short-term context from Intake module
   - Creates initial draft answer
   - Integrates context, reflection notes, and user prompt

-3. **Refinement** (`refine.py`) - Configurable LLM via HTTP
+3. **Refinement** (`refine.py`) - Primary LLM (llama.cpp)
   - Polishes the draft answer
   - Improves clarity and coherence
   - Ensures factual consistency

-4. **Persona** (`speak.py`) - Configurable LLM via HTTP
+4. **Persona** (`speak.py`) - Cloud LLM (OpenAI)
   - Applies Lyra's personality and speaking style
   - Natural, conversational output
   - Final answer returned to user
@@ -134,7 +147,7 @@ Relay → UI (returns final response)
 - OpenAI-compatible endpoint: `POST /v1/chat/completions`
 - Internal endpoint: `POST /chat`
 - Health check: `GET /_health`
- Async non-blocking calls to Cortex and Intake
+- Async non-blocking calls to Cortex
 - Shared request handler for code reuse
 - Comprehensive error handling

@@ -154,73 +167,70 @@ Relay → UI (returns final response)

 ### Reasoning Layer

-**Cortex** (v0.5):
+**Cortex** (v0.5.1):
 - Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
 - Flexible LLM backend routing via HTTP
 - Per-stage backend selection
 - Async processing throughout
- IntakeClient integration for short-term context
- `/reason`, `/ingest` (stub), `/health` endpoints
+- Embedded Intake module for short-term context
+- `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary` endpoints
+- Lenient error handling - never fails the chat pipeline

-**Intake** (v0.2):
- Simplified single-level summarization
- Session-based circular buffer (200 exchanges max)
- Background async summarization
- Automatic NeoMem push
- No persistent log files (memory-only)
- **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30)
+**Intake** (Embedded Module):
+- **Architectural change**: Now runs as Python module inside Cortex container
+- In-memory SESSIONS management (session_id → buffer)
+- Multi-level summarization: L1 (ultra-short), L5 (short), L10 (medium), L20 (detailed), L30 (full)
+- Deferred summarization strategy - summaries generated during `/reason` call
+- `bg_summarize()` is a logging stub - actual work deferred
+- **Single-worker constraint**: SESSIONS requires single Uvicorn worker or Redis/shared storage

 **LLM Router**:
 - Dynamic backend selection via HTTP
 - Environment-driven configuration
- Support for vLLM, Ollama, OpenAI, custom endpoints
- Per-module backend preferences
+- Support for llama.cpp, Ollama, OpenAI, custom endpoints
+- Per-module backend preferences:
+  - `CORTEX_LLM=SECONDARY` (Ollama for reasoning)
+  - `INTAKE_LLM=PRIMARY` (llama.cpp for summarization)
+  - `SPEAK_LLM=OPENAI` (Cloud for persona)
+  - `NEOMEM_LLM=PRIMARY` (llama.cpp for memory operations)
+
+### Beta Lyrae (RAG Memory DB) - Currently Disabled

-# Beta Lyrae (RAG Memory DB) - added 11-3-25
 - **RAG Knowledge DB - Beta Lyrae (sheliak)**
-  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.  
+  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
  - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
-		The system uses:
-  - **ChromaDB** for persistent vector storage  
-  - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity  
-  - **FastAPI** (port 7090) for the `/rag/search` REST endpoint  
-  - Directory Layout
-		rag/
-		├── rag_chat_import.py # imports JSON chat logs
-		├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
-		├── rag_build.py # legacy single-folder builder
-		├── rag_query.py # command-line query helper
-		├── rag_api.py # FastAPI service providing /rag/search
-		├── chromadb/ # persistent vector store
-		├── chatlogs/ # organized source data
-		│ ├── poker/
-		│ ├── work/
-		│ ├── lyra/
-		│ ├── personal/
-		│ └── ...
-		└── import.log # progress log for batch runs
-  - **OpenAI chatlog importer.
-	  - Takes JSON formatted chat logs and imports it to the RAG.
-	  - **fetures include:**
-	    - Recursive folder indexing with **category detection** from directory name  
-		- Smart chunking for long messages (5 000 chars per slice)  
-		- Automatic deduplication using SHA-1 hash of file + chunk
-		- Timestamps for both file modification and import time
-		- Full progress logging via tqdm
-		- Safe to run in background with nohup … &
-		- Metadata per chunk:
-		  ```json
-		  {
-			"chat_id": "<sha1 of filename>",
-			"chunk_index": 0,
-			"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
-			"title": "cortex LLMs 11-1-25",
-			"role": "assistant",
-			"category": "lyra",
-			"type": "chat",
-			"file_modified": "2025-11-06T23:41:02",
-			"imported_at": "2025-11-07T03:55:00Z"
-		  }```
+  - **Status**: Disabled in docker-compose.yml (v0.5.1)
+
+The system uses:
+- **ChromaDB** for persistent vector storage
+- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
+- **FastAPI** (port 7090) for the `/rag/search` REST endpoint
+
+Directory Layout:
+```
+rag/
+├── rag_chat_import.py    # imports JSON chat logs
+├── rag_docs_import.py    # (planned) PDF/EPUB/manual importer
+├── rag_build.py          # legacy single-folder builder
+├── rag_query.py          # command-line query helper
+├── rag_api.py            # FastAPI service providing /rag/search
+├── chromadb/             # persistent vector store
+├── chatlogs/             # organized source data
+│   ├── poker/
+│   ├── work/
+│   ├── lyra/
+│   ├── personal/
+│   └── ...
+└── import.log            # progress log for batch runs
+```
+
+**OpenAI chatlog importer features:**
+- Recursive folder indexing with **category detection** from directory name
+- Smart chunking for long messages (5,000 chars per slice)
+- Automatic deduplication using SHA-1 hash of file + chunk
+- Timestamps for both file modification and import time
+- Full progress logging via tqdm
+- Safe to run in background with `nohup … &`

 ---

@@ -228,13 +238,16 @@ Relay → UI (returns final response)

 All services run in a single docker-compose stack with the following containers:

+**Active Services:**
 - **neomem-postgres** - PostgreSQL with pgvector extension (port 5432)
 - **neomem-neo4j** - Neo4j graph database (ports 7474, 7687)
 - **neomem-api** - NeoMem memory service (port 7077)
 - **relay** - Main orchestrator (port 7078)
- **cortex** - Reasoning engine (port 7081)
- **intake** - Short-term memory summarization (port 7080) - currently disabled
- **rag** - RAG search service (port 7090) - currently disabled
+- **cortex** - Reasoning engine with embedded Intake (port 7081)
+
+**Disabled Services:**
+- **intake** - No longer needed (embedded in Cortex as of v0.5.1)
+- **rag** - Beta Lyrae RAG service (port 7090) - currently disabled

 All containers communicate via the `lyra_net` Docker bridge network.

@@ -242,10 +255,10 @@ All containers communicate via the `lyra_net` Docker bridge network.

 The following LLM backends are accessed via HTTP (not part of docker-compose):

- **vLLM Server** (`http://10.0.0.43:8000`)
+- **llama.cpp Server** (`http://10.0.0.44:8080`)
  - AMD MI50 GPU-accelerated inference
-  - Custom ROCm-enabled vLLM build
  - Primary backend for reasoning and refinement stages
+  - Model path: `/model`

 - **Ollama Server** (`http://10.0.0.3:11434`)
  - RTX 3090 GPU-accelerated inference
@@ -265,16 +278,38 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):

 ## Version History

-### v0.5.0 (2025-11-28) - Current Release
+### v0.5.1 (2025-12-11) - Current Release
+**Critical Intake Integration Fixes:**
+- ✅ Fixed `bg_summarize()` NameError preventing SESSIONS persistence
+- ✅ Fixed `/ingest` endpoint unreachable code
+- ✅ Added `cortex/intake/__init__.py` for proper package structure
+- ✅ Added diagnostic logging to verify SESSIONS singleton behavior
+- ✅ Added `/debug/sessions` and `/debug/summary` endpoints
+- ✅ Documented single-worker constraint in Dockerfile
+- ✅ Implemented lenient error handling (never fails chat pipeline)
+- ✅ Intake now embedded in Cortex - no longer standalone service
+
+**Architecture Changes:**
+- Intake module runs inside Cortex container as pure Python import
+- No HTTP calls between Cortex and Intake (internal function calls)
+- SESSIONS persist correctly in Uvicorn worker
+- Deferred summarization strategy (summaries generated during `/reason`)
+
+### v0.5.0 (2025-11-28)
 - ✅ Fixed all critical API wiring issues
 - ✅ Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`)
 - ✅ Fixed Cortex → Intake integration
 - ✅ Added missing Python package `__init__.py` files
 - ✅ End-to-end message flow verified and working

+### Infrastructure v1.0.0 (2025-11-26)
+- Consolidated 9 scattered `.env` files into single source of truth
+- Multi-backend LLM strategy implemented
+- Docker Compose consolidation
+- Created `.env.example` security templates
+
 ### v0.4.x (Major Rewire)
 - Cortex multi-stage reasoning pipeline
- Intake v0.2 simplification
 - LLM router with multi-backend support
 - Major architectural restructuring

@@ -285,19 +320,30 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):

 ---

-## Known Issues (v0.5.0)
+## Known Issues (v0.5.1)
+
+### Critical (Fixed in v0.5.1)
+- ~~Intake SESSIONS not persisting~~ ✅ **FIXED**
+- ~~`bg_summarize()` NameError~~ ✅ **FIXED**
+- ~~`/ingest` endpoint unreachable code~~ ✅ **FIXED**

 ### Non-Critical
 - Session management endpoints not fully implemented in Relay
- Intake service currently disabled in docker-compose.yml
 - RAG service currently disabled in docker-compose.yml
- Cortex `/ingest` endpoint is a stub
+- NeoMem integration in Relay not yet active (planned for v0.5.2)
+
+### Operational Notes
+- **Single-worker constraint**: Cortex must run with single Uvicorn worker to maintain SESSIONS state
+  - Multi-worker scaling requires migrating SESSIONS to Redis or shared storage
+- Diagnostic endpoints (`/debug/sessions`, `/debug/summary`) available for troubleshooting

 ### Future Enhancements
 - Re-enable RAG service integration
 - Implement full session persistence
+- Migrate SESSIONS to Redis for multi-worker support
 - Add request correlation IDs for tracing
- Comprehensive health checks
+- Comprehensive health checks across all services
+- NeoMem integration in Relay

 ---

@@ -305,21 +351,39 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):

 ### Prerequisites
 - Docker + Docker Compose
- At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key)
+- At least one HTTP-accessible LLM endpoint (llama.cpp, Ollama, or OpenAI API key)

 ### Setup
-1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys
+1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys:
+   ```bash
+   # Required: Configure at least one LLM backend
+   LLM_PRIMARY_URL=http://10.0.0.44:8080       # llama.cpp
+   LLM_SECONDARY_URL=http://10.0.0.3:11434     # Ollama
+   OPENAI_API_KEY=sk-...                        # OpenAI
+   ```
+
 2. Start all services with docker-compose:
   ```bash
   docker-compose up -d
   ```
+
 3. Check service health:
   ```bash
+   # Relay health
   curl http://localhost:7078/_health
+
+   # Cortex health
+   curl http://localhost:7081/health
+
+   # NeoMem health
+   curl http://localhost:7077/health
   ```
+
 4. Access the UI at `http://localhost:7078`

 ### Test
+
+**Test Relay → Cortex pipeline:**
 ```bash
 curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
@@ -329,15 +393,130 @@ curl -X POST http://localhost:7078/v1/chat/completions \
  }'
 ```

+**Test Cortex /ingest endpoint:**
+```bash
+curl -X POST http://localhost:7081/ingest \
+  -H "Content-Type: application/json" \
+  -d '{
+    "session_id": "test",
+    "user_msg": "Hello",
+    "assistant_msg": "Hi there!"
+  }'
+```
+
+**Inspect SESSIONS state:**
+```bash
+curl http://localhost:7081/debug/sessions
+```
+
+**Get summary for a session:**
+```bash
+curl "http://localhost:7081/debug/summary?session_id=test"
+```
+
 All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.

 ---

+## Environment Variables
+
+### LLM Backend Configuration
+
+**Backend URLs (Full API endpoints):**
+```bash
+LLM_PRIMARY_URL=http://10.0.0.44:8080           # llama.cpp
+LLM_PRIMARY_MODEL=/model
+
+LLM_SECONDARY_URL=http://10.0.0.3:11434         # Ollama
+LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
+
+LLM_OPENAI_URL=https://api.openai.com/v1
+LLM_OPENAI_MODEL=gpt-4o-mini
+OPENAI_API_KEY=sk-...
+```
+
+**Module-specific backend selection:**
+```bash
+CORTEX_LLM=SECONDARY      # Use Ollama for reasoning
+INTAKE_LLM=PRIMARY        # Use llama.cpp for summarization
+SPEAK_LLM=OPENAI          # Use OpenAI for persona
+NEOMEM_LLM=PRIMARY        # Use llama.cpp for memory
+UI_LLM=OPENAI             # Use OpenAI for UI
+RELAY_LLM=PRIMARY         # Use llama.cpp for relay
+```
+
+### Database Configuration
+```bash
+POSTGRES_USER=neomem
+POSTGRES_PASSWORD=neomempass
+POSTGRES_DB=neomem
+POSTGRES_HOST=neomem-postgres
+POSTGRES_PORT=5432
+
+NEO4J_URI=bolt://neomem-neo4j:7687
+NEO4J_USERNAME=neo4j
+NEO4J_PASSWORD=neomemgraph
+```
+
+### Service URLs (Internal Docker Network)
+```bash
+NEOMEM_API=http://neomem-api:7077
+CORTEX_API=http://cortex:7081
+CORTEX_REASON_URL=http://cortex:7081/reason
+CORTEX_INGEST_URL=http://cortex:7081/ingest
+RELAY_URL=http://relay:7078
+```
+
+### Feature Flags
+```bash
+CORTEX_ENABLED=true
+MEMORY_ENABLED=true
+PERSONA_ENABLED=false
+DEBUG_PROMPT=true
+VERBOSE_DEBUG=true
+```
+
+For complete environment variable reference, see [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md).
+
+---
+
 ## Documentation

- See [CHANGELOG.md](CHANGELOG.md) for detailed version history
- See `ENVIRONMENT_VARIABLES.md` for environment variable reference
- Additional information available in the Trilium docs
+- [CHANGELOG.md](CHANGELOG.md) - Detailed version history
+- [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - Comprehensive project overview for AI context
+- [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md) - Environment variable reference
+- [DEPRECATED_FILES.md](DEPRECATED_FILES.md) - Deprecated files and migration guide
+
+---
+
+## Troubleshooting
+
+### SESSIONS not persisting
+**Symptom:** Intake buffer always shows 0 exchanges, summaries always empty.
+
+**Solution (Fixed in v0.5.1):**
+- Ensure `cortex/intake/__init__.py` exists
+- Check Cortex logs for `[Intake Module Init]` message showing SESSIONS object ID
+- Verify single-worker mode (Dockerfile: `uvicorn main:app --workers 1`)
+- Use `/debug/sessions` endpoint to inspect current state
+
+### Cortex connection errors
+**Symptom:** Relay can't reach Cortex, 502 errors.
+
+**Solution:**
+- Verify Cortex container is running: `docker ps | grep cortex`
+- Check Cortex health: `curl http://localhost:7081/health`
+- Verify environment variables: `CORTEX_REASON_URL=http://cortex:7081/reason`
+- Check docker network: `docker network inspect lyra_net`
+
+### LLM backend timeouts
+**Symptom:** Reasoning stage hangs or times out.
+
+**Solution:**
+- Verify LLM backend is running and accessible
+- Check LLM backend health: `curl http://10.0.0.44:8080/health`
+- Increase timeout in llm_router.py if using slow models
+- Check logs for specific backend errors

 ---

@@ -356,6 +535,8 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
 - All services communicate via Docker internal networking on the `lyra_net` bridge
 - History and entity graphs are managed via PostgreSQL + Neo4j
 - LLM backends are accessed via HTTP and configured in `.env`
+- Intake module is imported internally by Cortex (no HTTP communication)
+- SESSIONS state is maintained in-memory within Cortex container

 ---

@@ -391,3 +572,38 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
     }'
   ```

+---
+
+## Development Notes
+
+### Cortex Architecture (v0.5.1)
+- Cortex contains embedded Intake module at `cortex/intake/`
+- Intake is imported as: `from intake.intake import add_exchange_internal, SESSIONS`
+- SESSIONS is a module-level global dictionary (singleton pattern)
+- Single-worker constraint required to maintain SESSIONS state
+- Diagnostic endpoints available for debugging: `/debug/sessions`, `/debug/summary`
+
+### Adding New LLM Backends
+1. Add backend URL to `.env`:
+   ```bash
+   LLM_CUSTOM_URL=http://your-backend:port
+   LLM_CUSTOM_MODEL=model-name
+   ```
+
+2. Configure module to use new backend:
+   ```bash
+   CORTEX_LLM=CUSTOM
+   ```
+
+3. Restart Cortex container:
+   ```bash
+   docker-compose restart cortex
+   ```
+
+### Debugging Tips
+- Enable verbose logging: `VERBOSE_DEBUG=true` in `.env`
+- Check Cortex logs: `docker logs cortex -f`
+- Inspect SESSIONS: `curl http://localhost:7081/debug/sessions`
+- Test summarization: `curl "http://localhost:7081/debug/summary?session_id=test"`
+- Check Relay logs: `docker logs relay -f`
+- Monitor Docker network: `docker network inspect lyra_net`