From 5ed3fd0982a5f56e3386218f76d5f3cdd7f05605 Mon Sep 17 00:00:00 2001
From: serversdwn <brianharrison02@gmail.com>
Date: Thu, 11 Dec 2025 02:50:23 -0500
Subject: [PATCH] cortex rework continued.

---
 CHANGELOG.md              | 1304 +++++++++++++++++++------------------
 cortex/Dockerfile         |    2 +
 cortex/context.py         |   39 +-
 cortex/intake/__init__.py |   18 +
 cortex/intake/intake.py   |  138 ++--
 cortex/router.py          |  106 ++-
 vllm-mi50.md              |  416 ------------
 7 files changed, 910 insertions(+), 1113 deletions(-)
 create mode 100644 cortex/intake/__init__.py
 delete mode 100644 vllm-mi50.md

diff --git a/CHANGELOG.md b/CHANGELOG.md
index b634cc9..ab30ad6 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,18 +1,72 @@
-# Project Lyra — Modular Changelog
-All notable changes to Project Lyra are organized by component.
-The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
-and adheres to [Semantic Versioning](https://semver.org/).
-# Last Updated: 11-28-25
+# Project Lyra Changelog
+
+All notable changes to Project Lyra.
+Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/).
+
 ---
 
-## 🧠 Lyra-Core ##############################################################################
+## [Unreleased]
 
-## [Project Lyra v0.5.0] - 2025-11-28
+---
+
+## [0.5.1] - 2025-12-11
+
+### Fixed - Intake Integration
+- **Critical**: Fixed `bg_summarize()` function not defined error
+  - Was only a `TYPE_CHECKING` stub, now implemented as logging stub
+  - Eliminated `NameError` preventing SESSIONS from persisting correctly
+  - Function now logs exchange additions and defers summarization to `/reason` endpoint
+- **Critical**: Fixed `/ingest` endpoint unreachable code in [router.py:201-233](cortex/router.py#L201-L233)
+  - Removed early return that prevented `update_last_assistant_message()` from executing
+  - Removed duplicate `add_exchange_internal()` call
+  - Implemented lenient error handling (each operation wrapped in try/except)
+- **Intake**: Added missing `__init__.py` to make intake a proper Python package [cortex/intake/__init__.py](cortex/intake/__init__.py)
+  - Prevents namespace package issues
+  - Enables proper module imports
+  - Exports `SESSIONS`, `add_exchange_internal`, `summarize_context`
+
+### Added - Diagnostics & Debugging
+- Added diagnostic logging to verify SESSIONS singleton behavior
+  - Module initialization logs SESSIONS object ID [intake.py:14](cortex/intake/intake.py#L14)
+  - Each `add_exchange_internal()` call logs object ID and buffer state [intake.py:343-358](cortex/intake/intake.py#L343-L358)
+- Added `/debug/sessions` HTTP endpoint [router.py:276-305](cortex/router.py#L276-L305)
+  - Inspect SESSIONS from within running Uvicorn worker
+  - Shows total sessions, session count, buffer sizes, recent exchanges
+  - Returns SESSIONS object ID for verification
+- Added `/debug/summary` HTTP endpoint [router.py:238-271](cortex/router.py#L238-L271)
+  - Test `summarize_context()` for any session
+  - Returns L1/L5/L10/L20/L30 summaries
+  - Includes buffer size and exchange preview
+
+### Changed - Intake Architecture
+- **Intake no longer standalone service** - runs inside Cortex container as pure Python module
+  - Imported as `from intake.intake import add_exchange_internal, SESSIONS`
+  - No HTTP calls between Cortex and Intake
+  - Eliminates network latency and dependency on Intake service being up
+- **Deferred summarization**: `bg_summarize()` is now a no-op stub [intake.py:318-325](cortex/intake/intake.py#L318-L325)
+  - Actual summarization happens during `/reason` call via `summarize_context()`
+  - Simplifies async/sync complexity
+  - Prevents NameError when called from `add_exchange_internal()`
+- **Lenient error handling**: `/ingest` endpoint always returns success [router.py:201-233](cortex/router.py#L201-L233)
+  - Each operation wrapped in try/except
+  - Logs errors but never fails to avoid breaking chat pipeline
+  - User requirement: never fail chat pipeline
+
+### Documentation
+- Added single-worker constraint note in [cortex/Dockerfile:7-8](cortex/Dockerfile#L7-L8)
+  - Documents that SESSIONS requires single Uvicorn worker
+  - Notes that multi-worker scaling requires Redis or shared storage
+- Updated plan documentation with root cause analysis
+
+---
+
+## [0.5.0] - 2025-11-28
+
+### Fixed - Critical API Wiring & Integration
 
-### 🔧 Fixed - Critical API Wiring & Integration
 After the major architectural rewire (v0.4.x), this release fixes all critical endpoint mismatches and ensures end-to-end system connectivity.
 
-#### Cortex → Intake Integration ✅
+#### Cortex → Intake Integration
 - **Fixed** `IntakeClient` to use correct Intake v0.2 API endpoints
   - Changed `GET /context/{session_id}` → `GET /summaries?session_id={session_id}`
   - Updated JSON response parsing to extract `summary_text` field
@@ -20,7 +74,7 @@ After the major architectural rewire (v0.4.x), this release fixes all critical e
   - Corrected default port: `7083` → `7080`
   - Added deprecation warning to `summarize_turn()` method (endpoint removed in Intake v0.2)
 
-#### Relay → UI Compatibility ✅
+#### Relay → UI Compatibility
 - **Added** OpenAI-compatible endpoint `POST /v1/chat/completions`
   - Accepts standard OpenAI format with `messages[]` array
   - Returns OpenAI-compatible response structure with `choices[]`
@@ -31,13 +85,13 @@ After the major architectural rewire (v0.4.x), this release fixes all critical e
   - Eliminates code duplication
   - Consistent error handling across endpoints
 
-#### Relay → Intake Connection ✅
+#### Relay → Intake Connection
 - **Fixed** Intake URL fallback in Relay server configuration
   - Corrected port: `7082` → `7080`
   - Updated endpoint: `/summary` → `/add_exchange`
   - Now properly sends exchanges to Intake for summarization
 
-#### Code Quality & Python Package Structure ✅
+#### Code Quality & Python Package Structure
 - **Added** missing `__init__.py` files to all Cortex subdirectories
   - `cortex/llm/__init__.py`
   - `cortex/reasoning/__init__.py`
@@ -48,7 +102,8 @@ After the major architectural rewire (v0.4.x), this release fixes all critical e
 - **Removed** unused import in `cortex/router.py`: `from unittest import result`
 - **Deleted** empty file `cortex/llm/resolve_llm_url.py` (was 0 bytes, never implemented)
 
-### ✅ Verified Working
+### Verified Working
+
 Complete end-to-end message flow now operational:
 ```
 UI → Relay (/v1/chat/completions)
@@ -72,26 +127,26 @@ Intake → NeoMem (background memory storage)
 Relay → UI (final response)
 ```
 
-### 📝 Documentation
-- **Added** this CHANGELOG entry with comprehensive v0.5.0 notes
+### Documentation
+- **Added** comprehensive v0.5.0 changelog entry
 - **Updated** README.md to reflect v0.5.0 architecture
   - Documented new endpoints
   - Updated data flow diagrams
   - Clarified Intake v0.2 changes
   - Corrected service descriptions
 
-### 🐛 Issues Resolved
+### Issues Resolved
 - ❌ Cortex could not retrieve context from Intake (wrong endpoint)
 - ❌ UI could not send messages to Relay (endpoint mismatch)
 - ❌ Relay could not send summaries to Intake (wrong port/endpoint)
 - ❌ Python package imports were implicit (missing __init__.py)
 
-### ⚠️ Known Issues (Non-Critical)
+### Known Issues (Non-Critical)
 - Session management endpoints not implemented in Relay (`GET/POST /sessions/:id`)
 - RAG service currently disabled in docker-compose.yml
 - Cortex `/ingest` endpoint is a stub returning `{"status": "ok"}`
 
-### 🎯 Migration Notes
+### Migration Notes
 If upgrading from v0.4.x:
 1. Pull latest changes from git
 2. Verify environment variables in `.env` files:
@@ -104,45 +159,48 @@ If upgrading from v0.4.x:
 
 ## [Infrastructure v1.0.0] - 2025-11-26
 
-### Changed
-- **Environment Variable Consolidation** - Major reorganization to eliminate duplication and improve maintainability
-  - Consolidated 9 scattered `.env` files into single source of truth architecture
-  - Root `.env` now contains all shared infrastructure (LLM backends, databases, API keys, service URLs)
-  - Service-specific `.env` files minimized to only essential overrides:
-    - `cortex/.env`: Reduced from 42 to 22 lines (operational parameters only)
-    - `neomem/.env`: Reduced from 26 to 14 lines (LLM naming conventions only)
-    - `intake/.env`: Kept at 8 lines (already minimal)
-  - **Result**: ~24% reduction in total configuration lines (197 → ~150)
+### Changed - Environment Variable Consolidation
 
-- **Docker Compose Consolidation**
-  - All services now defined in single root `docker-compose.yml`
-  - Relay service updated with complete configuration (env_file, volumes)
-  - Removed redundant `core/docker-compose.yml` (marked as DEPRECATED)
-  - Standardized network communication to use Docker container names
+**Major reorganization to eliminate duplication and improve maintainability**
 
-- **Service URL Standardization**
-  - Internal services use container names: `http://neomem-api:7077`, `http://cortex:7081`
-  - External services use IP addresses: `http://10.0.0.43:8000` (vLLM), `http://10.0.0.3:11434` (Ollama)
-  - Removed IP/container name inconsistencies across files
+- Consolidated 9 scattered `.env` files into single source of truth architecture
+- Root `.env` now contains all shared infrastructure (LLM backends, databases, API keys, service URLs)
+- Service-specific `.env` files minimized to only essential overrides:
+  - `cortex/.env`: Reduced from 42 to 22 lines (operational parameters only)
+  - `neomem/.env`: Reduced from 26 to 14 lines (LLM naming conventions only)
+  - `intake/.env`: Kept at 8 lines (already minimal)
+- **Result**: ~24% reduction in total configuration lines (197 → ~150)
 
-### Added
-- **Security Templates** - Created `.env.example` files for all services
-  - Root `.env.example` with sanitized credentials
-  - Service-specific templates: `cortex/.env.example`, `neomem/.env.example`, `intake/.env.example`, `rag/.env.example`
-  - All `.env.example` files safe to commit to version control
+**Docker Compose Consolidation**
+- All services now defined in single root `docker-compose.yml`
+- Relay service updated with complete configuration (env_file, volumes)
+- Removed redundant `core/docker-compose.yml` (marked as DEPRECATED)
+- Standardized network communication to use Docker container names
 
-- **Documentation**
-  - `ENVIRONMENT_VARIABLES.md`: Comprehensive reference for all environment variables
-    - Variable descriptions, defaults, and usage examples
-    - Multi-backend LLM strategy documentation
-    - Troubleshooting guide
-    - Security best practices
-  - `DEPRECATED_FILES.md`: Deletion guide for deprecated files with verification steps
+**Service URL Standardization**
+- Internal services use container names: `http://neomem-api:7077`, `http://cortex:7081`
+- External services use IP addresses: `http://10.0.0.43:8000` (vLLM), `http://10.0.0.3:11434` (Ollama)
+- Removed IP/container name inconsistencies across files
 
-- **Enhanced .gitignore**
-  - Ignores all `.env` files (including subdirectories)
-  - Tracks `.env.example` templates for documentation
-  - Ignores `.env-backups/` directory
+### Added - Security & Documentation
+
+**Security Templates** - Created `.env.example` files for all services
+- Root `.env.example` with sanitized credentials
+- Service-specific templates: `cortex/.env.example`, `neomem/.env.example`, `intake/.env.example`, `rag/.env.example`
+- All `.env.example` files safe to commit to version control
+
+**Documentation**
+- `ENVIRONMENT_VARIABLES.md`: Comprehensive reference for all environment variables
+  - Variable descriptions, defaults, and usage examples
+  - Multi-backend LLM strategy documentation
+  - Troubleshooting guide
+  - Security best practices
+- `DEPRECATED_FILES.md`: Deletion guide for deprecated files with verification steps
+
+**Enhanced .gitignore**
+- Ignores all `.env` files (including subdirectories)
+- Tracks `.env.example` templates for documentation
+- Ignores `.env-backups/` directory
 
 ### Removed
 - `core/.env` - Redundant with root `.env`, now deleted
@@ -154,13 +212,15 @@ If upgrading from v0.4.x:
 - Eliminated duplicate database credentials across 3+ files
 - Resolved Cortex `environment:` section override in docker-compose (now uses env_file)
 
-### Architecture
-- **Multi-Backend LLM Strategy**: Root `.env` provides all backend OPTIONS (PRIMARY, SECONDARY, CLOUD, FALLBACK), services choose which to USE
-  - Cortex → vLLM (PRIMARY) for autonomous reasoning
-  - NeoMem → Ollama (SECONDARY) + OpenAI embeddings
-  - Intake → vLLM (PRIMARY) for summarization
-  - Relay → Fallback chain with user preference
-- Preserves per-service flexibility while eliminating URL duplication
+### Architecture - Multi-Backend LLM Strategy
+
+Root `.env` provides all backend OPTIONS (PRIMARY, SECONDARY, CLOUD, FALLBACK), services choose which to USE:
+- **Cortex** → vLLM (PRIMARY) for autonomous reasoning
+- **NeoMem** → Ollama (SECONDARY) + OpenAI embeddings
+- **Intake** → vLLM (PRIMARY) for summarization
+- **Relay** → Fallback chain with user preference
+
+Preserves per-service flexibility while eliminating URL duplication.
 
 ### Migration
 - All original `.env` files backed up to `.env-backups/` with timestamp `20251126_025334`
@@ -169,637 +229,607 @@ If upgrading from v0.4.x:
 
 ---
 
-## [Lyra_RAG v0.1.0] 2025-11-07
-### Added
-- Initial standalone RAG module for Project Lyra.
-- Persistent ChromaDB vector store (`./chromadb`).
-- Importer `rag_chat_import.py` with:
-  - Recursive folder scanning and category tagging.
-  - Smart chunking (~5 k chars).
-  - SHA-1 deduplication and chat-ID metadata.
-  - Timestamp fields (`file_modified`, `imported_at`).
-  - Background-safe operation (`nohup`/`tmux`).
-- 68 Lyra-category chats imported:
-  - **6 556 new chunks added**
-  - **1 493 duplicates skipped**
-  - **7 997 total vectors** now stored.
+## [0.4.x] - 2025-11-13
 
-### API
-- `/rag/search` FastAPI endpoint implemented (port 7090).
-- Supports natural-language queries and returns top related excerpts.
-- Added answer synthesis step using `gpt-4o-mini`.
+### Added - Multi-Stage Reasoning Pipeline
 
-### Verified
-- Successful recall of Lyra-Core development history (v0.3.0 snapshot).
-- Correct metadata and category tagging for all new imports.
+**Cortex v0.5 - Complete architectural overhaul**
 
-### Next Planned
-- Optional `where` filter parameter for category/date queries.
-- Graceful “no results” handler for empty retrievals.
-- `rag_docs_import.py` for PDFs and other document types.
+- **New `reasoning.py` module**
+  - Async reasoning engine
+  - Accepts user prompt, identity, RAG block, and reflection notes
+  - Produces draft internal answers
+  - Uses primary backend (vLLM)
 
-## [Lyra Core v0.3.2 + Web Ui v0.2.0] - 2025-10-28
+- **New `reflection.py` module**
+  - Fully async meta-awareness layer
+  - Produces actionable JSON "internal notes"
+  - Enforces strict JSON schema and fallback parsing
+  - Forces cloud backend (`backend_override="cloud"`)
 
-### Added
-- ** New UI **
-  - Cleaned up UI look and feel.
-  
-- ** Added "sessions" **
-  - Now sessions persist over time.
-  - Ability to create new sessions or load sessions from a previous instance.
-  - When changing the session, it updates what the prompt is sending relay (doesn't prompt with messages from other sessions).
-  - Relay is correctly wired in.
+- **Integrated `refine.py` into pipeline**
+  - New stage between reflection and persona
+  - Runs exclusively on primary vLLM backend (MI50)
+  - Produces final, internally consistent output for downstream persona layer
 
-## [Lyra-Core 0.3.1] - 2025-10-09
+- **Backend override system**
+  - Each LLM call can now select its own backend
+  - Enables multi-LLM cognition: Reflection → cloud, Reasoning → primary
 
-### Added
-- **NVGRAM Integration (Full Pipeline Reconnected)**
-  - Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077).
-  - Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search`.
-  - Added `.env` variable:
-    ```
-    NVGRAM_API=http://nvgram-api:7077
-    ```
-  - Verified end-to-end Lyra conversation persistence:
-    - `relay → nvgram-api → postgres/neo4j → relay → ollama → ui`
-    - ✅ Memories stored, retrieved, and re-injected successfully.
+- **Identity loader**
+  - Added `identity.py` with `load_identity()` for consistent persona retrieval
 
-### Changed
-- Renamed `MEM0_URL` → `NVGRAM_API` across all relay environment configs.
-- Updated Docker Compose service dependency order:
-  - `relay` now depends on `nvgram-api` healthcheck.
-  - Removed `mem0` references and volumes.
-- Minor cleanup to Persona fetch block (null-checks and safer default persona string).
+- **Ingest handler**
+  - Async stub created for future Intake → NeoMem → RAG pipeline
+
+**Cortex v0.4.1 - RAG Integration**
+
+- **RAG integration**
+  - Added `rag.py` with `query_rag()` and `format_rag_block()`
+  - Cortex now queries local RAG API (`http://10.0.0.41:7090/rag/search`)
+  - Synthesized answers and top excerpts injected into reasoning prompt
+
+### Changed - Unified LLM Architecture
+
+**Cortex v0.5**
+
+- **Unified LLM backend URL handling across Cortex**
+  - ENV variables must now contain FULL API endpoints
+  - Removed all internal path-appending (e.g. `.../v1/completions`)
+  - `llm_router.py` rewritten to use env-provided URLs as-is
+  - Ensures consistent behavior between draft, reflection, refine, and persona
+
+- **Rebuilt `main.py`**
+  - Removed old annotation/analysis logic
+  - New structure: load identity → get RAG → reflect → reason → return draft+notes
+  - Routes now clean and minimal (`/reason`, `/ingest`, `/health`)
+  - Async path throughout Cortex
+
+- **Refactored `llm_router.py`**
+  - Removed old fallback logic during overrides
+  - OpenAI requests now use `/v1/chat/completions`
+  - Added proper OpenAI Authorization headers
+  - Distinct payload format for vLLM vs OpenAI
+  - Unified, correct parsing across models
+
+- **Simplified Cortex architecture**
+  - Removed deprecated "context.py" and old reasoning code
+  - Relay completely decoupled from smart behavior
+
+- **Updated environment specification**
+  - `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions`
+  - `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama)
+  - `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions`
+
+**Cortex v0.4.1**
+
+- **Revised `/reason` endpoint**
+  - Now builds unified context blocks: [Intake] → recent summaries, [RAG] → contextual knowledge, [User Message] → current input
+  - Calls `call_llm()` for first pass, then `reflection_loop()` for meta-evaluation
+  - Returns `cortex_prompt`, `draft_output`, `final_output`, and normalized reflection
+
+- **Reflection Pipeline Stability**
+  - Cleaned parsing to normalize JSON vs. text reflections
+  - Added fallback handling for malformed or non-JSON outputs
+  - Log system improved to show raw JSON, extracted fields, and normalized summary
+
+- **Async Summarization (Intake v0.2.1)**
+  - Intake summaries now run in background threads to avoid blocking Cortex
+  - Summaries (L1–L∞) logged asynchronously with [BG] tags
+
+- **Environment & Networking Fixes**
+  - Verified `.env` variables propagate correctly inside Cortex container
+  - Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG
+  - Adjusted localhost calls to service-IP mapping
+
+- **Behavioral Updates**
+  - Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers)
+  - RAG context successfully grounds reasoning outputs
+  - Intake and NeoMem confirmed receiving summaries via `/add_exchange`
+  - Log clarity pass: all reflective and contextual blocks clearly labeled
 
 ### Fixed
-- Relay startup no longer crashes when NVGRAM is unavailable — deferred connection handling.
-- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500`.
-- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON).
 
-### Goals / Next Steps
-- Add salience visualization (e.g., memory weights displayed in injected system message).
-- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring.
-- Add relay auto-retry for transient 500 responses from NVGRAM.
+**Cortex v0.5**
+
+- Resolved endpoint conflict where router expected base URLs and refine expected full URLs
+  - Fixed by standardizing full-URL behavior across entire system
+- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax)
+- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints
+- No more double-routing through vLLM during reflection
+- Corrected async/sync mismatch in multiple locations
+- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic
+
+### Removed
+
+**Cortex v0.5**
+
+- Legacy `annotate`, `reason_check` glue logic from old architecture
+- Old backend probing junk code
+- Stale imports and unused modules leftover from previous prototype
+
+### Verified
+
+**Cortex v0.5**
+
+- Cortex → vLLM (MI50) → refine → final_output now functioning correctly
+- Refine shows `used_primary_backend: true` and no fallback
+- Manual curl test confirms endpoint accuracy
+
+### Known Issues
+
+**Cortex v0.5**
+
+- Refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this
+- Hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned)
+
+**Cortex v0.4.1**
+
+- NeoMem tuning needed - improve retrieval latency and relevance
+- Need dedicated `/reflections/recent` endpoint for Cortex
+- Migrate to Cortex-first ingestion (Relay → Cortex → NeoMem)
+- Add persistent reflection recall (use prior reflections as meta-context)
+- Improve reflection JSON structure ("insight", "evaluation", "next_action" → guaranteed fields)
+- Tighten temperature and prompt control for factual consistency
+- RAG optimization: add source ranking, filtering, multi-vector hybrid search
+- Cache RAG responses per session to reduce duplicate calls
+
+### Notes
+
+**Cortex v0.5**
+
+This is the largest structural change to Cortex so far. It establishes:
+- Multi-model cognition
+- Clean layering
+- Identity + reflection separation
+- Correct async code
+- Deterministic backend routing
+- Predictable JSON reflection
+
+The system is now ready for:
+- Refinement loops
+- Persona-speaking layer
+- Containerized RAG
+- Long-term memory integration
+- True emergent-behavior experiments
 
 ---
 
-## [Lyra-Core] v0.3.1 - 2025-09-27
-### Changed
-- Removed salience filter logic; Cortex is now the default annotator.
-- All user messages stored in Mem0; no discard tier applied.
+## [0.3.x] - 2025-10-28 to 2025-09-26
 
 ### Added
-- Cortex annotations (`metadata.cortex`) now attached to memories.
-- Debug logging improvements:
+
+**[Lyra Core v0.3.2 + Web UI v0.2.0] - 2025-10-28**
+
+- **New UI**
+  - Cleaned up UI look and feel
+
+- **Sessions**
+  - Sessions now persist over time
+  - Ability to create new sessions or load sessions from previous instance
+  - When changing session, updates what the prompt sends to relay (doesn't prompt with messages from other sessions)
+  - Relay correctly wired in
+
+**[Lyra-Core 0.3.1] - 2025-10-09**
+
+- **NVGRAM Integration (Full Pipeline Reconnected)**
+  - Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077)
+  - Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search`
+  - Added `.env` variable: `NVGRAM_API=http://nvgram-api:7077`
+  - Verified end-to-end Lyra conversation persistence: `relay → nvgram-api → postgres/neo4j → relay → ollama → ui`
+  - ✅ Memories stored, retrieved, and re-injected successfully
+
+**[Lyra-Core v0.3.0] - 2025-09-26**
+
+- **Salience filtering** in Relay
+  - `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL`
+  - Supports `heuristic` and `llm` classification modes
+  - LLM-based salience filter integrated with Cortex VM running `llama-server`
+- Logging improvements
+  - Added debug logs for salience mode, raw LLM output, and unexpected outputs
+  - Fail-closed behavior for unexpected LLM responses
+- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers
+- Verified end-to-end flow: Relay → salience filter → Mem0 add/search → Persona injection → LLM reply
+
+**[Cortex v0.3.0] - 2025-10-31**
+
+- **Cortex Service (FastAPI)**
+  - New standalone reasoning engine (`cortex/main.py`) with endpoints:
+    - `GET /health` – reports active backend + NeoMem status
+    - `POST /reason` – evaluates `{prompt, response}` pairs
+    - `POST /annotate` – experimental text analysis
+  - Background NeoMem health monitor (5-minute interval)
+
+- **Multi-Backend Reasoning Support**
+  - Environment-driven backend selection via `LLM_FORCE_BACKEND`
+  - Supports: Primary (vLLM MI50), Secondary (Ollama 3090), Cloud (OpenAI), Fallback (llama.cpp CPU)
+  - Per-backend model variables: `LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL`
+
+- **Response Normalization Layer**
+  - Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON
+  - Handles Ollama's multi-line streaming and Mythomax's missing punctuation issues
+  - Prints concise debug previews of merged content
+
+- **Environment Simplification**
+  - Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file
+  - Removed reliance on shared/global env file to prevent cross-contamination
+  - Verified Docker Compose networking across containers
+
+**[NeoMem 0.1.2] - 2025-10-27** (formerly NVGRAM)
+
+- **Renamed NVGRAM to NeoMem**
+  - All future updates under name NeoMem
+  - Features unchanged
+
+**[NVGRAM 0.1.1] - 2025-10-08**
+
+- **Async Memory Rewrite (Stability + Safety Patch)**
+  - Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes
+  - Added input sanitation to prevent embedding errors (`'list' object has no attribute 'replace'`)
+  - Implemented `flatten_messages()` helper in API layer to clean malformed payloads
+  - Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware)
+  - Health endpoint (`/health`) returns structured JSON `{status, version, service}`
+  - Startup logs include sanitized embedder config with masked API keys
+
+**[NVGRAM 0.1.0] - 2025-10-07**
+
+- **Initial fork of Mem0 → NVGRAM**
+  - Created fully independent local-first memory engine based on Mem0 OSS
+  - Renamed all internal modules, Docker services, environment variables from `mem0` → `nvgram`
+  - New service name: `nvgram-api`, default port 7077
+  - Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility
+  - Uses FastAPI, Postgres, and Neo4j as persistent backends
+
+**[Lyra-Mem0 0.3.2] - 2025-10-05**
+
+- **Ollama LLM reasoning** alongside OpenAI embeddings
+  - Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090`
+  - Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M`
+  - Split processing: Embeddings → OpenAI `text-embedding-3-small`, LLM → Local Ollama
+- Added `.env.3090` template for self-hosted inference nodes
+- Integrated runtime diagnostics and seeder progress tracking
+  - File-level + message-level progress bars
+  - Retry/back-off logic for timeouts (3 attempts)
+  - Event logging (`ADD / UPDATE / NONE`) for every memory record
+- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers
+- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090)
+
+**[Lyra-Mem0 0.3.1] - 2025-10-03**
+
+- HuggingFace TEI integration (local 3090 embedder)
+- Dual-mode environment switch between OpenAI cloud and local
+- CSV export of memories from Postgres (`payload->>'data'`)
+
+**[Lyra-Mem0 0.3.0]**
+
+- **Ollama embeddings** in Mem0 OSS container
+  - Configure `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, `OLLAMA_HOST` via `.env`
+  - Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG`
+  - Installed `ollama` Python client into custom API container image
+- `.env.3090` file for external embedding mode (3090 machine)
+- Workflow for multiple embedding modes: LAN-based 3090/Ollama, Local-only CPU, OpenAI fallback
+
+**[Lyra-Mem0 v0.2.1]**
+
+- **Seeding pipeline**
+  - Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0
+  - Implemented incremental seeding option (skip existing memories, only add new ones)
+  - Verified insert process with Postgres-backed history DB
+
+**[Intake v0.1.0] - 2025-10-27**
+
+- Receives messages from relay and summarizes them in cascading format
+- Continues to summarize smaller amounts of exchanges while generating large-scale conversational summaries (L20)
+- Currently logs summaries to .log file in `/project-lyra/intake-logs/`
+
+**[Lyra-Cortex v0.2.0] - 2025-09-26**
+
+- Integrated **llama-server** on dedicated Cortex VM (Proxmox)
+- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs
+- Benchmarked Phi-3.5-mini performance: ~18 tokens/sec CPU-only on Ryzen 7 7800X
+- Salience classification functional but sometimes inconsistent
+- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier
+  - Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval)
+  - More responsive but over-classifies messages as "salient"
+- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models
+
+### Changed
+
+**[Lyra-Core 0.3.1] - 2025-10-09**
+
+- Renamed `MEM0_URL` → `NVGRAM_API` across all relay environment configs
+- Updated Docker Compose service dependency order
+  - `relay` now depends on `nvgram-api` healthcheck
+  - Removed `mem0` references and volumes
+- Minor cleanup to Persona fetch block (null-checks and safer default persona string)
+
+**[Lyra-Core v0.3.1] - 2025-09-27**
+
+- Removed salience filter logic; Cortex is now default annotator
+- All user messages stored in Mem0; no discard tier applied
+- Cortex annotations (`metadata.cortex`) now attached to memories
+- Debug logging improvements
   - Pretty-print Cortex annotations
   - Injected prompt preview
   - Memory search hit list with scores
-- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed.
+- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed
+
+**[Lyra-Core v0.3.0] - 2025-09-26**
+
+- Refactored `server.js` to gate `mem.add()` calls behind salience filter
+- Updated `.env` to support `SALIENCE_MODEL`
+
+**[Cortex v0.3.0] - 2025-10-31**
+
+- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend
+- Enhanced startup logs to announce active backend, model, URL, and mode
+- Improved error handling with clearer "Reasoning error" messages
+
+**[NVGRAM 0.1.1] - 2025-10-08**
+
+- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes
+- Normalized indentation and cleaned duplicate `main.py` references
+- Removed redundant `FastAPI()` app reinitialization
+- Updated internal logging to INFO-level timing format
+- Deprecated `@app.on_event("startup")` → will migrate to `lifespan` handler in v0.1.2
+
+**[NVGRAM 0.1.0] - 2025-10-07**
+
+- Removed dependency on external `mem0ai` SDK — all logic now local
+- Re-pinned requirements: fastapi==0.115.8, uvicorn==0.34.0, pydantic==2.10.4, python-dotenv==1.0.1, psycopg>=3.2.8, ollama
+- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming
+
+**[Lyra-Mem0 0.3.2] - 2025-10-05**
+
+- Updated `main.py` configuration block to load `LLM_PROVIDER`, `LLM_MODEL`, `OLLAMA_BASE_URL`
+  - Fallback to OpenAI if Ollama unavailable
+- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py`
+- Normalized `.env` loading so `mem0-api` and host environment share identical values
+- Improved seeder logging and progress telemetry
+- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']`
+
+**[Lyra-Mem0 0.3.0]**
+
+- `docker-compose.yml` updated to mount local `main.py` and `.env.3090`
+- Built custom Dockerfile (`mem0-api-server:latest`) extending base image with `pip install ollama`
+- Updated `requirements.txt` to include `ollama` package
+- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv`
+- Tested new embeddings path with curl `/memories` API call
+
+**[Lyra-Mem0 v0.2.1]**
+
+- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends
+- Mounted host `main.py` into container so local edits persist across rebuilds
+- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles
+- Built custom Dockerfile (`mem0-api-server:latest`) including `pip install ollama`
+- Updated `requirements.txt` with `ollama` dependency
+- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP)
+- Added logging to confirm model pulls and embedding requests
 
 ### Fixed
-- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner.
-- Relay no longer “hangs” on malformed Cortex outputs.
 
----
+**[Lyra-Core 0.3.1] - 2025-10-09**
 
-### [Lyra-Core] v0.3.0 — 2025-09-26
-#### Added
-- Implemented **salience filtering** in Relay:
-  - `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL`.
-  - Supports `heuristic` and `llm` classification modes.
-  - LLM-based salience filter integrated with Cortex VM running `llama-server`.
-- Logging improvements:
-  - Added debug logs for salience mode, raw LLM output, and unexpected outputs.
-  - Fail-closed behavior for unexpected LLM responses.
-- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers.
-- Verified end-to-end flow: Relay → salience filter → Mem0 add/search → Persona injection → LLM reply.
+- Relay startup no longer crashes when NVGRAM is unavailable — deferred connection handling
+- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500`
+- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON)
 
-#### Changed
-- Refactored `server.js` to gate `mem.add()` calls behind salience filter.
-- Updated `.env` to support `SALIENCE_MODEL`.
+**[Lyra-Core v0.3.1] - 2025-09-27**
 
-#### Known Issues
-- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient".
-- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi").
-- CPU-only inference is functional but limited; larger models recommended once GPU is available.
+- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner
+- Relay no longer "hangs" on malformed Cortex outputs
 
----
+**[Cortex v0.3.0] - 2025-10-31**
 
-### [Lyra-Core] v0.2.0 — 2025-09-24
-#### Added
-- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls.
-- Implemented `sessionId` support (client-supplied, fallback to `default`).
-- Added debug logs for memory add/search.
-- Cleaned up Relay structure for clarity.
+- Corrected broken vLLM endpoint routing (`/v1/completions`)
+- Stabilized cross-container health reporting for NeoMem
+- Resolved JSON parse failures caused by streaming chunk delimiters
 
----
+**[NVGRAM 0.1.1] - 2025-10-08**
 
-### [Lyra-Core] v0.1.0 — 2025-09-23
-#### Added
-- First working MVP of **Lyra Core Relay**.
-- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible).
-- Memory integration with Mem0:
-  - `POST /memories` on each user message.
-  - `POST /search` before LLM call.
-- Persona Sidecar integration (`GET /current`).
-- OpenAI GPT + Ollama (Mythomax) support in Relay.
-- Simple browser-based chat UI (talks to Relay at `http://<host>:7078`).
-- `.env` standardization for Relay + Mem0 + Postgres + Neo4j.
-- Working Neo4j + Postgres backing stores for Mem0.
-- Initial MVP relay service with raw fetch calls to Mem0.
-- Dockerized with basic healthcheck.
+- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content
+- Masked API key leaks from boot logs
+- Ensured Neo4j reconnects gracefully on first retry
 
-#### Fixed
-- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only).
-- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env`.
+**[Lyra-Mem0 0.3.2] - 2025-10-05**
 
-#### Known Issues
-- No feedback loop (thumbs up/down) yet.
-- Forget/delete flow is manual (via memory IDs).
-- Memory latency ~1–4s depending on embedding model.
+- Resolved crash during startup: `TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'`
+- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors
+- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests
+- "Unknown event" warnings now safely ignored (no longer break seeding loop)
+- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`)
 
----
+**[Lyra-Mem0 0.3.1] - 2025-10-03**
 
-## 🧩 lyra-neomem (used to be NVGRAM / Lyra-Mem0) ##############################################################################
+- `.env` CRLF vs LF line ending issues
+- Local seeding now possible via HuggingFace server
 
-## [NeoMem 0.1.2] - 2025-10-27
-### Changed
-- **Renamed NVGRAM to neomem**
-  - All future updates will be under the name NeoMem.
-  - Features have not changed.
+**[Lyra-Mem0 0.3.0]**
 
-## [NVGRAM 0.1.1] - 2025-10-08
-### Added
-- **Async Memory Rewrite (Stability + Safety Patch)**
-  - Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes.
-  - Added **input sanitation** to prevent embedding errors (`'list' object has no attribute 'replace'`).
-  - Implemented `flatten_messages()` helper in API layer to clean malformed payloads.
-  - Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware).
-  - Health endpoint (`/health`) now returns structured JSON `{status, version, service}`.
-  - Startup logs now include **sanitized embedder config** with API keys masked for safety:
-    ```
-    >>> Embedder config (sanitized): {'provider': 'openai', 'config': {'model': 'text-embedding-3-small', 'api_key': '***'}}
-    ✅ Connected to Neo4j on attempt 1
-    🧠 NVGRAM v0.1.1 — Neural Vectorized Graph Recall and Memory initialized
-    ```
+- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`)
+- Fixed config overwrite issue where rebuilding container restored stock `main.py`
+- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes
 
-### Changed
-- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes.
-- Normalized indentation and cleaned duplicate `main.py` references under `/nvgram/` vs `/nvgram/server/`.
-- Removed redundant `FastAPI()` app reinitialization.
-- Updated internal logging to INFO-level timing format:
-		2025-10-08 21:48:45 [INFO] POST /memories -> 200 (11189.1 ms)
-- Deprecated `@app.on_event("startup")` (FastAPI deprecation warning) → will migrate to `lifespan` handler in v0.1.2.
+**[Lyra-Mem0 v0.2.1]**
 
-### Fixed
-- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content.
-- Masked API key leaks from boot logs.
-- Ensured Neo4j reconnects gracefully on first retry.
-
-### Goals / Next Steps
-- Integrate **salience scoring** and **embedding confidence weight** fields in Postgres schema.
-- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall.
-- Migrate from deprecated `on_event` → `lifespan` pattern in 0.1.2.
-
----
-
-## [NVGRAM 0.1.0] - 2025-10-07
-### Added
-- **Initial fork of Mem0 → NVGRAM**:
-  - Created a fully independent local-first memory engine based on Mem0 OSS.
-  - Renamed all internal modules, Docker services, and environment variables from `mem0` → `nvgram`.
-  - New service name: **`nvgram-api`**, default port **7077**.
-  - Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility with Lyra Core.
-  - Uses **FastAPI**, **Postgres**, and **Neo4j** as persistent backends.
-  - Verified clean startup:
-    ```
-    ✅ Connected to Neo4j on attempt 1
-    INFO: Uvicorn running on http://0.0.0.0:7077
-    ```
-  - `/docs` and `/openapi.json` confirmed reachable and functional.
-
-### Changed
-- Removed dependency on the external `mem0ai` SDK — all logic now local.
-- Re-pinned requirements:
-	- fastapi==0.115.8
-	- uvicorn==0.34.0
-	- pydantic==2.10.4
-	- python-dotenv==1.0.1
-	- psycopg>=3.2.8
-	- ollama
-- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming and image paths.
-
-### Goals / Next Steps
-- Integrate NVGRAM as the new default backend in Lyra Relay.
-- Deprecate remaining Mem0 references and archive old configs.
-- Begin versioning as a standalone project (`nvgram-core`, `nvgram-api`, etc.).
-
----
-
-## [Lyra-Mem0 0.3.2] - 2025-10-05
-### Added
-- Support for **Ollama LLM reasoning** alongside OpenAI embeddings:
-  - Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090`.
-  - Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M`.
-  - Split processing pipeline:
-    - Embeddings → OpenAI `text-embedding-3-small`
-    - LLM → Local Ollama (`http://10.0.0.3:11434/api/chat`).
-- Added `.env.3090` template for self-hosted inference nodes.
-- Integrated runtime diagnostics and seeder progress tracking:
-  - File-level + message-level progress bars.
-  - Retry/back-off logic for timeouts (3 attempts).
-  - Event logging (`ADD / UPDATE / NONE`) for every memory record.
-- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers.
-- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090).
-
-### Changed
-- Updated `main.py` configuration block to load:
-  - `LLM_PROVIDER`, `LLM_MODEL`, and `OLLAMA_BASE_URL`.
-  - Fallback to OpenAI if Ollama unavailable.
-- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py`.
-- Normalized `.env` loading so `mem0-api` and host environment share identical values.
-- Improved seeder logging and progress telemetry for clearer diagnostics.
-- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']` for tuning future local inference runs.
-
-### Fixed
-- Resolved crash during startup:
-  `TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'`.
-- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors.
-- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests.
-- “Unknown event” warnings now safely ignored (no longer break seeding loop).
-- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`).
-
-### Observations
-- Stable GPU utilization: ~8 GB VRAM @ 92 % load, ≈ 67 °C under sustained seeding.
-- Next revision will re-format seed JSON to preserve `role` context (user vs assistant).
-
----
-
-## [Lyra-Mem0 0.3.1] - 2025-10-03
-### Added
-- HuggingFace TEI integration (local 3090 embedder).
-- Dual-mode environment switch between OpenAI cloud and local.
-- CSV export of memories from Postgres (`payload->>'data'`).
-
-### Fixed
-- `.env` CRLF vs LF line ending issues.
-- Local seeding now possible via huggingface server running 
-
----
-
-## [Lyra-mem0 0.3.0]
-### Added
-- Support for **Ollama embeddings** in Mem0 OSS container:
-  - Added ability to configure `EMBEDDER_PROVIDER=ollama` and set `EMBEDDER_MODEL` + `OLLAMA_HOST` via `.env`.
-  - Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG`.
-  - Installed `ollama` Python client into custom API container image.
-- `.env.3090` file created for external embedding mode (3090 machine):
-  - EMBEDDER_PROVIDER=ollama
-  - EMBEDDER_MODEL=mxbai-embed-large
-  - OLLAMA_HOST=http://10.0.0.3:11434
-- Workflow to support **multiple embedding modes**:
-  1. Fast LAN-based 3090/Ollama embeddings
-  2. Local-only CPU embeddings (Lyra Cortex VM)
-  3. OpenAI fallback embeddings
-
-### Changed
-- `docker-compose.yml` updated to mount local `main.py` and `.env.3090`.
-- Built **custom Dockerfile** (`mem0-api-server:latest`) extending base image with `pip install ollama`.
-- Updated `requirements.txt` to include `ollama` package.
-- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv` (`load_dotenv()`).
-- Tested new embeddings path with curl `/memories` API call.
-
-### Fixed
-- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`).
-- Fixed config overwrite issue where rebuilding container restored stock `main.py`.
-- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes and planning to standardize at 1536-dim.
-
---
-
-## [Lyra-mem0 v0.2.1]
-
-### Added
-- **Seeding pipeline**:
-  - Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0.
-  - Implemented incremental seeding option (skip existing memories, only add new ones).
-  - Verified insert process with Postgres-backed history DB and curl `/memories/search` sanity check.
-- **Ollama embedding support** in Mem0 OSS container:
-  - Added configuration for `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, and `OLLAMA_HOST` via `.env`.
-  - Created `.env.3090` profile for LAN-connected 3090 machine with Ollama.
-  - Set up three embedding modes:
-    1. Fast LAN-based 3090/Ollama
-    2. Local-only CPU model (Lyra Cortex VM)
-    3. OpenAI fallback
-
-### Changed
-- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends.
-- Mounted host `main.py` into container so local edits persist across rebuilds.
-- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles.
-- Built **custom Dockerfile** (`mem0-api-server:latest`) including `pip install ollama`.
-- Updated `requirements.txt` with `ollama` dependency.
-- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP).
-- Added logging to confirm model pulls and embedding requests.
-
-### Fixed
-- Seeder process originally failed on old memories — now skips duplicates and continues batch.
-- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image.
-- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild.
-- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch by investigating OpenAI (1536-dim) vs Ollama (1024-dim) schemas.
-
-### Notes
-- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAI’s schema and avoid Neo4j errors).
-- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors.
-- Seeder workflow validated but should be wrapped in a repeatable weekly run for full Cloud→Local sync.
-
----
-
-## [Lyra-Mem0 v0.2.0] - 2025-09-30
-### Added
-- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/`
-  - Includes **Postgres (pgvector)**, **Qdrant**, **Neo4j**, and **SQLite** for history tracking.
-  - Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building the Mem0 API server.
-- Verified REST API functionality:
-  - `POST /memories` works for adding memories.
-  - `POST /search` works for semantic search.
-- Successful end-to-end test with persisted memory:  
-  *"Likes coffee in the morning"* → retrievable via search. ✅
-
-### Changed
-- Split architecture into **modular stacks**:
-  - `~/lyra-core` (Relay, Persona-Sidecar, etc.)
-  - `~/lyra-mem0` (Mem0 OSS memory stack)
-- Removed old embedded mem0 containers from the Lyra-Core compose file.
-- Added Lyra-Mem0 section in README.md.
-
-### Next Steps
-- Wire **Relay → Mem0 API** (integration not yet complete).
-- Add integration tests to verify persistence and retrieval from within Lyra-Core.
-
----
-
-## 🧠 Lyra-Cortex ##############################################################################
-
-## [ Cortex - v0.5] -2025-11-13
-
-### Added
-- **New `reasoning.py` module**
-  - Async reasoning engine.
-  - Accepts user prompt, identity, RAG block, and reflection notes.
-  - Produces draft internal answers.
-  - Uses primary backend (vLLM).
-- **New `reflection.py` module**
-  - Fully async.
-  - Produces actionable JSON “internal notes.”
-  - Enforces strict JSON schema and fallback parsing.
-  - Forces cloud backend (`backend_override="cloud"`).
-- Integrated `refine.py` into Cortex reasoning pipeline:
-  - New stage between reflection and persona.
-  - Runs exclusively on primary vLLM backend (MI50).
-  - Produces final, internally consistent output for downstream persona layer.
-- **Backend override system**
-  - Each LLM call can now select its own backend.
-  - Enables multi-LLM cognition: Reflection → cloud, Reasoning → primary.
-
-- **identity loader**
-  - Added `identity.py` with `load_identity()` for consistent persona retrieval.
-
-- **ingest_handler**
-  - Async stub created for future Intake → NeoMem → RAG pipeline.  
-
-### Changed
-- Unified LLM backend URL handling across Cortex:
-  - ENV variables must now contain FULL API endpoints.
-  - Removed all internal path-appending (e.g. `.../v1/completions`).
-  - `llm_router.py` rewritten to use env-provided URLs as-is.
-  - Ensures consistent behavior between draft, reflection, refine, and persona.
-- **Rebuilt `main.py`**
-  - Removed old annotation/analysis logic.
-  - New structure: load identity → get RAG → reflect → reason → return draft+notes.
-  - Routes now clean and minimal (`/reason`, `/ingest`, `/health`).
-  - Async path throughout Cortex.
-
-- **Refactored `llm_router.py`**
-  - Removed old fallback logic during overrides.
-  - OpenAI requests now use `/v1/chat/completions`.
-  - Added proper OpenAI Authorization headers.
-  - Distinct payload format for vLLM vs OpenAI.
-  - Unified, correct parsing across models.
-
-- **Simplified Cortex architecture**
-  - Removed deprecated “context.py” and old reasoning code.
-  - Relay completely decoupled from smart behavior.
-
-- Updated environment specification:
-  - `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions`.
-  - `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama).
-  - `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions`.
-
-### Fixed
-- Resolved endpoint conflict where:
-  - Router expected base URLs.
-  - Refine expected full URLs.
-  - Refine always fell back due to hitting incorrect endpoint.
-  - Fixed by standardizing full-URL behavior across entire system.
-- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax).
-- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints.
-- No more double-routing through vLLM during reflection.
-- Corrected async/sync mismatch in multiple locations.  
-- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic.
-
-### Removed
-- Legacy `annotate`, `reason_check` glue logic from old architecture.
-- Old backend probing junk code.
-- Stale imports and unused modules leftover from previous prototype.
-
-### Verified
-- Cortex → vLLM (MI50) → refine → final_output now functioning correctly.
-- refine shows `used_primary_backend: true` and no fallback.
-- Manual curl test confirms endpoint accuracy.
+- Seeder process originally failed on old memories — now skips duplicates and continues batch
+- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image
+- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild
+- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch
 
 ### Known Issues
-- refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this.
-- hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned).
 
-### Pending / Known Issues
-- **RAG service does not exist** — requires containerized FastAPI service.
-- Reasoning layer lacks self-revision loop (deliberate thought cycle).
-- No speak/persona generation layer yet (`speak.py` planned).
-- Intake summaries not yet routing into RAG or reflection layer.
-- No refinement engine between reasoning and speak.
+**[Lyra-Core v0.3.0] - 2025-09-26**
 
-### Notes
-This is the largest structural change to Cortex so far.  
-It establishes:
-- multi-model cognition  
-- clean layering  
-- identity + reflection separation  
-- correct async code  
-- deterministic backend routing  
-- predictable JSON reflection  
+- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient"
+- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi")
+- CPU-only inference is functional but limited; larger models recommended once GPU available
 
-The system is now ready for:
-- refinement loops  
-- persona-speaking layer  
-- containerized RAG  
-- long-term memory integration  
-- true emergent-behavior experiments  
+**[Lyra-Cortex v0.2.0] - 2025-09-26**
 
+- Small models tend to drift or over-classify
+- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models
+- Need to set up `systemd` service for `llama-server` to auto-start on VM reboot
 
+### Observations
+
+**[Lyra-Mem0 0.3.2] - 2025-10-05**
+
+- Stable GPU utilization: ~8 GB VRAM @ 92% load, ≈ 67°C under sustained seeding
+- Next revision will re-format seed JSON to preserve `role` context (user vs assistant)
+
+**[Lyra-Mem0 v0.2.1]**
+
+- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAI's schema)
+- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors
+- Seeder workflow validated but should be wrapped in repeatable weekly run for full Cloud→Local sync
+
+### Next Steps
+
+**[Lyra-Core 0.3.1] - 2025-10-09**
+
+- Add salience visualization (e.g., memory weights displayed in injected system message)
+- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring
+- Add relay auto-retry for transient 500 responses from NVGRAM
+
+**[NVGRAM 0.1.1] - 2025-10-08**
+
+- Integrate salience scoring and embedding confidence weight fields in Postgres schema
+- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall
+- Migrate from deprecated `on_event` → `lifespan` pattern in 0.1.2
+
+**[NVGRAM 0.1.0] - 2025-10-07**
+
+- Integrate NVGRAM as new default backend in Lyra Relay
+- Deprecate remaining Mem0 references and archive old configs
+- Begin versioning as standalone project (`nvgram-core`, `nvgram-api`, etc.)
+
+**[Intake v0.1.0] - 2025-10-27**
+
+- Feed intake into NeoMem
+- Generate daily/hourly overall summary (IE: Today Brian and Lyra worked on x, y, and z)
+- Generate session-aware summaries with own intake hopper
+
+---
+
+## [0.2.x] - 2025-09-30 to 2025-09-24
 
-## [ Cortex - v0.4.1] - 2025-11-5
 ### Added
-- **RAG intergration**
-	- Added rag.py with query_rag() and format_rag_block().
-	- Cortex now queries the local RAG API (http://10.0.0.41:7090/rag/search) for contextual augmentation.
-	- Synthesized answers and top excerpts are injected into the reasoning prompt.
 
-### Changed ###
-- **Revised /reason endpoint.**
-	- Now builds unified context blocks:
-	  - [Intake] → recent summaries
-	  - [RAG] → contextual knowledge
-	  - [User Message] → current input 
-	- Calls call_llm() for the first pass, then reflection_loop() for meta-evaluation.
-	- Returns cortex_prompt, draft_output, final_output, and normalized reflection.
-- **Reflection Pipeline Stability**
-	- Cleaned parsing to normalize JSON vs. text reflections.
-	- Added fallback handling for malformed or non-JSON outputs.
-	- Log system improved to show raw JSON, extracted fields, and normalized summary.
-- **Async Summarization (Intake v0.2.1)**
-	- Intake summaries now run in background threads to avoid blocking Cortex.
-	- Summaries (L1–L∞) logged asynchronously with [BG] tags.
-- **Environment & Networking Fixes**
-	- Verified .env variables propagate correctly inside the Cortex container.
-	- Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG (shared serversdown_lyra_net).
-	- Adjusted localhost calls to service-IP mapping (10.0.0.41 for Cortex host).
-	
-- **Behavioral Updates**
-	- Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers).
-	- RAG context successfully grounds reasoning outputs.
-	- Intake and NeoMem confirmed receiving summaries via /add_exchange.
-	- Log clarity pass: all reflective and contextual blocks clearly labeled.
-- **Known Gaps / Next Steps**
-	- NeoMem Tuning
-	- Improve retrieval latency and relevance.
-	- Implement a dedicated /reflections/recent endpoint for Cortex.
-	- Migrate to Cortex-first ingestion (Relay → Cortex → NeoMem).
-- **Cortex Enhancements**
-	- Add persistent reflection recall (use prior reflections as meta-context).
-	- Improve reflection JSON structure ("insight", "evaluation", "next_action" → guaranteed fields).
-	- Tighten temperature and prompt control for factual consistency.
-- **RAG Optimization**
-	-Add source ranking, filtering, and multi-vector hybrid search.
-	-Cache RAG responses per session to reduce duplicate calls.
-- **Documentation / Monitoring**
-	-Add health route for RAG and Intake summaries.
-	-Include internal latency metrics in /health endpoint.
+**[Lyra-Mem0 v0.2.0] - 2025-09-30**
 
-Consolidate logs into unified “Lyra Cortex Console” for tracing all module calls.
+- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/`
+  - Includes Postgres (pgvector), Qdrant, Neo4j, and SQLite for history tracking
+  - Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building Mem0 API server
+- Verified REST API functionality
+  - `POST /memories` works for adding memories
+  - `POST /search` works for semantic search
+- Successful end-to-end test with persisted memory: *"Likes coffee in the morning"* → retrievable via search ✅
 
-## [Cortex - v0.3.0] – 2025-10-31
-### Added
-- **Cortex Service (FastAPI)**  
-  - New standalone reasoning engine (`cortex/main.py`) with endpoints:
-    - `GET /health` – reports active backend + NeoMem status.  
-    - `POST /reason` – evaluates `{prompt, response}` pairs.  
-    - `POST /annotate` – experimental text analysis.  
-  - Background NeoMem health monitor (5-minute interval).
+**[Lyra-Core v0.2.0] - 2025-09-24**
 
-- **Multi-Backend Reasoning Support**  
-  - Added environment-driven backend selection via `LLM_FORCE_BACKEND`.  
-  - Supports:
-    - **Primary** → vLLM (MI50 node @ 10.0.0.43)  
-    - **Secondary** → Ollama (3090 node @ 10.0.0.3)  
-    - **Cloud** → OpenAI API  
-    - **Fallback** → llama.cpp (CPU)
-  - Introduced per-backend model variables:  
-    `LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL`.
-
-- **Response Normalization Layer**  
-  - Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON.  
-  - Handles Ollama’s multi-line streaming and Mythomax’s missing punctuation issues.  
-  - Prints concise debug previews of merged content.
-
-- **Environment Simplification**  
-  - Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file.  
-  - Removed reliance on shared/global env file to prevent cross-contamination.  
-  - Verified Docker Compose networking across containers.
+- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls
+- Implemented `sessionId` support (client-supplied, fallback to `default`)
+- Added debug logs for memory add/search
+- Cleaned up Relay structure for clarity
 
 ### Changed
-- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend.
-- Enhanced startup logs to announce active backend, model, URL, and mode.
-- Improved error handling with clearer “Reasoning error” messages.
+
+**[Lyra-Mem0 v0.2.0] - 2025-09-30**
+
+- Split architecture into modular stacks:
+  - `~/lyra-core` (Relay, Persona-Sidecar, etc.)
+  - `~/lyra-mem0` (Mem0 OSS memory stack)
+- Removed old embedded mem0 containers from Lyra-Core compose file
+- Added Lyra-Mem0 section in README.md
+
+### Next Steps
+
+**[Lyra-Mem0 v0.2.0] - 2025-09-30**
+
+- Wire **Relay → Mem0 API** (integration not yet complete)
+- Add integration tests to verify persistence and retrieval from within Lyra-Core
+
+---
+
+## [0.1.x] - 2025-09-25 to 2025-09-23
+
+### Added
+
+**[Lyra_RAG v0.1.0] - 2025-11-07**
+
+- Initial standalone RAG module for Project Lyra
+- Persistent ChromaDB vector store (`./chromadb`)
+- Importer `rag_chat_import.py` with:
+  - Recursive folder scanning and category tagging
+  - Smart chunking (~5k chars)
+  - SHA-1 deduplication and chat-ID metadata
+  - Timestamp fields (`file_modified`, `imported_at`)
+  - Background-safe operation (`nohup`/`tmux`)
+- 68 Lyra-category chats imported:
+  - 6,556 new chunks added
+  - 1,493 duplicates skipped
+  - 7,997 total vectors stored
+
+**[Lyra_RAG v0.1.0 API] - 2025-11-07**
+
+- `/rag/search` FastAPI endpoint implemented (port 7090)
+- Supports natural-language queries and returns top related excerpts
+- Added answer synthesis step using `gpt-4o-mini`
+
+**[Lyra-Core v0.1.0] - 2025-09-23**
+
+- First working MVP of **Lyra Core Relay**
+- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible)
+- Memory integration with Mem0:
+  - `POST /memories` on each user message
+  - `POST /search` before LLM call
+- Persona Sidecar integration (`GET /current`)
+- OpenAI GPT + Ollama (Mythomax) support in Relay
+- Simple browser-based chat UI (talks to Relay at `http://<host>:7078`)
+- `.env` standardization for Relay + Mem0 + Postgres + Neo4j
+- Working Neo4j + Postgres backing stores for Mem0
+- Initial MVP relay service with raw fetch calls to Mem0
+- Dockerized with basic healthcheck
+
+**[Lyra-Cortex v0.1.0] - 2025-09-25**
+
+- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD)
+- Built **llama.cpp** with `llama-server` target via CMake
+- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model
+- Verified API compatibility at `/v1/chat/completions`
+- Local test successful via `curl` → ~523 token response generated
+- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X)
+- Confirmed usable for salience scoring, summarization, and lightweight reasoning
 
 ### Fixed
-- Corrected broken vLLM endpoint routing (`/v1/completions`).
-- Stabilized cross-container health reporting for NeoMem.
-- Resolved JSON parse failures caused by streaming chunk delimiters.
+
+**[Lyra-Core v0.1.0] - 2025-09-23**
+
+- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only)
+- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env`
+
+### Verified
+
+**[Lyra_RAG v0.1.0] - 2025-11-07**
+
+- Successful recall of Lyra-Core development history (v0.3.0 snapshot)
+- Correct metadata and category tagging for all new imports
+
+### Known Issues
+
+**[Lyra-Core v0.1.0] - 2025-09-23**
+
+- No feedback loop (thumbs up/down) yet
+- Forget/delete flow is manual (via memory IDs)
+- Memory latency ~1–4s depending on embedding model
+
+### Next Planned
+
+**[Lyra_RAG v0.1.0] - 2025-11-07**
+
+- Optional `where` filter parameter for category/date queries
+- Graceful "no results" handler for empty retrievals
+- `rag_docs_import.py` for PDFs and other document types
 
 ---
-
-## Next Planned – [v0.4.0]
-### Planned Additions
-- **Reflection Mode**
-  - Introduce `REASONING_MODE=factcheck|reflection`.  
-  - Output schema:
-    ```json
-    { "insight": "...", "evaluation": "...", "next_action": "..." }
-    ```
-
-- **Cortex-First Pipeline**
-  - UI → Cortex → [Reflection + Verifier + Memory] → Speech LLM → User.  
-  - Allows Lyra to “think before speaking.”
-
-- **Verifier Stub**
-  - New `/verify` endpoint for search-based factual grounding.  
-  - Asynchronous external truth checking.
-
-- **Memory Integration**
-  - Feed reflective outputs into NeoMem.  
-  - Enable “dream” cycles for autonomous self-review.
-
----
-
-**Status:** 🟢 Stable Core – Multi-backend reasoning operational.  
-**Next milestone:** *v0.4.0 — Reflection Mode + Thought Pipeline orchestration.*
-
----
-
-### [Intake] v0.1.0 - 2025-10-27
-    - Recieves messages from relay and summarizes them in a cascading format.
-	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
-	- Currently logs summaries to .log file in /project-lyra/intake-logs/
-  ** Next Steps **
-    - Feed intake into neomem.
-	- Generate a daily/hourly/etc overall summary, (IE: Today Brian and Lyra worked on x, y, and z)
-	- Generate session aware summaries, with its own intake hopper.
-  
-
-### [Lyra-Cortex] v0.2.0 — 2025-09-26
-**Added
-- Integrated **llama-server** on dedicated Cortex VM (Proxmox).
-- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs.
-- Benchmarked Phi-3.5-mini performance:
-  - ~18 tokens/sec CPU-only on Ryzen 7 7800X.
-  - Salience classification functional but sometimes inconsistent ("sali", "fi", "jamming").
-- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier:
-  - Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval).
-  - More responsive but over-classifies messages as “salient.”
-- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models.
-
-** Known Issues
-- Small models tend to drift or over-classify.
-- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models.
-- Need to set up a `systemd` service for `llama-server` to auto-start on VM reboot.
-
----
-
-### [Lyra-Cortex] v0.1.0 — 2025-09-25
-#### Added
-- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD).
-- Built **llama.cpp** with `llama-server` target via CMake.
-- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model.
-- Verified **API compatibility** at `/v1/chat/completions`.
-- Local test successful via `curl` → ~523 token response generated.
-- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X).
-- Confirmed usable for salience scoring, summarization, and lightweight reasoning.
diff --git a/cortex/Dockerfile b/cortex/Dockerfile
index 784f720..77cd233 100644
--- a/cortex/Dockerfile
+++ b/cortex/Dockerfile
@@ -4,4 +4,6 @@ COPY requirements.txt .
 RUN pip install -r requirements.txt
 COPY . .
 EXPOSE 7081
+# NOTE: Running with single worker to maintain SESSIONS global state in Intake.
+# If scaling to multiple workers, migrate SESSIONS to Redis or shared storage.
 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7081"]
diff --git a/cortex/context.py b/cortex/context.py
index aff3327..341946d 100644
--- a/cortex/context.py
+++ b/cortex/context.py
@@ -84,6 +84,7 @@ def _init_session(session_id: str) -> Dict[str, Any]:
         "mood": "neutral",  # Future: mood tracking
         "active_project": None,  # Future: project context
         "message_count": 0,
+        "message_history": [],
     }
 
 
@@ -275,6 +276,13 @@ async def collect_context(session_id: str, user_prompt: str) -> Dict[str, Any]:
     state["last_user_message"] = user_prompt
     state["last_timestamp"] = now
     state["message_count"] += 1
+    # Save user turn to history
+    state["message_history"].append({
+    "user": user_prompt,
+    "assistant": ""   # assistant reply filled later by update_last_assistant_message()
+    })
+
+
 
     # F. Assemble unified context
     context_state = {
@@ -311,20 +319,27 @@ async def collect_context(session_id: str, user_prompt: str) -> Dict[str, Any]:
 # -----------------------------
 def update_last_assistant_message(session_id: str, message: str) -> None:
     """
-    Update session state with assistant's response.
-
-    Called by router.py after persona layer completes.
-
-    Args:
-        session_id: Session identifier
-        message: Assistant's final response text
+    Update session state with assistant's response and complete
+    the last turn inside message_history.
     """
-    if session_id in SESSION_STATE:
-        SESSION_STATE[session_id]["last_assistant_message"] = message
-        SESSION_STATE[session_id]["last_timestamp"] = datetime.now()
-        logger.debug(f"Updated assistant message for session {session_id}")
-    else:
+    session = SESSION_STATE.get(session_id)
+    if not session:
         logger.warning(f"Attempted to update non-existent session: {session_id}")
+        return
+
+    # Update last assistant message + timestamp
+    session["last_assistant_message"] = message
+    session["last_timestamp"] = datetime.now()
+
+    # Fill in assistant reply for the most recent turn
+    history = session.get("message_history", [])
+    if history:
+        # history entry already contains {"user": "...", "assistant": "...?"}
+        history[-1]["assistant"] = message
+
+    if VERBOSE_DEBUG:
+        logger.debug(f"Updated assistant message for session {session_id}")
+
 
 
 def get_session_state(session_id: str) -> Optional[Dict[str, Any]]:
diff --git a/cortex/intake/__init__.py b/cortex/intake/__init__.py
new file mode 100644
index 0000000..c967d4a
--- /dev/null
+++ b/cortex/intake/__init__.py
@@ -0,0 +1,18 @@
+"""
+Intake module - short-term memory summarization.
+
+Runs inside the Cortex container as a pure Python module.
+No standalone API server - called internally by Cortex.
+"""
+
+from .intake import (
+    SESSIONS,
+    add_exchange_internal,
+    summarize_context,
+)
+
+__all__ = [
+    "SESSIONS",
+    "add_exchange_internal",
+    "summarize_context",
+]
diff --git a/cortex/intake/intake.py b/cortex/intake/intake.py
index 897acf8..50b192d 100644
--- a/cortex/intake/intake.py
+++ b/cortex/intake/intake.py
@@ -1,18 +1,29 @@
 import os
+import json
 from datetime import datetime
 from typing import List, Dict, Any, TYPE_CHECKING
 from collections import deque
+from llm.llm_router import call_llm
 
+# -------------------------------------------------------------------
+# Global Short-Term Memory (new Intake)
+# -------------------------------------------------------------------
+SESSIONS: dict[str, dict] = {}   # session_id → { buffer: deque, created_at: timestamp }
+
+# Diagnostic: Verify module loads only once
+print(f"[Intake Module Init] SESSIONS object id: {id(SESSIONS)}, module: {__name__}")
+
+# L10 / L20 history lives here too
+L10_HISTORY: Dict[str, list[str]] = {}
+L20_HISTORY: Dict[str, list[str]] = {}
+
+from llm.llm_router import call_llm  # Use Cortex's shared LLM router
 
 if TYPE_CHECKING:
+    # Only for type hints — do NOT redefine SESSIONS here
     from collections import deque as _deque
-    SESSIONS: dict
-    L10_HISTORY: dict
-    L20_HISTORY: dict
     def bg_summarize(session_id: str) -> None: ...
 
-from llm.llm_router import call_llm  # use Cortex's shared router
-
 # ─────────────────────────────
 # Config
 # ─────────────────────────────
@@ -220,20 +231,24 @@ def push_to_neomem(summary: str, session_id: str, level: str) -> None:
 # ─────────────────────────────
 # Main entrypoint for Cortex
 # ─────────────────────────────
-
-async def summarize_context(
-    session_id: str,
-    exchanges: List[Dict[str, Any]],
-) -> Dict[str, Any]:
+async def summarize_context(session_id: str, exchanges: list[dict]):
     """
-    Main API used by Cortex:
+    Internal summarizer that uses Cortex's LLM router.
+    Produces L1 / L5 / L10 / L20 / L30 summaries.
 
-        summaries = await summarize_context(session_id, exchanges)
-
-    `exchanges` should be the recent conversation buffer for that session.
+    Args:
+        session_id: The conversation/session ID
+        exchanges: A list of {"user_msg": ..., "assistant_msg": ..., "timestamp": ...}
     """
-    buf = list(exchanges)
-    if not buf:
+
+    # Build raw conversation text
+    convo_lines = []
+    for ex in exchanges:
+        convo_lines.append(f"User: {ex.get('user_msg','')}")
+        convo_lines.append(f"Assistant: {ex.get('assistant_msg','')}")
+    convo_text = "\n".join(convo_lines)
+
+    if not convo_text.strip():
         return {
             "session_id": session_id,
             "exchange_count": 0,
@@ -242,31 +257,72 @@ async def summarize_context(
             "L10": "",
             "L20": "",
             "L30": "",
-            "last_updated": None,
+            "last_updated": datetime.now().isoformat()
         }
 
-    # Base levels
-    L1 = await summarize_L1(buf)
-    L5 = await summarize_L5(buf)
-    L10 = await summarize_L10(session_id, buf)
-    L20 = await summarize_L20(session_id)
-    L30 = await summarize_L30(session_id)
+    # Prompt the LLM (internal — no HTTP)
+    prompt = f"""
+Summarize the conversation below into multiple compression levels.
 
-    # Push the "interesting" tiers into NeoMem
-    push_to_neomem(L10, session_id, "L10")
-    push_to_neomem(L20, session_id, "L20")
-    push_to_neomem(L30, session_id, "L30")
+Conversation:
+----------------
+{convo_text}
+----------------
 
-    return {
-        "session_id": session_id,
-        "exchange_count": len(buf),
-        "L1": L1,
-        "L5": L5,
-        "L10": L10,
-        "L20": L20,
-        "L30": L30,
-        "last_updated": datetime.now().isoformat(),
-    }
+Output strictly in JSON with keys:
+L1  → ultra short summary (1–2 sentences max)
+L5  → short summary
+L10 → medium summary
+L20 → detailed overview
+L30 → full detailed summary
+
+JSON only. No text outside JSON.
+"""
+
+    try:
+        llm_response = await call_llm(
+            prompt,
+            temperature=0.2
+        )
+
+
+        # LLM should return JSON, parse it
+        summary = json.loads(llm_response)
+
+        return {
+            "session_id": session_id,
+            "exchange_count": len(exchanges),
+            "L1": summary.get("L1", ""),
+            "L5": summary.get("L5", ""),
+            "L10": summary.get("L10", ""),
+            "L20": summary.get("L20", ""),
+            "L30": summary.get("L30", ""),
+            "last_updated": datetime.now().isoformat()
+        }
+
+    except Exception as e:
+        return {
+            "session_id": session_id,
+            "exchange_count": len(exchanges),
+            "L1": f"[Error summarizing: {str(e)}]",
+            "L5": "",
+            "L10": "",
+            "L20": "",
+            "L30": "",
+            "last_updated": datetime.now().isoformat()
+        }
+
+# ─────────────────────────────────
+# Background summarization stub
+# ─────────────────────────────────
+def bg_summarize(session_id: str):
+    """
+    Placeholder for background summarization.
+    Actual summarization happens during /reason via summarize_context().
+
+    This function exists to prevent NameError when called from add_exchange_internal().
+    """
+    print(f"[Intake] Exchange added for {session_id}. Will summarize on next /reason call.")
 
 # ─────────────────────────────
 # Internal entrypoint for Cortex
@@ -283,15 +339,23 @@ def add_exchange_internal(exchange: dict):
 
     exchange["timestamp"] = datetime.now().isoformat()
 
+    # DEBUG: Verify we're using the module-level SESSIONS
+    print(f"[add_exchange_internal] SESSIONS object id: {id(SESSIONS)}, current sessions: {list(SESSIONS.keys())}")
+
     # Ensure session exists
     if session_id not in SESSIONS:
         SESSIONS[session_id] = {
             "buffer": deque(maxlen=200),
             "created_at": datetime.now()
         }
+        print(f"[add_exchange_internal] Created new session: {session_id}")
+    else:
+        print(f"[add_exchange_internal] Using existing session: {session_id}")
 
     # Append exchange into the rolling buffer
     SESSIONS[session_id]["buffer"].append(exchange)
+    buffer_len = len(SESSIONS[session_id]["buffer"])
+    print(f"[add_exchange_internal] Added exchange to {session_id}, buffer now has {buffer_len} items")
 
     # Trigger summarization immediately
     try:
diff --git a/cortex/router.py b/cortex/router.py
index 0beb457..e6ba161 100644
--- a/cortex/router.py
+++ b/cortex/router.py
@@ -197,26 +197,110 @@ class IngestPayload(BaseModel):
     user_msg: str
     assistant_msg: str
 
+
 @cortex_router.post("/ingest")
-async def ingest_stub():
-    # Intake is internal now — this endpoint is only for compatibility.
-    return {"status": "ok", "note": "intake is internal now"}
+async def ingest(payload: IngestPayload):
+    """
+    Receives (session_id, user_msg, assistant_msg) from Relay
+    and pushes directly into Intake's in-memory buffer.
 
-
-    # 1. Update Cortex session state
-    update_last_assistant_message(payload.session_id, payload.assistant_msg)
-
-    # 2. Feed Intake internally (no HTTP)
+    Uses lenient error handling - always returns success to avoid
+    breaking the chat pipeline.
+    """
     try:
+        # 1. Update Cortex session state
+        update_last_assistant_message(payload.session_id, payload.assistant_msg)
+    except Exception as e:
+        logger.warning(f"[INGEST] Failed to update session state: {e}")
+        # Continue anyway (lenient mode)
+
+    try:
+        # 2. Feed Intake internally (no HTTP)
         add_exchange_internal({
             "session_id": payload.session_id,
             "user_msg": payload.user_msg,
             "assistant_msg": payload.assistant_msg,
         })
-
         logger.debug(f"[INGEST] Added exchange to Intake for {payload.session_id}")
     except Exception as e:
-        logger.warning(f"[INGEST] Failed to add exchange to Intake: {e}")
+        logger.warning(f"[INGEST] Failed to add to Intake: {e}")
+        # Continue anyway (lenient mode)
 
-    return {"ok": True, "session_id": payload.session_id}
+    # Always return success (user requirement: never fail chat pipeline)
+    return {
+        "status": "ok",
+        "session_id": payload.session_id
+    }
+
+# -----------------------------
+# Debug endpoint: summarized context
+# -----------------------------
+@cortex_router.get("/debug/summary")
+async def debug_summary(session_id: str):
+    """
+    Diagnostic endpoint that runs Intake's summarize_context() for a session.
+
+    Shows exactly what L1/L5/L10/L20/L30 summaries would look like
+    inside the actual Uvicorn worker, using the real SESSIONS buffer.
+    """
+    from intake.intake import SESSIONS, summarize_context
+
+    # Validate session
+    session = SESSIONS.get(session_id)
+    if not session:
+        return {"error": "session not found", "session_id": session_id}
+
+    # Convert deque into the structure summarize_context expects
+    buffer = session["buffer"]
+    exchanges = [
+        {
+            "user_msg": ex.get("user_msg", ""),
+            "assistant_msg": ex.get("assistant_msg", ""),
+        }
+        for ex in buffer
+    ]
+
+    # 🔥 CRITICAL FIX — summarize_context is async
+    summary = await summarize_context(session_id, exchanges)
+
+    return {
+        "session_id": session_id,
+        "buffer_size": len(buffer),
+        "exchanges_preview": exchanges[-5:],   # last 5 items
+        "summary": summary
+    }
+
+# -----------------------------
+# Debug endpoint for SESSIONS
+# -----------------------------
+@cortex_router.get("/debug/sessions")
+async def debug_sessions():
+    """
+    Diagnostic endpoint to inspect SESSIONS from within the running Uvicorn worker.
+    This shows the actual state of the in-memory SESSIONS dict.
+    """
+    from intake.intake import SESSIONS
+
+    sessions_data = {}
+    for session_id, session_info in SESSIONS.items():
+        buffer = session_info["buffer"]
+        sessions_data[session_id] = {
+            "created_at": session_info["created_at"].isoformat(),
+            "buffer_size": len(buffer),
+            "buffer_maxlen": buffer.maxlen,
+            "recent_exchanges": [
+                {
+                    "user_msg": ex.get("user_msg", "")[:100],
+                    "assistant_msg": ex.get("assistant_msg", "")[:100],
+                    "timestamp": ex.get("timestamp", "")
+                }
+                for ex in list(buffer)[-5:]  # Last 5 exchanges
+            ]
+        }
+
+    return {
+        "sessions_object_id": id(SESSIONS),
+        "total_sessions": len(SESSIONS),
+        "sessions": sessions_data
+    }
 
diff --git a/vllm-mi50.md b/vllm-mi50.md
deleted file mode 100644
index c8f6fd4..0000000
--- a/vllm-mi50.md
+++ /dev/null
@@ -1,416 +0,0 @@
-Here you go — a **clean, polished, ready-to-drop-into-Trilium or GitHub** Markdown file.
-
-If you want, I can also auto-generate a matching `/docs/vllm-mi50/` folder structure and a mini-ToC.
-
----
-
-# **MI50 + vLLM + Proxmox LXC Setup Guide**
-
-### *End-to-End Field Manual for gfx906 LLM Serving*
-
-**Version:** 1.0
-**Last updated:** 2025-11-17
-
----
-
-## **📌 Overview**
-
-This guide documents how to run a **vLLM OpenAI-compatible server** on an
-**AMD Instinct MI50 (gfx906)** inside a **Proxmox LXC container**, expose it over LAN,
-and wire it into **Project Lyra's Cortex reasoning layer**.
-
-This file is long, specific, and intentionally leaves *nothing* out so you never have to rediscover ROCm pain rituals again.
-
----
-
-## **1. What This Stack Looks Like**
-
-```
-Proxmox Host
- ├─ AMD Instinct MI50 (gfx906)
- ├─ AMDGPU + ROCm stack
- └─ LXC Container (CT 201: cortex-gpu)
-      ├─ Ubuntu 24.04
-      ├─ Docker + docker compose
-      ├─ vLLM inside Docker (nalanzeyu/vllm-gfx906)
-      ├─ GPU passthrough via /dev/kfd + /dev/dri + PCI bind
-      └─ vLLM API exposed on :8000
-Lyra Cortex (VM/Server)
- └─ LLM_PRIMARY_URL=http://10.0.0.43:8000
-```
-
----
-
-## **2. Proxmox Host — GPU Setup**
-
-### **2.1 Confirm MI50 exists**
-
-```bash
-lspci -nn | grep -i 'vega\|instinct\|radeon'
-```
-
-You should see something like:
-
-```
-0a:00.0 Display controller: AMD Instinct MI50 (gfx906)
-```
-
-### **2.2 Load AMDGPU driver**
-
-The main pitfall after **any host reboot**.
-
-```bash
-modprobe amdgpu
-```
-
-If you skip this, the LXC container won't see the GPU.
-
----
-
-## **3. LXC Container Configuration (CT 201)**
-
-The container ID is **201**.
-Config file is at:
-
-```
-/etc/pve/lxc/201.conf
-```
-
-### **3.1 Working 201.conf**
-
-Paste this *exact* version:
-
-```ini
-arch: amd64
-cores: 4
-hostname: cortex-gpu
-memory: 16384
-swap: 512
-ostype: ubuntu
-onboot: 1
-startup: order=2,up=10,down=10
-net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:C6:3E:88,ip=dhcp,type=veth
-rootfs: local-lvm:vm-201-disk-0,size=200G
-unprivileged: 0
-
-# Docker in LXC requires this
-features: keyctl=1,nesting=1
-lxc.apparmor.profile: unconfined
-lxc.cap.drop:
-
-# --- GPU passthrough for ROCm (MI50) ---
-lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file,mode=0666
-lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
-lxc.mount.entry: /sys/class/drm sys/class/drm none bind,ro,optional,create=dir
-lxc.mount.entry: /opt/rocm /opt/rocm none bind,ro,optional,create=dir
-
-# Bind the MI50 PCI device
-lxc.mount.entry: /dev/bus/pci/0000:0a:00.0 dev/bus/pci/0000:0a:00.0 none bind,optional,create=file
-
-# Allow GPU-related character devices
-lxc.cgroup2.devices.allow: c 226:* rwm
-lxc.cgroup2.devices.allow: c 29:* rwm
-lxc.cgroup2.devices.allow: c 189:* rwm
-lxc.cgroup2.devices.allow: c 238:* rwm
-lxc.cgroup2.devices.allow: c 241:* rwm
-lxc.cgroup2.devices.allow: c 242:* rwm
-lxc.cgroup2.devices.allow: c 243:* rwm
-lxc.cgroup2.devices.allow: c 244:* rwm
-lxc.cgroup2.devices.allow: c 245:* rwm
-lxc.cgroup2.devices.allow: c 246:* rwm
-lxc.cgroup2.devices.allow: c 247:* rwm
-lxc.cgroup2.devices.allow: c 248:* rwm
-lxc.cgroup2.devices.allow: c 249:* rwm
-lxc.cgroup2.devices.allow: c 250:* rwm
-lxc.cgroup2.devices.allow: c 510:0 rwm
-```
-
-### **3.2 Restart sequence**
-
-```bash
-pct stop 201
-modprobe amdgpu
-pct start 201
-pct enter 201
-```
-
----
-
-## **4. Inside CT 201 — Verifying ROCm + GPU Visibility**
-
-### **4.1 Check device nodes**
-
-```bash
-ls -l /dev/kfd
-ls -l /dev/dri
-ls -l /opt/rocm
-```
-
-All must exist.
-
-### **4.2 Validate GPU via rocminfo**
-
-```bash
-/opt/rocm/bin/rocminfo | grep -i gfx
-```
-
-You need to see:
-
-```
-gfx906
-```
-
-If you see **nothing**, the GPU isn’t passed through — restart and re-check the host steps.
-
----
-
-## **5. Install Docker in the LXC (Ubuntu 24.04)**
-
-This container runs Docker inside LXC (nesting enabled).
-
-```bash
-apt update
-apt install -y ca-certificates curl gnupg
-
-install -m 0755 -d /etc/apt/keyrings
-curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
-  | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
-chmod a+r /etc/apt/keyrings/docker.gpg
-
-echo \
-  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
-  https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
-  > /etc/apt/sources.list.d/docker.list
-
-apt update
-apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
-```
-
-Check:
-
-```bash
-docker --version
-docker compose version
-```
-
----
-
-## **6. Running vLLM Inside CT 201 via Docker**
-
-### **6.1 Create directory**
-
-```bash
-mkdir -p /root/vllm
-cd /root/vllm
-```
-
-### **6.2 docker-compose.yml**
-
-Save this exact file as `/root/vllm/docker-compose.yml`:
-
-```yaml
-version: "3.9"
-
-services:
-  vllm-mi50:
-    image: nalanzeyu/vllm-gfx906:latest
-    container_name: vllm-mi50
-    restart: unless-stopped
-    ports:
-      - "8000:8000"
-    environment:
-      VLLM_ROLE: "APIServer"
-      VLLM_MODEL: "/model"
-      VLLM_LOGGING_LEVEL: "INFO"
-    command: >
-      vllm serve /model
-      --host 0.0.0.0
-      --port 8000
-      --dtype float16
-      --max-model-len 4096
-      --api-type openai
-    devices:
-      - "/dev/kfd:/dev/kfd"
-      - "/dev/dri:/dev/dri"
-    volumes:
-      - /opt/rocm:/opt/rocm:ro
-```
-
-### **6.3 Start vLLM**
-
-```bash
-docker compose up -d
-docker compose logs -f
-```
-
-When healthy, you’ll see:
-
-```
-(APIServer) Application startup complete.
-```
-
-and periodic throughput logs.
-
----
-
-## **7. Test vLLM API**
-
-### **7.1 From Proxmox host**
-
-```bash
-curl -X POST http://10.0.0.43:8000/v1/completions \
-  -H "Content-Type: application/json" \
-  -d '{"model":"/model","prompt":"ping","max_tokens":5}'
-```
-
-Should respond like:
-
-```json
-{"choices":[{"text":"-pong"}]}
-```
-
-### **7.2 From Cortex machine**
-
-```bash
-curl -X POST http://10.0.0.43:8000/v1/completions \
-  -H "Content-Type: application/json" \
-  -d '{"model":"/model","prompt":"ping from cortex","max_tokens":5}'
-```
-
----
-
-## **8. Wiring into Lyra Cortex**
-
-In `cortex` container’s `docker-compose.yml`:
-
-```yaml
-environment:
-  LLM_PRIMARY_URL: http://10.0.0.43:8000
-```
-
-Not `/v1/completions` because the router appends that automatically.
-
-In `cortex/.env`:
-
-```env
-LLM_FORCE_BACKEND=primary
-LLM_MODEL=/model
-```
-
-Test:
-
-```bash
-curl -X POST http://10.0.0.41:7081/reason \
-  -H "Content-Type: application/json" \
-  -d '{"prompt":"test vllm","session_id":"dev"}'
-```
-
-If you get a meaningful response: **Cortex → vLLM is online**.
-
----
-
-## **9. Common Failure Modes (And Fixes)**
-
-### **9.1 “Failed to infer device type”**
-
-vLLM cannot see any ROCm devices.
-
-Fix:
-
-```bash
-# On host
-modprobe amdgpu
-pct stop 201
-pct start 201
-# In container
-/opt/rocm/bin/rocminfo | grep -i gfx
-docker compose up -d
-```
-
-### **9.2 GPU disappears after reboot**
-
-Same fix:
-
-```bash
-modprobe amdgpu
-pct stop 201
-pct start 201
-```
-
-### **9.3 Invalid image name**
-
-If you see pull errors:
-
-```
-pull access denied for nalanzeuy...
-```
-
-Use:
-
-```
-image: nalanzeyu/vllm-gfx906
-```
-
-### **9.4 Double `/v1` in URL**
-
-Ensure:
-
-```
-LLM_PRIMARY_URL=http://10.0.0.43:8000
-```
-
-Router appends `/v1/completions`.
-
----
-
-## **10. Daily / Reboot Ritual**
-
-### **On Proxmox host**
-
-```bash
-modprobe amdgpu
-pct stop 201
-pct start 201
-```
-
-### **Inside CT 201**
-
-```bash
-/opt/rocm/bin/rocminfo | grep -i gfx
-cd /root/vllm
-docker compose up -d
-docker compose logs -f
-```
-
-### **Test API**
-
-```bash
-curl -X POST http://10.0.0.43:8000/v1/completions \
-  -H "Content-Type: application/json" \
-  -d '{"model":"/model","prompt":"ping","max_tokens":5}'
-```
-
----
-
-## **11. Summary**
-
-You now have:
-
-* **MI50 (gfx906)** correctly passed into LXC
-* **ROCm** inside the container via bind mounts
-* **vLLM** running inside Docker in the LXC
-* **OpenAI-compatible API** on port 8000
-* **Lyra Cortex** using it automatically as primary backend
-
-This is a complete, reproducible setup that survives reboots (with the modprobe ritual) and allows you to upgrade/replace models anytime.
-
----
-
-If you want, I can generate:
-
-* A `/docs/vllm-mi50/README.md`
-* A "vLLM Gotchas" document
-* A quick-reference cheat sheet
-* A troubleshooting decision tree
-
-Just say the word.