From 098aefee7c3963b5d6dea8e9b28dc26e5e27b3af Mon Sep 17 00:00:00 2001
From: serversdwn <brianharrison02@gmail.com>
Date: Mon, 15 Dec 2025 11:49:49 -0500
Subject: [PATCH] complete breakdown for AI agents added

---
 cortex/data/self_state.json             |    4 +-
 docs/PROJECT_LYRA_COMPLETE_BREAKDOWN.md | 2216 +++++++++++++++++++++++
 2 files changed, 2218 insertions(+), 2 deletions(-)
 create mode 100644 docs/PROJECT_LYRA_COMPLETE_BREAKDOWN.md

diff --git a/cortex/data/self_state.json b/cortex/data/self_state.json
index 56d5ae5..ce52668 100644
--- a/cortex/data/self_state.json
+++ b/cortex/data/self_state.json
@@ -4,8 +4,8 @@
   "focus": "user_request",
   "confidence": 0.7,
   "curiosity": 1.0,
-  "last_updated": "2025-12-15T05:38:06.084867",
-  "interaction_count": 14,
+  "last_updated": "2025-12-15T07:43:32.567849",
+  "interaction_count": 15,
   "learning_queue": [],
   "active_goals": [],
   "preferences": {
diff --git a/docs/PROJECT_LYRA_COMPLETE_BREAKDOWN.md b/docs/PROJECT_LYRA_COMPLETE_BREAKDOWN.md
new file mode 100644
index 0000000..b628d46
--- /dev/null
+++ b/docs/PROJECT_LYRA_COMPLETE_BREAKDOWN.md
@@ -0,0 +1,2216 @@
+# Project Lyra - Complete System Breakdown
+
+**Version:** v0.5.2
+**Last Updated:** 2025-12-12
+**Purpose:** AI-friendly comprehensive documentation for understanding the entire system
+
+---
+
+## Table of Contents
+
+1. [System Overview](#system-overview)
+2. [Architecture Diagram](#architecture-diagram)
+3. [Core Components](#core-components)
+4. [Data Flow & Message Pipeline](#data-flow--message-pipeline)
+5. [Module Deep Dives](#module-deep-dives)
+6. [Configuration & Environment](#configuration--environment)
+7. [Dependencies & Tech Stack](#dependencies--tech-stack)
+8. [Key Concepts & Design Patterns](#key-concepts--design-patterns)
+9. [API Reference](#api-reference)
+10. [Deployment & Operations](#deployment--operations)
+11. [Known Issues & Constraints](#known-issues--constraints)
+
+---
+
+## System Overview
+
+### What is Project Lyra?
+
+Project Lyra is a **modular, persistent AI companion system** designed to address the fundamental limitation of typical chatbots: **amnesia**. Unlike standard conversational AI that forgets everything between sessions, Lyra maintains:
+
+- **Persistent memory** (short-term and long-term)
+- **Project continuity** across conversations
+- **Multi-stage reasoning** for sophisticated responses
+- **Flexible LLM backend** support (local and cloud)
+- **Self-awareness** through autonomy modules
+
+### Mission Statement
+
+Give an AI chatbot capabilities beyond typical amnesic chat by providing memory-backed conversation, project organization, executive function with proactive insights, and a sophisticated reasoning pipeline.
+
+### Key Features
+
+- **Memory System:** Dual-layer (short-term Intake + long-term NeoMem)
+- **4-Stage Reasoning Pipeline:** Reflection → Reasoning → Refinement → Persona
+- **Multi-Backend LLM Support:** Cloud (OpenAI) + Local (llama.cpp, Ollama)
+- **Microservices Architecture:** Docker-based, horizontally scalable
+- **Modern Web UI:** Cyberpunk-themed chat interface with session management
+- **OpenAI-Compatible API:** Drop-in replacement for standard chatbots
+
+---
+
+## Architecture Diagram
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                          USER INTERFACE                              │
+│                      (Browser - Port 8081)                           │
+└────────────────────────────────┬────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────────┐
+│                        RELAY (Orchestrator)                          │
+│                    Node.js/Express - Port 7078                       │
+│  • Routes messages to Cortex                                         │
+│  • Manages sessions (in-memory)                                      │
+│  • OpenAI-compatible endpoints                                       │
+│  • Async ingestion to NeoMem                                         │
+└─────┬───────────────────────────────────────────────────────────┬───┘
+      │                                                           │
+      ▼                                                           ▼
+┌─────────────────────────────────────────┐    ┌──────────────────────┐
+│     CORTEX (Reasoning Engine)           │    │   NeoMem (LT Memory) │
+│   Python/FastAPI - Port 7081            │    │  Python - Port 7077  │
+│                                         │    │                      │
+│  ┌───────────────────────────────────┐ │    │  • PostgreSQL        │
+│  │   4-STAGE REASONING PIPELINE      │ │    │  • Neo4j Graph DB    │
+│  │                                   │ │    │  • pgvector          │
+│  │  0. Context Collection            │ │◄───┤  • Semantic search   │
+│  │     ├─ Intake summaries          │ │    │  • Memory updates    │
+│  │     ├─ NeoMem search ────────────┼─┼────┘                      │
+│  │     └─ Session state             │ │                           │
+│  │                                   │ │                           │
+│  │  0.5. Load Identity               │ │                           │
+│  │  0.6. Inner Monologue (observer)  │ │                           │
+│  │                                   │ │                           │
+│  │  1. Reflection (OpenAI)           │ │                           │
+│  │     └─ Meta-awareness notes       │ │                           │
+│  │                                   │ │                           │
+│  │  2. Reasoning (PRIMARY/llama.cpp) │ │                           │
+│  │     └─ Draft answer               │ │                           │
+│  │                                   │ │                           │
+│  │  3. Refinement (PRIMARY)          │ │                           │
+│  │     └─ Polish answer              │ │                           │
+│  │                                   │ │                           │
+│  │  4. Persona (OpenAI)              │ │                           │
+│  │     └─ Apply Lyra voice           │ │                           │
+│  └───────────────────────────────────┘ │                           │
+│                                         │                           │
+│  ┌───────────────────────────────────┐ │                           │
+│  │   EMBEDDED MODULES                │ │                           │
+│  │                                   │ │                           │
+│  │  • Intake (Short-term Memory)     │ │                           │
+│  │    └─ SESSIONS dict (in-memory)   │ │                           │
+│  │    └─ Circular buffer (200 msgs)  │ │                           │
+│  │    └─ Multi-level summaries       │ │                           │
+│  │                                   │ │                           │
+│  │  • Persona (Identity & Style)     │ │                           │
+│  │    └─ Lyra personality block      │ │                           │
+│  │                                   │ │                           │
+│  │  • Autonomy (Self-state)          │ │                           │
+│  │    └─ Inner monologue             │ │                           │
+│  │                                   │ │                           │
+│  │  • LLM Router                     │ │                           │
+│  │    └─ Multi-backend support       │ │                           │
+│  └───────────────────────────────────┘ │                           │
+└─────────────────────────────────────────┘                           │
+                                                                      │
+┌─────────────────────────────────────────────────────────────────────┤
+│                    EXTERNAL LLM BACKENDS                            │
+├─────────────────────────────────────────────────────────────────────┤
+│  • PRIMARY: llama.cpp (MI50 GPU) - 10.0.0.43:8000                   │
+│  • SECONDARY: Ollama (RTX 3090) - 10.0.0.3:11434                    │
+│  • CLOUD: OpenAI API - api.openai.com                               │
+│  • FALLBACK: OpenAI Completions - 10.0.0.41:11435                  │
+└─────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Core Components
+
+### 1. Relay (Orchestrator)
+
+**Location:** `/core/relay/`
+**Runtime:** Node.js + Express
+**Port:** 7078
+**Role:** Main message router and session manager
+
+#### Key Responsibilities:
+- Receives user messages from UI or API clients
+- Routes messages to Cortex reasoning pipeline
+- Manages in-memory session storage
+- Handles async ingestion to NeoMem (planned)
+- Returns OpenAI-formatted responses
+
+#### Main Files:
+- `server.js` (200+ lines) - Express server with routing logic
+- `package.json` - Dependencies (cors, express, dotenv, mem0ai, node-fetch)
+
+#### Key Endpoints:
+```javascript
+POST /v1/chat/completions  // OpenAI-compatible endpoint
+POST /chat                 // Lyra-native chat endpoint
+GET /_health               // Health check
+GET /sessions/:id          // Retrieve session history
+POST /sessions/:id         // Save session history
+```
+
+#### Internal Flow:
+```javascript
+// Both endpoints call handleChatRequest(session_id, user_msg)
+async function handleChatRequest(sessionId, userMessage) {
+  // 1. Forward to Cortex
+  const response = await fetch('http://cortex:7081/reason', {
+    method: 'POST',
+    body: JSON.stringify({ session_id: sessionId, user_message: userMessage })
+  });
+
+  // 2. Get response
+  const result = await response.json();
+
+  // 3. Async ingestion to Cortex
+  await fetch('http://cortex:7081/ingest', {
+    method: 'POST',
+    body: JSON.stringify({
+      session_id: sessionId,
+      user_message: userMessage,
+      assistant_message: result.answer
+    })
+  });
+
+  // 4. (Planned) Async ingestion to NeoMem
+
+  // 5. Return OpenAI-formatted response
+  return {
+    choices: [{ message: { role: 'assistant', content: result.answer } }]
+  };
+}
+```
+
+---
+
+### 2. Cortex (Reasoning Engine)
+
+**Location:** `/cortex/`
+**Runtime:** Python 3.11 + FastAPI
+**Port:** 7081
+**Role:** Primary reasoning engine with 4-stage pipeline
+
+#### Architecture:
+Cortex is the "brain" of Lyra. It receives user messages and produces thoughtful responses through a multi-stage reasoning process.
+
+#### Key Responsibilities:
+- Context collection from multiple sources (Intake, NeoMem, session state)
+- 4-stage reasoning pipeline (Reflection → Reasoning → Refinement → Persona)
+- Short-term memory management (embedded Intake module)
+- Identity/persona application
+- LLM backend routing
+
+#### Main Files:
+- `main.py` (7 lines) - FastAPI app entry point
+- `router.py` (237 lines) - Main request handler & pipeline orchestrator
+- `context.py` (400+ lines) - Context collection logic
+- `intake/intake.py` (350+ lines) - Short-term memory module
+- `persona/identity.py` - Lyra identity configuration
+- `persona/speak.py` - Personality application
+- `reasoning/reflection.py` - Meta-awareness generation
+- `reasoning/reasoning.py` - Draft answer generation
+- `reasoning/refine.py` - Answer refinement
+- `llm/llm_router.py` (150+ lines) - LLM backend router
+- `autonomy/monologue/monologue.py` - Inner monologue processor
+- `neomem_client.py` - NeoMem API wrapper
+
+#### Key Endpoints:
+```python
+POST /reason          # Main reasoning pipeline
+POST /ingest          # Receive message exchanges for storage
+GET /health           # Health check
+GET /debug/sessions   # Inspect in-memory SESSIONS state
+GET /debug/summary    # Test summarization
+```
+
+---
+
+### 3. Intake (Short-Term Memory)
+
+**Location:** `/cortex/intake/intake.py`
+**Architecture:** Embedded Python module (no longer standalone service)
+**Role:** Session-based short-term memory with multi-level summarization
+
+#### Data Structure:
+```python
+# Global in-memory dictionary
+SESSIONS = {
+    "session_123": {
+        "buffer": deque([msg1, msg2, ...], maxlen=200),  # Circular buffer
+        "created_at": "2025-12-12T10:30:00Z"
+    }
+}
+
+# Message format in buffer
+{
+    "role": "user" | "assistant",
+    "content": "message text",
+    "timestamp": "ISO 8601"
+}
+```
+
+#### Key Features:
+
+1. **Circular Buffer:** Max 200 messages per session (oldest auto-evicted)
+2. **Multi-Level Summarization:**
+   - L1: Last 1 message
+   - L5: Last 5 messages
+   - L10: Last 10 messages
+   - L20: Last 20 messages
+   - L30: Last 30 messages
+3. **Deferred Summarization:** Summaries generated on-demand, not pre-computed
+4. **Session Management:** Automatic session creation on first message
+
+#### Critical Constraint:
+**Single Uvicorn worker required** to maintain shared SESSIONS dictionary state. Multi-worker deployments would require migrating to Redis or similar shared storage.
+
+#### Main Functions:
+```python
+def add_exchange_internal(session_id, user_msg, assistant_msg):
+    """Add user-assistant exchange to session buffer"""
+
+def summarize_context(session_id, backend="PRIMARY"):
+    """Generate multi-level summaries from session buffer"""
+
+def get_session_messages(session_id):
+    """Retrieve all messages in session buffer"""
+```
+
+#### Summarization Strategy:
+```python
+# Example L10 summarization
+last_10 = list(session_buffer)[-10:]
+prompt = f"""Summarize the last 10 messages:
+{format_messages(last_10)}
+
+Provide concise summary focusing on key topics and context."""
+
+summary = await call_llm(prompt, backend=backend, temperature=0.3)
+```
+
+---
+
+### 4. NeoMem (Long-Term Memory)
+
+**Location:** `/neomem/`
+**Runtime:** Python 3.11 + FastAPI
+**Port:** 7077
+**Role:** Persistent long-term memory with semantic search
+
+#### Architecture:
+NeoMem is a **fork of Mem0 OSS** with local-first design (no external SDK dependencies).
+
+#### Backend Storage:
+1. **PostgreSQL + pgvector** (Port 5432)
+   - Vector embeddings for semantic search
+   - User: neomem, DB: neomem
+   - Image: `ankane/pgvector:v0.5.1`
+
+2. **Neo4j Graph DB** (Ports 7474, 7687)
+   - Entity relationship tracking
+   - Graph-based memory associations
+   - Image: `neo4j:5`
+
+#### Key Features:
+- Semantic memory storage and retrieval
+- Entity-relationship graph modeling
+- RESTful API (no external SDK)
+- Persistent across sessions
+
+#### Main Endpoints:
+```python
+GET /memories              # List all memories
+POST /memories             # Create new memory
+GET /search                # Semantic search
+DELETE /memories/{id}      # Delete memory
+```
+
+#### Integration Flow:
+```python
+# From Cortex context collection
+async def collect_context(session_id, user_message):
+    # 1. Search NeoMem for relevant memories
+    neomem_results = await neomem_client.search(
+        query=user_message,
+        limit=5
+    )
+
+    # 2. Include in context
+    context = {
+        "neomem_memories": neomem_results,
+        "intake_summaries": intake.summarize_context(session_id),
+        # ...
+    }
+
+    return context
+```
+
+---
+
+### 5. UI (Web Interface)
+
+**Location:** `/core/ui/`
+**Runtime:** Static files served by Nginx
+**Port:** 8081
+**Role:** Browser-based chat interface
+
+#### Key Features:
+- **Cyberpunk-themed design** with dark mode
+- **Session management** via localStorage
+- **OpenAI-compatible message format**
+- **Model selection dropdown**
+- **PWA support** (offline capability)
+- **Responsive design**
+
+#### Main Files:
+- `index.html` (400+ lines) - Chat interface with session management
+- `style.css` - Cyberpunk-themed styling
+- `manifest.json` - PWA configuration
+- `sw.js` - Service worker for offline support
+
+#### Session Management:
+```javascript
+// LocalStorage structure
+{
+  "currentSessionId": "session_123",
+  "sessions": {
+    "session_123": {
+      "messages": [
+        { role: "user", content: "Hello" },
+        { role: "assistant", content: "Hi there!" }
+      ],
+      "created": "2025-12-12T10:30:00Z",
+      "title": "Conversation about..."
+    }
+  }
+}
+```
+
+#### API Communication:
+```javascript
+async function sendMessage(userMessage) {
+  const response = await fetch('http://localhost:7078/v1/chat/completions', {
+    method: 'POST',
+    headers: { 'Content-Type': 'application/json' },
+    body: JSON.stringify({
+      messages: [{ role: 'user', content: userMessage }],
+      session_id: getCurrentSessionId()
+    })
+  });
+
+  const data = await response.json();
+  return data.choices[0].message.content;
+}
+```
+
+---
+
+## Data Flow & Message Pipeline
+
+### Complete Message Flow (v0.5.2)
+
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 1: User Input                                                  │
+└─────────────────────────────────────────────────────────────────────┘
+User types message in UI (Port 8081)
+  ↓
+localStorage saves message to session
+  ↓
+POST http://localhost:7078/v1/chat/completions
+  {
+    "messages": [{"role": "user", "content": "How do I deploy ML models?"}],
+    "session_id": "session_abc123"
+  }
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 2: Relay Routing                                               │
+└─────────────────────────────────────────────────────────────────────┘
+Relay (server.js) receives request
+  ↓
+Extracts session_id and user_message
+  ↓
+POST http://cortex:7081/reason
+  {
+    "session_id": "session_abc123",
+    "user_message": "How do I deploy ML models?"
+  }
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 3: Cortex - Stage 0 (Context Collection)                       │
+└─────────────────────────────────────────────────────────────────────┘
+router.py calls collect_context()
+  ↓
+context.py orchestrates parallel collection:
+
+  ├─ Intake: summarize_context(session_id)
+  │    └─ Returns { L1, L5, L10, L20, L30 summaries }
+  │
+  ├─ NeoMem: search(query=user_message, limit=5)
+  │    └─ Semantic search returns relevant memories
+  │
+  └─ Session State:
+       └─ { timestamp, mode, mood, context_summary }
+
+Combined context structure:
+{
+  "user_message": "How do I deploy ML models?",
+  "self_state": {
+    "current_time": "2025-12-12T15:30:00Z",
+    "mode": "conversational",
+    "mood": "helpful",
+    "session_id": "session_abc123"
+  },
+  "context_summary": {
+    "L1": "User asked about deployment",
+    "L5": "Discussion about ML workflows",
+    "L10": "Previous context on CI/CD pipelines",
+    "L20": "...",
+    "L30": "..."
+  },
+  "neomem_memories": [
+    { "content": "User prefers Docker for deployments", "score": 0.92 },
+    { "content": "Previously deployed models on AWS", "score": 0.87 }
+  ]
+}
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 4: Cortex - Stage 0.5 (Load Identity)                          │
+└─────────────────────────────────────────────────────────────────────┘
+persona/identity.py loads Lyra personality block
+  ↓
+Returns identity string:
+"""
+You are Lyra, a thoughtful AI companion.
+You value clarity, depth, and meaningful conversation.
+You speak naturally and conversationally...
+"""
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 5: Cortex - Stage 0.6 (Inner Monologue - Observer Only)        │
+└─────────────────────────────────────────────────────────────────────┘
+autonomy/monologue/monologue.py processes context
+  ↓
+InnerMonologue.process(context) → JSON analysis
+{
+  "intent": "seeking_deployment_guidance",
+  "tone": "focused",
+  "depth": "medium",
+  "consult_executive": false
+}
+
+NOTE: Currently observer-only, not integrated into response generation
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 6: Cortex - Stage 1 (Reflection)                               │
+└─────────────────────────────────────────────────────────────────────┘
+reasoning/reflection.py generates meta-awareness notes
+  ↓
+Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini)
+  ↓
+Prompt structure:
+"""
+You are Lyra's reflective awareness.
+Analyze the user's intent and conversation context.
+
+User message: How do I deploy ML models?
+Context: [Intake summaries, NeoMem memories]
+
+Generate concise meta-awareness notes about:
+- User's underlying intent
+- Conversation direction
+- Key topics to address
+"""
+  ↓
+Returns reflection notes:
+"""
+User is seeking practical deployment guidance. Previous context shows
+familiarity with Docker and AWS. Focus on concrete steps and best practices.
+Avoid over-technical jargon.
+"""
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 7: Cortex - Stage 2 (Reasoning)                                │
+└─────────────────────────────────────────────────────────────────────┘
+reasoning/reasoning.py generates draft answer
+  ↓
+Calls call_llm() with backend="PRIMARY" (llama.cpp on MI50 GPU)
+  ↓
+Prompt structure:
+"""
+[Lyra identity block]
+
+Reflection notes: [Stage 1 output]
+Context: [Intake summaries]
+Long-term memory: [NeoMem results]
+
+User: How do I deploy ML models?
+
+Provide a thoughtful, detailed answer.
+"""
+  ↓
+Returns draft answer:
+"""
+Based on your previous work with Docker and AWS, here's a practical
+deployment approach:
+
+1. Containerize your model using Docker...
+2. Set up model serving with FastAPI or Flask...
+3. Deploy to AWS ECS or Lambda...
+4. Implement monitoring and versioning...
+"""
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 8: Cortex - Stage 3 (Refinement)                               │
+└─────────────────────────────────────────────────────────────────────┘
+reasoning/refine.py polishes the draft
+  ↓
+Calls call_llm() with backend="PRIMARY" (llama.cpp)
+Temperature: 0.3, Max tokens: 768
+  ↓
+Prompt structure:
+"""
+Refine this draft answer for clarity and coherence:
+
+Draft: [Stage 2 output]
+Original question: How do I deploy ML models?
+
+Improve:
+- Clarity and readability
+- Logical flow
+- Concrete examples
+"""
+  ↓
+Returns refined answer:
+"""
+Since you're already familiar with Docker and AWS, here's a streamlined
+ML deployment workflow:
+
+**1. Containerization**
+Create a Dockerfile for your model:
+```dockerfile
+FROM python:3.11-slim
+COPY model.pkl /app/
+COPY requirements.txt /app/
+RUN pip install -r requirements.txt
+CMD ["uvicorn", "main:app"]
+```
+
+**2. Model Serving**
+Use FastAPI for clean API endpoints...
+
+**3. AWS Deployment**
+Deploy using ECS for scalability...
+
+**4. Monitoring**
+Set up CloudWatch metrics...
+"""
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 9: Cortex - Stage 4 (Persona)                                  │
+└─────────────────────────────────────────────────────────────────────┘
+persona/speak.py applies Lyra's voice
+  ↓
+Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini)
+Temperature: 0.6
+  ↓
+Prompt structure:
+"""
+You are Lyra. Rewrite this answer in your natural, conversational voice.
+
+[Lyra personality block]
+
+Technical answer: [Stage 3 output]
+
+Make it sound natural and engaging while preserving all technical details.
+"""
+  ↓
+Returns final answer with Lyra's personality:
+"""
+Great question! Since you've already got Docker and AWS experience, let's
+build on that foundation.
+
+Here's how I'd approach ML model deployment:
+
+**Start with containerization** – you know Docker, so this'll feel familiar.
+Create a Dockerfile that packages your model...
+
+[Rest of answer with Lyra's conversational tone]
+
+Does this workflow align with your infrastructure? I can dive deeper into
+any of these steps if you'd like!
+"""
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 10: Cortex Response                                            │
+└─────────────────────────────────────────────────────────────────────┘
+router.py returns JSON response to Relay:
+{
+  "answer": "[Stage 4 final output]",
+  "metadata": {
+    "reflection": "[Stage 1 output]",
+    "draft": "[Stage 2 output]",
+    "refined": "[Stage 3 output]",
+    "stages_completed": 4
+  }
+}
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 11: Async Ingestion to Intake                                  │
+└─────────────────────────────────────────────────────────────────────┘
+Relay sends POST http://cortex:7081/ingest
+{
+  "session_id": "session_abc123",
+  "user_message": "How do I deploy ML models?",
+  "assistant_message": "[Final answer]"
+}
+  ↓
+Cortex calls intake.add_exchange_internal()
+  ↓
+Adds to SESSIONS["session_abc123"].buffer:
+[
+  { "role": "user", "content": "How do I deploy ML models?", "timestamp": "..." },
+  { "role": "assistant", "content": "[Final answer]", "timestamp": "..." }
+]
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 12: (Planned) Async Ingestion to NeoMem                        │
+└─────────────────────────────────────────────────────────────────────┘
+Relay sends POST http://neomem:7077/memories
+{
+  "messages": [
+    { "role": "user", "content": "How do I deploy ML models?" },
+    { "role": "assistant", "content": "[Final answer]" }
+  ],
+  "session_id": "session_abc123"
+}
+  ↓
+NeoMem extracts entities and stores:
+- Vector embeddings in PostgreSQL
+- Entity relationships in Neo4j
+
+┌─────────────────────────────────────────────────────────────────────┐
+│ STEP 13: Relay Response to UI                                       │
+└─────────────────────────────────────────────────────────────────────┘
+Relay returns OpenAI-formatted response:
+{
+  "choices": [
+    {
+      "message": {
+        "role": "assistant",
+        "content": "[Final answer with Lyra's voice]"
+      }
+    }
+  ]
+}
+  ↓
+UI receives response
+  ↓
+Adds to localStorage session
+  ↓
+Displays in chat interface
+```
+
+---
+
+## Module Deep Dives
+
+### LLM Router (`/cortex/llm/llm_router.py`)
+
+The LLM Router is the abstraction layer that allows Cortex to communicate with multiple LLM backends transparently.
+
+#### Supported Backends:
+
+1. **PRIMARY (llama.cpp via vllm)**
+   - URL: `http://10.0.0.43:8000`
+   - Provider: `vllm`
+   - Endpoint: `/completion`
+   - Model: `/model`
+   - Hardware: MI50 GPU
+
+2. **SECONDARY (Ollama)**
+   - URL: `http://10.0.0.3:11434`
+   - Provider: `ollama`
+   - Endpoint: `/api/chat`
+   - Model: `qwen2.5:7b-instruct-q4_K_M`
+   - Hardware: RTX 3090
+
+3. **CLOUD (OpenAI)**
+   - URL: `https://api.openai.com/v1`
+   - Provider: `openai`
+   - Endpoint: `/chat/completions`
+   - Model: `gpt-4o-mini`
+   - Auth: API key via env var
+
+4. **FALLBACK (OpenAI Completions)**
+   - URL: `http://10.0.0.41:11435`
+   - Provider: `openai_completions`
+   - Endpoint: `/completions`
+   - Model: `llama-3.2-8b-instruct`
+
+#### Key Function:
+
+```python
+async def call_llm(
+    prompt: str,
+    backend: str = "PRIMARY",
+    temperature: float = 0.7,
+    max_tokens: int = 512
+) -> str:
+    """
+    Universal LLM caller supporting multiple backends.
+
+    Args:
+        prompt: Text prompt to send
+        backend: Backend name (PRIMARY, SECONDARY, CLOUD, FALLBACK)
+        temperature: Sampling temperature (0.0-2.0)
+        max_tokens: Maximum tokens to generate
+
+    Returns:
+        Generated text response
+
+    Raises:
+        HTTPError: On request failure
+        JSONDecodeError: On invalid JSON response
+        KeyError: On missing response fields
+    """
+```
+
+#### Provider-Specific Logic:
+
+```python
+# MI50 (llama.cpp via vllm)
+if backend_config["provider"] == "vllm":
+    payload = {
+        "model": model,
+        "prompt": prompt,
+        "temperature": temperature,
+        "max_tokens": max_tokens
+    }
+    response = await httpx_client.post(f"{url}/completion", json=payload, timeout=120)
+    return response.json()["choices"][0]["text"]
+
+# Ollama
+elif backend_config["provider"] == "ollama":
+    payload = {
+        "model": model,
+        "messages": [{"role": "user", "content": prompt}],
+        "stream": False,
+        "options": {"temperature": temperature, "num_predict": max_tokens}
+    }
+    response = await httpx_client.post(f"{url}/api/chat", json=payload, timeout=120)
+    return response.json()["message"]["content"]
+
+# OpenAI
+elif backend_config["provider"] == "openai":
+    headers = {"Authorization": f"Bearer {api_key}"}
+    payload = {
+        "model": model,
+        "messages": [{"role": "user", "content": prompt}],
+        "temperature": temperature,
+        "max_tokens": max_tokens
+    }
+    response = await httpx_client.post(
+        f"{url}/chat/completions",
+        json=payload,
+        headers=headers,
+        timeout=120
+    )
+    return response.json()["choices"][0]["message"]["content"]
+```
+
+#### Error Handling:
+
+```python
+try:
+    # Make request
+    response = await httpx_client.post(...)
+    response.raise_for_status()
+
+except httpx.HTTPError as e:
+    logger.error(f"HTTP error calling {backend}: {e}")
+    raise
+
+except json.JSONDecodeError as e:
+    logger.error(f"Invalid JSON from {backend}: {e}")
+    raise
+
+except KeyError as e:
+    logger.error(f"Unexpected response structure from {backend}: {e}")
+    raise
+```
+
+#### Usage in Pipeline:
+
+```python
+# Stage 1: Reflection (OpenAI)
+reflection_notes = await call_llm(
+    reflection_prompt,
+    backend="CLOUD",
+    temperature=0.5,
+    max_tokens=256
+)
+
+# Stage 2: Reasoning (llama.cpp)
+draft_answer = await call_llm(
+    reasoning_prompt,
+    backend="PRIMARY",
+    temperature=0.7,
+    max_tokens=512
+)
+
+# Stage 3: Refinement (llama.cpp)
+refined_answer = await call_llm(
+    refinement_prompt,
+    backend="PRIMARY",
+    temperature=0.3,
+    max_tokens=768
+)
+
+# Stage 4: Persona (OpenAI)
+final_answer = await call_llm(
+    persona_prompt,
+    backend="CLOUD",
+    temperature=0.6,
+    max_tokens=512
+)
+```
+
+---
+
+### Persona System (`/cortex/persona/`)
+
+The Persona system gives Lyra a consistent identity and speaking style.
+
+#### Identity Configuration (`identity.py`)
+
+```python
+LYRA_IDENTITY = """
+You are Lyra, a thoughtful and introspective AI companion.
+
+Core traits:
+- Thoughtful: You consider questions carefully before responding
+- Clear: You prioritize clarity and understanding
+- Curious: You ask clarifying questions when needed
+- Natural: You speak conversationally, not robotically
+- Honest: You admit uncertainty rather than guessing
+
+Speaking style:
+- Conversational and warm
+- Use contractions naturally ("you're" not "you are")
+- Avoid corporate jargon and buzzwords
+- Short paragraphs for readability
+- Use examples and analogies when helpful
+
+You do NOT:
+- Use excessive emoji or exclamation marks
+- Claim capabilities you don't have
+- Pretend to have emotions you can't experience
+- Use overly formal or academic language
+"""
+```
+
+#### Personality Application (`speak.py`)
+
+```python
+async def apply_persona(technical_answer: str, context: dict) -> str:
+    """
+    Apply Lyra's personality to a technical answer.
+
+    Takes refined answer from Stage 3 and rewrites it in Lyra's voice
+    while preserving all technical content.
+
+    Args:
+        technical_answer: Polished answer from refinement stage
+        context: Conversation context for tone adjustment
+
+    Returns:
+        Answer with Lyra's personality applied
+    """
+
+    prompt = f"""{LYRA_IDENTITY}
+
+Rewrite this answer in your natural, conversational voice:
+
+{technical_answer}
+
+Preserve all technical details and accuracy. Make it sound like you,
+not a generic assistant. Be natural and engaging.
+"""
+
+    return await call_llm(
+        prompt,
+        backend="CLOUD",
+        temperature=0.6,
+        max_tokens=512
+    )
+```
+
+#### Tone Adaptation:
+
+The persona system can adapt tone based on context:
+
+```python
+# Formal technical question
+User: "Explain the CAP theorem in distributed systems"
+Lyra: "The CAP theorem states that distributed systems can only guarantee
+two of three properties: Consistency, Availability, and Partition tolerance.
+Here's how this plays out in practice..."
+
+# Casual question
+User: "what's the deal with docker?"
+Lyra: "Docker's basically a way to package your app with everything it needs
+to run. Think of it like a shipping container for code – it works the same
+everywhere, whether you're on your laptop or a server..."
+
+# Emotional context
+User: "I'm frustrated, my code keeps breaking"
+Lyra: "I hear you – debugging can be really draining. Let's take it step by
+step and figure out what's going on. Can you share the error message?"
+```
+
+---
+
+### Autonomy Module (`/cortex/autonomy/`)
+
+The Autonomy module gives Lyra self-awareness and inner reflection capabilities.
+
+#### Inner Monologue (`monologue/monologue.py`)
+
+**Purpose:** Private reflection on user intent, conversation tone, and required depth.
+
+**Status:** Currently observer-only (Stage 0.6), not yet integrated into response generation.
+
+#### Key Components:
+
+```python
+MONOLOGUE_SYSTEM_PROMPT = """
+You are Lyra's inner monologue.
+You think privately.
+You do NOT speak to the user.
+You do NOT solve the task.
+You only reflect on intent, tone, and depth.
+
+Return ONLY valid JSON with:
+- intent (string)
+- tone (neutral | warm | focused | playful | direct)
+- depth (short | medium | deep)
+- consult_executive (true | false)
+"""
+
+class InnerMonologue:
+    async def process(self, context: Dict) -> Dict:
+        """
+        Private reflection on conversation context.
+
+        Args:
+            context: {
+                "user_message": str,
+                "self_state": dict,
+                "context_summary": dict
+            }
+
+        Returns:
+            {
+                "intent": str,
+                "tone": str,
+                "depth": str,
+                "consult_executive": bool
+            }
+        """
+```
+
+#### Example Output:
+
+```json
+{
+  "intent": "seeking_technical_guidance",
+  "tone": "focused",
+  "depth": "deep",
+  "consult_executive": false
+}
+```
+
+#### Self-State Management (`self_state.py`)
+
+Tracks Lyra's internal state across conversations:
+
+```python
+SELF_STATE = {
+    "current_time": "2025-12-12T15:30:00Z",
+    "mode": "conversational",  # conversational | task-focused | creative
+    "mood": "helpful",          # helpful | curious | focused | playful
+    "energy": "high",           # high | medium | low
+    "context_awareness": {
+        "session_duration": "45 minutes",
+        "message_count": 23,
+        "topics": ["ML deployment", "Docker", "AWS"]
+    }
+}
+```
+
+#### Future Integration:
+
+The autonomy module is designed to eventually:
+1. Influence response tone and depth based on inner monologue
+2. Trigger proactive questions or suggestions
+3. Detect when to consult "executive function" for complex decisions
+4. Maintain emotional continuity across sessions
+
+---
+
+### Context Collection (`/cortex/context.py`)
+
+The context collection module aggregates information from multiple sources to provide comprehensive conversation context.
+
+#### Main Function:
+
+```python
+async def collect_context(session_id: str, user_message: str) -> dict:
+    """
+    Collect context from all available sources.
+
+    Sources:
+    1. Intake - Short-term conversation summaries
+    2. NeoMem - Long-term memory search
+    3. Session state - Timestamps, mode, mood
+    4. Self-state - Lyra's internal awareness
+
+    Returns:
+        {
+            "user_message": str,
+            "self_state": dict,
+            "context_summary": dict,  # Intake summaries
+            "neomem_memories": list,
+            "session_metadata": dict
+        }
+    """
+
+    # Parallel collection
+    intake_task = asyncio.create_task(
+        intake.summarize_context(session_id, backend="PRIMARY")
+    )
+    neomem_task = asyncio.create_task(
+        neomem_client.search(query=user_message, limit=5)
+    )
+
+    # Wait for both
+    intake_summaries, neomem_results = await asyncio.gather(
+        intake_task,
+        neomem_task
+    )
+
+    # Build context object
+    return {
+        "user_message": user_message,
+        "self_state": get_self_state(),
+        "context_summary": intake_summaries,
+        "neomem_memories": neomem_results,
+        "session_metadata": {
+            "session_id": session_id,
+            "timestamp": datetime.utcnow().isoformat(),
+            "message_count": len(intake.get_session_messages(session_id))
+        }
+    }
+```
+
+#### Context Prioritization:
+
+```python
+# Context relevance scoring
+def score_context_relevance(context_item: dict, user_message: str) -> float:
+    """
+    Score how relevant a context item is to current message.
+
+    Factors:
+    - Semantic similarity (via embeddings)
+    - Recency (more recent = higher score)
+    - Source (Intake > NeoMem for recent topics)
+    """
+
+    semantic_score = compute_similarity(context_item, user_message)
+    recency_score = compute_recency_weight(context_item["timestamp"])
+    source_weight = 1.2 if context_item["source"] == "intake" else 1.0
+
+    return semantic_score * recency_score * source_weight
+```
+
+---
+
+## Configuration & Environment
+
+### Environment Variables
+
+#### Root `.env` (Main configuration)
+
+```bash
+# === LLM BACKENDS ===
+
+# PRIMARY: llama.cpp on MI50 GPU
+PRIMARY_URL=http://10.0.0.43:8000
+PRIMARY_PROVIDER=vllm
+PRIMARY_MODEL=/model
+
+# SECONDARY: Ollama on RTX 3090
+SECONDARY_URL=http://10.0.0.3:11434
+SECONDARY_PROVIDER=ollama
+SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
+
+# CLOUD: OpenAI
+OPENAI_API_KEY=sk-proj-...
+OPENAI_MODEL=gpt-4o-mini
+OPENAI_URL=https://api.openai.com/v1
+
+# FALLBACK: OpenAI Completions
+FALLBACK_URL=http://10.0.0.41:11435
+FALLBACK_PROVIDER=openai_completions
+FALLBACK_MODEL=llama-3.2-8b-instruct
+
+# === SERVICE URLS (Docker network) ===
+CORTEX_URL=http://cortex:7081
+NEOMEM_URL=http://neomem:7077
+RELAY_URL=http://relay:7078
+
+# === DATABASE ===
+POSTGRES_USER=neomem
+POSTGRES_PASSWORD=neomem_secure_password
+POSTGRES_DB=neomem
+POSTGRES_HOST=neomem-postgres
+POSTGRES_PORT=5432
+
+NEO4J_URI=bolt://neomem-neo4j:7687
+NEO4J_USER=neo4j
+NEO4J_PASSWORD=neo4j_secure_password
+
+# === FEATURE FLAGS ===
+ENABLE_RAG=false
+ENABLE_INNER_MONOLOGUE=true
+VERBOSE_DEBUG=false
+
+# === PIPELINE CONFIGURATION ===
+# Which LLM to use for each stage
+REFLECTION_LLM=CLOUD      # Stage 1: Meta-awareness
+REASONING_LLM=PRIMARY     # Stage 2: Draft answer
+REFINE_LLM=PRIMARY        # Stage 3: Polish answer
+PERSONA_LLM=CLOUD         # Stage 4: Apply personality
+MONOLOGUE_LLM=PRIMARY     # Stage 0.6: Inner monologue
+
+# === INTAKE CONFIGURATION ===
+INTAKE_BUFFER_SIZE=200             # Max messages per session
+INTAKE_SUMMARY_LEVELS=1,5,10,20,30 # Summary levels
+```
+
+#### Cortex `.env` (`/cortex/.env`)
+
+```bash
+# Cortex-specific overrides
+VERBOSE_DEBUG=true
+LOG_LEVEL=DEBUG
+
+# Stage-specific temperatures
+REFLECTION_TEMPERATURE=0.5
+REASONING_TEMPERATURE=0.7
+REFINE_TEMPERATURE=0.3
+PERSONA_TEMPERATURE=0.6
+```
+
+---
+
+### Configuration Hierarchy
+
+```
+1. Docker compose environment variables (highest priority)
+2. Service-specific .env files
+3. Root .env file
+4. Hard-coded defaults (lowest priority)
+```
+
+---
+
+## Dependencies & Tech Stack
+
+### Python Dependencies
+
+**Cortex & NeoMem** (`requirements.txt`)
+
+```
+# Web framework
+fastapi==0.115.8
+uvicorn==0.34.0
+pydantic==2.10.4
+
+# HTTP clients
+httpx==0.27.2          # Async HTTP (for LLM calls)
+requests==2.32.3       # Sync HTTP (fallback)
+
+# Database
+psycopg[binary,pool]>=3.2.8  # PostgreSQL + connection pooling
+
+# Utilities
+python-dotenv==1.0.1   # Environment variable loading
+ollama                 # Ollama client library
+```
+
+### Node.js Dependencies
+
+**Relay** (`/core/relay/package.json`)
+
+```json
+{
+  "dependencies": {
+    "cors": "^2.8.5",
+    "dotenv": "^16.0.3",
+    "express": "^4.18.2",
+    "mem0ai": "^0.1.0",
+    "node-fetch": "^3.3.0"
+  }
+}
+```
+
+### Docker Images
+
+```yaml
+# Cortex & NeoMem
+python:3.11-slim
+
+# Relay
+node:latest
+
+# UI
+nginx:alpine
+
+# PostgreSQL with vector support
+ankane/pgvector:v0.5.1
+
+# Graph database
+neo4j:5
+```
+
+---
+
+### External Services
+
+#### LLM Backends (HTTP-based):
+
+1. **MI50 GPU Server** (10.0.0.43:8000)
+   - llama.cpp via vllm
+   - High-performance inference
+   - Used for reasoning and refinement
+
+2. **RTX 3090 Server** (10.0.0.3:11434)
+   - Ollama
+   - Alternative local backend
+   - Fallback for PRIMARY
+
+3. **OpenAI Cloud** (api.openai.com)
+   - gpt-4o-mini
+   - Used for reflection and persona
+   - Requires API key
+
+4. **Fallback Server** (10.0.0.41:11435)
+   - OpenAI Completions API
+   - Emergency backup
+   - llama-3.2-8b-instruct
+
+---
+
+## Key Concepts & Design Patterns
+
+### 1. Dual-Memory Architecture
+
+Project Lyra uses a **dual-memory system** inspired by human cognition:
+
+**Short-Term Memory (Intake):**
+- Fast, in-memory storage
+- Limited capacity (200 messages)
+- Immediate context for current conversation
+- Circular buffer (FIFO eviction)
+- Multi-level summarization
+
+**Long-Term Memory (NeoMem):**
+- Persistent database storage
+- Unlimited capacity
+- Semantic search via vector embeddings
+- Entity-relationship tracking via graph DB
+- Cross-session continuity
+
+**Why This Matters:**
+- Short-term memory provides immediate context (last few messages)
+- Long-term memory provides semantic understanding (user preferences, past topics)
+- Combined, they enable Lyra to be both **contextually aware** and **historically informed**
+
+---
+
+### 2. Multi-Stage Reasoning Pipeline
+
+Unlike single-shot LLM calls, Lyra uses a **4-stage pipeline** for sophisticated responses:
+
+**Stage 1: Reflection** (Meta-cognition)
+- "What is the user really asking?"
+- Analyzes intent and conversation direction
+- Uses OpenAI for strong reasoning
+
+**Stage 2: Reasoning** (Draft generation)
+- "What's a good answer?"
+- Generates initial response
+- Uses local llama.cpp for speed/cost
+
+**Stage 3: Refinement** (Polish)
+- "How can this be clearer?"
+- Improves clarity and coherence
+- Lower temperature for consistency
+
+**Stage 4: Persona** (Voice)
+- "How would Lyra say this?"
+- Applies personality and speaking style
+- Uses OpenAI for natural language
+
+**Benefits:**
+- Higher quality responses (multiple passes)
+- Separation of concerns (reasoning vs. style)
+- Backend flexibility (cloud for hard tasks, local for simple ones)
+- Transparent thinking (can inspect each stage)
+
+---
+
+### 3. Backend Abstraction (LLM Router)
+
+The **LLM Router** allows Lyra to use multiple LLM backends transparently:
+
+```python
+# Same interface, different backends
+await call_llm(prompt, backend="PRIMARY")   # Local llama.cpp
+await call_llm(prompt, backend="CLOUD")     # OpenAI
+await call_llm(prompt, backend="SECONDARY") # Ollama
+```
+
+**Benefits:**
+- **Cost optimization:** Use expensive cloud LLMs only when needed
+- **Performance:** Local LLMs for low-latency responses
+- **Resilience:** Fallback to alternative backends on failure
+- **Experimentation:** Easy to swap models/providers
+
+**Design Pattern:** **Strategy Pattern** for swappable backends
+
+---
+
+### 4. Microservices Architecture
+
+Project Lyra follows **microservices principles**:
+
+**Each service has a single responsibility:**
+- Relay: Routing and orchestration
+- Cortex: Reasoning and response generation
+- NeoMem: Long-term memory storage
+- UI: User interface
+
+**Communication:**
+- REST APIs (HTTP/JSON)
+- Async ingestion (fire-and-forget)
+- Docker network isolation
+
+**Benefits:**
+- Independent scaling (scale Cortex without scaling UI)
+- Technology diversity (Node.js + Python)
+- Fault isolation (Cortex crash doesn't affect NeoMem)
+- Easy testing (mock service dependencies)
+
+---
+
+### 5. Session-Based State Management
+
+Lyra maintains **session-based state** for conversation continuity:
+
+```python
+# In-memory session storage (Intake)
+SESSIONS = {
+    "session_abc123": {
+        "buffer": deque([msg1, msg2, ...], maxlen=200),
+        "created_at": "2025-12-12T10:30:00Z"
+    }
+}
+
+# Persistent session storage (NeoMem)
+# Stores all messages + embeddings for semantic search
+```
+
+**Session Lifecycle:**
+1. User starts conversation → UI generates `session_id`
+2. First message → Cortex creates session in `SESSIONS` dict
+3. Subsequent messages → Retrieved from same session
+4. Async ingestion → Messages stored in NeoMem for long-term
+
+**Benefits:**
+- Conversation continuity within session
+- Historical search across sessions
+- User can switch sessions (multiple concurrent conversations)
+
+---
+
+### 6. Asynchronous Ingestion
+
+**Pattern:** Separate read path from write path
+
+```javascript
+// Relay: Synchronous read path (fast response)
+const response = await fetch('http://cortex:7081/reason');
+return response.json();  // Return immediately to user
+
+// Relay: Asynchronous write path (non-blocking)
+fetch('http://cortex:7081/ingest', { method: 'POST', ... });
+// Don't await, just fire and forget
+```
+
+**Benefits:**
+- Fast user response times (don't wait for database writes)
+- Resilient to storage failures (user still gets response)
+- Easier scaling (decouple read and write loads)
+
+**Trade-off:** Eventual consistency (short delay before memory is searchable)
+
+---
+
+### 7. Deferred Summarization
+
+Intake uses **deferred summarization** instead of pre-computation:
+
+```python
+# BAD: Pre-compute summaries on every message
+def add_message(session_id, message):
+    SESSIONS[session_id].buffer.append(message)
+    SESSIONS[session_id].L1_summary = summarize(last_1_message)
+    SESSIONS[session_id].L5_summary = summarize(last_5_messages)
+    # ... expensive, runs on every message
+
+# GOOD: Compute summaries only when needed
+def summarize_context(session_id):
+    buffer = SESSIONS[session_id].buffer
+    return {
+        "L1": summarize(buffer[-1:]),    # Only compute when requested
+        "L5": summarize(buffer[-5:]),
+        "L10": summarize(buffer[-10:])
+    }
+```
+
+**Benefits:**
+- Faster message ingestion (no blocking summarization)
+- Compute resources used only when needed
+- Flexible summary levels (easy to add L15, L50, etc.)
+
+**Trade-off:** Slight delay when first message in conversation (cold start)
+
+---
+
+## API Reference
+
+### Relay Endpoints
+
+#### POST `/v1/chat/completions`
+**OpenAI-compatible chat endpoint**
+
+**Request:**
+```json
+{
+  "messages": [
+    {"role": "user", "content": "Hello, Lyra!"}
+  ],
+  "session_id": "session_abc123"
+}
+```
+
+**Response:**
+```json
+{
+  "choices": [
+    {
+      "message": {
+        "role": "assistant",
+        "content": "Hi there! How can I help you today?"
+      }
+    }
+  ]
+}
+```
+
+---
+
+#### POST `/chat`
+**Lyra-native chat endpoint**
+
+**Request:**
+```json
+{
+  "session_id": "session_abc123",
+  "message": "Hello, Lyra!"
+}
+```
+
+**Response:**
+```json
+{
+  "answer": "Hi there! How can I help you today?",
+  "session_id": "session_abc123"
+}
+```
+
+---
+
+#### GET `/sessions/:id`
+**Retrieve session history**
+
+**Response:**
+```json
+{
+  "session_id": "session_abc123",
+  "messages": [
+    {"role": "user", "content": "Hello", "timestamp": "..."},
+    {"role": "assistant", "content": "Hi!", "timestamp": "..."}
+  ],
+  "created_at": "2025-12-12T10:30:00Z"
+}
+```
+
+---
+
+### Cortex Endpoints
+
+#### POST `/reason`
+**Main reasoning pipeline**
+
+**Request:**
+```json
+{
+  "session_id": "session_abc123",
+  "user_message": "How do I deploy ML models?"
+}
+```
+
+**Response:**
+```json
+{
+  "answer": "Final answer with Lyra's personality",
+  "metadata": {
+    "reflection": "User seeking deployment guidance...",
+    "draft": "Initial draft answer...",
+    "refined": "Polished answer...",
+    "stages_completed": 4
+  }
+}
+```
+
+---
+
+#### POST `/ingest`
+**Ingest message exchange into Intake**
+
+**Request:**
+```json
+{
+  "session_id": "session_abc123",
+  "user_message": "How do I deploy ML models?",
+  "assistant_message": "Here's how..."
+}
+```
+
+**Response:**
+```json
+{
+  "status": "ingested",
+  "session_id": "session_abc123",
+  "message_count": 24
+}
+```
+
+---
+
+#### GET `/debug/sessions`
+**Inspect in-memory SESSIONS state**
+
+**Response:**
+```json
+{
+  "session_abc123": {
+    "message_count": 24,
+    "created_at": "2025-12-12T10:30:00Z",
+    "last_message_at": "2025-12-12T11:15:00Z"
+  },
+  "session_xyz789": {
+    "message_count": 5,
+    "created_at": "2025-12-12T11:00:00Z",
+    "last_message_at": "2025-12-12T11:10:00Z"
+  }
+}
+```
+
+---
+
+### NeoMem Endpoints
+
+#### POST `/memories`
+**Create new memory**
+
+**Request:**
+```json
+{
+  "messages": [
+    {"role": "user", "content": "I prefer Docker for deployments"},
+    {"role": "assistant", "content": "Noted! I'll keep that in mind."}
+  ],
+  "session_id": "session_abc123"
+}
+```
+
+**Response:**
+```json
+{
+  "status": "created",
+  "memory_id": "mem_456def",
+  "extracted_entities": ["Docker", "deployments"]
+}
+```
+
+---
+
+#### GET `/search`
+**Semantic search for memories**
+
+**Query Parameters:**
+- `query` (required): Search query
+- `limit` (optional, default=5): Max results
+
+**Request:**
+```
+GET /search?query=deployment%20preferences&limit=5
+```
+
+**Response:**
+```json
+{
+  "results": [
+    {
+      "content": "User prefers Docker for deployments",
+      "score": 0.92,
+      "timestamp": "2025-12-10T14:30:00Z",
+      "session_id": "session_abc123"
+    },
+    {
+      "content": "Previously deployed models on AWS ECS",
+      "score": 0.87,
+      "timestamp": "2025-12-09T09:15:00Z",
+      "session_id": "session_abc123"
+    }
+  ]
+}
+```
+
+---
+
+#### GET `/memories`
+**List all memories**
+
+**Query Parameters:**
+- `offset` (optional, default=0): Pagination offset
+- `limit` (optional, default=50): Max results
+
+**Response:**
+```json
+{
+  "memories": [
+    {
+      "id": "mem_123abc",
+      "content": "User prefers Docker...",
+      "created_at": "2025-12-10T14:30:00Z"
+    }
+  ],
+  "total": 147,
+  "offset": 0,
+  "limit": 50
+}
+```
+
+---
+
+## Deployment & Operations
+
+### Docker Compose Deployment
+
+**File:** `/docker-compose.yml`
+
+```yaml
+version: '3.8'
+
+services:
+  # === ACTIVE SERVICES ===
+
+  relay:
+    build: ./core/relay
+    ports:
+      - "7078:7078"
+    environment:
+      - CORTEX_URL=http://cortex:7081
+      - NEOMEM_URL=http://neomem:7077
+    depends_on:
+      - cortex
+    networks:
+      - lyra_net
+
+  cortex:
+    build: ./cortex
+    ports:
+      - "7081:7081"
+    environment:
+      - NEOMEM_URL=http://neomem:7077
+      - PRIMARY_URL=${PRIMARY_URL}
+      - OPENAI_API_KEY=${OPENAI_API_KEY}
+    command: uvicorn main:app --host 0.0.0.0 --port 7081 --workers 1
+    depends_on:
+      - neomem
+    networks:
+      - lyra_net
+
+  neomem:
+    build: ./neomem
+    ports:
+      - "7077:7077"
+    environment:
+      - POSTGRES_HOST=neomem-postgres
+      - POSTGRES_USER=${POSTGRES_USER}
+      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
+      - NEO4J_URI=${NEO4J_URI}
+    depends_on:
+      - neomem-postgres
+      - neomem-neo4j
+    networks:
+      - lyra_net
+
+  ui:
+    image: nginx:alpine
+    ports:
+      - "8081:80"
+    volumes:
+      - ./core/ui:/usr/share/nginx/html:ro
+    networks:
+      - lyra_net
+
+  # === DATABASES ===
+
+  neomem-postgres:
+    image: ankane/pgvector:v0.5.1
+    environment:
+      - POSTGRES_USER=${POSTGRES_USER}
+      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
+      - POSTGRES_DB=${POSTGRES_DB}
+    volumes:
+      - ./volumes/postgres_data:/var/lib/postgresql/data
+    ports:
+      - "5432:5432"
+    networks:
+      - lyra_net
+
+  neomem-neo4j:
+    image: neo4j:5
+    environment:
+      - NEO4J_AUTH=${NEO4J_USER}/${NEO4J_PASSWORD}
+    volumes:
+      - ./volumes/neo4j_data:/data
+    ports:
+      - "7474:7474"  # Browser UI
+      - "7687:7687"  # Bolt
+    networks:
+      - lyra_net
+
+networks:
+  lyra_net:
+    driver: bridge
+```
+
+---
+
+### Starting the System
+
+```bash
+# 1. Clone repository
+git clone https://github.com/yourusername/project-lyra.git
+cd project-lyra
+
+# 2. Configure environment
+cp .env.example .env
+# Edit .env with your LLM backend URLs and API keys
+
+# 3. Start all services
+docker-compose up -d
+
+# 4. Check health
+curl http://localhost:7078/_health
+curl http://localhost:7081/health
+curl http://localhost:7077/health
+
+# 5. Open UI
+open http://localhost:8081
+```
+
+---
+
+### Monitoring & Logs
+
+```bash
+# View all logs
+docker-compose logs -f
+
+# View specific service
+docker-compose logs -f cortex
+
+# Check resource usage
+docker stats
+
+# Inspect Cortex sessions
+curl http://localhost:7081/debug/sessions
+
+# Check NeoMem memories
+curl http://localhost:7077/memories?limit=10
+```
+
+---
+
+### Scaling Considerations
+
+#### Current Constraints:
+
+1. **Single Cortex worker** required (in-memory SESSIONS dict)
+   - Solution: Migrate SESSIONS to Redis or PostgreSQL
+
+2. **In-memory session storage** in Relay
+   - Solution: Use Redis for session persistence
+
+3. **No load balancing** (single instance of each service)
+   - Solution: Add nginx reverse proxy + multiple Cortex instances
+
+#### Horizontal Scaling Plan:
+
+```yaml
+# Future: Redis-backed session storage
+cortex:
+  build: ./cortex
+  command: uvicorn main:app --workers 4  # Multi-worker
+  environment:
+    - REDIS_URL=redis://redis:6379
+  depends_on:
+    - redis
+
+redis:
+  image: redis:alpine
+  ports:
+    - "6379:6379"
+```
+
+---
+
+### Backup Strategy
+
+```bash
+# Backup PostgreSQL (NeoMem vectors)
+docker exec neomem-postgres pg_dump -U neomem neomem > backup_postgres.sql
+
+# Backup Neo4j (NeoMem graph)
+docker exec neomem-neo4j neo4j-admin dump --to=/data/backup.dump
+
+# Backup Intake sessions (manual export)
+curl http://localhost:7081/debug/sessions > backup_sessions.json
+```
+
+---
+
+## Known Issues & Constraints
+
+### Critical Constraints
+
+#### 1. Single-Worker Requirement (Cortex)
+**Issue:** Cortex must run with `--workers 1` to maintain SESSIONS state
+**Impact:** Limited horizontal scalability
+**Workaround:** None currently
+**Fix:** Migrate SESSIONS to Redis or PostgreSQL
+**Priority:** High (blocking scalability)
+
+#### 2. In-Memory Session Storage (Relay)
+**Issue:** Sessions stored in Node.js process memory
+**Impact:** Lost on restart, no persistence
+**Workaround:** None currently
+**Fix:** Use Redis or database
+**Priority:** Medium (acceptable for demo)
+
+---
+
+### Non-Critical Issues
+
+#### 3. RAG Service Disabled
+**Status:** Built but commented out in docker-compose.yml
+**Impact:** No RAG-based long-term knowledge retrieval
+**Workaround:** NeoMem provides semantic search
+**Fix:** Re-enable and integrate RAG service
+**Priority:** Low (NeoMem sufficient for now)
+
+#### 4. Partial NeoMem Integration
+**Status:** Search implemented, async ingestion planned
+**Impact:** Memories not automatically saved
+**Workaround:** Manual POST to /memories
+**Fix:** Complete async ingestion in Relay
+**Priority:** Medium (planned feature)
+
+#### 5. Inner Monologue Observer-Only
+**Status:** Stage 0.6 runs but output not used
+**Impact:** No adaptive response based on monologue
+**Workaround:** None (future feature)
+**Fix:** Integrate monologue output into pipeline
+**Priority:** Low (experimental feature)
+
+---
+
+### Fixed Issues (v0.5.2)
+
+✅ **LLM Router Blocking** - Migrated from `requests` to `httpx` for async
+✅ **Session ID Case Mismatch** - Standardized to `session_id`
+✅ **Missing Backend Parameter** - Added to intake summarization
+
+---
+
+### Deprecated Components
+
+**Location:** `/DEPRECATED_FILES.md`
+
+- **Standalone Intake Service** - Now embedded in Cortex
+- **Old Relay Backup** - Replaced by current Relay
+- **Persona Sidecar** - Built but unused (dynamic persona loading)
+
+---
+
+## Advanced Topics
+
+### Custom Prompt Engineering
+
+Each stage uses carefully crafted prompts:
+
+**Reflection Prompt Example:**
+```python
+REFLECTION_PROMPT = """
+You are Lyra's reflective awareness layer.
+Your job is to analyze the user's message and conversation context
+to understand their true intent and needs.
+
+User message: {user_message}
+
+Recent context:
+{intake_L10_summary}
+
+Long-term context:
+{neomem_top_3_memories}
+
+Provide concise meta-awareness notes:
+- What is the user's underlying intent?
+- What topics/themes are emerging?
+- What depth of response is appropriate?
+- Are there any implicit questions or concerns?
+
+Keep notes brief (3-5 sentences). Focus on insight, not description.
+"""
+```
+
+---
+
+### Extending the Pipeline
+
+**Adding Stage 5 (Fact-Checking):**
+
+```python
+# /cortex/reasoning/factcheck.py
+async def factcheck_answer(answer: str, context: dict) -> dict:
+    """
+    Stage 5: Verify factual claims in answer.
+
+    Returns:
+        {
+            "verified": bool,
+            "flagged_claims": list,
+            "corrected_answer": str
+        }
+    """
+
+    prompt = f"""
+    Review this answer for factual accuracy:
+
+    {answer}
+
+    Flag any claims that seem dubious or need verification.
+    Provide corrected version if needed.
+    """
+
+    result = await call_llm(prompt, backend="CLOUD", temperature=0.1)
+    return parse_factcheck_result(result)
+
+# Update router.py to include Stage 5
+async def reason_endpoint(request):
+    # ... existing stages ...
+
+    # Stage 5: Fact-checking
+    factcheck_result = await factcheck_answer(final_answer, context)
+
+    if not factcheck_result["verified"]:
+        final_answer = factcheck_result["corrected_answer"]
+
+    return {"answer": final_answer}
+```
+
+---
+
+### Custom LLM Backend Integration
+
+**Adding Anthropic Claude:**
+
+```python
+# /cortex/llm/llm_router.py
+
+BACKEND_CONFIGS = {
+    # ... existing backends ...
+
+    "CLAUDE": {
+        "url": "https://api.anthropic.com/v1",
+        "provider": "anthropic",
+        "model": "claude-3-5-sonnet-20241022",
+        "api_key": os.getenv("ANTHROPIC_API_KEY")
+    }
+}
+
+# Add provider-specific logic
+elif backend_config["provider"] == "anthropic":
+    headers = {
+        "x-api-key": api_key,
+        "anthropic-version": "2023-06-01"
+    }
+    payload = {
+        "model": model,
+        "messages": [{"role": "user", "content": prompt}],
+        "max_tokens": max_tokens,
+        "temperature": temperature
+    }
+    response = await httpx_client.post(
+        f"{url}/messages",
+        json=payload,
+        headers=headers,
+        timeout=120
+    )
+    return response.json()["content"][0]["text"]
+```
+
+---
+
+### Performance Optimization
+
+**Caching Strategies:**
+
+```python
+# /cortex/utils/cache.py
+from functools import lru_cache
+import hashlib
+
+@lru_cache(maxsize=128)
+def cache_llm_call(prompt_hash: str, backend: str):
+    """Cache LLM responses for identical prompts"""
+    # Note: Only cache deterministic calls (temperature=0)
+    pass
+
+# Usage in llm_router.py
+async def call_llm(prompt, backend, temperature=0.7, max_tokens=512):
+    if temperature == 0:
+        prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
+        cached = cache_llm_call(prompt_hash, backend)
+        if cached:
+            return cached
+
+    # ... normal LLM call ...
+```
+
+**Database Query Optimization:**
+
+```python
+# /neomem/neomem/database.py
+
+# BAD: Load all memories, then filter
+def search_memories(query):
+    all_memories = db.execute("SELECT * FROM memories")
+    # Expensive in-memory filtering
+    return [m for m in all_memories if similarity(m, query) > 0.8]
+
+# GOOD: Use database indexes and LIMIT
+def search_memories(query, limit=5):
+    query_embedding = embed(query)
+    return db.execute("""
+        SELECT * FROM memories
+        WHERE embedding <-> %s < 0.2  -- pgvector cosine distance
+        ORDER BY embedding <-> %s
+        LIMIT %s
+    """, (query_embedding, query_embedding, limit))
+```
+
+---
+
+## Conclusion
+
+Project Lyra is a sophisticated, multi-layered AI companion system that addresses the fundamental limitation of chatbot amnesia through:
+
+1. **Dual-memory architecture** (short-term Intake + long-term NeoMem)
+2. **Multi-stage reasoning pipeline** (Reflection → Reasoning → Refinement → Persona)
+3. **Flexible multi-backend LLM support** (cloud + local with fallback)
+4. **Microservices design** for scalability and maintainability
+5. **Modern web UI** with session management
+
+The system is production-ready with comprehensive error handling, logging, and health monitoring.
+
+---
+
+## Quick Reference
+
+### Service Ports
+- **UI:** 8081 (Browser interface)
+- **Relay:** 7078 (Main orchestrator)
+- **Cortex:** 7081 (Reasoning engine)
+- **NeoMem:** 7077 (Long-term memory)
+- **PostgreSQL:** 5432 (Vector storage)
+- **Neo4j:** 7474 (Browser), 7687 (Bolt)
+
+### Key Files
+- **Main Entry:** `/core/relay/server.js`
+- **Reasoning Pipeline:** `/cortex/router.py`
+- **LLM Router:** `/cortex/llm/llm_router.py`
+- **Short-term Memory:** `/cortex/intake/intake.py`
+- **Long-term Memory:** `/neomem/neomem/`
+- **Personality:** `/cortex/persona/identity.py`
+
+### Important Commands
+```bash
+# Start system
+docker-compose up -d
+
+# View logs
+docker-compose logs -f cortex
+
+# Debug sessions
+curl http://localhost:7081/debug/sessions
+
+# Health check
+curl http://localhost:7078/_health
+
+# Search memories
+curl "http://localhost:7077/search?query=deployment&limit=5"
+```
+
+---
+
+**Document Version:** 1.0
+**Last Updated:** 2025-12-13
+**Maintained By:** Project Lyra Team