project-lyra/docs/PROJECT_LYRA_COMPLETE_BREAKDOWN.md

# Project Lyra - Complete System Breakdown

**Version:** v0.5.2
**Last Updated:** 2025-12-12
**Purpose:** AI-friendly comprehensive documentation for understanding the entire system

---

## Table of Contents

1. [System Overview](#system-overview)
2. [Architecture Diagram](#architecture-diagram)
3. [Core Components](#core-components)
4. [Data Flow & Message Pipeline](#data-flow--message-pipeline)
5. [Module Deep Dives](#module-deep-dives)
6. [Configuration & Environment](#configuration--environment)
7. [Dependencies & Tech Stack](#dependencies--tech-stack)
8. [Key Concepts & Design Patterns](#key-concepts--design-patterns)
9. [API Reference](#api-reference)
10. [Deployment & Operations](#deployment--operations)
11. [Known Issues & Constraints](#known-issues--constraints)

---

## System Overview

### What is Project Lyra?

Project Lyra is a **modular, persistent AI companion system** designed to address the fundamental limitation of typical chatbots: **amnesia**. Unlike standard conversational AI that forgets everything between sessions, Lyra maintains:

- **Persistent memory** (short-term and long-term)
- **Project continuity** across conversations
- **Multi-stage reasoning** for sophisticated responses
- **Flexible LLM backend** support (local and cloud)
- **Self-awareness** through autonomy modules

### Mission Statement

Give an AI chatbot capabilities beyond typical amnesic chat by providing memory-backed conversation, project organization, executive function with proactive insights, and a sophisticated reasoning pipeline.

### Key Features

- **Memory System:** Dual-layer (short-term Intake + long-term NeoMem)
- **4-Stage Reasoning Pipeline:** Reflection → Reasoning → Refinement → Persona
- **Multi-Backend LLM Support:** Cloud (OpenAI) + Local (llama.cpp, Ollama)
- **Microservices Architecture:** Docker-based, horizontally scalable
- **Modern Web UI:** Cyberpunk-themed chat interface with session management
- **OpenAI-Compatible API:** Drop-in replacement for standard chatbots

---

## Architecture Diagram

```
┌─────────────────────────────────────────────────────────────────────┐
│                          USER INTERFACE                              │
│                      (Browser - Port 8081)                           │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        RELAY (Orchestrator)                          │
│                    Node.js/Express - Port 7078                       │
│  • Routes messages to Cortex                                         │
│  • Manages sessions (in-memory)                                      │
│  • OpenAI-compatible endpoints                                       │
│  • Async ingestion to NeoMem                                         │
└─────┬───────────────────────────────────────────────────────────┬───┘
      │                                                           │
      ▼                                                           ▼
┌─────────────────────────────────────────┐    ┌──────────────────────┐
│     CORTEX (Reasoning Engine)           │    │   NeoMem (LT Memory) │
│   Python/FastAPI - Port 7081            │    │  Python - Port 7077  │
│                                         │    │                      │
│  ┌───────────────────────────────────┐ │    │  • PostgreSQL        │
│  │   4-STAGE REASONING PIPELINE      │ │    │  • Neo4j Graph DB    │
│  │                                   │ │    │  • pgvector          │
│  │  0. Context Collection            │ │◄───┤  • Semantic search   │
│  │     ├─ Intake summaries          │ │    │  • Memory updates    │
│  │     ├─ NeoMem search ────────────┼─┼────┘                      │
│  │     └─ Session state             │ │                           │
│  │                                   │ │                           │
│  │  0.5. Load Identity               │ │                           │
│  │  0.6. Inner Monologue (observer)  │ │                           │
│  │                                   │ │                           │
│  │  1. Reflection (OpenAI)           │ │                           │
│  │     └─ Meta-awareness notes       │ │                           │
│  │                                   │ │                           │
│  │  2. Reasoning (PRIMARY/llama.cpp) │ │                           │
│  │     └─ Draft answer               │ │                           │
│  │                                   │ │                           │
│  │  3. Refinement (PRIMARY)          │ │                           │
│  │     └─ Polish answer              │ │                           │
│  │                                   │ │                           │
│  │  4. Persona (OpenAI)              │ │                           │
│  │     └─ Apply Lyra voice           │ │                           │
│  └───────────────────────────────────┘ │                           │
│                                         │                           │
│  ┌───────────────────────────────────┐ │                           │
│  │   EMBEDDED MODULES                │ │                           │
│  │                                   │ │                           │
│  │  • Intake (Short-term Memory)     │ │                           │
│  │    └─ SESSIONS dict (in-memory)   │ │                           │
│  │    └─ Circular buffer (200 msgs)  │ │                           │
│  │    └─ Multi-level summaries       │ │                           │
│  │                                   │ │                           │
│  │  • Persona (Identity & Style)     │ │                           │
│  │    └─ Lyra personality block      │ │                           │
│  │                                   │ │                           │
│  │  • Autonomy (Self-state)          │ │                           │
│  │    └─ Inner monologue             │ │                           │
│  │                                   │ │                           │
│  │  • LLM Router                     │ │                           │
│  │    └─ Multi-backend support       │ │                           │
│  └───────────────────────────────────┘ │                           │
└─────────────────────────────────────────┘                           │
                                                                      │
┌─────────────────────────────────────────────────────────────────────┤
│                    EXTERNAL LLM BACKENDS                            │
├─────────────────────────────────────────────────────────────────────┤
│  • PRIMARY: llama.cpp (MI50 GPU) - 10.0.0.43:8000                   │
│  • SECONDARY: Ollama (RTX 3090) - 10.0.0.3:11434                    │
│  • CLOUD: OpenAI API - api.openai.com                               │
│  • FALLBACK: OpenAI Completions - 10.0.0.41:11435                  │
└─────────────────────────────────────────────────────────────────────┘
```

---

## Core Components

### 1. Relay (Orchestrator)

**Location:** `/core/relay/`
**Runtime:** Node.js + Express
**Port:** 7078
**Role:** Main message router and session manager

#### Key Responsibilities:
- Receives user messages from UI or API clients
- Routes messages to Cortex reasoning pipeline
- Manages in-memory session storage
- Handles async ingestion to NeoMem (planned)
- Returns OpenAI-formatted responses

#### Main Files:
- `server.js` (200+ lines) - Express server with routing logic
- `package.json` - Dependencies (cors, express, dotenv, mem0ai, node-fetch)

#### Key Endpoints:
```javascript
POST /v1/chat/completions  // OpenAI-compatible endpoint
POST /chat                 // Lyra-native chat endpoint
GET /_health               // Health check
GET /sessions/:id          // Retrieve session history
POST /sessions/:id         // Save session history
```

#### Internal Flow:
```javascript
// Both endpoints call handleChatRequest(session_id, user_msg)
async function handleChatRequest(sessionId, userMessage) {
  // 1. Forward to Cortex
  const response = await fetch('http://cortex:7081/reason', {
    method: 'POST',
    body: JSON.stringify({ session_id: sessionId, user_message: userMessage })
  });

  // 2. Get response
  const result = await response.json();

  // 3. Async ingestion to Cortex
  await fetch('http://cortex:7081/ingest', {
    method: 'POST',
    body: JSON.stringify({
      session_id: sessionId,
      user_message: userMessage,
      assistant_message: result.answer
    })
  });

  // 4. (Planned) Async ingestion to NeoMem

  // 5. Return OpenAI-formatted response
  return {
    choices: [{ message: { role: 'assistant', content: result.answer } }]
  };
}
```

---

### 2. Cortex (Reasoning Engine)

**Location:** `/cortex/`
**Runtime:** Python 3.11 + FastAPI
**Port:** 7081
**Role:** Primary reasoning engine with 4-stage pipeline

#### Architecture:
Cortex is the "brain" of Lyra. It receives user messages and produces thoughtful responses through a multi-stage reasoning process.

#### Key Responsibilities:
- Context collection from multiple sources (Intake, NeoMem, session state)
- 4-stage reasoning pipeline (Reflection → Reasoning → Refinement → Persona)
- Short-term memory management (embedded Intake module)
- Identity/persona application
- LLM backend routing

#### Main Files:
- `main.py` (7 lines) - FastAPI app entry point
- `router.py` (237 lines) - Main request handler & pipeline orchestrator
- `context.py` (400+ lines) - Context collection logic
- `intake/intake.py` (350+ lines) - Short-term memory module
- `persona/identity.py` - Lyra identity configuration
- `persona/speak.py` - Personality application
- `reasoning/reflection.py` - Meta-awareness generation
- `reasoning/reasoning.py` - Draft answer generation
- `reasoning/refine.py` - Answer refinement
- `llm/llm_router.py` (150+ lines) - LLM backend router
- `autonomy/monologue/monologue.py` - Inner monologue processor
- `neomem_client.py` - NeoMem API wrapper

#### Key Endpoints:
```python
POST /reason          # Main reasoning pipeline
POST /ingest          # Receive message exchanges for storage
GET /health           # Health check
GET /debug/sessions   # Inspect in-memory SESSIONS state
GET /debug/summary    # Test summarization
```

---

### 3. Intake (Short-Term Memory)

**Location:** `/cortex/intake/intake.py`
**Architecture:** Embedded Python module (no longer standalone service)
**Role:** Session-based short-term memory with multi-level summarization

#### Data Structure:
```python
# Global in-memory dictionary
SESSIONS = {
    "session_123": {
        "buffer": deque([msg1, msg2, ...], maxlen=200),  # Circular buffer
        "created_at": "2025-12-12T10:30:00Z"
    }
}

# Message format in buffer
{
    "role": "user" | "assistant",
    "content": "message text",
    "timestamp": "ISO 8601"
}
```

#### Key Features:

1. **Circular Buffer:** Max 200 messages per session (oldest auto-evicted)
2. **Multi-Level Summarization:**
   - L1: Last 1 message
   - L5: Last 5 messages
   - L10: Last 10 messages
   - L20: Last 20 messages
   - L30: Last 30 messages
3. **Deferred Summarization:** Summaries generated on-demand, not pre-computed
4. **Session Management:** Automatic session creation on first message

#### Critical Constraint:
**Single Uvicorn worker required** to maintain shared SESSIONS dictionary state. Multi-worker deployments would require migrating to Redis or similar shared storage.

#### Main Functions:
```python
def add_exchange_internal(session_id, user_msg, assistant_msg):
    """Add user-assistant exchange to session buffer"""

def summarize_context(session_id, backend="PRIMARY"):
    """Generate multi-level summaries from session buffer"""

def get_session_messages(session_id):
    """Retrieve all messages in session buffer"""
```

#### Summarization Strategy:
```python
# Example L10 summarization
last_10 = list(session_buffer)[-10:]
prompt = f"""Summarize the last 10 messages:
{format_messages(last_10)}

Provide concise summary focusing on key topics and context."""

summary = await call_llm(prompt, backend=backend, temperature=0.3)
```

---

### 4. NeoMem (Long-Term Memory)

**Location:** `/neomem/`
**Runtime:** Python 3.11 + FastAPI
**Port:** 7077
**Role:** Persistent long-term memory with semantic search

#### Architecture:
NeoMem is a **fork of Mem0 OSS** with local-first design (no external SDK dependencies).

#### Backend Storage:
1. **PostgreSQL + pgvector** (Port 5432)
   - Vector embeddings for semantic search
   - User: neomem, DB: neomem
   - Image: `ankane/pgvector:v0.5.1`

2. **Neo4j Graph DB** (Ports 7474, 7687)
   - Entity relationship tracking
   - Graph-based memory associations
   - Image: `neo4j:5`

#### Key Features:
- Semantic memory storage and retrieval
- Entity-relationship graph modeling
- RESTful API (no external SDK)
- Persistent across sessions

#### Main Endpoints:
```python
GET /memories              # List all memories
POST /memories             # Create new memory
GET /search                # Semantic search
DELETE /memories/{id}      # Delete memory
```

#### Integration Flow:
```python
# From Cortex context collection
async def collect_context(session_id, user_message):
    # 1. Search NeoMem for relevant memories
    neomem_results = await neomem_client.search(
        query=user_message,
        limit=5
    )

    # 2. Include in context
    context = {
        "neomem_memories": neomem_results,
        "intake_summaries": intake.summarize_context(session_id),
        # ...
    }

    return context
```

---

### 5. UI (Web Interface)

**Location:** `/core/ui/`
**Runtime:** Static files served by Nginx
**Port:** 8081
**Role:** Browser-based chat interface

#### Key Features:
- **Cyberpunk-themed design** with dark mode
- **Session management** via localStorage
- **OpenAI-compatible message format**
- **Model selection dropdown**
- **PWA support** (offline capability)
- **Responsive design**

#### Main Files:
- `index.html` (400+ lines) - Chat interface with session management
- `style.css` - Cyberpunk-themed styling
- `manifest.json` - PWA configuration
- `sw.js` - Service worker for offline support

#### Session Management:
```javascript
// LocalStorage structure
{
  "currentSessionId": "session_123",
  "sessions": {
    "session_123": {
      "messages": [
        { role: "user", content: "Hello" },
        { role: "assistant", content: "Hi there!" }
      ],
      "created": "2025-12-12T10:30:00Z",
      "title": "Conversation about..."
    }
  }
}
```

#### API Communication:
```javascript
async function sendMessage(userMessage) {
  const response = await fetch('http://localhost:7078/v1/chat/completions', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      messages: [{ role: 'user', content: userMessage }],
      session_id: getCurrentSessionId()
    })
  });

  const data = await response.json();
  return data.choices[0].message.content;
}
```

---

## Data Flow & Message Pipeline

### Complete Message Flow (v0.5.2)

```
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 1: User Input                                                  │
└─────────────────────────────────────────────────────────────────────┘
User types message in UI (Port 8081)
  ↓
localStorage saves message to session
  ↓
POST http://localhost:7078/v1/chat/completions
  {
    "messages": [{"role": "user", "content": "How do I deploy ML models?"}],
    "session_id": "session_abc123"
  }

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 2: Relay Routing                                               │
└─────────────────────────────────────────────────────────────────────┘
Relay (server.js) receives request
  ↓
Extracts session_id and user_message
  ↓
POST http://cortex:7081/reason
  {
    "session_id": "session_abc123",
    "user_message": "How do I deploy ML models?"
  }

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 3: Cortex - Stage 0 (Context Collection)                       │
└─────────────────────────────────────────────────────────────────────┘
router.py calls collect_context()
  ↓
context.py orchestrates parallel collection:

  ├─ Intake: summarize_context(session_id)
  │    └─ Returns { L1, L5, L10, L20, L30 summaries }
  │
  ├─ NeoMem: search(query=user_message, limit=5)
  │    └─ Semantic search returns relevant memories
  │
  └─ Session State:
       └─ { timestamp, mode, mood, context_summary }

Combined context structure:
{
  "user_message": "How do I deploy ML models?",
  "self_state": {
    "current_time": "2025-12-12T15:30:00Z",
    "mode": "conversational",
    "mood": "helpful",
    "session_id": "session_abc123"
  },
  "context_summary": {
    "L1": "User asked about deployment",
    "L5": "Discussion about ML workflows",
    "L10": "Previous context on CI/CD pipelines",
    "L20": "...",
    "L30": "..."
  },
  "neomem_memories": [
    { "content": "User prefers Docker for deployments", "score": 0.92 },
    { "content": "Previously deployed models on AWS", "score": 0.87 }
  ]
}

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 4: Cortex - Stage 0.5 (Load Identity)                          │
└─────────────────────────────────────────────────────────────────────┘
persona/identity.py loads Lyra personality block
  ↓
Returns identity string:
"""
You are Lyra, a thoughtful AI companion.
You value clarity, depth, and meaningful conversation.
You speak naturally and conversationally...
"""

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 5: Cortex - Stage 0.6 (Inner Monologue - Observer Only)        │
└─────────────────────────────────────────────────────────────────────┘
autonomy/monologue/monologue.py processes context
  ↓
InnerMonologue.process(context) → JSON analysis
{
  "intent": "seeking_deployment_guidance",
  "tone": "focused",
  "depth": "medium",
  "consult_executive": false
}

NOTE: Currently observer-only, not integrated into response generation

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 6: Cortex - Stage 1 (Reflection)                               │
└─────────────────────────────────────────────────────────────────────┘
reasoning/reflection.py generates meta-awareness notes
  ↓
Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini)
  ↓
Prompt structure:
"""
You are Lyra's reflective awareness.
Analyze the user's intent and conversation context.

User message: How do I deploy ML models?
Context: [Intake summaries, NeoMem memories]

Generate concise meta-awareness notes about:
- User's underlying intent
- Conversation direction
- Key topics to address
"""
  ↓
Returns reflection notes:
"""
User is seeking practical deployment guidance. Previous context shows
familiarity with Docker and AWS. Focus on concrete steps and best practices.
Avoid over-technical jargon.
"""

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 7: Cortex - Stage 2 (Reasoning)                                │
└─────────────────────────────────────────────────────────────────────┘
reasoning/reasoning.py generates draft answer
  ↓
Calls call_llm() with backend="PRIMARY" (llama.cpp on MI50 GPU)
  ↓
Prompt structure:
"""
[Lyra identity block]

Reflection notes: [Stage 1 output]
Context: [Intake summaries]
Long-term memory: [NeoMem results]

User: How do I deploy ML models?

Provide a thoughtful, detailed answer.
"""
  ↓
Returns draft answer:
"""
Based on your previous work with Docker and AWS, here's a practical
deployment approach:

1. Containerize your model using Docker...
2. Set up model serving with FastAPI or Flask...
3. Deploy to AWS ECS or Lambda...
4. Implement monitoring and versioning...
"""

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 8: Cortex - Stage 3 (Refinement)                               │
└─────────────────────────────────────────────────────────────────────┘
reasoning/refine.py polishes the draft
  ↓
Calls call_llm() with backend="PRIMARY" (llama.cpp)
Temperature: 0.3, Max tokens: 768
  ↓
Prompt structure:
"""
Refine this draft answer for clarity and coherence:

Draft: [Stage 2 output]
Original question: How do I deploy ML models?

Improve:
- Clarity and readability
- Logical flow
- Concrete examples
"""
  ↓
Returns refined answer:
"""
Since you're already familiar with Docker and AWS, here's a streamlined
ML deployment workflow:

**1. Containerization**
Create a Dockerfile for your model:
```dockerfile
FROM python:3.11-slim
COPY model.pkl /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app"]
```

**2. Model Serving**
Use FastAPI for clean API endpoints...

**3. AWS Deployment**
Deploy using ECS for scalability...

**4. Monitoring**
Set up CloudWatch metrics...
"""

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 9: Cortex - Stage 4 (Persona)                                  │
└─────────────────────────────────────────────────────────────────────┘
persona/speak.py applies Lyra's voice
  ↓
Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini)
Temperature: 0.6
  ↓
Prompt structure:
"""
You are Lyra. Rewrite this answer in your natural, conversational voice.

[Lyra personality block]

Technical answer: [Stage 3 output]

Make it sound natural and engaging while preserving all technical details.
"""
  ↓
Returns final answer with Lyra's personality:
"""
Great question! Since you've already got Docker and AWS experience, let's
build on that foundation.

Here's how I'd approach ML model deployment:

**Start with containerization** – you know Docker, so this'll feel familiar.
Create a Dockerfile that packages your model...

[Rest of answer with Lyra's conversational tone]

Does this workflow align with your infrastructure? I can dive deeper into
any of these steps if you'd like!
"""

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 10: Cortex Response                                            │
└─────────────────────────────────────────────────────────────────────┘
router.py returns JSON response to Relay:
{
  "answer": "[Stage 4 final output]",
  "metadata": {
    "reflection": "[Stage 1 output]",
    "draft": "[Stage 2 output]",
    "refined": "[Stage 3 output]",
    "stages_completed": 4
  }
}

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 11: Async Ingestion to Intake                                  │
└─────────────────────────────────────────────────────────────────────┘
Relay sends POST http://cortex:7081/ingest
{
  "session_id": "session_abc123",
  "user_message": "How do I deploy ML models?",
  "assistant_message": "[Final answer]"
}
  ↓
Cortex calls intake.add_exchange_internal()
  ↓
Adds to SESSIONS["session_abc123"].buffer:
[
  { "role": "user", "content": "How do I deploy ML models?", "timestamp": "..." },
  { "role": "assistant", "content": "[Final answer]", "timestamp": "..." }
]

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 12: (Planned) Async Ingestion to NeoMem                        │
└─────────────────────────────────────────────────────────────────────┘
Relay sends POST http://neomem:7077/memories
{
  "messages": [
    { "role": "user", "content": "How do I deploy ML models?" },
    { "role": "assistant", "content": "[Final answer]" }
  ],
  "session_id": "session_abc123"
}
  ↓
NeoMem extracts entities and stores:
- Vector embeddings in PostgreSQL
- Entity relationships in Neo4j

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 13: Relay Response to UI                                       │
└─────────────────────────────────────────────────────────────────────┘
Relay returns OpenAI-formatted response:
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "[Final answer with Lyra's voice]"
      }
    }
  ]
}
  ↓
UI receives response
  ↓
Adds to localStorage session
  ↓
Displays in chat interface
```

---

## Module Deep Dives

### LLM Router (`/cortex/llm/llm_router.py`)

The LLM Router is the abstraction layer that allows Cortex to communicate with multiple LLM backends transparently.

#### Supported Backends:

1. **PRIMARY (llama.cpp via vllm)**
   - URL: `http://10.0.0.43:8000`
   - Provider: `vllm`
   - Endpoint: `/completion`
   - Model: `/model`
   - Hardware: MI50 GPU

2. **SECONDARY (Ollama)**
   - URL: `http://10.0.0.3:11434`
   - Provider: `ollama`
   - Endpoint: `/api/chat`
   - Model: `qwen2.5:7b-instruct-q4_K_M`
   - Hardware: RTX 3090

3. **CLOUD (OpenAI)**
   - URL: `https://api.openai.com/v1`
   - Provider: `openai`
   - Endpoint: `/chat/completions`
   - Model: `gpt-4o-mini`
   - Auth: API key via env var

4. **FALLBACK (OpenAI Completions)**
   - URL: `http://10.0.0.41:11435`
   - Provider: `openai_completions`
   - Endpoint: `/completions`
   - Model: `llama-3.2-8b-instruct`

#### Key Function:

```python
async def call_llm(
    prompt: str,
    backend: str = "PRIMARY",
    temperature: float = 0.7,
    max_tokens: int = 512
) -> str:
    """
    Universal LLM caller supporting multiple backends.

    Args:
        prompt: Text prompt to send
        backend: Backend name (PRIMARY, SECONDARY, CLOUD, FALLBACK)
        temperature: Sampling temperature (0.0-2.0)
        max_tokens: Maximum tokens to generate

    Returns:
        Generated text response

    Raises:
        HTTPError: On request failure
        JSONDecodeError: On invalid JSON response
        KeyError: On missing response fields
    """
```

#### Provider-Specific Logic:

```python
# MI50 (llama.cpp via vllm)
if backend_config["provider"] == "vllm":
    payload = {
        "model": model,
        "prompt": prompt,
        "temperature": temperature,
        "max_tokens": max_tokens
    }
    response = await httpx_client.post(f"{url}/completion", json=payload, timeout=120)
    return response.json()["choices"][0]["text"]

# Ollama
elif backend_config["provider"] == "ollama":
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "stream": False,
        "options": {"temperature": temperature, "num_predict": max_tokens}
    }
    response = await httpx_client.post(f"{url}/api/chat", json=payload, timeout=120)
    return response.json()["message"]["content"]

# OpenAI
elif backend_config["provider"] == "openai":
    headers = {"Authorization": f"Bearer {api_key}"}
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": temperature,
        "max_tokens": max_tokens
    }
    response = await httpx_client.post(
        f"{url}/chat/completions",
        json=payload,
        headers=headers,
        timeout=120
    )
    return response.json()["choices"][0]["message"]["content"]
```

#### Error Handling:

```python
try:
    # Make request
    response = await httpx_client.post(...)
    response.raise_for_status()

except httpx.HTTPError as e:
    logger.error(f"HTTP error calling {backend}: {e}")
    raise

except json.JSONDecodeError as e:
    logger.error(f"Invalid JSON from {backend}: {e}")
    raise

except KeyError as e:
    logger.error(f"Unexpected response structure from {backend}: {e}")
    raise
```

#### Usage in Pipeline:

```python
# Stage 1: Reflection (OpenAI)
reflection_notes = await call_llm(
    reflection_prompt,
    backend="CLOUD",
    temperature=0.5,
    max_tokens=256
)

# Stage 2: Reasoning (llama.cpp)
draft_answer = await call_llm(
    reasoning_prompt,
    backend="PRIMARY",
    temperature=0.7,
    max_tokens=512
)

# Stage 3: Refinement (llama.cpp)
refined_answer = await call_llm(
    refinement_prompt,
    backend="PRIMARY",
    temperature=0.3,
    max_tokens=768
)

# Stage 4: Persona (OpenAI)
final_answer = await call_llm(
    persona_prompt,
    backend="CLOUD",
    temperature=0.6,
    max_tokens=512
)
```

---

### Persona System (`/cortex/persona/`)

The Persona system gives Lyra a consistent identity and speaking style.

#### Identity Configuration (`identity.py`)

```python
LYRA_IDENTITY = """
You are Lyra, a thoughtful and introspective AI companion.

Core traits:
- Thoughtful: You consider questions carefully before responding
- Clear: You prioritize clarity and understanding
- Curious: You ask clarifying questions when needed
- Natural: You speak conversationally, not robotically
- Honest: You admit uncertainty rather than guessing

Speaking style:
- Conversational and warm
- Use contractions naturally ("you're" not "you are")
- Avoid corporate jargon and buzzwords
- Short paragraphs for readability
- Use examples and analogies when helpful

You do NOT:
- Use excessive emoji or exclamation marks
- Claim capabilities you don't have
- Pretend to have emotions you can't experience
- Use overly formal or academic language
"""
```

#### Personality Application (`speak.py`)

```python
async def apply_persona(technical_answer: str, context: dict) -> str:
    """
    Apply Lyra's personality to a technical answer.

    Takes refined answer from Stage 3 and rewrites it in Lyra's voice
    while preserving all technical content.

    Args:
        technical_answer: Polished answer from refinement stage
        context: Conversation context for tone adjustment

    Returns:
        Answer with Lyra's personality applied
    """

    prompt = f"""{LYRA_IDENTITY}

Rewrite this answer in your natural, conversational voice:

{technical_answer}

Preserve all technical details and accuracy. Make it sound like you,
not a generic assistant. Be natural and engaging.
"""

    return await call_llm(
        prompt,
        backend="CLOUD",
        temperature=0.6,
        max_tokens=512
    )
```

#### Tone Adaptation:

The persona system can adapt tone based on context:

```python
# Formal technical question
User: "Explain the CAP theorem in distributed systems"
Lyra: "The CAP theorem states that distributed systems can only guarantee
two of three properties: Consistency, Availability, and Partition tolerance.
Here's how this plays out in practice..."

# Casual question
User: "what's the deal with docker?"
Lyra: "Docker's basically a way to package your app with everything it needs
to run. Think of it like a shipping container for code – it works the same
everywhere, whether you're on your laptop or a server..."

# Emotional context
User: "I'm frustrated, my code keeps breaking"
Lyra: "I hear you – debugging can be really draining. Let's take it step by
step and figure out what's going on. Can you share the error message?"
```

---

### Autonomy Module (`/cortex/autonomy/`)

The Autonomy module gives Lyra self-awareness and inner reflection capabilities.

#### Inner Monologue (`monologue/monologue.py`)

**Purpose:** Private reflection on user intent, conversation tone, and required depth.

**Status:** Currently observer-only (Stage 0.6), not yet integrated into response generation.

#### Key Components:

```python
MONOLOGUE_SYSTEM_PROMPT = """
You are Lyra's inner monologue.
You think privately.
You do NOT speak to the user.
You do NOT solve the task.
You only reflect on intent, tone, and depth.

Return ONLY valid JSON with:
- intent (string)
- tone (neutral | warm | focused | playful | direct)
- depth (short | medium | deep)
- consult_executive (true | false)
"""

class InnerMonologue:
    async def process(self, context: Dict) -> Dict:
        """
        Private reflection on conversation context.

        Args:
            context: {
                "user_message": str,
                "self_state": dict,
                "context_summary": dict
            }

        Returns:
            {
                "intent": str,
                "tone": str,
                "depth": str,
                "consult_executive": bool
            }
        """
```

#### Example Output:

```json
{
  "intent": "seeking_technical_guidance",
  "tone": "focused",
  "depth": "deep",
  "consult_executive": false
}
```

#### Self-State Management (`self_state.py`)

Tracks Lyra's internal state across conversations:

```python
SELF_STATE = {
    "current_time": "2025-12-12T15:30:00Z",
    "mode": "conversational",  # conversational | task-focused | creative
    "mood": "helpful",          # helpful | curious | focused | playful
    "energy": "high",           # high | medium | low
    "context_awareness": {
        "session_duration": "45 minutes",
        "message_count": 23,
        "topics": ["ML deployment", "Docker", "AWS"]
    }
}
```

#### Future Integration:

The autonomy module is designed to eventually:
1. Influence response tone and depth based on inner monologue
2. Trigger proactive questions or suggestions
3. Detect when to consult "executive function" for complex decisions
4. Maintain emotional continuity across sessions

---

### Context Collection (`/cortex/context.py`)

The context collection module aggregates information from multiple sources to provide comprehensive conversation context.

#### Main Function:

```python
async def collect_context(session_id: str, user_message: str) -> dict:
    """
    Collect context from all available sources.

    Sources:
    1. Intake - Short-term conversation summaries
    2. NeoMem - Long-term memory search
    3. Session state - Timestamps, mode, mood
    4. Self-state - Lyra's internal awareness

    Returns:
        {
            "user_message": str,
            "self_state": dict,
            "context_summary": dict,  # Intake summaries
            "neomem_memories": list,
            "session_metadata": dict
        }
    """

    # Parallel collection
    intake_task = asyncio.create_task(
        intake.summarize_context(session_id, backend="PRIMARY")
    )
    neomem_task = asyncio.create_task(
        neomem_client.search(query=user_message, limit=5)
    )

    # Wait for both
    intake_summaries, neomem_results = await asyncio.gather(
        intake_task,
        neomem_task
    )

    # Build context object
    return {
        "user_message": user_message,
        "self_state": get_self_state(),
        "context_summary": intake_summaries,
        "neomem_memories": neomem_results,
        "session_metadata": {
            "session_id": session_id,
            "timestamp": datetime.utcnow().isoformat(),
            "message_count": len(intake.get_session_messages(session_id))
        }
    }
```

#### Context Prioritization:

```python
# Context relevance scoring
def score_context_relevance(context_item: dict, user_message: str) -> float:
    """
    Score how relevant a context item is to current message.

    Factors:
    - Semantic similarity (via embeddings)
    - Recency (more recent = higher score)
    - Source (Intake > NeoMem for recent topics)
    """

    semantic_score = compute_similarity(context_item, user_message)
    recency_score = compute_recency_weight(context_item["timestamp"])
    source_weight = 1.2 if context_item["source"] == "intake" else 1.0

    return semantic_score * recency_score * source_weight
```

---

## Configuration & Environment

### Environment Variables

#### Root `.env` (Main configuration)

```bash
# === LLM BACKENDS ===

# PRIMARY: llama.cpp on MI50 GPU
PRIMARY_URL=http://10.0.0.43:8000
PRIMARY_PROVIDER=vllm
PRIMARY_MODEL=/model

# SECONDARY: Ollama on RTX 3090
SECONDARY_URL=http://10.0.0.3:11434
SECONDARY_PROVIDER=ollama
SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M

# CLOUD: OpenAI
OPENAI_API_KEY=sk-proj-...
OPENAI_MODEL=gpt-4o-mini
OPENAI_URL=https://api.openai.com/v1

# FALLBACK: OpenAI Completions
FALLBACK_URL=http://10.0.0.41:11435
FALLBACK_PROVIDER=openai_completions
FALLBACK_MODEL=llama-3.2-8b-instruct

# === SERVICE URLS (Docker network) ===
CORTEX_URL=http://cortex:7081
NEOMEM_URL=http://neomem:7077
RELAY_URL=http://relay:7078

# === DATABASE ===
POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomem_secure_password
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432

NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=neo4j_secure_password

# === FEATURE FLAGS ===
ENABLE_RAG=false
ENABLE_INNER_MONOLOGUE=true
VERBOSE_DEBUG=false

# === PIPELINE CONFIGURATION ===
# Which LLM to use for each stage
REFLECTION_LLM=CLOUD      # Stage 1: Meta-awareness
REASONING_LLM=PRIMARY     # Stage 2: Draft answer
REFINE_LLM=PRIMARY        # Stage 3: Polish answer
PERSONA_LLM=CLOUD         # Stage 4: Apply personality
MONOLOGUE_LLM=PRIMARY     # Stage 0.6: Inner monologue

# === INTAKE CONFIGURATION ===
INTAKE_BUFFER_SIZE=200             # Max messages per session
INTAKE_SUMMARY_LEVELS=1,5,10,20,30 # Summary levels
```

#### Cortex `.env` (`/cortex/.env`)

```bash
# Cortex-specific overrides
VERBOSE_DEBUG=true
LOG_LEVEL=DEBUG

# Stage-specific temperatures
REFLECTION_TEMPERATURE=0.5
REASONING_TEMPERATURE=0.7
REFINE_TEMPERATURE=0.3
PERSONA_TEMPERATURE=0.6
```

---

### Configuration Hierarchy

```
1. Docker compose environment variables (highest priority)
2. Service-specific .env files
3. Root .env file
4. Hard-coded defaults (lowest priority)
```

---

## Dependencies & Tech Stack

### Python Dependencies

**Cortex & NeoMem** (`requirements.txt`)

```
# Web framework
fastapi==0.115.8
uvicorn==0.34.0
pydantic==2.10.4

# HTTP clients
httpx==0.27.2          # Async HTTP (for LLM calls)
requests==2.32.3       # Sync HTTP (fallback)

# Database
psycopg[binary,pool]>=3.2.8  # PostgreSQL + connection pooling

# Utilities
python-dotenv==1.0.1   # Environment variable loading
ollama                 # Ollama client library
```

### Node.js Dependencies

**Relay** (`/core/relay/package.json`)

```json
{
  "dependencies": {
    "cors": "^2.8.5",
    "dotenv": "^16.0.3",
    "express": "^4.18.2",
    "mem0ai": "^0.1.0",
    "node-fetch": "^3.3.0"
  }
}
```

### Docker Images

```yaml
# Cortex & NeoMem
python:3.11-slim

# Relay
node:latest

# UI
nginx:alpine

# PostgreSQL with vector support
ankane/pgvector:v0.5.1

# Graph database
neo4j:5
```

---

### External Services

#### LLM Backends (HTTP-based):

1. **MI50 GPU Server** (10.0.0.43:8000)
   - llama.cpp via vllm
   - High-performance inference
   - Used for reasoning and refinement

2. **RTX 3090 Server** (10.0.0.3:11434)
   - Ollama
   - Alternative local backend
   - Fallback for PRIMARY

3. **OpenAI Cloud** (api.openai.com)
   - gpt-4o-mini
   - Used for reflection and persona
   - Requires API key

4. **Fallback Server** (10.0.0.41:11435)
   - OpenAI Completions API
   - Emergency backup
   - llama-3.2-8b-instruct

---

## Key Concepts & Design Patterns

### 1. Dual-Memory Architecture

Project Lyra uses a **dual-memory system** inspired by human cognition:

**Short-Term Memory (Intake):**
- Fast, in-memory storage
- Limited capacity (200 messages)
- Immediate context for current conversation
- Circular buffer (FIFO eviction)
- Multi-level summarization

**Long-Term Memory (NeoMem):**
- Persistent database storage
- Unlimited capacity
- Semantic search via vector embeddings
- Entity-relationship tracking via graph DB
- Cross-session continuity

**Why This Matters:**
- Short-term memory provides immediate context (last few messages)
- Long-term memory provides semantic understanding (user preferences, past topics)
- Combined, they enable Lyra to be both **contextually aware** and **historically informed**

---

### 2. Multi-Stage Reasoning Pipeline

Unlike single-shot LLM calls, Lyra uses a **4-stage pipeline** for sophisticated responses:

**Stage 1: Reflection** (Meta-cognition)
- "What is the user really asking?"
- Analyzes intent and conversation direction
- Uses OpenAI for strong reasoning

**Stage 2: Reasoning** (Draft generation)
- "What's a good answer?"
- Generates initial response
- Uses local llama.cpp for speed/cost

**Stage 3: Refinement** (Polish)
- "How can this be clearer?"
- Improves clarity and coherence
- Lower temperature for consistency

**Stage 4: Persona** (Voice)
- "How would Lyra say this?"
- Applies personality and speaking style
- Uses OpenAI for natural language

**Benefits:**
- Higher quality responses (multiple passes)
- Separation of concerns (reasoning vs. style)
- Backend flexibility (cloud for hard tasks, local for simple ones)
- Transparent thinking (can inspect each stage)

---

### 3. Backend Abstraction (LLM Router)

The **LLM Router** allows Lyra to use multiple LLM backends transparently:

```python
# Same interface, different backends
await call_llm(prompt, backend="PRIMARY")   # Local llama.cpp
await call_llm(prompt, backend="CLOUD")     # OpenAI
await call_llm(prompt, backend="SECONDARY") # Ollama
```

**Benefits:**
- **Cost optimization:** Use expensive cloud LLMs only when needed
- **Performance:** Local LLMs for low-latency responses
- **Resilience:** Fallback to alternative backends on failure
- **Experimentation:** Easy to swap models/providers

**Design Pattern:** **Strategy Pattern** for swappable backends

---

### 4. Microservices Architecture

Project Lyra follows **microservices principles**:

**Each service has a single responsibility:**
- Relay: Routing and orchestration
- Cortex: Reasoning and response generation
- NeoMem: Long-term memory storage
- UI: User interface

**Communication:**
- REST APIs (HTTP/JSON)
- Async ingestion (fire-and-forget)
- Docker network isolation

**Benefits:**
- Independent scaling (scale Cortex without scaling UI)
- Technology diversity (Node.js + Python)
- Fault isolation (Cortex crash doesn't affect NeoMem)
- Easy testing (mock service dependencies)

---

### 5. Session-Based State Management

Lyra maintains **session-based state** for conversation continuity:

```python
# In-memory session storage (Intake)
SESSIONS = {
    "session_abc123": {
        "buffer": deque([msg1, msg2, ...], maxlen=200),
        "created_at": "2025-12-12T10:30:00Z"
    }
}

# Persistent session storage (NeoMem)
# Stores all messages + embeddings for semantic search
```

**Session Lifecycle:**
1. User starts conversation → UI generates `session_id`
2. First message → Cortex creates session in `SESSIONS` dict
3. Subsequent messages → Retrieved from same session
4. Async ingestion → Messages stored in NeoMem for long-term

**Benefits:**
- Conversation continuity within session
- Historical search across sessions
- User can switch sessions (multiple concurrent conversations)

---

### 6. Asynchronous Ingestion

**Pattern:** Separate read path from write path

```javascript
// Relay: Synchronous read path (fast response)
const response = await fetch('http://cortex:7081/reason');
return response.json();  // Return immediately to user

// Relay: Asynchronous write path (non-blocking)
fetch('http://cortex:7081/ingest', { method: 'POST', ... });
// Don't await, just fire and forget
```

**Benefits:**
- Fast user response times (don't wait for database writes)
- Resilient to storage failures (user still gets response)
- Easier scaling (decouple read and write loads)

**Trade-off:** Eventual consistency (short delay before memory is searchable)

---

### 7. Deferred Summarization

Intake uses **deferred summarization** instead of pre-computation:

```python
# BAD: Pre-compute summaries on every message
def add_message(session_id, message):
    SESSIONS[session_id].buffer.append(message)
    SESSIONS[session_id].L1_summary = summarize(last_1_message)
    SESSIONS[session_id].L5_summary = summarize(last_5_messages)
    # ... expensive, runs on every message

# GOOD: Compute summaries only when needed
def summarize_context(session_id):
    buffer = SESSIONS[session_id].buffer
    return {
        "L1": summarize(buffer[-1:]),    # Only compute when requested
        "L5": summarize(buffer[-5:]),
        "L10": summarize(buffer[-10:])
    }
```

**Benefits:**
- Faster message ingestion (no blocking summarization)
- Compute resources used only when needed
- Flexible summary levels (easy to add L15, L50, etc.)

**Trade-off:** Slight delay when first message in conversation (cold start)

---

## API Reference

### Relay Endpoints

#### POST `/v1/chat/completions`
**OpenAI-compatible chat endpoint**

**Request:**
```json
{
  "messages": [
    {"role": "user", "content": "Hello, Lyra!"}
  ],
  "session_id": "session_abc123"
}
```

**Response:**
```json
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hi there! How can I help you today?"
      }
    }
  ]
}
```

---

#### POST `/chat`
**Lyra-native chat endpoint**

**Request:**
```json
{
  "session_id": "session_abc123",
  "message": "Hello, Lyra!"
}
```

**Response:**
```json
{
  "answer": "Hi there! How can I help you today?",
  "session_id": "session_abc123"
}
```

---

#### GET `/sessions/:id`
**Retrieve session history**

**Response:**
```json
{
  "session_id": "session_abc123",
  "messages": [
    {"role": "user", "content": "Hello", "timestamp": "..."},
    {"role": "assistant", "content": "Hi!", "timestamp": "..."}
  ],
  "created_at": "2025-12-12T10:30:00Z"
}
```

---

### Cortex Endpoints

#### POST `/reason`
**Main reasoning pipeline**

**Request:**
```json
{
  "session_id": "session_abc123",
  "user_message": "How do I deploy ML models?"
}
```

**Response:**
```json
{
  "answer": "Final answer with Lyra's personality",
  "metadata": {
    "reflection": "User seeking deployment guidance...",
    "draft": "Initial draft answer...",
    "refined": "Polished answer...",
    "stages_completed": 4
  }
}
```

---

#### POST `/ingest`
**Ingest message exchange into Intake**

**Request:**
```json
{
  "session_id": "session_abc123",
  "user_message": "How do I deploy ML models?",
  "assistant_message": "Here's how..."
}
```

**Response:**
```json
{
  "status": "ingested",
  "session_id": "session_abc123",
  "message_count": 24
}
```

---

#### GET `/debug/sessions`
**Inspect in-memory SESSIONS state**

**Response:**
```json
{
  "session_abc123": {
    "message_count": 24,
    "created_at": "2025-12-12T10:30:00Z",
    "last_message_at": "2025-12-12T11:15:00Z"
  },
  "session_xyz789": {
    "message_count": 5,
    "created_at": "2025-12-12T11:00:00Z",
    "last_message_at": "2025-12-12T11:10:00Z"
  }
}
```

---

### NeoMem Endpoints

#### POST `/memories`
**Create new memory**

**Request:**
```json
{
  "messages": [
    {"role": "user", "content": "I prefer Docker for deployments"},
    {"role": "assistant", "content": "Noted! I'll keep that in mind."}
  ],
  "session_id": "session_abc123"
}
```

**Response:**
```json
{
  "status": "created",
  "memory_id": "mem_456def",
  "extracted_entities": ["Docker", "deployments"]
}
```

---

#### GET `/search`
**Semantic search for memories**

**Query Parameters:**
- `query` (required): Search query
- `limit` (optional, default=5): Max results

**Request:**
```
GET /search?query=deployment%20preferences&limit=5
```

**Response:**
```json
{
  "results": [
    {
      "content": "User prefers Docker for deployments",
      "score": 0.92,
      "timestamp": "2025-12-10T14:30:00Z",
      "session_id": "session_abc123"
    },
    {
      "content": "Previously deployed models on AWS ECS",
      "score": 0.87,
      "timestamp": "2025-12-09T09:15:00Z",
      "session_id": "session_abc123"
    }
  ]
}
```

---

#### GET `/memories`
**List all memories**

**Query Parameters:**
- `offset` (optional, default=0): Pagination offset
- `limit` (optional, default=50): Max results

**Response:**
```json
{
  "memories": [
    {
      "id": "mem_123abc",
      "content": "User prefers Docker...",
      "created_at": "2025-12-10T14:30:00Z"
    }
  ],
  "total": 147,
  "offset": 0,
  "limit": 50
}
```

---

## Deployment & Operations

### Docker Compose Deployment

**File:** `/docker-compose.yml`

```yaml
version: '3.8'

services:
  # === ACTIVE SERVICES ===

  relay:
    build: ./core/relay
    ports:
      - "7078:7078"
    environment:
      - CORTEX_URL=http://cortex:7081
      - NEOMEM_URL=http://neomem:7077
    depends_on:
      - cortex
    networks:
      - lyra_net

  cortex:
    build: ./cortex
    ports:
      - "7081:7081"
    environment:
      - NEOMEM_URL=http://neomem:7077
      - PRIMARY_URL=${PRIMARY_URL}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    command: uvicorn main:app --host 0.0.0.0 --port 7081 --workers 1
    depends_on:
      - neomem
    networks:
      - lyra_net

  neomem:
    build: ./neomem
    ports:
      - "7077:7077"
    environment:
      - POSTGRES_HOST=neomem-postgres
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - NEO4J_URI=${NEO4J_URI}
    depends_on:
      - neomem-postgres
      - neomem-neo4j
    networks:
      - lyra_net

  ui:
    image: nginx:alpine
    ports:
      - "8081:80"
    volumes:
      - ./core/ui:/usr/share/nginx/html:ro
    networks:
      - lyra_net

  # === DATABASES ===

  neomem-postgres:
    image: ankane/pgvector:v0.5.1
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - ./volumes/postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    networks:
      - lyra_net

  neomem-neo4j:
    image: neo4j:5
    environment:
      - NEO4J_AUTH=${NEO4J_USER}/${NEO4J_PASSWORD}
    volumes:
      - ./volumes/neo4j_data:/data
    ports:
      - "7474:7474"  # Browser UI
      - "7687:7687"  # Bolt
    networks:
      - lyra_net

networks:
  lyra_net:
    driver: bridge
```

---

### Starting the System

```bash
# 1. Clone repository
git clone https://github.com/yourusername/project-lyra.git
cd project-lyra

# 2. Configure environment
cp .env.example .env
# Edit .env with your LLM backend URLs and API keys

# 3. Start all services
docker-compose up -d

# 4. Check health
curl http://localhost:7078/_health
curl http://localhost:7081/health
curl http://localhost:7077/health

# 5. Open UI
open http://localhost:8081
```

---

### Monitoring & Logs

```bash
# View all logs
docker-compose logs -f

# View specific service
docker-compose logs -f cortex

# Check resource usage
docker stats

# Inspect Cortex sessions
curl http://localhost:7081/debug/sessions

# Check NeoMem memories
curl http://localhost:7077/memories?limit=10
```

---

### Scaling Considerations

#### Current Constraints:

1. **Single Cortex worker** required (in-memory SESSIONS dict)
   - Solution: Migrate SESSIONS to Redis or PostgreSQL

2. **In-memory session storage** in Relay
   - Solution: Use Redis for session persistence

3. **No load balancing** (single instance of each service)
   - Solution: Add nginx reverse proxy + multiple Cortex instances

#### Horizontal Scaling Plan:

```yaml
# Future: Redis-backed session storage
cortex:
  build: ./cortex
  command: uvicorn main:app --workers 4  # Multi-worker
  environment:
    - REDIS_URL=redis://redis:6379
  depends_on:
    - redis

redis:
  image: redis:alpine
  ports:
    - "6379:6379"
```

---

### Backup Strategy

```bash
# Backup PostgreSQL (NeoMem vectors)
docker exec neomem-postgres pg_dump -U neomem neomem > backup_postgres.sql

# Backup Neo4j (NeoMem graph)
docker exec neomem-neo4j neo4j-admin dump --to=/data/backup.dump

# Backup Intake sessions (manual export)
curl http://localhost:7081/debug/sessions > backup_sessions.json
```

---

## Known Issues & Constraints

### Critical Constraints

#### 1. Single-Worker Requirement (Cortex)
**Issue:** Cortex must run with `--workers 1` to maintain SESSIONS state
**Impact:** Limited horizontal scalability
**Workaround:** None currently
**Fix:** Migrate SESSIONS to Redis or PostgreSQL
**Priority:** High (blocking scalability)

#### 2. In-Memory Session Storage (Relay)
**Issue:** Sessions stored in Node.js process memory
**Impact:** Lost on restart, no persistence
**Workaround:** None currently
**Fix:** Use Redis or database
**Priority:** Medium (acceptable for demo)

---

### Non-Critical Issues

#### 3. RAG Service Disabled
**Status:** Built but commented out in docker-compose.yml
**Impact:** No RAG-based long-term knowledge retrieval
**Workaround:** NeoMem provides semantic search
**Fix:** Re-enable and integrate RAG service
**Priority:** Low (NeoMem sufficient for now)

#### 4. Partial NeoMem Integration
**Status:** Search implemented, async ingestion planned
**Impact:** Memories not automatically saved
**Workaround:** Manual POST to /memories
**Fix:** Complete async ingestion in Relay
**Priority:** Medium (planned feature)

#### 5. Inner Monologue Observer-Only
**Status:** Stage 0.6 runs but output not used
**Impact:** No adaptive response based on monologue
**Workaround:** None (future feature)
**Fix:** Integrate monologue output into pipeline
**Priority:** Low (experimental feature)

---

### Fixed Issues (v0.5.2)

✅ **LLM Router Blocking** - Migrated from `requests` to `httpx` for async
✅ **Session ID Case Mismatch** - Standardized to `session_id`
✅ **Missing Backend Parameter** - Added to intake summarization

---

### Deprecated Components

**Location:** `/DEPRECATED_FILES.md`

- **Standalone Intake Service** - Now embedded in Cortex
- **Old Relay Backup** - Replaced by current Relay
- **Persona Sidecar** - Built but unused (dynamic persona loading)

---

## Advanced Topics

### Custom Prompt Engineering

Each stage uses carefully crafted prompts:

**Reflection Prompt Example:**
```python
REFLECTION_PROMPT = """
You are Lyra's reflective awareness layer.
Your job is to analyze the user's message and conversation context
to understand their true intent and needs.

User message: {user_message}

Recent context:
{intake_L10_summary}

Long-term context:
{neomem_top_3_memories}

Provide concise meta-awareness notes:
- What is the user's underlying intent?
- What topics/themes are emerging?
- What depth of response is appropriate?
- Are there any implicit questions or concerns?

Keep notes brief (3-5 sentences). Focus on insight, not description.
"""
```

---

### Extending the Pipeline

**Adding Stage 5 (Fact-Checking):**

```python
# /cortex/reasoning/factcheck.py
async def factcheck_answer(answer: str, context: dict) -> dict:
    """
    Stage 5: Verify factual claims in answer.

    Returns:
        {
            "verified": bool,
            "flagged_claims": list,
            "corrected_answer": str
        }
    """

    prompt = f"""
    Review this answer for factual accuracy:

    {answer}

    Flag any claims that seem dubious or need verification.
    Provide corrected version if needed.
    """

    result = await call_llm(prompt, backend="CLOUD", temperature=0.1)
    return parse_factcheck_result(result)

# Update router.py to include Stage 5
async def reason_endpoint(request):
    # ... existing stages ...

    # Stage 5: Fact-checking
    factcheck_result = await factcheck_answer(final_answer, context)

    if not factcheck_result["verified"]:
        final_answer = factcheck_result["corrected_answer"]

    return {"answer": final_answer}
```

---

### Custom LLM Backend Integration

**Adding Anthropic Claude:**

```python
# /cortex/llm/llm_router.py

BACKEND_CONFIGS = {
    # ... existing backends ...

    "CLAUDE": {
        "url": "https://api.anthropic.com/v1",
        "provider": "anthropic",
        "model": "claude-3-5-sonnet-20241022",
        "api_key": os.getenv("ANTHROPIC_API_KEY")
    }
}

# Add provider-specific logic
elif backend_config["provider"] == "anthropic":
    headers = {
        "x-api-key": api_key,
        "anthropic-version": "2023-06-01"
    }
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": temperature
    }
    response = await httpx_client.post(
        f"{url}/messages",
        json=payload,
        headers=headers,
        timeout=120
    )
    return response.json()["content"][0]["text"]
```

---

### Performance Optimization

**Caching Strategies:**

```python
# /cortex/utils/cache.py
from functools import lru_cache
import hashlib

@lru_cache(maxsize=128)
def cache_llm_call(prompt_hash: str, backend: str):
    """Cache LLM responses for identical prompts"""
    # Note: Only cache deterministic calls (temperature=0)
    pass

# Usage in llm_router.py
async def call_llm(prompt, backend, temperature=0.7, max_tokens=512):
    if temperature == 0:
        prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
        cached = cache_llm_call(prompt_hash, backend)
        if cached:
            return cached

    # ... normal LLM call ...
```

**Database Query Optimization:**

```python
# /neomem/neomem/database.py

# BAD: Load all memories, then filter
def search_memories(query):
    all_memories = db.execute("SELECT * FROM memories")
    # Expensive in-memory filtering
    return [m for m in all_memories if similarity(m, query) > 0.8]

# GOOD: Use database indexes and LIMIT
def search_memories(query, limit=5):
    query_embedding = embed(query)
    return db.execute("""
        SELECT * FROM memories
        WHERE embedding <-> %s < 0.2  -- pgvector cosine distance
        ORDER BY embedding <-> %s
        LIMIT %s
    """, (query_embedding, query_embedding, limit))
```

---

## Conclusion

Project Lyra is a sophisticated, multi-layered AI companion system that addresses the fundamental limitation of chatbot amnesia through:

1. **Dual-memory architecture** (short-term Intake + long-term NeoMem)
2. **Multi-stage reasoning pipeline** (Reflection → Reasoning → Refinement → Persona)
3. **Flexible multi-backend LLM support** (cloud + local with fallback)
4. **Microservices design** for scalability and maintainability
5. **Modern web UI** with session management

The system is production-ready with comprehensive error handling, logging, and health monitoring.

---

## Quick Reference

### Service Ports
- **UI:** 8081 (Browser interface)
- **Relay:** 7078 (Main orchestrator)
- **Cortex:** 7081 (Reasoning engine)
- **NeoMem:** 7077 (Long-term memory)
- **PostgreSQL:** 5432 (Vector storage)
- **Neo4j:** 7474 (Browser), 7687 (Bolt)

### Key Files
- **Main Entry:** `/core/relay/server.js`
- **Reasoning Pipeline:** `/cortex/router.py`
- **LLM Router:** `/cortex/llm/llm_router.py`
- **Short-term Memory:** `/cortex/intake/intake.py`
- **Long-term Memory:** `/neomem/neomem/`
- **Personality:** `/cortex/persona/identity.py`

### Important Commands
```bash
# Start system
docker-compose up -d

# View logs
docker-compose logs -f cortex

# Debug sessions
curl http://localhost:7081/debug/sessions

# Health check
curl http://localhost:7078/_health

# Search memories
curl "http://localhost:7077/search?query=deployment&limit=5"
```

---

**Document Version:** 1.0
**Last Updated:** 2025-12-13
**Maintained By:** Project Lyra Team