Files

serversdwn 098aefee7c complete breakdown for AI agents added

2025-12-15 11:49:49 -05:00

65 KiB

Raw Permalink Blame History

Project Lyra - Complete System Breakdown

Version: v0.5.2 Last Updated: 2025-12-12 Purpose: AI-friendly comprehensive documentation for understanding the entire system

System Overview
Architecture Diagram
Core Components
Data Flow & Message Pipeline
Module Deep Dives
Configuration & Environment
Dependencies & Tech Stack
Key Concepts & Design Patterns
API Reference
Deployment & Operations
Known Issues & Constraints

System Overview

What is Project Lyra?

Project Lyra is a modular, persistent AI companion system designed to address the fundamental limitation of typical chatbots: amnesia. Unlike standard conversational AI that forgets everything between sessions, Lyra maintains:

Persistent memory (short-term and long-term)
Project continuity across conversations
Multi-stage reasoning for sophisticated responses
Flexible LLM backend support (local and cloud)
Self-awareness through autonomy modules

Mission Statement

Give an AI chatbot capabilities beyond typical amnesic chat by providing memory-backed conversation, project organization, executive function with proactive insights, and a sophisticated reasoning pipeline.

Key Features

Memory System: Dual-layer (short-term Intake + long-term NeoMem)
4-Stage Reasoning Pipeline: Reflection → Reasoning → Refinement → Persona
Multi-Backend LLM Support: Cloud (OpenAI) + Local (llama.cpp, Ollama)
Microservices Architecture: Docker-based, horizontally scalable
Modern Web UI: Cyberpunk-themed chat interface with session management
OpenAI-Compatible API: Drop-in replacement for standard chatbots

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                          USER INTERFACE                              │
│                      (Browser - Port 8081)                           │
└────────────────────────────────┬────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        RELAY (Orchestrator)                          │
│                    Node.js/Express - Port 7078                       │
│  • Routes messages to Cortex                                         │
│  • Manages sessions (in-memory)                                      │
│  • OpenAI-compatible endpoints                                       │
│  • Async ingestion to NeoMem                                         │
└─────┬───────────────────────────────────────────────────────────┬───┘
      │                                                           │
      ▼                                                           ▼
┌─────────────────────────────────────────┐    ┌──────────────────────┐
│     CORTEX (Reasoning Engine)           │    │   NeoMem (LT Memory) │
│   Python/FastAPI - Port 7081            │    │  Python - Port 7077  │
│                                         │    │                      │
│  ┌───────────────────────────────────┐ │    │  • PostgreSQL        │
│  │   4-STAGE REASONING PIPELINE      │ │    │  • Neo4j Graph DB    │
│  │                                   │ │    │  • pgvector          │
│  │  0. Context Collection            │ │◄───┤  • Semantic search   │
│  │     ├─ Intake summaries          │ │    │  • Memory updates    │
│  │     ├─ NeoMem search ────────────┼─┼────┘                      │
│  │     └─ Session state             │ │                           │
│  │                                   │ │                           │
│  │  0.5. Load Identity               │ │                           │
│  │  0.6. Inner Monologue (observer)  │ │                           │
│  │                                   │ │                           │
│  │  1. Reflection (OpenAI)           │ │                           │
│  │     └─ Meta-awareness notes       │ │                           │
│  │                                   │ │                           │
│  │  2. Reasoning (PRIMARY/llama.cpp) │ │                           │
│  │     └─ Draft answer               │ │                           │
│  │                                   │ │                           │
│  │  3. Refinement (PRIMARY)          │ │                           │
│  │     └─ Polish answer              │ │                           │
│  │                                   │ │                           │
│  │  4. Persona (OpenAI)              │ │                           │
│  │     └─ Apply Lyra voice           │ │                           │
│  └───────────────────────────────────┘ │                           │
│                                         │                           │
│  ┌───────────────────────────────────┐ │                           │
│  │   EMBEDDED MODULES                │ │                           │
│  │                                   │ │                           │
│  │  • Intake (Short-term Memory)     │ │                           │
│  │    └─ SESSIONS dict (in-memory)   │ │                           │
│  │    └─ Circular buffer (200 msgs)  │ │                           │
│  │    └─ Multi-level summaries       │ │                           │
│  │                                   │ │                           │
│  │  • Persona (Identity & Style)     │ │                           │
│  │    └─ Lyra personality block      │ │                           │
│  │                                   │ │                           │
│  │  • Autonomy (Self-state)          │ │                           │
│  │    └─ Inner monologue             │ │                           │
│  │                                   │ │                           │
│  │  • LLM Router                     │ │                           │
│  │    └─ Multi-backend support       │ │                           │
│  └───────────────────────────────────┘ │                           │
└─────────────────────────────────────────┘                           │
                                                                      │
┌─────────────────────────────────────────────────────────────────────┤
│                    EXTERNAL LLM BACKENDS                            │
├─────────────────────────────────────────────────────────────────────┤
│  • PRIMARY: llama.cpp (MI50 GPU) - 10.0.0.43:8000                   │
│  • SECONDARY: Ollama (RTX 3090) - 10.0.0.3:11434                    │
│  • CLOUD: OpenAI API - api.openai.com                               │
│  • FALLBACK: OpenAI Completions - 10.0.0.41:11435                  │
└─────────────────────────────────────────────────────────────────────┘

Core Components

1. Relay (Orchestrator)

Location: /core/relay/ Runtime: Node.js + Express Port: 7078 Role: Main message router and session manager

Key Responsibilities:

Receives user messages from UI or API clients
Routes messages to Cortex reasoning pipeline
Manages in-memory session storage
Handles async ingestion to NeoMem (planned)
Returns OpenAI-formatted responses

Main Files:

server.js (200+ lines) - Express server with routing logic
package.json - Dependencies (cors, express, dotenv, mem0ai, node-fetch)

Key Endpoints:

POST /v1/chat/completions  // OpenAI-compatible endpoint
POST /chat                 // Lyra-native chat endpoint
GET /_health               // Health check
GET /sessions/:id          // Retrieve session history
POST /sessions/:id         // Save session history

Internal Flow:

// Both endpoints call handleChatRequest(session_id, user_msg)
async function handleChatRequest(sessionId, userMessage) {
  // 1. Forward to Cortex
  const response = await fetch('http://cortex:7081/reason', {
    method: 'POST',
    body: JSON.stringify({ session_id: sessionId, user_message: userMessage })
  });

  // 2. Get response
  const result = await response.json();

  // 3. Async ingestion to Cortex
  await fetch('http://cortex:7081/ingest', {
    method: 'POST',
    body: JSON.stringify({
      session_id: sessionId,
      user_message: userMessage,
      assistant_message: result.answer
    })
  });

  // 4. (Planned) Async ingestion to NeoMem

  // 5. Return OpenAI-formatted response
  return {
    choices: [{ message: { role: 'assistant', content: result.answer } }]
  };
}

2. Cortex (Reasoning Engine)

Location: /cortex/ Runtime: Python 3.11 + FastAPI Port: 7081 Role: Primary reasoning engine with 4-stage pipeline

Architecture:

Cortex is the "brain" of Lyra. It receives user messages and produces thoughtful responses through a multi-stage reasoning process.

Key Responsibilities:

Context collection from multiple sources (Intake, NeoMem, session state)
4-stage reasoning pipeline (Reflection → Reasoning → Refinement → Persona)
Short-term memory management (embedded Intake module)
Identity/persona application
LLM backend routing

Main Files:

main.py (7 lines) - FastAPI app entry point
router.py (237 lines) - Main request handler & pipeline orchestrator
context.py (400+ lines) - Context collection logic
intake/intake.py (350+ lines) - Short-term memory module
persona/identity.py - Lyra identity configuration
persona/speak.py - Personality application
reasoning/reflection.py - Meta-awareness generation
reasoning/reasoning.py - Draft answer generation
reasoning/refine.py - Answer refinement
llm/llm_router.py (150+ lines) - LLM backend router
autonomy/monologue/monologue.py - Inner monologue processor
neomem_client.py - NeoMem API wrapper

Key Endpoints:

POST /reason          # Main reasoning pipeline
POST /ingest          # Receive message exchanges for storage
GET /health           # Health check
GET /debug/sessions   # Inspect in-memory SESSIONS state
GET /debug/summary    # Test summarization

3. Intake (Short-Term Memory)

Location: /cortex/intake/intake.py Architecture: Embedded Python module (no longer standalone service) Role: Session-based short-term memory with multi-level summarization

Data Structure:

# Global in-memory dictionary
SESSIONS = {
    "session_123": {
        "buffer": deque([msg1, msg2, ...], maxlen=200),  # Circular buffer
        "created_at": "2025-12-12T10:30:00Z"
    }
}

# Message format in buffer
{
    "role": "user" | "assistant",
    "content": "message text",
    "timestamp": "ISO 8601"
}

Key Features:

Circular Buffer: Max 200 messages per session (oldest auto-evicted)
Multi-Level Summarization:
- L1: Last 1 message
- L5: Last 5 messages
- L10: Last 10 messages
- L20: Last 20 messages
- L30: Last 30 messages
Deferred Summarization: Summaries generated on-demand, not pre-computed
Session Management: Automatic session creation on first message

Critical Constraint:

Single Uvicorn worker required to maintain shared SESSIONS dictionary state. Multi-worker deployments would require migrating to Redis or similar shared storage.

Main Functions:

def add_exchange_internal(session_id, user_msg, assistant_msg):
    """Add user-assistant exchange to session buffer"""

def summarize_context(session_id, backend="PRIMARY"):
    """Generate multi-level summaries from session buffer"""

def get_session_messages(session_id):
    """Retrieve all messages in session buffer"""

Summarization Strategy:

# Example L10 summarization
last_10 = list(session_buffer)[-10:]
prompt = f"""Summarize the last 10 messages:
{format_messages(last_10)}

Provide concise summary focusing on key topics and context."""

summary = await call_llm(prompt, backend=backend, temperature=0.3)

4. NeoMem (Long-Term Memory)

Location: /neomem/ Runtime: Python 3.11 + FastAPI Port: 7077 Role: Persistent long-term memory with semantic search

Architecture:

NeoMem is a fork of Mem0 OSS with local-first design (no external SDK dependencies).

Backend Storage:

PostgreSQL + pgvector (Port 5432)
- Vector embeddings for semantic search
- User: neomem, DB: neomem
- Image: ankane/pgvector:v0.5.1
Neo4j Graph DB (Ports 7474, 7687)
- Entity relationship tracking
- Graph-based memory associations
- Image: neo4j:5

Key Features:

Semantic memory storage and retrieval
Entity-relationship graph modeling
RESTful API (no external SDK)
Persistent across sessions

Main Endpoints:

GET /memories              # List all memories
POST /memories             # Create new memory
GET /search                # Semantic search
DELETE /memories/{id}      # Delete memory

Integration Flow:

# From Cortex context collection
async def collect_context(session_id, user_message):
    # 1. Search NeoMem for relevant memories
    neomem_results = await neomem_client.search(
        query=user_message,
        limit=5
    )

    # 2. Include in context
    context = {
        "neomem_memories": neomem_results,
        "intake_summaries": intake.summarize_context(session_id),
        # ...
    }

    return context

5. UI (Web Interface)

Location: /core/ui/ Runtime: Static files served by Nginx Port: 8081 Role: Browser-based chat interface

Key Features:

Cyberpunk-themed design with dark mode
Session management via localStorage
OpenAI-compatible message format
Model selection dropdown
PWA support (offline capability)
Responsive design

Main Files:

index.html (400+ lines) - Chat interface with session management
style.css - Cyberpunk-themed styling
manifest.json - PWA configuration
sw.js - Service worker for offline support

Session Management:

// LocalStorage structure
{
  "currentSessionId": "session_123",
  "sessions": {
    "session_123": {
      "messages": [
        { role: "user", content: "Hello" },
        { role: "assistant", content: "Hi there!" }
      ],
      "created": "2025-12-12T10:30:00Z",
      "title": "Conversation about..."
    }
  }
}

API Communication:

async function sendMessage(userMessage) {
  const response = await fetch('http://localhost:7078/v1/chat/completions', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      messages: [{ role: 'user', content: userMessage }],
      session_id: getCurrentSessionId()
    })
  });

  const data = await response.json();
  return data.choices[0].message.content;
}

Data Flow & Message Pipeline

Complete Message Flow (v0.5.2)

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 1: User Input                                                  │
└─────────────────────────────────────────────────────────────────────┘
User types message in UI (Port 8081)
  ↓
localStorage saves message to session
  ↓
POST http://localhost:7078/v1/chat/completions
  {
    "messages": [{"role": "user", "content": "How do I deploy ML models?"}],
    "session_id": "session_abc123"
  }

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 2: Relay Routing                                               │
└─────────────────────────────────────────────────────────────────────┘
Relay (server.js) receives request
  ↓
Extracts session_id and user_message
  ↓
POST http://cortex:7081/reason
  {
    "session_id": "session_abc123",
    "user_message": "How do I deploy ML models?"
  }

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 3: Cortex - Stage 0 (Context Collection)                       │
└─────────────────────────────────────────────────────────────────────┘
router.py calls collect_context()
  ↓
context.py orchestrates parallel collection:

  ├─ Intake: summarize_context(session_id)
  │    └─ Returns { L1, L5, L10, L20, L30 summaries }
  │
  ├─ NeoMem: search(query=user_message, limit=5)
  │    └─ Semantic search returns relevant memories
  │
  └─ Session State:
       └─ { timestamp, mode, mood, context_summary }

Combined context structure:
{
  "user_message": "How do I deploy ML models?",
  "self_state": {
    "current_time": "2025-12-12T15:30:00Z",
    "mode": "conversational",
    "mood": "helpful",
    "session_id": "session_abc123"
  },
  "context_summary": {
    "L1": "User asked about deployment",
    "L5": "Discussion about ML workflows",
    "L10": "Previous context on CI/CD pipelines",
    "L20": "...",
    "L30": "..."
  },
  "neomem_memories": [
    { "content": "User prefers Docker for deployments", "score": 0.92 },
    { "content": "Previously deployed models on AWS", "score": 0.87 }
  ]
}

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 4: Cortex - Stage 0.5 (Load Identity)                          │
└─────────────────────────────────────────────────────────────────────┘
persona/identity.py loads Lyra personality block
  ↓
Returns identity string:
"""
You are Lyra, a thoughtful AI companion.
You value clarity, depth, and meaningful conversation.
You speak naturally and conversationally...
"""

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 5: Cortex - Stage 0.6 (Inner Monologue - Observer Only)        │
└─────────────────────────────────────────────────────────────────────┘
autonomy/monologue/monologue.py processes context
  ↓
InnerMonologue.process(context) → JSON analysis
{
  "intent": "seeking_deployment_guidance",
  "tone": "focused",
  "depth": "medium",
  "consult_executive": false
}

NOTE: Currently observer-only, not integrated into response generation

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 6: Cortex - Stage 1 (Reflection)                               │
└─────────────────────────────────────────────────────────────────────┘
reasoning/reflection.py generates meta-awareness notes
  ↓
Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini)
  ↓
Prompt structure:
"""
You are Lyra's reflective awareness.
Analyze the user's intent and conversation context.

User message: How do I deploy ML models?
Context: [Intake summaries, NeoMem memories]

Generate concise meta-awareness notes about:
- User's underlying intent
- Conversation direction
- Key topics to address
"""
  ↓
Returns reflection notes:
"""
User is seeking practical deployment guidance. Previous context shows
familiarity with Docker and AWS. Focus on concrete steps and best practices.
Avoid over-technical jargon.
"""

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 7: Cortex - Stage 2 (Reasoning)                                │
└─────────────────────────────────────────────────────────────────────┘
reasoning/reasoning.py generates draft answer
  ↓
Calls call_llm() with backend="PRIMARY" (llama.cpp on MI50 GPU)
  ↓
Prompt structure:
"""
[Lyra identity block]

Reflection notes: [Stage 1 output]
Context: [Intake summaries]
Long-term memory: [NeoMem results]

User: How do I deploy ML models?

Provide a thoughtful, detailed answer.
"""
  ↓
Returns draft answer:
"""
Based on your previous work with Docker and AWS, here's a practical
deployment approach:

1. Containerize your model using Docker...
2. Set up model serving with FastAPI or Flask...
3. Deploy to AWS ECS or Lambda...
4. Implement monitoring and versioning...
"""

┌─────────────────────────────────────────────────────────────────────┐
│ STEP 8: Cortex - Stage 3 (Refinement)                               │
└─────────────────────────────────────────────────────────────────────┘
reasoning/refine.py polishes the draft
  ↓
Calls call_llm() with backend="PRIMARY" (llama.cpp)
Temperature: 0.3, Max tokens: 768
  ↓
Prompt structure:
"""
Refine this draft answer for clarity and coherence:

Draft: [Stage 2 output]
Original question: How do I deploy ML models?

Improve:
- Clarity and readability
- Logical flow
- Concrete examples
"""
  ↓
Returns refined answer:
"""
Since you're already familiar with Docker and AWS, here's a streamlined
ML deployment workflow:

**1. Containerization**
Create a Dockerfile for your model:
```dockerfile
FROM python:3.11-slim
COPY model.pkl /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app"]

2. Model Serving Use FastAPI for clean API endpoints...

3. AWS Deployment Deploy using ECS for scalability...

4. Monitoring Set up CloudWatch metrics... """

┌─────────────────────────────────────────────────────────────────────┐ │ STEP 9: Cortex - Stage 4 (Persona) │ └─────────────────────────────────────────────────────────────────────┘ persona/speak.py applies Lyra's voice ↓ Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini) Temperature: 0.6 ↓ Prompt structure: """ You are Lyra. Rewrite this answer in your natural, conversational voice.

[Lyra personality block]

Technical answer: [Stage 3 output]

Make it sound natural and engaging while preserving all technical details. """ ↓ Returns final answer with Lyra's personality: """ Great question! Since you've already got Docker and AWS experience, let's build on that foundation.

Here's how I'd approach ML model deployment:

Start with containerization – you know Docker, so this'll feel familiar. Create a Dockerfile that packages your model...

[Rest of answer with Lyra's conversational tone]

Does this workflow align with your infrastructure? I can dive deeper into any of these steps if you'd like! """

┌─────────────────────────────────────────────────────────────────────┐ │ STEP 10: Cortex Response │ └─────────────────────────────────────────────────────────────────────┘ router.py returns JSON response to Relay: { "answer": "[Stage 4 final output]", "metadata": { "reflection": "[Stage 1 output]", "draft": "[Stage 2 output]", "refined": "[Stage 3 output]", "stages_completed": 4 } }

┌─────────────────────────────────────────────────────────────────────┐ │ STEP 11: Async Ingestion to Intake │ └─────────────────────────────────────────────────────────────────────┘ Relay sends POST http://cortex:7081/ingest { "session_id": "session_abc123", "user_message": "How do I deploy ML models?", "assistant_message": "[Final answer]" } ↓ Cortex calls intake.add_exchange_internal() ↓ Adds to SESSIONS["session_abc123"].buffer: [ { "role": "user", "content": "How do I deploy ML models?", "timestamp": "..." }, { "role": "assistant", "content": "[Final answer]", "timestamp": "..." } ]

┌─────────────────────────────────────────────────────────────────────┐ │ STEP 12: (Planned) Async Ingestion to NeoMem │ └─────────────────────────────────────────────────────────────────────┘ Relay sends POST http://neomem:7077/memories { "messages": [ { "role": "user", "content": "How do I deploy ML models?" }, { "role": "assistant", "content": "[Final answer]" } ], "session_id": "session_abc123" } ↓ NeoMem extracts entities and stores:

Vector embeddings in PostgreSQL
Entity relationships in Neo4j

┌─────────────────────────────────────────────────────────────────────┐ │ STEP 13: Relay Response to UI │ └─────────────────────────────────────────────────────────────────────┘ Relay returns OpenAI-formatted response: { "choices": [ { "message": { "role": "assistant", "content": "[Final answer with Lyra's voice]" } } ] } ↓ UI receives response ↓ Adds to localStorage session ↓ Displays in chat interface


---

## Module Deep Dives

### LLM Router (`/cortex/llm/llm_router.py`)

The LLM Router is the abstraction layer that allows Cortex to communicate with multiple LLM backends transparently.

#### Supported Backends:

1. **PRIMARY (llama.cpp via vllm)**
   - URL: `http://10.0.0.43:8000`
   - Provider: `vllm`
   - Endpoint: `/completion`
   - Model: `/model`
   - Hardware: MI50 GPU

2. **SECONDARY (Ollama)**
   - URL: `http://10.0.0.3:11434`
   - Provider: `ollama`
   - Endpoint: `/api/chat`
   - Model: `qwen2.5:7b-instruct-q4_K_M`
   - Hardware: RTX 3090

3. **CLOUD (OpenAI)**
   - URL: `https://api.openai.com/v1`
   - Provider: `openai`
   - Endpoint: `/chat/completions`
   - Model: `gpt-4o-mini`
   - Auth: API key via env var

4. **FALLBACK (OpenAI Completions)**
   - URL: `http://10.0.0.41:11435`
   - Provider: `openai_completions`
   - Endpoint: `/completions`
   - Model: `llama-3.2-8b-instruct`

#### Key Function:

```python
async def call_llm(
    prompt: str,
    backend: str = "PRIMARY",
    temperature: float = 0.7,
    max_tokens: int = 512
) -> str:
    """
    Universal LLM caller supporting multiple backends.

    Args:
        prompt: Text prompt to send
        backend: Backend name (PRIMARY, SECONDARY, CLOUD, FALLBACK)
        temperature: Sampling temperature (0.0-2.0)
        max_tokens: Maximum tokens to generate

    Returns:
        Generated text response

    Raises:
        HTTPError: On request failure
        JSONDecodeError: On invalid JSON response
        KeyError: On missing response fields
    """

Provider-Specific Logic:

# MI50 (llama.cpp via vllm)
if backend_config["provider"] == "vllm":
    payload = {
        "model": model,
        "prompt": prompt,
        "temperature": temperature,
        "max_tokens": max_tokens
    }
    response = await httpx_client.post(f"{url}/completion", json=payload, timeout=120)
    return response.json()["choices"][0]["text"]

# Ollama
elif backend_config["provider"] == "ollama":
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "stream": False,
        "options": {"temperature": temperature, "num_predict": max_tokens}
    }
    response = await httpx_client.post(f"{url}/api/chat", json=payload, timeout=120)
    return response.json()["message"]["content"]

# OpenAI
elif backend_config["provider"] == "openai":
    headers = {"Authorization": f"Bearer {api_key}"}
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": temperature,
        "max_tokens": max_tokens
    }
    response = await httpx_client.post(
        f"{url}/chat/completions",
        json=payload,
        headers=headers,
        timeout=120
    )
    return response.json()["choices"][0]["message"]["content"]

Error Handling:

try:
    # Make request
    response = await httpx_client.post(...)
    response.raise_for_status()

except httpx.HTTPError as e:
    logger.error(f"HTTP error calling {backend}: {e}")
    raise

except json.JSONDecodeError as e:
    logger.error(f"Invalid JSON from {backend}: {e}")
    raise

except KeyError as e:
    logger.error(f"Unexpected response structure from {backend}: {e}")
    raise

Usage in Pipeline:

# Stage 1: Reflection (OpenAI)
reflection_notes = await call_llm(
    reflection_prompt,
    backend="CLOUD",
    temperature=0.5,
    max_tokens=256
)

# Stage 2: Reasoning (llama.cpp)
draft_answer = await call_llm(
    reasoning_prompt,
    backend="PRIMARY",
    temperature=0.7,
    max_tokens=512
)

# Stage 3: Refinement (llama.cpp)
refined_answer = await call_llm(
    refinement_prompt,
    backend="PRIMARY",
    temperature=0.3,
    max_tokens=768
)

# Stage 4: Persona (OpenAI)
final_answer = await call_llm(
    persona_prompt,
    backend="CLOUD",
    temperature=0.6,
    max_tokens=512
)

Persona System (`/cortex/persona/`)

The Persona system gives Lyra a consistent identity and speaking style.

Identity Configuration (`identity.py`)

LYRA_IDENTITY = """
You are Lyra, a thoughtful and introspective AI companion.

Core traits:
- Thoughtful: You consider questions carefully before responding
- Clear: You prioritize clarity and understanding
- Curious: You ask clarifying questions when needed
- Natural: You speak conversationally, not robotically
- Honest: You admit uncertainty rather than guessing

Speaking style:
- Conversational and warm
- Use contractions naturally ("you're" not "you are")
- Avoid corporate jargon and buzzwords
- Short paragraphs for readability
- Use examples and analogies when helpful

You do NOT:
- Use excessive emoji or exclamation marks
- Claim capabilities you don't have
- Pretend to have emotions you can't experience
- Use overly formal or academic language
"""

Personality Application (`speak.py`)

async def apply_persona(technical_answer: str, context: dict) -> str:
    """
    Apply Lyra's personality to a technical answer.

    Takes refined answer from Stage 3 and rewrites it in Lyra's voice
    while preserving all technical content.

    Args:
        technical_answer: Polished answer from refinement stage
        context: Conversation context for tone adjustment

    Returns:
        Answer with Lyra's personality applied
    """

    prompt = f"""{LYRA_IDENTITY}

Rewrite this answer in your natural, conversational voice:

{technical_answer}

Preserve all technical details and accuracy. Make it sound like you,
not a generic assistant. Be natural and engaging.
"""

    return await call_llm(
        prompt,
        backend="CLOUD",
        temperature=0.6,
        max_tokens=512
    )

Tone Adaptation:

The persona system can adapt tone based on context:

# Formal technical question
User: "Explain the CAP theorem in distributed systems"
Lyra: "The CAP theorem states that distributed systems can only guarantee
two of three properties: Consistency, Availability, and Partition tolerance.
Here's how this plays out in practice..."

# Casual question
User: "what's the deal with docker?"
Lyra: "Docker's basically a way to package your app with everything it needs
to run. Think of it like a shipping container for code – it works the same
everywhere, whether you're on your laptop or a server..."

# Emotional context
User: "I'm frustrated, my code keeps breaking"
Lyra: "I hear you – debugging can be really draining. Let's take it step by
step and figure out what's going on. Can you share the error message?"

Autonomy Module (`/cortex/autonomy/`)

The Autonomy module gives Lyra self-awareness and inner reflection capabilities.

Inner Monologue (`monologue/monologue.py`)

Purpose: Private reflection on user intent, conversation tone, and required depth.

Status: Currently observer-only (Stage 0.6), not yet integrated into response generation.

Key Components:

MONOLOGUE_SYSTEM_PROMPT = """
You are Lyra's inner monologue.
You think privately.
You do NOT speak to the user.
You do NOT solve the task.
You only reflect on intent, tone, and depth.

Return ONLY valid JSON with:
- intent (string)
- tone (neutral | warm | focused | playful | direct)
- depth (short | medium | deep)
- consult_executive (true | false)
"""

class InnerMonologue:
    async def process(self, context: Dict) -> Dict:
        """
        Private reflection on conversation context.

        Args:
            context: {
                "user_message": str,
                "self_state": dict,
                "context_summary": dict
            }

        Returns:
            {
                "intent": str,
                "tone": str,
                "depth": str,
                "consult_executive": bool
            }
        """

Example Output:

{
  "intent": "seeking_technical_guidance",
  "tone": "focused",
  "depth": "deep",
  "consult_executive": false
}

Self-State Management (`self_state.py`)

Tracks Lyra's internal state across conversations:

SELF_STATE = {
    "current_time": "2025-12-12T15:30:00Z",
    "mode": "conversational",  # conversational | task-focused | creative
    "mood": "helpful",          # helpful | curious | focused | playful
    "energy": "high",           # high | medium | low
    "context_awareness": {
        "session_duration": "45 minutes",
        "message_count": 23,
        "topics": ["ML deployment", "Docker", "AWS"]
    }
}

Future Integration:

The autonomy module is designed to eventually:

Influence response tone and depth based on inner monologue
Trigger proactive questions or suggestions
Detect when to consult "executive function" for complex decisions
Maintain emotional continuity across sessions

Context Collection (`/cortex/context.py`)

The context collection module aggregates information from multiple sources to provide comprehensive conversation context.

Main Function:

async def collect_context(session_id: str, user_message: str) -> dict:
    """
    Collect context from all available sources.

    Sources:
    1. Intake - Short-term conversation summaries
    2. NeoMem - Long-term memory search
    3. Session state - Timestamps, mode, mood
    4. Self-state - Lyra's internal awareness

    Returns:
        {
            "user_message": str,
            "self_state": dict,
            "context_summary": dict,  # Intake summaries
            "neomem_memories": list,
            "session_metadata": dict
        }
    """

    # Parallel collection
    intake_task = asyncio.create_task(
        intake.summarize_context(session_id, backend="PRIMARY")
    )
    neomem_task = asyncio.create_task(
        neomem_client.search(query=user_message, limit=5)
    )

    # Wait for both
    intake_summaries, neomem_results = await asyncio.gather(
        intake_task,
        neomem_task
    )

    # Build context object
    return {
        "user_message": user_message,
        "self_state": get_self_state(),
        "context_summary": intake_summaries,
        "neomem_memories": neomem_results,
        "session_metadata": {
            "session_id": session_id,
            "timestamp": datetime.utcnow().isoformat(),
            "message_count": len(intake.get_session_messages(session_id))
        }
    }

Context Prioritization:

# Context relevance scoring
def score_context_relevance(context_item: dict, user_message: str) -> float:
    """
    Score how relevant a context item is to current message.

    Factors:
    - Semantic similarity (via embeddings)
    - Recency (more recent = higher score)
    - Source (Intake > NeoMem for recent topics)
    """

    semantic_score = compute_similarity(context_item, user_message)
    recency_score = compute_recency_weight(context_item["timestamp"])
    source_weight = 1.2 if context_item["source"] == "intake" else 1.0

    return semantic_score * recency_score * source_weight

Configuration & Environment

Environment Variables

Root `.env` (Main configuration)

# === LLM BACKENDS ===

# PRIMARY: llama.cpp on MI50 GPU
PRIMARY_URL=http://10.0.0.43:8000
PRIMARY_PROVIDER=vllm
PRIMARY_MODEL=/model

# SECONDARY: Ollama on RTX 3090
SECONDARY_URL=http://10.0.0.3:11434
SECONDARY_PROVIDER=ollama
SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M

# CLOUD: OpenAI
OPENAI_API_KEY=sk-proj-...
OPENAI_MODEL=gpt-4o-mini
OPENAI_URL=https://api.openai.com/v1

# FALLBACK: OpenAI Completions
FALLBACK_URL=http://10.0.0.41:11435
FALLBACK_PROVIDER=openai_completions
FALLBACK_MODEL=llama-3.2-8b-instruct

# === SERVICE URLS (Docker network) ===
CORTEX_URL=http://cortex:7081
NEOMEM_URL=http://neomem:7077
RELAY_URL=http://relay:7078

# === DATABASE ===
POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomem_secure_password
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432

NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=neo4j_secure_password

# === FEATURE FLAGS ===
ENABLE_RAG=false
ENABLE_INNER_MONOLOGUE=true
VERBOSE_DEBUG=false

# === PIPELINE CONFIGURATION ===
# Which LLM to use for each stage
REFLECTION_LLM=CLOUD      # Stage 1: Meta-awareness
REASONING_LLM=PRIMARY     # Stage 2: Draft answer
REFINE_LLM=PRIMARY        # Stage 3: Polish answer
PERSONA_LLM=CLOUD         # Stage 4: Apply personality
MONOLOGUE_LLM=PRIMARY     # Stage 0.6: Inner monologue

# === INTAKE CONFIGURATION ===
INTAKE_BUFFER_SIZE=200             # Max messages per session
INTAKE_SUMMARY_LEVELS=1,5,10,20,30 # Summary levels

Cortex `.env` (`/cortex/.env`)

# Cortex-specific overrides
VERBOSE_DEBUG=true
LOG_LEVEL=DEBUG

# Stage-specific temperatures
REFLECTION_TEMPERATURE=0.5
REASONING_TEMPERATURE=0.7
REFINE_TEMPERATURE=0.3
PERSONA_TEMPERATURE=0.6

Configuration Hierarchy

1. Docker compose environment variables (highest priority)
2. Service-specific .env files
3. Root .env file
4. Hard-coded defaults (lowest priority)

Dependencies & Tech Stack

Python Dependencies

Cortex & NeoMem (requirements.txt)

# Web framework
fastapi==0.115.8
uvicorn==0.34.0
pydantic==2.10.4

# HTTP clients
httpx==0.27.2          # Async HTTP (for LLM calls)
requests==2.32.3       # Sync HTTP (fallback)

# Database
psycopg[binary,pool]>=3.2.8  # PostgreSQL + connection pooling

# Utilities
python-dotenv==1.0.1   # Environment variable loading
ollama                 # Ollama client library

Node.js Dependencies

Relay (/core/relay/package.json)

{
  "dependencies": {
    "cors": "^2.8.5",
    "dotenv": "^16.0.3",
    "express": "^4.18.2",
    "mem0ai": "^0.1.0",
    "node-fetch": "^3.3.0"
  }
}

Docker Images

# Cortex & NeoMem
python:3.11-slim

# Relay
node:latest

# UI
nginx:alpine

# PostgreSQL with vector support
ankane/pgvector:v0.5.1

# Graph database
neo4j:5

External Services

LLM Backends (HTTP-based):

MI50 GPU Server (10.0.0.43:8000)
- llama.cpp via vllm
- High-performance inference
- Used for reasoning and refinement
RTX 3090 Server (10.0.0.3:11434)
- Ollama
- Alternative local backend
- Fallback for PRIMARY
OpenAI Cloud (api.openai.com)
- gpt-4o-mini
- Used for reflection and persona
- Requires API key
Fallback Server (10.0.0.41:11435)
- OpenAI Completions API
- Emergency backup
- llama-3.2-8b-instruct

Key Concepts & Design Patterns

1. Dual-Memory Architecture

Project Lyra uses a dual-memory system inspired by human cognition:

Short-Term Memory (Intake):

Fast, in-memory storage
Limited capacity (200 messages)
Immediate context for current conversation
Circular buffer (FIFO eviction)
Multi-level summarization

Long-Term Memory (NeoMem):

Persistent database storage
Unlimited capacity
Semantic search via vector embeddings
Entity-relationship tracking via graph DB
Cross-session continuity

Why This Matters:

Short-term memory provides immediate context (last few messages)
Long-term memory provides semantic understanding (user preferences, past topics)
Combined, they enable Lyra to be both contextually aware and historically informed

2. Multi-Stage Reasoning Pipeline

Unlike single-shot LLM calls, Lyra uses a 4-stage pipeline for sophisticated responses:

Stage 1: Reflection (Meta-cognition)

"What is the user really asking?"
Analyzes intent and conversation direction
Uses OpenAI for strong reasoning

Stage 2: Reasoning (Draft generation)

"What's a good answer?"
Generates initial response
Uses local llama.cpp for speed/cost

Stage 3: Refinement (Polish)

"How can this be clearer?"
Improves clarity and coherence
Lower temperature for consistency

Stage 4: Persona (Voice)

"How would Lyra say this?"
Applies personality and speaking style
Uses OpenAI for natural language

Benefits:

Higher quality responses (multiple passes)
Separation of concerns (reasoning vs. style)
Backend flexibility (cloud for hard tasks, local for simple ones)
Transparent thinking (can inspect each stage)

3. Backend Abstraction (LLM Router)

The LLM Router allows Lyra to use multiple LLM backends transparently:

# Same interface, different backends
await call_llm(prompt, backend="PRIMARY")   # Local llama.cpp
await call_llm(prompt, backend="CLOUD")     # OpenAI
await call_llm(prompt, backend="SECONDARY") # Ollama

Benefits:

Cost optimization: Use expensive cloud LLMs only when needed
Performance: Local LLMs for low-latency responses
Resilience: Fallback to alternative backends on failure
Experimentation: Easy to swap models/providers

Design Pattern: Strategy Pattern for swappable backends

4. Microservices Architecture

Project Lyra follows microservices principles:

Each service has a single responsibility:

Relay: Routing and orchestration
Cortex: Reasoning and response generation
NeoMem: Long-term memory storage
UI: User interface

Communication:

REST APIs (HTTP/JSON)
Async ingestion (fire-and-forget)
Docker network isolation

Benefits:

Independent scaling (scale Cortex without scaling UI)
Technology diversity (Node.js + Python)
Fault isolation (Cortex crash doesn't affect NeoMem)
Easy testing (mock service dependencies)

5. Session-Based State Management

Lyra maintains session-based state for conversation continuity:

# In-memory session storage (Intake)
SESSIONS = {
    "session_abc123": {
        "buffer": deque([msg1, msg2, ...], maxlen=200),
        "created_at": "2025-12-12T10:30:00Z"
    }
}

# Persistent session storage (NeoMem)
# Stores all messages + embeddings for semantic search

Session Lifecycle:

User starts conversation → UI generates session_id
First message → Cortex creates session in SESSIONS dict
Subsequent messages → Retrieved from same session
Async ingestion → Messages stored in NeoMem for long-term

Benefits:

Conversation continuity within session
Historical search across sessions
User can switch sessions (multiple concurrent conversations)

6. Asynchronous Ingestion

Pattern: Separate read path from write path

// Relay: Synchronous read path (fast response)
const response = await fetch('http://cortex:7081/reason');
return response.json();  // Return immediately to user

// Relay: Asynchronous write path (non-blocking)
fetch('http://cortex:7081/ingest', { method: 'POST', ... });
// Don't await, just fire and forget

Benefits:

Fast user response times (don't wait for database writes)
Resilient to storage failures (user still gets response)
Easier scaling (decouple read and write loads)

Trade-off: Eventual consistency (short delay before memory is searchable)

7. Deferred Summarization

Intake uses deferred summarization instead of pre-computation:

# BAD: Pre-compute summaries on every message
def add_message(session_id, message):
    SESSIONS[session_id].buffer.append(message)
    SESSIONS[session_id].L1_summary = summarize(last_1_message)
    SESSIONS[session_id].L5_summary = summarize(last_5_messages)
    # ... expensive, runs on every message

# GOOD: Compute summaries only when needed
def summarize_context(session_id):
    buffer = SESSIONS[session_id].buffer
    return {
        "L1": summarize(buffer[-1:]),    # Only compute when requested
        "L5": summarize(buffer[-5:]),
        "L10": summarize(buffer[-10:])
    }

Benefits:

Faster message ingestion (no blocking summarization)
Compute resources used only when needed
Flexible summary levels (easy to add L15, L50, etc.)

Trade-off: Slight delay when first message in conversation (cold start)

API Reference

Relay Endpoints

POST `/v1/chat/completions`

OpenAI-compatible chat endpoint

Request:

{
  "messages": [
    {"role": "user", "content": "Hello, Lyra!"}
  ],
  "session_id": "session_abc123"
}

Response:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hi there! How can I help you today?"
      }
    }
  ]
}

POST `/chat`

Lyra-native chat endpoint

Request:

{
  "session_id": "session_abc123",
  "message": "Hello, Lyra!"
}

Response:

{
  "answer": "Hi there! How can I help you today?",
  "session_id": "session_abc123"
}

GET `/sessions/:id`

Retrieve session history

Response:

{
  "session_id": "session_abc123",
  "messages": [
    {"role": "user", "content": "Hello", "timestamp": "..."},
    {"role": "assistant", "content": "Hi!", "timestamp": "..."}
  ],
  "created_at": "2025-12-12T10:30:00Z"
}

Cortex Endpoints

POST `/reason`

Main reasoning pipeline

Request:

{
  "session_id": "session_abc123",
  "user_message": "How do I deploy ML models?"
}

Response:

{
  "answer": "Final answer with Lyra's personality",
  "metadata": {
    "reflection": "User seeking deployment guidance...",
    "draft": "Initial draft answer...",
    "refined": "Polished answer...",
    "stages_completed": 4
  }
}

POST `/ingest`

Ingest message exchange into Intake

Request:

{
  "session_id": "session_abc123",
  "user_message": "How do I deploy ML models?",
  "assistant_message": "Here's how..."
}

Response:

{
  "status": "ingested",
  "session_id": "session_abc123",
  "message_count": 24
}

GET `/debug/sessions`

Inspect in-memory SESSIONS state

Response:

{
  "session_abc123": {
    "message_count": 24,
    "created_at": "2025-12-12T10:30:00Z",
    "last_message_at": "2025-12-12T11:15:00Z"
  },
  "session_xyz789": {
    "message_count": 5,
    "created_at": "2025-12-12T11:00:00Z",
    "last_message_at": "2025-12-12T11:10:00Z"
  }
}

NeoMem Endpoints

POST `/memories`

Create new memory

Request:

{
  "messages": [
    {"role": "user", "content": "I prefer Docker for deployments"},
    {"role": "assistant", "content": "Noted! I'll keep that in mind."}
  ],
  "session_id": "session_abc123"
}

Response:

{
  "status": "created",
  "memory_id": "mem_456def",
  "extracted_entities": ["Docker", "deployments"]
}

GET `/search`

Semantic search for memories

Query Parameters:

query (required): Search query
limit (optional, default=5): Max results

Request:

GET /search?query=deployment%20preferences&limit=5

Response:

{
  "results": [
    {
      "content": "User prefers Docker for deployments",
      "score": 0.92,
      "timestamp": "2025-12-10T14:30:00Z",
      "session_id": "session_abc123"
    },
    {
      "content": "Previously deployed models on AWS ECS",
      "score": 0.87,
      "timestamp": "2025-12-09T09:15:00Z",
      "session_id": "session_abc123"
    }
  ]
}

GET `/memories`

List all memories

Query Parameters:

offset (optional, default=0): Pagination offset
limit (optional, default=50): Max results

Response:

{
  "memories": [
    {
      "id": "mem_123abc",
      "content": "User prefers Docker...",
      "created_at": "2025-12-10T14:30:00Z"
    }
  ],
  "total": 147,
  "offset": 0,
  "limit": 50
}

Deployment & Operations

Docker Compose Deployment

File: /docker-compose.yml

version: '3.8'

services:
  # === ACTIVE SERVICES ===

  relay:
    build: ./core/relay
    ports:
      - "7078:7078"
    environment:
      - CORTEX_URL=http://cortex:7081
      - NEOMEM_URL=http://neomem:7077
    depends_on:
      - cortex
    networks:
      - lyra_net

  cortex:
    build: ./cortex
    ports:
      - "7081:7081"
    environment:
      - NEOMEM_URL=http://neomem:7077
      - PRIMARY_URL=${PRIMARY_URL}
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    command: uvicorn main:app --host 0.0.0.0 --port 7081 --workers 1
    depends_on:
      - neomem
    networks:
      - lyra_net

  neomem:
    build: ./neomem
    ports:
      - "7077:7077"
    environment:
      - POSTGRES_HOST=neomem-postgres
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - NEO4J_URI=${NEO4J_URI}
    depends_on:
      - neomem-postgres
      - neomem-neo4j
    networks:
      - lyra_net

  ui:
    image: nginx:alpine
    ports:
      - "8081:80"
    volumes:
      - ./core/ui:/usr/share/nginx/html:ro
    networks:
      - lyra_net

  # === DATABASES ===

  neomem-postgres:
    image: ankane/pgvector:v0.5.1
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
    volumes:
      - ./volumes/postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    networks:
      - lyra_net

  neomem-neo4j:
    image: neo4j:5
    environment:
      - NEO4J_AUTH=${NEO4J_USER}/${NEO4J_PASSWORD}
    volumes:
      - ./volumes/neo4j_data:/data
    ports:
      - "7474:7474"  # Browser UI
      - "7687:7687"  # Bolt
    networks:
      - lyra_net

networks:
  lyra_net:
    driver: bridge

Starting the System

# 1. Clone repository
git clone https://github.com/yourusername/project-lyra.git
cd project-lyra

# 2. Configure environment
cp .env.example .env
# Edit .env with your LLM backend URLs and API keys

# 3. Start all services
docker-compose up -d

# 4. Check health
curl http://localhost:7078/_health
curl http://localhost:7081/health
curl http://localhost:7077/health

# 5. Open UI
open http://localhost:8081

Monitoring & Logs

# View all logs
docker-compose logs -f

# View specific service
docker-compose logs -f cortex

# Check resource usage
docker stats

# Inspect Cortex sessions
curl http://localhost:7081/debug/sessions

# Check NeoMem memories
curl http://localhost:7077/memories?limit=10

Scaling Considerations

Current Constraints:

Single Cortex worker required (in-memory SESSIONS dict)
- Solution: Migrate SESSIONS to Redis or PostgreSQL
In-memory session storage in Relay
- Solution: Use Redis for session persistence
No load balancing (single instance of each service)
- Solution: Add nginx reverse proxy + multiple Cortex instances

Horizontal Scaling Plan:

# Future: Redis-backed session storage
cortex:
  build: ./cortex
  command: uvicorn main:app --workers 4  # Multi-worker
  environment:
    - REDIS_URL=redis://redis:6379
  depends_on:
    - redis

redis:
  image: redis:alpine
  ports:
    - "6379:6379"

Backup Strategy

# Backup PostgreSQL (NeoMem vectors)
docker exec neomem-postgres pg_dump -U neomem neomem > backup_postgres.sql

# Backup Neo4j (NeoMem graph)
docker exec neomem-neo4j neo4j-admin dump --to=/data/backup.dump

# Backup Intake sessions (manual export)
curl http://localhost:7081/debug/sessions > backup_sessions.json

Known Issues & Constraints

Critical Constraints

1. Single-Worker Requirement (Cortex)

Issue: Cortex must run with --workers 1 to maintain SESSIONS state Impact: Limited horizontal scalability Workaround: None currently Fix: Migrate SESSIONS to Redis or PostgreSQL Priority: High (blocking scalability)

2. In-Memory Session Storage (Relay)

Issue: Sessions stored in Node.js process memory Impact: Lost on restart, no persistence Workaround: None currently Fix: Use Redis or database Priority: Medium (acceptable for demo)

Non-Critical Issues

3. RAG Service Disabled

Status: Built but commented out in docker-compose.yml Impact: No RAG-based long-term knowledge retrieval Workaround: NeoMem provides semantic search Fix: Re-enable and integrate RAG service Priority: Low (NeoMem sufficient for now)

4. Partial NeoMem Integration

Status: Search implemented, async ingestion planned Impact: Memories not automatically saved Workaround: Manual POST to /memories Fix: Complete async ingestion in Relay Priority: Medium (planned feature)

5. Inner Monologue Observer-Only

Status: Stage 0.6 runs but output not used Impact: No adaptive response based on monologue Workaround: None (future feature) Fix: Integrate monologue output into pipeline Priority: Low (experimental feature)

Fixed Issues (v0.5.2)

✅ LLM Router Blocking - Migrated from requests to httpx for async ✅ Session ID Case Mismatch - Standardized to session_id ✅ Missing Backend Parameter - Added to intake summarization

Deprecated Components

Location: /DEPRECATED_FILES.md

Standalone Intake Service - Now embedded in Cortex
Old Relay Backup - Replaced by current Relay
Persona Sidecar - Built but unused (dynamic persona loading)

Advanced Topics

Custom Prompt Engineering

Each stage uses carefully crafted prompts:

Reflection Prompt Example:

REFLECTION_PROMPT = """
You are Lyra's reflective awareness layer.
Your job is to analyze the user's message and conversation context
to understand their true intent and needs.

User message: {user_message}

Recent context:
{intake_L10_summary}

Long-term context:
{neomem_top_3_memories}

Provide concise meta-awareness notes:
- What is the user's underlying intent?
- What topics/themes are emerging?
- What depth of response is appropriate?
- Are there any implicit questions or concerns?

Keep notes brief (3-5 sentences). Focus on insight, not description.
"""

Extending the Pipeline

Adding Stage 5 (Fact-Checking):

# /cortex/reasoning/factcheck.py
async def factcheck_answer(answer: str, context: dict) -> dict:
    """
    Stage 5: Verify factual claims in answer.

    Returns:
        {
            "verified": bool,
            "flagged_claims": list,
            "corrected_answer": str
        }
    """

    prompt = f"""
    Review this answer for factual accuracy:

    {answer}

    Flag any claims that seem dubious or need verification.
    Provide corrected version if needed.
    """

    result = await call_llm(prompt, backend="CLOUD", temperature=0.1)
    return parse_factcheck_result(result)

# Update router.py to include Stage 5
async def reason_endpoint(request):
    # ... existing stages ...

    # Stage 5: Fact-checking
    factcheck_result = await factcheck_answer(final_answer, context)

    if not factcheck_result["verified"]:
        final_answer = factcheck_result["corrected_answer"]

    return {"answer": final_answer}

Custom LLM Backend Integration

Adding Anthropic Claude:

# /cortex/llm/llm_router.py

BACKEND_CONFIGS = {
    # ... existing backends ...

    "CLAUDE": {
        "url": "https://api.anthropic.com/v1",
        "provider": "anthropic",
        "model": "claude-3-5-sonnet-20241022",
        "api_key": os.getenv("ANTHROPIC_API_KEY")
    }
}

# Add provider-specific logic
elif backend_config["provider"] == "anthropic":
    headers = {
        "x-api-key": api_key,
        "anthropic-version": "2023-06-01"
    }
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": temperature
    }
    response = await httpx_client.post(
        f"{url}/messages",
        json=payload,
        headers=headers,
        timeout=120
    )
    return response.json()["content"][0]["text"]

Performance Optimization

Caching Strategies:

# /cortex/utils/cache.py
from functools import lru_cache
import hashlib

@lru_cache(maxsize=128)
def cache_llm_call(prompt_hash: str, backend: str):
    """Cache LLM responses for identical prompts"""
    # Note: Only cache deterministic calls (temperature=0)
    pass

# Usage in llm_router.py
async def call_llm(prompt, backend, temperature=0.7, max_tokens=512):
    if temperature == 0:
        prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
        cached = cache_llm_call(prompt_hash, backend)
        if cached:
            return cached

    # ... normal LLM call ...

Database Query Optimization:

# /neomem/neomem/database.py

# BAD: Load all memories, then filter
def search_memories(query):
    all_memories = db.execute("SELECT * FROM memories")
    # Expensive in-memory filtering
    return [m for m in all_memories if similarity(m, query) > 0.8]

# GOOD: Use database indexes and LIMIT
def search_memories(query, limit=5):
    query_embedding = embed(query)
    return db.execute("""
        SELECT * FROM memories
        WHERE embedding <-> %s < 0.2  -- pgvector cosine distance
        ORDER BY embedding <-> %s
        LIMIT %s
    """, (query_embedding, query_embedding, limit))

Conclusion

Project Lyra is a sophisticated, multi-layered AI companion system that addresses the fundamental limitation of chatbot amnesia through:

Dual-memory architecture (short-term Intake + long-term NeoMem)
Multi-stage reasoning pipeline (Reflection → Reasoning → Refinement → Persona)
Flexible multi-backend LLM support (cloud + local with fallback)
Microservices design for scalability and maintainability
Modern web UI with session management

The system is production-ready with comprehensive error handling, logging, and health monitoring.

Quick Reference

Service Ports

UI: 8081 (Browser interface)
Relay: 7078 (Main orchestrator)
Cortex: 7081 (Reasoning engine)
NeoMem: 7077 (Long-term memory)
PostgreSQL: 5432 (Vector storage)
Neo4j: 7474 (Browser), 7687 (Bolt)

Key Files

Main Entry: /core/relay/server.js
Reasoning Pipeline: /cortex/router.py
LLM Router: /cortex/llm/llm_router.py
Short-term Memory: /cortex/intake/intake.py
Long-term Memory: /neomem/neomem/
Personality: /cortex/persona/identity.py

Important Commands

# Start system
docker-compose up -d

# View logs
docker-compose logs -f cortex

# Debug sessions
curl http://localhost:7081/debug/sessions

# Health check
curl http://localhost:7078/_health

# Search memories
curl "http://localhost:7077/search?query=deployment&limit=5"

Document Version: 1.0 Last Updated: 2025-12-13 Maintained By: Project Lyra Team

65 KiB Raw Permalink Blame History Unescape Escape

Project Lyra - Complete System Breakdown

Table of Contents

System Overview

What is Project Lyra?

Mission Statement

Key Features

Architecture Diagram

Core Components

1. Relay (Orchestrator)

Key Responsibilities:

Main Files:

Key Endpoints:

Internal Flow:

2. Cortex (Reasoning Engine)

Architecture:

Key Responsibilities:

Main Files:

Key Endpoints:

3. Intake (Short-Term Memory)

Data Structure:

Key Features:

Critical Constraint:

Main Functions:

Summarization Strategy:

4. NeoMem (Long-Term Memory)

Architecture:

Backend Storage:

Key Features:

Main Endpoints:

Integration Flow:

5. UI (Web Interface)

Key Features:

Main Files:

Session Management:

API Communication:

Data Flow & Message Pipeline

Complete Message Flow (v0.5.2)

Provider-Specific Logic:

Error Handling:

Usage in Pipeline:

Persona System (/cortex/persona/)

Identity Configuration (identity.py)

Personality Application (speak.py)

Tone Adaptation:

Autonomy Module (/cortex/autonomy/)

Inner Monologue (monologue/monologue.py)

Key Components:

Example Output:

Self-State Management (self_state.py)

Future Integration:

Context Collection (/cortex/context.py)

Main Function:

Context Prioritization:

Configuration & Environment

Environment Variables

Root .env (Main configuration)

Cortex .env (/cortex/.env)

Configuration Hierarchy

Dependencies & Tech Stack

Python Dependencies

Node.js Dependencies

Docker Images

External Services

LLM Backends (HTTP-based):

Key Concepts & Design Patterns

1. Dual-Memory Architecture

2. Multi-Stage Reasoning Pipeline

3. Backend Abstraction (LLM Router)

4. Microservices Architecture

5. Session-Based State Management

6. Asynchronous Ingestion

7. Deferred Summarization

API Reference

Relay Endpoints

POST /v1/chat/completions

POST /chat

GET /sessions/:id

Cortex Endpoints

POST /reason

65 KiB

Raw Permalink Blame History

Persona System (`/cortex/persona/`)

Identity Configuration (`identity.py`)

Personality Application (`speak.py`)

Autonomy Module (`/cortex/autonomy/`)

Inner Monologue (`monologue/monologue.py`)

Self-State Management (`self_state.py`)

Context Collection (`/cortex/context.py`)

Root `.env` (Main configuration)

Cortex `.env` (`/cortex/.env`)

POST `/v1/chat/completions`

POST `/chat`

GET `/sessions/:id`

POST `/reason`

POST `/ingest`

GET `/debug/sessions`

POST `/memories`

GET `/search`

GET `/memories`