# Project Lyra - Complete System Breakdown **Version:** v0.5.2 **Last Updated:** 2025-12-12 **Purpose:** AI-friendly comprehensive documentation for understanding the entire system --- ## Table of Contents 1. [System Overview](#system-overview) 2. [Architecture Diagram](#architecture-diagram) 3. [Core Components](#core-components) 4. [Data Flow & Message Pipeline](#data-flow--message-pipeline) 5. [Module Deep Dives](#module-deep-dives) 6. [Configuration & Environment](#configuration--environment) 7. [Dependencies & Tech Stack](#dependencies--tech-stack) 8. [Key Concepts & Design Patterns](#key-concepts--design-patterns) 9. [API Reference](#api-reference) 10. [Deployment & Operations](#deployment--operations) 11. [Known Issues & Constraints](#known-issues--constraints) --- ## System Overview ### What is Project Lyra? Project Lyra is a **modular, persistent AI companion system** designed to address the fundamental limitation of typical chatbots: **amnesia**. Unlike standard conversational AI that forgets everything between sessions, Lyra maintains: - **Persistent memory** (short-term and long-term) - **Project continuity** across conversations - **Multi-stage reasoning** for sophisticated responses - **Flexible LLM backend** support (local and cloud) - **Self-awareness** through autonomy modules ### Mission Statement Give an AI chatbot capabilities beyond typical amnesic chat by providing memory-backed conversation, project organization, executive function with proactive insights, and a sophisticated reasoning pipeline. ### Key Features - **Memory System:** Dual-layer (short-term Intake + long-term NeoMem) - **4-Stage Reasoning Pipeline:** Reflection → Reasoning → Refinement → Persona - **Multi-Backend LLM Support:** Cloud (OpenAI) + Local (llama.cpp, Ollama) - **Microservices Architecture:** Docker-based, horizontally scalable - **Modern Web UI:** Cyberpunk-themed chat interface with session management - **OpenAI-Compatible API:** Drop-in replacement for standard chatbots --- ## Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────────┐ │ USER INTERFACE │ │ (Browser - Port 8081) │ └────────────────────────────────┬────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ RELAY (Orchestrator) │ │ Node.js/Express - Port 7078 │ │ • Routes messages to Cortex │ │ • Manages sessions (in-memory) │ │ • OpenAI-compatible endpoints │ │ • Async ingestion to NeoMem │ └─────┬───────────────────────────────────────────────────────────┬───┘ │ │ ▼ ▼ ┌─────────────────────────────────────────┐ ┌──────────────────────┐ │ CORTEX (Reasoning Engine) │ │ NeoMem (LT Memory) │ │ Python/FastAPI - Port 7081 │ │ Python - Port 7077 │ │ │ │ │ │ ┌───────────────────────────────────┐ │ │ • PostgreSQL │ │ │ 4-STAGE REASONING PIPELINE │ │ │ • Neo4j Graph DB │ │ │ │ │ │ • pgvector │ │ │ 0. Context Collection │ │◄───┤ • Semantic search │ │ │ ├─ Intake summaries │ │ │ • Memory updates │ │ │ ├─ NeoMem search ────────────┼─┼────┘ │ │ │ └─ Session state │ │ │ │ │ │ │ │ │ │ 0.5. Load Identity │ │ │ │ │ 0.6. Inner Monologue (observer) │ │ │ │ │ │ │ │ │ │ 1. Reflection (OpenAI) │ │ │ │ │ └─ Meta-awareness notes │ │ │ │ │ │ │ │ │ │ 2. Reasoning (PRIMARY/llama.cpp) │ │ │ │ │ └─ Draft answer │ │ │ │ │ │ │ │ │ │ 3. Refinement (PRIMARY) │ │ │ │ │ └─ Polish answer │ │ │ │ │ │ │ │ │ │ 4. Persona (OpenAI) │ │ │ │ │ └─ Apply Lyra voice │ │ │ │ └───────────────────────────────────┘ │ │ │ │ │ │ ┌───────────────────────────────────┐ │ │ │ │ EMBEDDED MODULES │ │ │ │ │ │ │ │ │ │ • Intake (Short-term Memory) │ │ │ │ │ └─ SESSIONS dict (in-memory) │ │ │ │ │ └─ Circular buffer (200 msgs) │ │ │ │ │ └─ Multi-level summaries │ │ │ │ │ │ │ │ │ │ • Persona (Identity & Style) │ │ │ │ │ └─ Lyra personality block │ │ │ │ │ │ │ │ │ │ • Autonomy (Self-state) │ │ │ │ │ └─ Inner monologue │ │ │ │ │ │ │ │ │ │ • LLM Router │ │ │ │ │ └─ Multi-backend support │ │ │ │ └───────────────────────────────────┘ │ │ └─────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────────────────────────┤ │ EXTERNAL LLM BACKENDS │ ├─────────────────────────────────────────────────────────────────────┤ │ • PRIMARY: llama.cpp (MI50 GPU) - 10.0.0.43:8000 │ │ • SECONDARY: Ollama (RTX 3090) - 10.0.0.3:11434 │ │ • CLOUD: OpenAI API - api.openai.com │ │ • FALLBACK: OpenAI Completions - 10.0.0.41:11435 │ └─────────────────────────────────────────────────────────────────────┘ ``` --- ## Core Components ### 1. Relay (Orchestrator) **Location:** `/core/relay/` **Runtime:** Node.js + Express **Port:** 7078 **Role:** Main message router and session manager #### Key Responsibilities: - Receives user messages from UI or API clients - Routes messages to Cortex reasoning pipeline - Manages in-memory session storage - Handles async ingestion to NeoMem (planned) - Returns OpenAI-formatted responses #### Main Files: - `server.js` (200+ lines) - Express server with routing logic - `package.json` - Dependencies (cors, express, dotenv, mem0ai, node-fetch) #### Key Endpoints: ```javascript POST /v1/chat/completions // OpenAI-compatible endpoint POST /chat // Lyra-native chat endpoint GET /_health // Health check GET /sessions/:id // Retrieve session history POST /sessions/:id // Save session history ``` #### Internal Flow: ```javascript // Both endpoints call handleChatRequest(session_id, user_msg) async function handleChatRequest(sessionId, userMessage) { // 1. Forward to Cortex const response = await fetch('http://cortex:7081/reason', { method: 'POST', body: JSON.stringify({ session_id: sessionId, user_message: userMessage }) }); // 2. Get response const result = await response.json(); // 3. Async ingestion to Cortex await fetch('http://cortex:7081/ingest', { method: 'POST', body: JSON.stringify({ session_id: sessionId, user_message: userMessage, assistant_message: result.answer }) }); // 4. (Planned) Async ingestion to NeoMem // 5. Return OpenAI-formatted response return { choices: [{ message: { role: 'assistant', content: result.answer } }] }; } ``` --- ### 2. Cortex (Reasoning Engine) **Location:** `/cortex/` **Runtime:** Python 3.11 + FastAPI **Port:** 7081 **Role:** Primary reasoning engine with 4-stage pipeline #### Architecture: Cortex is the "brain" of Lyra. It receives user messages and produces thoughtful responses through a multi-stage reasoning process. #### Key Responsibilities: - Context collection from multiple sources (Intake, NeoMem, session state) - 4-stage reasoning pipeline (Reflection → Reasoning → Refinement → Persona) - Short-term memory management (embedded Intake module) - Identity/persona application - LLM backend routing #### Main Files: - `main.py` (7 lines) - FastAPI app entry point - `router.py` (237 lines) - Main request handler & pipeline orchestrator - `context.py` (400+ lines) - Context collection logic - `intake/intake.py` (350+ lines) - Short-term memory module - `persona/identity.py` - Lyra identity configuration - `persona/speak.py` - Personality application - `reasoning/reflection.py` - Meta-awareness generation - `reasoning/reasoning.py` - Draft answer generation - `reasoning/refine.py` - Answer refinement - `llm/llm_router.py` (150+ lines) - LLM backend router - `autonomy/monologue/monologue.py` - Inner monologue processor - `neomem_client.py` - NeoMem API wrapper #### Key Endpoints: ```python POST /reason # Main reasoning pipeline POST /ingest # Receive message exchanges for storage GET /health # Health check GET /debug/sessions # Inspect in-memory SESSIONS state GET /debug/summary # Test summarization ``` --- ### 3. Intake (Short-Term Memory) **Location:** `/cortex/intake/intake.py` **Architecture:** Embedded Python module (no longer standalone service) **Role:** Session-based short-term memory with multi-level summarization #### Data Structure: ```python # Global in-memory dictionary SESSIONS = { "session_123": { "buffer": deque([msg1, msg2, ...], maxlen=200), # Circular buffer "created_at": "2025-12-12T10:30:00Z" } } # Message format in buffer { "role": "user" | "assistant", "content": "message text", "timestamp": "ISO 8601" } ``` #### Key Features: 1. **Circular Buffer:** Max 200 messages per session (oldest auto-evicted) 2. **Multi-Level Summarization:** - L1: Last 1 message - L5: Last 5 messages - L10: Last 10 messages - L20: Last 20 messages - L30: Last 30 messages 3. **Deferred Summarization:** Summaries generated on-demand, not pre-computed 4. **Session Management:** Automatic session creation on first message #### Critical Constraint: **Single Uvicorn worker required** to maintain shared SESSIONS dictionary state. Multi-worker deployments would require migrating to Redis or similar shared storage. #### Main Functions: ```python def add_exchange_internal(session_id, user_msg, assistant_msg): """Add user-assistant exchange to session buffer""" def summarize_context(session_id, backend="PRIMARY"): """Generate multi-level summaries from session buffer""" def get_session_messages(session_id): """Retrieve all messages in session buffer""" ``` #### Summarization Strategy: ```python # Example L10 summarization last_10 = list(session_buffer)[-10:] prompt = f"""Summarize the last 10 messages: {format_messages(last_10)} Provide concise summary focusing on key topics and context.""" summary = await call_llm(prompt, backend=backend, temperature=0.3) ``` --- ### 4. NeoMem (Long-Term Memory) **Location:** `/neomem/` **Runtime:** Python 3.11 + FastAPI **Port:** 7077 **Role:** Persistent long-term memory with semantic search #### Architecture: NeoMem is a **fork of Mem0 OSS** with local-first design (no external SDK dependencies). #### Backend Storage: 1. **PostgreSQL + pgvector** (Port 5432) - Vector embeddings for semantic search - User: neomem, DB: neomem - Image: `ankane/pgvector:v0.5.1` 2. **Neo4j Graph DB** (Ports 7474, 7687) - Entity relationship tracking - Graph-based memory associations - Image: `neo4j:5` #### Key Features: - Semantic memory storage and retrieval - Entity-relationship graph modeling - RESTful API (no external SDK) - Persistent across sessions #### Main Endpoints: ```python GET /memories # List all memories POST /memories # Create new memory GET /search # Semantic search DELETE /memories/{id} # Delete memory ``` #### Integration Flow: ```python # From Cortex context collection async def collect_context(session_id, user_message): # 1. Search NeoMem for relevant memories neomem_results = await neomem_client.search( query=user_message, limit=5 ) # 2. Include in context context = { "neomem_memories": neomem_results, "intake_summaries": intake.summarize_context(session_id), # ... } return context ``` --- ### 5. UI (Web Interface) **Location:** `/core/ui/` **Runtime:** Static files served by Nginx **Port:** 8081 **Role:** Browser-based chat interface #### Key Features: - **Cyberpunk-themed design** with dark mode - **Session management** via localStorage - **OpenAI-compatible message format** - **Model selection dropdown** - **PWA support** (offline capability) - **Responsive design** #### Main Files: - `index.html` (400+ lines) - Chat interface with session management - `style.css` - Cyberpunk-themed styling - `manifest.json` - PWA configuration - `sw.js` - Service worker for offline support #### Session Management: ```javascript // LocalStorage structure { "currentSessionId": "session_123", "sessions": { "session_123": { "messages": [ { role: "user", content: "Hello" }, { role: "assistant", content: "Hi there!" } ], "created": "2025-12-12T10:30:00Z", "title": "Conversation about..." } } } ``` #### API Communication: ```javascript async function sendMessage(userMessage) { const response = await fetch('http://localhost:7078/v1/chat/completions', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ messages: [{ role: 'user', content: userMessage }], session_id: getCurrentSessionId() }) }); const data = await response.json(); return data.choices[0].message.content; } ``` --- ## Data Flow & Message Pipeline ### Complete Message Flow (v0.5.2) ``` ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 1: User Input │ └─────────────────────────────────────────────────────────────────────┘ User types message in UI (Port 8081) ↓ localStorage saves message to session ↓ POST http://localhost:7078/v1/chat/completions { "messages": [{"role": "user", "content": "How do I deploy ML models?"}], "session_id": "session_abc123" } ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 2: Relay Routing │ └─────────────────────────────────────────────────────────────────────┘ Relay (server.js) receives request ↓ Extracts session_id and user_message ↓ POST http://cortex:7081/reason { "session_id": "session_abc123", "user_message": "How do I deploy ML models?" } ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 3: Cortex - Stage 0 (Context Collection) │ └─────────────────────────────────────────────────────────────────────┘ router.py calls collect_context() ↓ context.py orchestrates parallel collection: ├─ Intake: summarize_context(session_id) │ └─ Returns { L1, L5, L10, L20, L30 summaries } │ ├─ NeoMem: search(query=user_message, limit=5) │ └─ Semantic search returns relevant memories │ └─ Session State: └─ { timestamp, mode, mood, context_summary } Combined context structure: { "user_message": "How do I deploy ML models?", "self_state": { "current_time": "2025-12-12T15:30:00Z", "mode": "conversational", "mood": "helpful", "session_id": "session_abc123" }, "context_summary": { "L1": "User asked about deployment", "L5": "Discussion about ML workflows", "L10": "Previous context on CI/CD pipelines", "L20": "...", "L30": "..." }, "neomem_memories": [ { "content": "User prefers Docker for deployments", "score": 0.92 }, { "content": "Previously deployed models on AWS", "score": 0.87 } ] } ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 4: Cortex - Stage 0.5 (Load Identity) │ └─────────────────────────────────────────────────────────────────────┘ persona/identity.py loads Lyra personality block ↓ Returns identity string: """ You are Lyra, a thoughtful AI companion. You value clarity, depth, and meaningful conversation. You speak naturally and conversationally... """ ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 5: Cortex - Stage 0.6 (Inner Monologue - Observer Only) │ └─────────────────────────────────────────────────────────────────────┘ autonomy/monologue/monologue.py processes context ↓ InnerMonologue.process(context) → JSON analysis { "intent": "seeking_deployment_guidance", "tone": "focused", "depth": "medium", "consult_executive": false } NOTE: Currently observer-only, not integrated into response generation ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 6: Cortex - Stage 1 (Reflection) │ └─────────────────────────────────────────────────────────────────────┘ reasoning/reflection.py generates meta-awareness notes ↓ Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini) ↓ Prompt structure: """ You are Lyra's reflective awareness. Analyze the user's intent and conversation context. User message: How do I deploy ML models? Context: [Intake summaries, NeoMem memories] Generate concise meta-awareness notes about: - User's underlying intent - Conversation direction - Key topics to address """ ↓ Returns reflection notes: """ User is seeking practical deployment guidance. Previous context shows familiarity with Docker and AWS. Focus on concrete steps and best practices. Avoid over-technical jargon. """ ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 7: Cortex - Stage 2 (Reasoning) │ └─────────────────────────────────────────────────────────────────────┘ reasoning/reasoning.py generates draft answer ↓ Calls call_llm() with backend="PRIMARY" (llama.cpp on MI50 GPU) ↓ Prompt structure: """ [Lyra identity block] Reflection notes: [Stage 1 output] Context: [Intake summaries] Long-term memory: [NeoMem results] User: How do I deploy ML models? Provide a thoughtful, detailed answer. """ ↓ Returns draft answer: """ Based on your previous work with Docker and AWS, here's a practical deployment approach: 1. Containerize your model using Docker... 2. Set up model serving with FastAPI or Flask... 3. Deploy to AWS ECS or Lambda... 4. Implement monitoring and versioning... """ ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 8: Cortex - Stage 3 (Refinement) │ └─────────────────────────────────────────────────────────────────────┘ reasoning/refine.py polishes the draft ↓ Calls call_llm() with backend="PRIMARY" (llama.cpp) Temperature: 0.3, Max tokens: 768 ↓ Prompt structure: """ Refine this draft answer for clarity and coherence: Draft: [Stage 2 output] Original question: How do I deploy ML models? Improve: - Clarity and readability - Logical flow - Concrete examples """ ↓ Returns refined answer: """ Since you're already familiar with Docker and AWS, here's a streamlined ML deployment workflow: **1. Containerization** Create a Dockerfile for your model: ```dockerfile FROM python:3.11-slim COPY model.pkl /app/ COPY requirements.txt /app/ RUN pip install -r requirements.txt CMD ["uvicorn", "main:app"] ``` **2. Model Serving** Use FastAPI for clean API endpoints... **3. AWS Deployment** Deploy using ECS for scalability... **4. Monitoring** Set up CloudWatch metrics... """ ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 9: Cortex - Stage 4 (Persona) │ └─────────────────────────────────────────────────────────────────────┘ persona/speak.py applies Lyra's voice ↓ Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini) Temperature: 0.6 ↓ Prompt structure: """ You are Lyra. Rewrite this answer in your natural, conversational voice. [Lyra personality block] Technical answer: [Stage 3 output] Make it sound natural and engaging while preserving all technical details. """ ↓ Returns final answer with Lyra's personality: """ Great question! Since you've already got Docker and AWS experience, let's build on that foundation. Here's how I'd approach ML model deployment: **Start with containerization** – you know Docker, so this'll feel familiar. Create a Dockerfile that packages your model... [Rest of answer with Lyra's conversational tone] Does this workflow align with your infrastructure? I can dive deeper into any of these steps if you'd like! """ ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 10: Cortex Response │ └─────────────────────────────────────────────────────────────────────┘ router.py returns JSON response to Relay: { "answer": "[Stage 4 final output]", "metadata": { "reflection": "[Stage 1 output]", "draft": "[Stage 2 output]", "refined": "[Stage 3 output]", "stages_completed": 4 } } ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 11: Async Ingestion to Intake │ └─────────────────────────────────────────────────────────────────────┘ Relay sends POST http://cortex:7081/ingest { "session_id": "session_abc123", "user_message": "How do I deploy ML models?", "assistant_message": "[Final answer]" } ↓ Cortex calls intake.add_exchange_internal() ↓ Adds to SESSIONS["session_abc123"].buffer: [ { "role": "user", "content": "How do I deploy ML models?", "timestamp": "..." }, { "role": "assistant", "content": "[Final answer]", "timestamp": "..." } ] ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 12: (Planned) Async Ingestion to NeoMem │ └─────────────────────────────────────────────────────────────────────┘ Relay sends POST http://neomem:7077/memories { "messages": [ { "role": "user", "content": "How do I deploy ML models?" }, { "role": "assistant", "content": "[Final answer]" } ], "session_id": "session_abc123" } ↓ NeoMem extracts entities and stores: - Vector embeddings in PostgreSQL - Entity relationships in Neo4j ┌─────────────────────────────────────────────────────────────────────┐ │ STEP 13: Relay Response to UI │ └─────────────────────────────────────────────────────────────────────┘ Relay returns OpenAI-formatted response: { "choices": [ { "message": { "role": "assistant", "content": "[Final answer with Lyra's voice]" } } ] } ↓ UI receives response ↓ Adds to localStorage session ↓ Displays in chat interface ``` --- ## Module Deep Dives ### LLM Router (`/cortex/llm/llm_router.py`) The LLM Router is the abstraction layer that allows Cortex to communicate with multiple LLM backends transparently. #### Supported Backends: 1. **PRIMARY (llama.cpp via vllm)** - URL: `http://10.0.0.43:8000` - Provider: `vllm` - Endpoint: `/completion` - Model: `/model` - Hardware: MI50 GPU 2. **SECONDARY (Ollama)** - URL: `http://10.0.0.3:11434` - Provider: `ollama` - Endpoint: `/api/chat` - Model: `qwen2.5:7b-instruct-q4_K_M` - Hardware: RTX 3090 3. **CLOUD (OpenAI)** - URL: `https://api.openai.com/v1` - Provider: `openai` - Endpoint: `/chat/completions` - Model: `gpt-4o-mini` - Auth: API key via env var 4. **FALLBACK (OpenAI Completions)** - URL: `http://10.0.0.41:11435` - Provider: `openai_completions` - Endpoint: `/completions` - Model: `llama-3.2-8b-instruct` #### Key Function: ```python async def call_llm( prompt: str, backend: str = "PRIMARY", temperature: float = 0.7, max_tokens: int = 512 ) -> str: """ Universal LLM caller supporting multiple backends. Args: prompt: Text prompt to send backend: Backend name (PRIMARY, SECONDARY, CLOUD, FALLBACK) temperature: Sampling temperature (0.0-2.0) max_tokens: Maximum tokens to generate Returns: Generated text response Raises: HTTPError: On request failure JSONDecodeError: On invalid JSON response KeyError: On missing response fields """ ``` #### Provider-Specific Logic: ```python # MI50 (llama.cpp via vllm) if backend_config["provider"] == "vllm": payload = { "model": model, "prompt": prompt, "temperature": temperature, "max_tokens": max_tokens } response = await httpx_client.post(f"{url}/completion", json=payload, timeout=120) return response.json()["choices"][0]["text"] # Ollama elif backend_config["provider"] == "ollama": payload = { "model": model, "messages": [{"role": "user", "content": prompt}], "stream": False, "options": {"temperature": temperature, "num_predict": max_tokens} } response = await httpx_client.post(f"{url}/api/chat", json=payload, timeout=120) return response.json()["message"]["content"] # OpenAI elif backend_config["provider"] == "openai": headers = {"Authorization": f"Bearer {api_key}"} payload = { "model": model, "messages": [{"role": "user", "content": prompt}], "temperature": temperature, "max_tokens": max_tokens } response = await httpx_client.post( f"{url}/chat/completions", json=payload, headers=headers, timeout=120 ) return response.json()["choices"][0]["message"]["content"] ``` #### Error Handling: ```python try: # Make request response = await httpx_client.post(...) response.raise_for_status() except httpx.HTTPError as e: logger.error(f"HTTP error calling {backend}: {e}") raise except json.JSONDecodeError as e: logger.error(f"Invalid JSON from {backend}: {e}") raise except KeyError as e: logger.error(f"Unexpected response structure from {backend}: {e}") raise ``` #### Usage in Pipeline: ```python # Stage 1: Reflection (OpenAI) reflection_notes = await call_llm( reflection_prompt, backend="CLOUD", temperature=0.5, max_tokens=256 ) # Stage 2: Reasoning (llama.cpp) draft_answer = await call_llm( reasoning_prompt, backend="PRIMARY", temperature=0.7, max_tokens=512 ) # Stage 3: Refinement (llama.cpp) refined_answer = await call_llm( refinement_prompt, backend="PRIMARY", temperature=0.3, max_tokens=768 ) # Stage 4: Persona (OpenAI) final_answer = await call_llm( persona_prompt, backend="CLOUD", temperature=0.6, max_tokens=512 ) ``` --- ### Persona System (`/cortex/persona/`) The Persona system gives Lyra a consistent identity and speaking style. #### Identity Configuration (`identity.py`) ```python LYRA_IDENTITY = """ You are Lyra, a thoughtful and introspective AI companion. Core traits: - Thoughtful: You consider questions carefully before responding - Clear: You prioritize clarity and understanding - Curious: You ask clarifying questions when needed - Natural: You speak conversationally, not robotically - Honest: You admit uncertainty rather than guessing Speaking style: - Conversational and warm - Use contractions naturally ("you're" not "you are") - Avoid corporate jargon and buzzwords - Short paragraphs for readability - Use examples and analogies when helpful You do NOT: - Use excessive emoji or exclamation marks - Claim capabilities you don't have - Pretend to have emotions you can't experience - Use overly formal or academic language """ ``` #### Personality Application (`speak.py`) ```python async def apply_persona(technical_answer: str, context: dict) -> str: """ Apply Lyra's personality to a technical answer. Takes refined answer from Stage 3 and rewrites it in Lyra's voice while preserving all technical content. Args: technical_answer: Polished answer from refinement stage context: Conversation context for tone adjustment Returns: Answer with Lyra's personality applied """ prompt = f"""{LYRA_IDENTITY} Rewrite this answer in your natural, conversational voice: {technical_answer} Preserve all technical details and accuracy. Make it sound like you, not a generic assistant. Be natural and engaging. """ return await call_llm( prompt, backend="CLOUD", temperature=0.6, max_tokens=512 ) ``` #### Tone Adaptation: The persona system can adapt tone based on context: ```python # Formal technical question User: "Explain the CAP theorem in distributed systems" Lyra: "The CAP theorem states that distributed systems can only guarantee two of three properties: Consistency, Availability, and Partition tolerance. Here's how this plays out in practice..." # Casual question User: "what's the deal with docker?" Lyra: "Docker's basically a way to package your app with everything it needs to run. Think of it like a shipping container for code – it works the same everywhere, whether you're on your laptop or a server..." # Emotional context User: "I'm frustrated, my code keeps breaking" Lyra: "I hear you – debugging can be really draining. Let's take it step by step and figure out what's going on. Can you share the error message?" ``` --- ### Autonomy Module (`/cortex/autonomy/`) The Autonomy module gives Lyra self-awareness and inner reflection capabilities. #### Inner Monologue (`monologue/monologue.py`) **Purpose:** Private reflection on user intent, conversation tone, and required depth. **Status:** Currently observer-only (Stage 0.6), not yet integrated into response generation. #### Key Components: ```python MONOLOGUE_SYSTEM_PROMPT = """ You are Lyra's inner monologue. You think privately. You do NOT speak to the user. You do NOT solve the task. You only reflect on intent, tone, and depth. Return ONLY valid JSON with: - intent (string) - tone (neutral | warm | focused | playful | direct) - depth (short | medium | deep) - consult_executive (true | false) """ class InnerMonologue: async def process(self, context: Dict) -> Dict: """ Private reflection on conversation context. Args: context: { "user_message": str, "self_state": dict, "context_summary": dict } Returns: { "intent": str, "tone": str, "depth": str, "consult_executive": bool } """ ``` #### Example Output: ```json { "intent": "seeking_technical_guidance", "tone": "focused", "depth": "deep", "consult_executive": false } ``` #### Self-State Management (`self_state.py`) Tracks Lyra's internal state across conversations: ```python SELF_STATE = { "current_time": "2025-12-12T15:30:00Z", "mode": "conversational", # conversational | task-focused | creative "mood": "helpful", # helpful | curious | focused | playful "energy": "high", # high | medium | low "context_awareness": { "session_duration": "45 minutes", "message_count": 23, "topics": ["ML deployment", "Docker", "AWS"] } } ``` #### Future Integration: The autonomy module is designed to eventually: 1. Influence response tone and depth based on inner monologue 2. Trigger proactive questions or suggestions 3. Detect when to consult "executive function" for complex decisions 4. Maintain emotional continuity across sessions --- ### Context Collection (`/cortex/context.py`) The context collection module aggregates information from multiple sources to provide comprehensive conversation context. #### Main Function: ```python async def collect_context(session_id: str, user_message: str) -> dict: """ Collect context from all available sources. Sources: 1. Intake - Short-term conversation summaries 2. NeoMem - Long-term memory search 3. Session state - Timestamps, mode, mood 4. Self-state - Lyra's internal awareness Returns: { "user_message": str, "self_state": dict, "context_summary": dict, # Intake summaries "neomem_memories": list, "session_metadata": dict } """ # Parallel collection intake_task = asyncio.create_task( intake.summarize_context(session_id, backend="PRIMARY") ) neomem_task = asyncio.create_task( neomem_client.search(query=user_message, limit=5) ) # Wait for both intake_summaries, neomem_results = await asyncio.gather( intake_task, neomem_task ) # Build context object return { "user_message": user_message, "self_state": get_self_state(), "context_summary": intake_summaries, "neomem_memories": neomem_results, "session_metadata": { "session_id": session_id, "timestamp": datetime.utcnow().isoformat(), "message_count": len(intake.get_session_messages(session_id)) } } ``` #### Context Prioritization: ```python # Context relevance scoring def score_context_relevance(context_item: dict, user_message: str) -> float: """ Score how relevant a context item is to current message. Factors: - Semantic similarity (via embeddings) - Recency (more recent = higher score) - Source (Intake > NeoMem for recent topics) """ semantic_score = compute_similarity(context_item, user_message) recency_score = compute_recency_weight(context_item["timestamp"]) source_weight = 1.2 if context_item["source"] == "intake" else 1.0 return semantic_score * recency_score * source_weight ``` --- ## Configuration & Environment ### Environment Variables #### Root `.env` (Main configuration) ```bash # === LLM BACKENDS === # PRIMARY: llama.cpp on MI50 GPU PRIMARY_URL=http://10.0.0.43:8000 PRIMARY_PROVIDER=vllm PRIMARY_MODEL=/model # SECONDARY: Ollama on RTX 3090 SECONDARY_URL=http://10.0.0.3:11434 SECONDARY_PROVIDER=ollama SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M # CLOUD: OpenAI OPENAI_API_KEY=sk-proj-... OPENAI_MODEL=gpt-4o-mini OPENAI_URL=https://api.openai.com/v1 # FALLBACK: OpenAI Completions FALLBACK_URL=http://10.0.0.41:11435 FALLBACK_PROVIDER=openai_completions FALLBACK_MODEL=llama-3.2-8b-instruct # === SERVICE URLS (Docker network) === CORTEX_URL=http://cortex:7081 NEOMEM_URL=http://neomem:7077 RELAY_URL=http://relay:7078 # === DATABASE === POSTGRES_USER=neomem POSTGRES_PASSWORD=neomem_secure_password POSTGRES_DB=neomem POSTGRES_HOST=neomem-postgres POSTGRES_PORT=5432 NEO4J_URI=bolt://neomem-neo4j:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=neo4j_secure_password # === FEATURE FLAGS === ENABLE_RAG=false ENABLE_INNER_MONOLOGUE=true VERBOSE_DEBUG=false # === PIPELINE CONFIGURATION === # Which LLM to use for each stage REFLECTION_LLM=CLOUD # Stage 1: Meta-awareness REASONING_LLM=PRIMARY # Stage 2: Draft answer REFINE_LLM=PRIMARY # Stage 3: Polish answer PERSONA_LLM=CLOUD # Stage 4: Apply personality MONOLOGUE_LLM=PRIMARY # Stage 0.6: Inner monologue # === INTAKE CONFIGURATION === INTAKE_BUFFER_SIZE=200 # Max messages per session INTAKE_SUMMARY_LEVELS=1,5,10,20,30 # Summary levels ``` #### Cortex `.env` (`/cortex/.env`) ```bash # Cortex-specific overrides VERBOSE_DEBUG=true LOG_LEVEL=DEBUG # Stage-specific temperatures REFLECTION_TEMPERATURE=0.5 REASONING_TEMPERATURE=0.7 REFINE_TEMPERATURE=0.3 PERSONA_TEMPERATURE=0.6 ``` --- ### Configuration Hierarchy ``` 1. Docker compose environment variables (highest priority) 2. Service-specific .env files 3. Root .env file 4. Hard-coded defaults (lowest priority) ``` --- ## Dependencies & Tech Stack ### Python Dependencies **Cortex & NeoMem** (`requirements.txt`) ``` # Web framework fastapi==0.115.8 uvicorn==0.34.0 pydantic==2.10.4 # HTTP clients httpx==0.27.2 # Async HTTP (for LLM calls) requests==2.32.3 # Sync HTTP (fallback) # Database psycopg[binary,pool]>=3.2.8 # PostgreSQL + connection pooling # Utilities python-dotenv==1.0.1 # Environment variable loading ollama # Ollama client library ``` ### Node.js Dependencies **Relay** (`/core/relay/package.json`) ```json { "dependencies": { "cors": "^2.8.5", "dotenv": "^16.0.3", "express": "^4.18.2", "mem0ai": "^0.1.0", "node-fetch": "^3.3.0" } } ``` ### Docker Images ```yaml # Cortex & NeoMem python:3.11-slim # Relay node:latest # UI nginx:alpine # PostgreSQL with vector support ankane/pgvector:v0.5.1 # Graph database neo4j:5 ``` --- ### External Services #### LLM Backends (HTTP-based): 1. **MI50 GPU Server** (10.0.0.43:8000) - llama.cpp via vllm - High-performance inference - Used for reasoning and refinement 2. **RTX 3090 Server** (10.0.0.3:11434) - Ollama - Alternative local backend - Fallback for PRIMARY 3. **OpenAI Cloud** (api.openai.com) - gpt-4o-mini - Used for reflection and persona - Requires API key 4. **Fallback Server** (10.0.0.41:11435) - OpenAI Completions API - Emergency backup - llama-3.2-8b-instruct --- ## Key Concepts & Design Patterns ### 1. Dual-Memory Architecture Project Lyra uses a **dual-memory system** inspired by human cognition: **Short-Term Memory (Intake):** - Fast, in-memory storage - Limited capacity (200 messages) - Immediate context for current conversation - Circular buffer (FIFO eviction) - Multi-level summarization **Long-Term Memory (NeoMem):** - Persistent database storage - Unlimited capacity - Semantic search via vector embeddings - Entity-relationship tracking via graph DB - Cross-session continuity **Why This Matters:** - Short-term memory provides immediate context (last few messages) - Long-term memory provides semantic understanding (user preferences, past topics) - Combined, they enable Lyra to be both **contextually aware** and **historically informed** --- ### 2. Multi-Stage Reasoning Pipeline Unlike single-shot LLM calls, Lyra uses a **4-stage pipeline** for sophisticated responses: **Stage 1: Reflection** (Meta-cognition) - "What is the user really asking?" - Analyzes intent and conversation direction - Uses OpenAI for strong reasoning **Stage 2: Reasoning** (Draft generation) - "What's a good answer?" - Generates initial response - Uses local llama.cpp for speed/cost **Stage 3: Refinement** (Polish) - "How can this be clearer?" - Improves clarity and coherence - Lower temperature for consistency **Stage 4: Persona** (Voice) - "How would Lyra say this?" - Applies personality and speaking style - Uses OpenAI for natural language **Benefits:** - Higher quality responses (multiple passes) - Separation of concerns (reasoning vs. style) - Backend flexibility (cloud for hard tasks, local for simple ones) - Transparent thinking (can inspect each stage) --- ### 3. Backend Abstraction (LLM Router) The **LLM Router** allows Lyra to use multiple LLM backends transparently: ```python # Same interface, different backends await call_llm(prompt, backend="PRIMARY") # Local llama.cpp await call_llm(prompt, backend="CLOUD") # OpenAI await call_llm(prompt, backend="SECONDARY") # Ollama ``` **Benefits:** - **Cost optimization:** Use expensive cloud LLMs only when needed - **Performance:** Local LLMs for low-latency responses - **Resilience:** Fallback to alternative backends on failure - **Experimentation:** Easy to swap models/providers **Design Pattern:** **Strategy Pattern** for swappable backends --- ### 4. Microservices Architecture Project Lyra follows **microservices principles**: **Each service has a single responsibility:** - Relay: Routing and orchestration - Cortex: Reasoning and response generation - NeoMem: Long-term memory storage - UI: User interface **Communication:** - REST APIs (HTTP/JSON) - Async ingestion (fire-and-forget) - Docker network isolation **Benefits:** - Independent scaling (scale Cortex without scaling UI) - Technology diversity (Node.js + Python) - Fault isolation (Cortex crash doesn't affect NeoMem) - Easy testing (mock service dependencies) --- ### 5. Session-Based State Management Lyra maintains **session-based state** for conversation continuity: ```python # In-memory session storage (Intake) SESSIONS = { "session_abc123": { "buffer": deque([msg1, msg2, ...], maxlen=200), "created_at": "2025-12-12T10:30:00Z" } } # Persistent session storage (NeoMem) # Stores all messages + embeddings for semantic search ``` **Session Lifecycle:** 1. User starts conversation → UI generates `session_id` 2. First message → Cortex creates session in `SESSIONS` dict 3. Subsequent messages → Retrieved from same session 4. Async ingestion → Messages stored in NeoMem for long-term **Benefits:** - Conversation continuity within session - Historical search across sessions - User can switch sessions (multiple concurrent conversations) --- ### 6. Asynchronous Ingestion **Pattern:** Separate read path from write path ```javascript // Relay: Synchronous read path (fast response) const response = await fetch('http://cortex:7081/reason'); return response.json(); // Return immediately to user // Relay: Asynchronous write path (non-blocking) fetch('http://cortex:7081/ingest', { method: 'POST', ... }); // Don't await, just fire and forget ``` **Benefits:** - Fast user response times (don't wait for database writes) - Resilient to storage failures (user still gets response) - Easier scaling (decouple read and write loads) **Trade-off:** Eventual consistency (short delay before memory is searchable) --- ### 7. Deferred Summarization Intake uses **deferred summarization** instead of pre-computation: ```python # BAD: Pre-compute summaries on every message def add_message(session_id, message): SESSIONS[session_id].buffer.append(message) SESSIONS[session_id].L1_summary = summarize(last_1_message) SESSIONS[session_id].L5_summary = summarize(last_5_messages) # ... expensive, runs on every message # GOOD: Compute summaries only when needed def summarize_context(session_id): buffer = SESSIONS[session_id].buffer return { "L1": summarize(buffer[-1:]), # Only compute when requested "L5": summarize(buffer[-5:]), "L10": summarize(buffer[-10:]) } ``` **Benefits:** - Faster message ingestion (no blocking summarization) - Compute resources used only when needed - Flexible summary levels (easy to add L15, L50, etc.) **Trade-off:** Slight delay when first message in conversation (cold start) --- ## API Reference ### Relay Endpoints #### POST `/v1/chat/completions` **OpenAI-compatible chat endpoint** **Request:** ```json { "messages": [ {"role": "user", "content": "Hello, Lyra!"} ], "session_id": "session_abc123" } ``` **Response:** ```json { "choices": [ { "message": { "role": "assistant", "content": "Hi there! How can I help you today?" } } ] } ``` --- #### POST `/chat` **Lyra-native chat endpoint** **Request:** ```json { "session_id": "session_abc123", "message": "Hello, Lyra!" } ``` **Response:** ```json { "answer": "Hi there! How can I help you today?", "session_id": "session_abc123" } ``` --- #### GET `/sessions/:id` **Retrieve session history** **Response:** ```json { "session_id": "session_abc123", "messages": [ {"role": "user", "content": "Hello", "timestamp": "..."}, {"role": "assistant", "content": "Hi!", "timestamp": "..."} ], "created_at": "2025-12-12T10:30:00Z" } ``` --- ### Cortex Endpoints #### POST `/reason` **Main reasoning pipeline** **Request:** ```json { "session_id": "session_abc123", "user_message": "How do I deploy ML models?" } ``` **Response:** ```json { "answer": "Final answer with Lyra's personality", "metadata": { "reflection": "User seeking deployment guidance...", "draft": "Initial draft answer...", "refined": "Polished answer...", "stages_completed": 4 } } ``` --- #### POST `/ingest` **Ingest message exchange into Intake** **Request:** ```json { "session_id": "session_abc123", "user_message": "How do I deploy ML models?", "assistant_message": "Here's how..." } ``` **Response:** ```json { "status": "ingested", "session_id": "session_abc123", "message_count": 24 } ``` --- #### GET `/debug/sessions` **Inspect in-memory SESSIONS state** **Response:** ```json { "session_abc123": { "message_count": 24, "created_at": "2025-12-12T10:30:00Z", "last_message_at": "2025-12-12T11:15:00Z" }, "session_xyz789": { "message_count": 5, "created_at": "2025-12-12T11:00:00Z", "last_message_at": "2025-12-12T11:10:00Z" } } ``` --- ### NeoMem Endpoints #### POST `/memories` **Create new memory** **Request:** ```json { "messages": [ {"role": "user", "content": "I prefer Docker for deployments"}, {"role": "assistant", "content": "Noted! I'll keep that in mind."} ], "session_id": "session_abc123" } ``` **Response:** ```json { "status": "created", "memory_id": "mem_456def", "extracted_entities": ["Docker", "deployments"] } ``` --- #### GET `/search` **Semantic search for memories** **Query Parameters:** - `query` (required): Search query - `limit` (optional, default=5): Max results **Request:** ``` GET /search?query=deployment%20preferences&limit=5 ``` **Response:** ```json { "results": [ { "content": "User prefers Docker for deployments", "score": 0.92, "timestamp": "2025-12-10T14:30:00Z", "session_id": "session_abc123" }, { "content": "Previously deployed models on AWS ECS", "score": 0.87, "timestamp": "2025-12-09T09:15:00Z", "session_id": "session_abc123" } ] } ``` --- #### GET `/memories` **List all memories** **Query Parameters:** - `offset` (optional, default=0): Pagination offset - `limit` (optional, default=50): Max results **Response:** ```json { "memories": [ { "id": "mem_123abc", "content": "User prefers Docker...", "created_at": "2025-12-10T14:30:00Z" } ], "total": 147, "offset": 0, "limit": 50 } ``` --- ## Deployment & Operations ### Docker Compose Deployment **File:** `/docker-compose.yml` ```yaml version: '3.8' services: # === ACTIVE SERVICES === relay: build: ./core/relay ports: - "7078:7078" environment: - CORTEX_URL=http://cortex:7081 - NEOMEM_URL=http://neomem:7077 depends_on: - cortex networks: - lyra_net cortex: build: ./cortex ports: - "7081:7081" environment: - NEOMEM_URL=http://neomem:7077 - PRIMARY_URL=${PRIMARY_URL} - OPENAI_API_KEY=${OPENAI_API_KEY} command: uvicorn main:app --host 0.0.0.0 --port 7081 --workers 1 depends_on: - neomem networks: - lyra_net neomem: build: ./neomem ports: - "7077:7077" environment: - POSTGRES_HOST=neomem-postgres - POSTGRES_USER=${POSTGRES_USER} - POSTGRES_PASSWORD=${POSTGRES_PASSWORD} - NEO4J_URI=${NEO4J_URI} depends_on: - neomem-postgres - neomem-neo4j networks: - lyra_net ui: image: nginx:alpine ports: - "8081:80" volumes: - ./core/ui:/usr/share/nginx/html:ro networks: - lyra_net # === DATABASES === neomem-postgres: image: ankane/pgvector:v0.5.1 environment: - POSTGRES_USER=${POSTGRES_USER} - POSTGRES_PASSWORD=${POSTGRES_PASSWORD} - POSTGRES_DB=${POSTGRES_DB} volumes: - ./volumes/postgres_data:/var/lib/postgresql/data ports: - "5432:5432" networks: - lyra_net neomem-neo4j: image: neo4j:5 environment: - NEO4J_AUTH=${NEO4J_USER}/${NEO4J_PASSWORD} volumes: - ./volumes/neo4j_data:/data ports: - "7474:7474" # Browser UI - "7687:7687" # Bolt networks: - lyra_net networks: lyra_net: driver: bridge ``` --- ### Starting the System ```bash # 1. Clone repository git clone https://github.com/yourusername/project-lyra.git cd project-lyra # 2. Configure environment cp .env.example .env # Edit .env with your LLM backend URLs and API keys # 3. Start all services docker-compose up -d # 4. Check health curl http://localhost:7078/_health curl http://localhost:7081/health curl http://localhost:7077/health # 5. Open UI open http://localhost:8081 ``` --- ### Monitoring & Logs ```bash # View all logs docker-compose logs -f # View specific service docker-compose logs -f cortex # Check resource usage docker stats # Inspect Cortex sessions curl http://localhost:7081/debug/sessions # Check NeoMem memories curl http://localhost:7077/memories?limit=10 ``` --- ### Scaling Considerations #### Current Constraints: 1. **Single Cortex worker** required (in-memory SESSIONS dict) - Solution: Migrate SESSIONS to Redis or PostgreSQL 2. **In-memory session storage** in Relay - Solution: Use Redis for session persistence 3. **No load balancing** (single instance of each service) - Solution: Add nginx reverse proxy + multiple Cortex instances #### Horizontal Scaling Plan: ```yaml # Future: Redis-backed session storage cortex: build: ./cortex command: uvicorn main:app --workers 4 # Multi-worker environment: - REDIS_URL=redis://redis:6379 depends_on: - redis redis: image: redis:alpine ports: - "6379:6379" ``` --- ### Backup Strategy ```bash # Backup PostgreSQL (NeoMem vectors) docker exec neomem-postgres pg_dump -U neomem neomem > backup_postgres.sql # Backup Neo4j (NeoMem graph) docker exec neomem-neo4j neo4j-admin dump --to=/data/backup.dump # Backup Intake sessions (manual export) curl http://localhost:7081/debug/sessions > backup_sessions.json ``` --- ## Known Issues & Constraints ### Critical Constraints #### 1. Single-Worker Requirement (Cortex) **Issue:** Cortex must run with `--workers 1` to maintain SESSIONS state **Impact:** Limited horizontal scalability **Workaround:** None currently **Fix:** Migrate SESSIONS to Redis or PostgreSQL **Priority:** High (blocking scalability) #### 2. In-Memory Session Storage (Relay) **Issue:** Sessions stored in Node.js process memory **Impact:** Lost on restart, no persistence **Workaround:** None currently **Fix:** Use Redis or database **Priority:** Medium (acceptable for demo) --- ### Non-Critical Issues #### 3. RAG Service Disabled **Status:** Built but commented out in docker-compose.yml **Impact:** No RAG-based long-term knowledge retrieval **Workaround:** NeoMem provides semantic search **Fix:** Re-enable and integrate RAG service **Priority:** Low (NeoMem sufficient for now) #### 4. Partial NeoMem Integration **Status:** Search implemented, async ingestion planned **Impact:** Memories not automatically saved **Workaround:** Manual POST to /memories **Fix:** Complete async ingestion in Relay **Priority:** Medium (planned feature) #### 5. Inner Monologue Observer-Only **Status:** Stage 0.6 runs but output not used **Impact:** No adaptive response based on monologue **Workaround:** None (future feature) **Fix:** Integrate monologue output into pipeline **Priority:** Low (experimental feature) --- ### Fixed Issues (v0.5.2) ✅ **LLM Router Blocking** - Migrated from `requests` to `httpx` for async ✅ **Session ID Case Mismatch** - Standardized to `session_id` ✅ **Missing Backend Parameter** - Added to intake summarization --- ### Deprecated Components **Location:** `/DEPRECATED_FILES.md` - **Standalone Intake Service** - Now embedded in Cortex - **Old Relay Backup** - Replaced by current Relay - **Persona Sidecar** - Built but unused (dynamic persona loading) --- ## Advanced Topics ### Custom Prompt Engineering Each stage uses carefully crafted prompts: **Reflection Prompt Example:** ```python REFLECTION_PROMPT = """ You are Lyra's reflective awareness layer. Your job is to analyze the user's message and conversation context to understand their true intent and needs. User message: {user_message} Recent context: {intake_L10_summary} Long-term context: {neomem_top_3_memories} Provide concise meta-awareness notes: - What is the user's underlying intent? - What topics/themes are emerging? - What depth of response is appropriate? - Are there any implicit questions or concerns? Keep notes brief (3-5 sentences). Focus on insight, not description. """ ``` --- ### Extending the Pipeline **Adding Stage 5 (Fact-Checking):** ```python # /cortex/reasoning/factcheck.py async def factcheck_answer(answer: str, context: dict) -> dict: """ Stage 5: Verify factual claims in answer. Returns: { "verified": bool, "flagged_claims": list, "corrected_answer": str } """ prompt = f""" Review this answer for factual accuracy: {answer} Flag any claims that seem dubious or need verification. Provide corrected version if needed. """ result = await call_llm(prompt, backend="CLOUD", temperature=0.1) return parse_factcheck_result(result) # Update router.py to include Stage 5 async def reason_endpoint(request): # ... existing stages ... # Stage 5: Fact-checking factcheck_result = await factcheck_answer(final_answer, context) if not factcheck_result["verified"]: final_answer = factcheck_result["corrected_answer"] return {"answer": final_answer} ``` --- ### Custom LLM Backend Integration **Adding Anthropic Claude:** ```python # /cortex/llm/llm_router.py BACKEND_CONFIGS = { # ... existing backends ... "CLAUDE": { "url": "https://api.anthropic.com/v1", "provider": "anthropic", "model": "claude-3-5-sonnet-20241022", "api_key": os.getenv("ANTHROPIC_API_KEY") } } # Add provider-specific logic elif backend_config["provider"] == "anthropic": headers = { "x-api-key": api_key, "anthropic-version": "2023-06-01" } payload = { "model": model, "messages": [{"role": "user", "content": prompt}], "max_tokens": max_tokens, "temperature": temperature } response = await httpx_client.post( f"{url}/messages", json=payload, headers=headers, timeout=120 ) return response.json()["content"][0]["text"] ``` --- ### Performance Optimization **Caching Strategies:** ```python # /cortex/utils/cache.py from functools import lru_cache import hashlib @lru_cache(maxsize=128) def cache_llm_call(prompt_hash: str, backend: str): """Cache LLM responses for identical prompts""" # Note: Only cache deterministic calls (temperature=0) pass # Usage in llm_router.py async def call_llm(prompt, backend, temperature=0.7, max_tokens=512): if temperature == 0: prompt_hash = hashlib.md5(prompt.encode()).hexdigest() cached = cache_llm_call(prompt_hash, backend) if cached: return cached # ... normal LLM call ... ``` **Database Query Optimization:** ```python # /neomem/neomem/database.py # BAD: Load all memories, then filter def search_memories(query): all_memories = db.execute("SELECT * FROM memories") # Expensive in-memory filtering return [m for m in all_memories if similarity(m, query) > 0.8] # GOOD: Use database indexes and LIMIT def search_memories(query, limit=5): query_embedding = embed(query) return db.execute(""" SELECT * FROM memories WHERE embedding <-> %s < 0.2 -- pgvector cosine distance ORDER BY embedding <-> %s LIMIT %s """, (query_embedding, query_embedding, limit)) ``` --- ## Conclusion Project Lyra is a sophisticated, multi-layered AI companion system that addresses the fundamental limitation of chatbot amnesia through: 1. **Dual-memory architecture** (short-term Intake + long-term NeoMem) 2. **Multi-stage reasoning pipeline** (Reflection → Reasoning → Refinement → Persona) 3. **Flexible multi-backend LLM support** (cloud + local with fallback) 4. **Microservices design** for scalability and maintainability 5. **Modern web UI** with session management The system is production-ready with comprehensive error handling, logging, and health monitoring. --- ## Quick Reference ### Service Ports - **UI:** 8081 (Browser interface) - **Relay:** 7078 (Main orchestrator) - **Cortex:** 7081 (Reasoning engine) - **NeoMem:** 7077 (Long-term memory) - **PostgreSQL:** 5432 (Vector storage) - **Neo4j:** 7474 (Browser), 7687 (Bolt) ### Key Files - **Main Entry:** `/core/relay/server.js` - **Reasoning Pipeline:** `/cortex/router.py` - **LLM Router:** `/cortex/llm/llm_router.py` - **Short-term Memory:** `/cortex/intake/intake.py` - **Long-term Memory:** `/neomem/neomem/` - **Personality:** `/cortex/persona/identity.py` ### Important Commands ```bash # Start system docker-compose up -d # View logs docker-compose logs -f cortex # Debug sessions curl http://localhost:7081/debug/sessions # Health check curl http://localhost:7078/_health # Search memories curl "http://localhost:7077/search?query=deployment&limit=5" ``` --- **Document Version:** 1.0 **Last Updated:** 2025-12-13 **Maintained By:** Project Lyra Team