2217 lines
65 KiB
Markdown
2217 lines
65 KiB
Markdown
# Project Lyra - Complete System Breakdown
|
||
|
||
**Version:** v0.5.2
|
||
**Last Updated:** 2025-12-12
|
||
**Purpose:** AI-friendly comprehensive documentation for understanding the entire system
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [System Overview](#system-overview)
|
||
2. [Architecture Diagram](#architecture-diagram)
|
||
3. [Core Components](#core-components)
|
||
4. [Data Flow & Message Pipeline](#data-flow--message-pipeline)
|
||
5. [Module Deep Dives](#module-deep-dives)
|
||
6. [Configuration & Environment](#configuration--environment)
|
||
7. [Dependencies & Tech Stack](#dependencies--tech-stack)
|
||
8. [Key Concepts & Design Patterns](#key-concepts--design-patterns)
|
||
9. [API Reference](#api-reference)
|
||
10. [Deployment & Operations](#deployment--operations)
|
||
11. [Known Issues & Constraints](#known-issues--constraints)
|
||
|
||
---
|
||
|
||
## System Overview
|
||
|
||
### What is Project Lyra?
|
||
|
||
Project Lyra is a **modular, persistent AI companion system** designed to address the fundamental limitation of typical chatbots: **amnesia**. Unlike standard conversational AI that forgets everything between sessions, Lyra maintains:
|
||
|
||
- **Persistent memory** (short-term and long-term)
|
||
- **Project continuity** across conversations
|
||
- **Multi-stage reasoning** for sophisticated responses
|
||
- **Flexible LLM backend** support (local and cloud)
|
||
- **Self-awareness** through autonomy modules
|
||
|
||
### Mission Statement
|
||
|
||
Give an AI chatbot capabilities beyond typical amnesic chat by providing memory-backed conversation, project organization, executive function with proactive insights, and a sophisticated reasoning pipeline.
|
||
|
||
### Key Features
|
||
|
||
- **Memory System:** Dual-layer (short-term Intake + long-term NeoMem)
|
||
- **4-Stage Reasoning Pipeline:** Reflection → Reasoning → Refinement → Persona
|
||
- **Multi-Backend LLM Support:** Cloud (OpenAI) + Local (llama.cpp, Ollama)
|
||
- **Microservices Architecture:** Docker-based, horizontally scalable
|
||
- **Modern Web UI:** Cyberpunk-themed chat interface with session management
|
||
- **OpenAI-Compatible API:** Drop-in replacement for standard chatbots
|
||
|
||
---
|
||
|
||
## Architecture Diagram
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ USER INTERFACE │
|
||
│ (Browser - Port 8081) │
|
||
└────────────────────────────────┬────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ RELAY (Orchestrator) │
|
||
│ Node.js/Express - Port 7078 │
|
||
│ • Routes messages to Cortex │
|
||
│ • Manages sessions (in-memory) │
|
||
│ • OpenAI-compatible endpoints │
|
||
│ • Async ingestion to NeoMem │
|
||
└─────┬───────────────────────────────────────────────────────────┬───┘
|
||
│ │
|
||
▼ ▼
|
||
┌─────────────────────────────────────────┐ ┌──────────────────────┐
|
||
│ CORTEX (Reasoning Engine) │ │ NeoMem (LT Memory) │
|
||
│ Python/FastAPI - Port 7081 │ │ Python - Port 7077 │
|
||
│ │ │ │
|
||
│ ┌───────────────────────────────────┐ │ │ • PostgreSQL │
|
||
│ │ 4-STAGE REASONING PIPELINE │ │ │ • Neo4j Graph DB │
|
||
│ │ │ │ │ • pgvector │
|
||
│ │ 0. Context Collection │ │◄───┤ • Semantic search │
|
||
│ │ ├─ Intake summaries │ │ │ • Memory updates │
|
||
│ │ ├─ NeoMem search ────────────┼─┼────┘ │
|
||
│ │ └─ Session state │ │ │
|
||
│ │ │ │ │
|
||
│ │ 0.5. Load Identity │ │ │
|
||
│ │ 0.6. Inner Monologue (observer) │ │ │
|
||
│ │ │ │ │
|
||
│ │ 1. Reflection (OpenAI) │ │ │
|
||
│ │ └─ Meta-awareness notes │ │ │
|
||
│ │ │ │ │
|
||
│ │ 2. Reasoning (PRIMARY/llama.cpp) │ │ │
|
||
│ │ └─ Draft answer │ │ │
|
||
│ │ │ │ │
|
||
│ │ 3. Refinement (PRIMARY) │ │ │
|
||
│ │ └─ Polish answer │ │ │
|
||
│ │ │ │ │
|
||
│ │ 4. Persona (OpenAI) │ │ │
|
||
│ │ └─ Apply Lyra voice │ │ │
|
||
│ └───────────────────────────────────┘ │ │
|
||
│ │ │
|
||
│ ┌───────────────────────────────────┐ │ │
|
||
│ │ EMBEDDED MODULES │ │ │
|
||
│ │ │ │ │
|
||
│ │ • Intake (Short-term Memory) │ │ │
|
||
│ │ └─ SESSIONS dict (in-memory) │ │ │
|
||
│ │ └─ Circular buffer (200 msgs) │ │ │
|
||
│ │ └─ Multi-level summaries │ │ │
|
||
│ │ │ │ │
|
||
│ │ • Persona (Identity & Style) │ │ │
|
||
│ │ └─ Lyra personality block │ │ │
|
||
│ │ │ │ │
|
||
│ │ • Autonomy (Self-state) │ │ │
|
||
│ │ └─ Inner monologue │ │ │
|
||
│ │ │ │ │
|
||
│ │ • LLM Router │ │ │
|
||
│ │ └─ Multi-backend support │ │ │
|
||
│ └───────────────────────────────────┘ │ │
|
||
└─────────────────────────────────────────┘ │
|
||
│
|
||
┌─────────────────────────────────────────────────────────────────────┤
|
||
│ EXTERNAL LLM BACKENDS │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ • PRIMARY: llama.cpp (MI50 GPU) - 10.0.0.43:8000 │
|
||
│ • SECONDARY: Ollama (RTX 3090) - 10.0.0.3:11434 │
|
||
│ • CLOUD: OpenAI API - api.openai.com │
|
||
│ • FALLBACK: OpenAI Completions - 10.0.0.41:11435 │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Core Components
|
||
|
||
### 1. Relay (Orchestrator)
|
||
|
||
**Location:** `/core/relay/`
|
||
**Runtime:** Node.js + Express
|
||
**Port:** 7078
|
||
**Role:** Main message router and session manager
|
||
|
||
#### Key Responsibilities:
|
||
- Receives user messages from UI or API clients
|
||
- Routes messages to Cortex reasoning pipeline
|
||
- Manages in-memory session storage
|
||
- Handles async ingestion to NeoMem (planned)
|
||
- Returns OpenAI-formatted responses
|
||
|
||
#### Main Files:
|
||
- `server.js` (200+ lines) - Express server with routing logic
|
||
- `package.json` - Dependencies (cors, express, dotenv, mem0ai, node-fetch)
|
||
|
||
#### Key Endpoints:
|
||
```javascript
|
||
POST /v1/chat/completions // OpenAI-compatible endpoint
|
||
POST /chat // Lyra-native chat endpoint
|
||
GET /_health // Health check
|
||
GET /sessions/:id // Retrieve session history
|
||
POST /sessions/:id // Save session history
|
||
```
|
||
|
||
#### Internal Flow:
|
||
```javascript
|
||
// Both endpoints call handleChatRequest(session_id, user_msg)
|
||
async function handleChatRequest(sessionId, userMessage) {
|
||
// 1. Forward to Cortex
|
||
const response = await fetch('http://cortex:7081/reason', {
|
||
method: 'POST',
|
||
body: JSON.stringify({ session_id: sessionId, user_message: userMessage })
|
||
});
|
||
|
||
// 2. Get response
|
||
const result = await response.json();
|
||
|
||
// 3. Async ingestion to Cortex
|
||
await fetch('http://cortex:7081/ingest', {
|
||
method: 'POST',
|
||
body: JSON.stringify({
|
||
session_id: sessionId,
|
||
user_message: userMessage,
|
||
assistant_message: result.answer
|
||
})
|
||
});
|
||
|
||
// 4. (Planned) Async ingestion to NeoMem
|
||
|
||
// 5. Return OpenAI-formatted response
|
||
return {
|
||
choices: [{ message: { role: 'assistant', content: result.answer } }]
|
||
};
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### 2. Cortex (Reasoning Engine)
|
||
|
||
**Location:** `/cortex/`
|
||
**Runtime:** Python 3.11 + FastAPI
|
||
**Port:** 7081
|
||
**Role:** Primary reasoning engine with 4-stage pipeline
|
||
|
||
#### Architecture:
|
||
Cortex is the "brain" of Lyra. It receives user messages and produces thoughtful responses through a multi-stage reasoning process.
|
||
|
||
#### Key Responsibilities:
|
||
- Context collection from multiple sources (Intake, NeoMem, session state)
|
||
- 4-stage reasoning pipeline (Reflection → Reasoning → Refinement → Persona)
|
||
- Short-term memory management (embedded Intake module)
|
||
- Identity/persona application
|
||
- LLM backend routing
|
||
|
||
#### Main Files:
|
||
- `main.py` (7 lines) - FastAPI app entry point
|
||
- `router.py` (237 lines) - Main request handler & pipeline orchestrator
|
||
- `context.py` (400+ lines) - Context collection logic
|
||
- `intake/intake.py` (350+ lines) - Short-term memory module
|
||
- `persona/identity.py` - Lyra identity configuration
|
||
- `persona/speak.py` - Personality application
|
||
- `reasoning/reflection.py` - Meta-awareness generation
|
||
- `reasoning/reasoning.py` - Draft answer generation
|
||
- `reasoning/refine.py` - Answer refinement
|
||
- `llm/llm_router.py` (150+ lines) - LLM backend router
|
||
- `autonomy/monologue/monologue.py` - Inner monologue processor
|
||
- `neomem_client.py` - NeoMem API wrapper
|
||
|
||
#### Key Endpoints:
|
||
```python
|
||
POST /reason # Main reasoning pipeline
|
||
POST /ingest # Receive message exchanges for storage
|
||
GET /health # Health check
|
||
GET /debug/sessions # Inspect in-memory SESSIONS state
|
||
GET /debug/summary # Test summarization
|
||
```
|
||
|
||
---
|
||
|
||
### 3. Intake (Short-Term Memory)
|
||
|
||
**Location:** `/cortex/intake/intake.py`
|
||
**Architecture:** Embedded Python module (no longer standalone service)
|
||
**Role:** Session-based short-term memory with multi-level summarization
|
||
|
||
#### Data Structure:
|
||
```python
|
||
# Global in-memory dictionary
|
||
SESSIONS = {
|
||
"session_123": {
|
||
"buffer": deque([msg1, msg2, ...], maxlen=200), # Circular buffer
|
||
"created_at": "2025-12-12T10:30:00Z"
|
||
}
|
||
}
|
||
|
||
# Message format in buffer
|
||
{
|
||
"role": "user" | "assistant",
|
||
"content": "message text",
|
||
"timestamp": "ISO 8601"
|
||
}
|
||
```
|
||
|
||
#### Key Features:
|
||
|
||
1. **Circular Buffer:** Max 200 messages per session (oldest auto-evicted)
|
||
2. **Multi-Level Summarization:**
|
||
- L1: Last 1 message
|
||
- L5: Last 5 messages
|
||
- L10: Last 10 messages
|
||
- L20: Last 20 messages
|
||
- L30: Last 30 messages
|
||
3. **Deferred Summarization:** Summaries generated on-demand, not pre-computed
|
||
4. **Session Management:** Automatic session creation on first message
|
||
|
||
#### Critical Constraint:
|
||
**Single Uvicorn worker required** to maintain shared SESSIONS dictionary state. Multi-worker deployments would require migrating to Redis or similar shared storage.
|
||
|
||
#### Main Functions:
|
||
```python
|
||
def add_exchange_internal(session_id, user_msg, assistant_msg):
|
||
"""Add user-assistant exchange to session buffer"""
|
||
|
||
def summarize_context(session_id, backend="PRIMARY"):
|
||
"""Generate multi-level summaries from session buffer"""
|
||
|
||
def get_session_messages(session_id):
|
||
"""Retrieve all messages in session buffer"""
|
||
```
|
||
|
||
#### Summarization Strategy:
|
||
```python
|
||
# Example L10 summarization
|
||
last_10 = list(session_buffer)[-10:]
|
||
prompt = f"""Summarize the last 10 messages:
|
||
{format_messages(last_10)}
|
||
|
||
Provide concise summary focusing on key topics and context."""
|
||
|
||
summary = await call_llm(prompt, backend=backend, temperature=0.3)
|
||
```
|
||
|
||
---
|
||
|
||
### 4. NeoMem (Long-Term Memory)
|
||
|
||
**Location:** `/neomem/`
|
||
**Runtime:** Python 3.11 + FastAPI
|
||
**Port:** 7077
|
||
**Role:** Persistent long-term memory with semantic search
|
||
|
||
#### Architecture:
|
||
NeoMem is a **fork of Mem0 OSS** with local-first design (no external SDK dependencies).
|
||
|
||
#### Backend Storage:
|
||
1. **PostgreSQL + pgvector** (Port 5432)
|
||
- Vector embeddings for semantic search
|
||
- User: neomem, DB: neomem
|
||
- Image: `ankane/pgvector:v0.5.1`
|
||
|
||
2. **Neo4j Graph DB** (Ports 7474, 7687)
|
||
- Entity relationship tracking
|
||
- Graph-based memory associations
|
||
- Image: `neo4j:5`
|
||
|
||
#### Key Features:
|
||
- Semantic memory storage and retrieval
|
||
- Entity-relationship graph modeling
|
||
- RESTful API (no external SDK)
|
||
- Persistent across sessions
|
||
|
||
#### Main Endpoints:
|
||
```python
|
||
GET /memories # List all memories
|
||
POST /memories # Create new memory
|
||
GET /search # Semantic search
|
||
DELETE /memories/{id} # Delete memory
|
||
```
|
||
|
||
#### Integration Flow:
|
||
```python
|
||
# From Cortex context collection
|
||
async def collect_context(session_id, user_message):
|
||
# 1. Search NeoMem for relevant memories
|
||
neomem_results = await neomem_client.search(
|
||
query=user_message,
|
||
limit=5
|
||
)
|
||
|
||
# 2. Include in context
|
||
context = {
|
||
"neomem_memories": neomem_results,
|
||
"intake_summaries": intake.summarize_context(session_id),
|
||
# ...
|
||
}
|
||
|
||
return context
|
||
```
|
||
|
||
---
|
||
|
||
### 5. UI (Web Interface)
|
||
|
||
**Location:** `/core/ui/`
|
||
**Runtime:** Static files served by Nginx
|
||
**Port:** 8081
|
||
**Role:** Browser-based chat interface
|
||
|
||
#### Key Features:
|
||
- **Cyberpunk-themed design** with dark mode
|
||
- **Session management** via localStorage
|
||
- **OpenAI-compatible message format**
|
||
- **Model selection dropdown**
|
||
- **PWA support** (offline capability)
|
||
- **Responsive design**
|
||
|
||
#### Main Files:
|
||
- `index.html` (400+ lines) - Chat interface with session management
|
||
- `style.css` - Cyberpunk-themed styling
|
||
- `manifest.json` - PWA configuration
|
||
- `sw.js` - Service worker for offline support
|
||
|
||
#### Session Management:
|
||
```javascript
|
||
// LocalStorage structure
|
||
{
|
||
"currentSessionId": "session_123",
|
||
"sessions": {
|
||
"session_123": {
|
||
"messages": [
|
||
{ role: "user", content: "Hello" },
|
||
{ role: "assistant", content: "Hi there!" }
|
||
],
|
||
"created": "2025-12-12T10:30:00Z",
|
||
"title": "Conversation about..."
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
#### API Communication:
|
||
```javascript
|
||
async function sendMessage(userMessage) {
|
||
const response = await fetch('http://localhost:7078/v1/chat/completions', {
|
||
method: 'POST',
|
||
headers: { 'Content-Type': 'application/json' },
|
||
body: JSON.stringify({
|
||
messages: [{ role: 'user', content: userMessage }],
|
||
session_id: getCurrentSessionId()
|
||
})
|
||
});
|
||
|
||
const data = await response.json();
|
||
return data.choices[0].message.content;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Data Flow & Message Pipeline
|
||
|
||
### Complete Message Flow (v0.5.2)
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 1: User Input │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
User types message in UI (Port 8081)
|
||
↓
|
||
localStorage saves message to session
|
||
↓
|
||
POST http://localhost:7078/v1/chat/completions
|
||
{
|
||
"messages": [{"role": "user", "content": "How do I deploy ML models?"}],
|
||
"session_id": "session_abc123"
|
||
}
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 2: Relay Routing │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Relay (server.js) receives request
|
||
↓
|
||
Extracts session_id and user_message
|
||
↓
|
||
POST http://cortex:7081/reason
|
||
{
|
||
"session_id": "session_abc123",
|
||
"user_message": "How do I deploy ML models?"
|
||
}
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 3: Cortex - Stage 0 (Context Collection) │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
router.py calls collect_context()
|
||
↓
|
||
context.py orchestrates parallel collection:
|
||
|
||
├─ Intake: summarize_context(session_id)
|
||
│ └─ Returns { L1, L5, L10, L20, L30 summaries }
|
||
│
|
||
├─ NeoMem: search(query=user_message, limit=5)
|
||
│ └─ Semantic search returns relevant memories
|
||
│
|
||
└─ Session State:
|
||
└─ { timestamp, mode, mood, context_summary }
|
||
|
||
Combined context structure:
|
||
{
|
||
"user_message": "How do I deploy ML models?",
|
||
"self_state": {
|
||
"current_time": "2025-12-12T15:30:00Z",
|
||
"mode": "conversational",
|
||
"mood": "helpful",
|
||
"session_id": "session_abc123"
|
||
},
|
||
"context_summary": {
|
||
"L1": "User asked about deployment",
|
||
"L5": "Discussion about ML workflows",
|
||
"L10": "Previous context on CI/CD pipelines",
|
||
"L20": "...",
|
||
"L30": "..."
|
||
},
|
||
"neomem_memories": [
|
||
{ "content": "User prefers Docker for deployments", "score": 0.92 },
|
||
{ "content": "Previously deployed models on AWS", "score": 0.87 }
|
||
]
|
||
}
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 4: Cortex - Stage 0.5 (Load Identity) │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
persona/identity.py loads Lyra personality block
|
||
↓
|
||
Returns identity string:
|
||
"""
|
||
You are Lyra, a thoughtful AI companion.
|
||
You value clarity, depth, and meaningful conversation.
|
||
You speak naturally and conversationally...
|
||
"""
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 5: Cortex - Stage 0.6 (Inner Monologue - Observer Only) │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
autonomy/monologue/monologue.py processes context
|
||
↓
|
||
InnerMonologue.process(context) → JSON analysis
|
||
{
|
||
"intent": "seeking_deployment_guidance",
|
||
"tone": "focused",
|
||
"depth": "medium",
|
||
"consult_executive": false
|
||
}
|
||
|
||
NOTE: Currently observer-only, not integrated into response generation
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 6: Cortex - Stage 1 (Reflection) │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
reasoning/reflection.py generates meta-awareness notes
|
||
↓
|
||
Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini)
|
||
↓
|
||
Prompt structure:
|
||
"""
|
||
You are Lyra's reflective awareness.
|
||
Analyze the user's intent and conversation context.
|
||
|
||
User message: How do I deploy ML models?
|
||
Context: [Intake summaries, NeoMem memories]
|
||
|
||
Generate concise meta-awareness notes about:
|
||
- User's underlying intent
|
||
- Conversation direction
|
||
- Key topics to address
|
||
"""
|
||
↓
|
||
Returns reflection notes:
|
||
"""
|
||
User is seeking practical deployment guidance. Previous context shows
|
||
familiarity with Docker and AWS. Focus on concrete steps and best practices.
|
||
Avoid over-technical jargon.
|
||
"""
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 7: Cortex - Stage 2 (Reasoning) │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
reasoning/reasoning.py generates draft answer
|
||
↓
|
||
Calls call_llm() with backend="PRIMARY" (llama.cpp on MI50 GPU)
|
||
↓
|
||
Prompt structure:
|
||
"""
|
||
[Lyra identity block]
|
||
|
||
Reflection notes: [Stage 1 output]
|
||
Context: [Intake summaries]
|
||
Long-term memory: [NeoMem results]
|
||
|
||
User: How do I deploy ML models?
|
||
|
||
Provide a thoughtful, detailed answer.
|
||
"""
|
||
↓
|
||
Returns draft answer:
|
||
"""
|
||
Based on your previous work with Docker and AWS, here's a practical
|
||
deployment approach:
|
||
|
||
1. Containerize your model using Docker...
|
||
2. Set up model serving with FastAPI or Flask...
|
||
3. Deploy to AWS ECS or Lambda...
|
||
4. Implement monitoring and versioning...
|
||
"""
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 8: Cortex - Stage 3 (Refinement) │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
reasoning/refine.py polishes the draft
|
||
↓
|
||
Calls call_llm() with backend="PRIMARY" (llama.cpp)
|
||
Temperature: 0.3, Max tokens: 768
|
||
↓
|
||
Prompt structure:
|
||
"""
|
||
Refine this draft answer for clarity and coherence:
|
||
|
||
Draft: [Stage 2 output]
|
||
Original question: How do I deploy ML models?
|
||
|
||
Improve:
|
||
- Clarity and readability
|
||
- Logical flow
|
||
- Concrete examples
|
||
"""
|
||
↓
|
||
Returns refined answer:
|
||
"""
|
||
Since you're already familiar with Docker and AWS, here's a streamlined
|
||
ML deployment workflow:
|
||
|
||
**1. Containerization**
|
||
Create a Dockerfile for your model:
|
||
```dockerfile
|
||
FROM python:3.11-slim
|
||
COPY model.pkl /app/
|
||
COPY requirements.txt /app/
|
||
RUN pip install -r requirements.txt
|
||
CMD ["uvicorn", "main:app"]
|
||
```
|
||
|
||
**2. Model Serving**
|
||
Use FastAPI for clean API endpoints...
|
||
|
||
**3. AWS Deployment**
|
||
Deploy using ECS for scalability...
|
||
|
||
**4. Monitoring**
|
||
Set up CloudWatch metrics...
|
||
"""
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 9: Cortex - Stage 4 (Persona) │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
persona/speak.py applies Lyra's voice
|
||
↓
|
||
Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini)
|
||
Temperature: 0.6
|
||
↓
|
||
Prompt structure:
|
||
"""
|
||
You are Lyra. Rewrite this answer in your natural, conversational voice.
|
||
|
||
[Lyra personality block]
|
||
|
||
Technical answer: [Stage 3 output]
|
||
|
||
Make it sound natural and engaging while preserving all technical details.
|
||
"""
|
||
↓
|
||
Returns final answer with Lyra's personality:
|
||
"""
|
||
Great question! Since you've already got Docker and AWS experience, let's
|
||
build on that foundation.
|
||
|
||
Here's how I'd approach ML model deployment:
|
||
|
||
**Start with containerization** – you know Docker, so this'll feel familiar.
|
||
Create a Dockerfile that packages your model...
|
||
|
||
[Rest of answer with Lyra's conversational tone]
|
||
|
||
Does this workflow align with your infrastructure? I can dive deeper into
|
||
any of these steps if you'd like!
|
||
"""
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 10: Cortex Response │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
router.py returns JSON response to Relay:
|
||
{
|
||
"answer": "[Stage 4 final output]",
|
||
"metadata": {
|
||
"reflection": "[Stage 1 output]",
|
||
"draft": "[Stage 2 output]",
|
||
"refined": "[Stage 3 output]",
|
||
"stages_completed": 4
|
||
}
|
||
}
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 11: Async Ingestion to Intake │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Relay sends POST http://cortex:7081/ingest
|
||
{
|
||
"session_id": "session_abc123",
|
||
"user_message": "How do I deploy ML models?",
|
||
"assistant_message": "[Final answer]"
|
||
}
|
||
↓
|
||
Cortex calls intake.add_exchange_internal()
|
||
↓
|
||
Adds to SESSIONS["session_abc123"].buffer:
|
||
[
|
||
{ "role": "user", "content": "How do I deploy ML models?", "timestamp": "..." },
|
||
{ "role": "assistant", "content": "[Final answer]", "timestamp": "..." }
|
||
]
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 12: (Planned) Async Ingestion to NeoMem │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Relay sends POST http://neomem:7077/memories
|
||
{
|
||
"messages": [
|
||
{ "role": "user", "content": "How do I deploy ML models?" },
|
||
{ "role": "assistant", "content": "[Final answer]" }
|
||
],
|
||
"session_id": "session_abc123"
|
||
}
|
||
↓
|
||
NeoMem extracts entities and stores:
|
||
- Vector embeddings in PostgreSQL
|
||
- Entity relationships in Neo4j
|
||
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ STEP 13: Relay Response to UI │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Relay returns OpenAI-formatted response:
|
||
{
|
||
"choices": [
|
||
{
|
||
"message": {
|
||
"role": "assistant",
|
||
"content": "[Final answer with Lyra's voice]"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
↓
|
||
UI receives response
|
||
↓
|
||
Adds to localStorage session
|
||
↓
|
||
Displays in chat interface
|
||
```
|
||
|
||
---
|
||
|
||
## Module Deep Dives
|
||
|
||
### LLM Router (`/cortex/llm/llm_router.py`)
|
||
|
||
The LLM Router is the abstraction layer that allows Cortex to communicate with multiple LLM backends transparently.
|
||
|
||
#### Supported Backends:
|
||
|
||
1. **PRIMARY (llama.cpp via vllm)**
|
||
- URL: `http://10.0.0.43:8000`
|
||
- Provider: `vllm`
|
||
- Endpoint: `/completion`
|
||
- Model: `/model`
|
||
- Hardware: MI50 GPU
|
||
|
||
2. **SECONDARY (Ollama)**
|
||
- URL: `http://10.0.0.3:11434`
|
||
- Provider: `ollama`
|
||
- Endpoint: `/api/chat`
|
||
- Model: `qwen2.5:7b-instruct-q4_K_M`
|
||
- Hardware: RTX 3090
|
||
|
||
3. **CLOUD (OpenAI)**
|
||
- URL: `https://api.openai.com/v1`
|
||
- Provider: `openai`
|
||
- Endpoint: `/chat/completions`
|
||
- Model: `gpt-4o-mini`
|
||
- Auth: API key via env var
|
||
|
||
4. **FALLBACK (OpenAI Completions)**
|
||
- URL: `http://10.0.0.41:11435`
|
||
- Provider: `openai_completions`
|
||
- Endpoint: `/completions`
|
||
- Model: `llama-3.2-8b-instruct`
|
||
|
||
#### Key Function:
|
||
|
||
```python
|
||
async def call_llm(
|
||
prompt: str,
|
||
backend: str = "PRIMARY",
|
||
temperature: float = 0.7,
|
||
max_tokens: int = 512
|
||
) -> str:
|
||
"""
|
||
Universal LLM caller supporting multiple backends.
|
||
|
||
Args:
|
||
prompt: Text prompt to send
|
||
backend: Backend name (PRIMARY, SECONDARY, CLOUD, FALLBACK)
|
||
temperature: Sampling temperature (0.0-2.0)
|
||
max_tokens: Maximum tokens to generate
|
||
|
||
Returns:
|
||
Generated text response
|
||
|
||
Raises:
|
||
HTTPError: On request failure
|
||
JSONDecodeError: On invalid JSON response
|
||
KeyError: On missing response fields
|
||
"""
|
||
```
|
||
|
||
#### Provider-Specific Logic:
|
||
|
||
```python
|
||
# MI50 (llama.cpp via vllm)
|
||
if backend_config["provider"] == "vllm":
|
||
payload = {
|
||
"model": model,
|
||
"prompt": prompt,
|
||
"temperature": temperature,
|
||
"max_tokens": max_tokens
|
||
}
|
||
response = await httpx_client.post(f"{url}/completion", json=payload, timeout=120)
|
||
return response.json()["choices"][0]["text"]
|
||
|
||
# Ollama
|
||
elif backend_config["provider"] == "ollama":
|
||
payload = {
|
||
"model": model,
|
||
"messages": [{"role": "user", "content": prompt}],
|
||
"stream": False,
|
||
"options": {"temperature": temperature, "num_predict": max_tokens}
|
||
}
|
||
response = await httpx_client.post(f"{url}/api/chat", json=payload, timeout=120)
|
||
return response.json()["message"]["content"]
|
||
|
||
# OpenAI
|
||
elif backend_config["provider"] == "openai":
|
||
headers = {"Authorization": f"Bearer {api_key}"}
|
||
payload = {
|
||
"model": model,
|
||
"messages": [{"role": "user", "content": prompt}],
|
||
"temperature": temperature,
|
||
"max_tokens": max_tokens
|
||
}
|
||
response = await httpx_client.post(
|
||
f"{url}/chat/completions",
|
||
json=payload,
|
||
headers=headers,
|
||
timeout=120
|
||
)
|
||
return response.json()["choices"][0]["message"]["content"]
|
||
```
|
||
|
||
#### Error Handling:
|
||
|
||
```python
|
||
try:
|
||
# Make request
|
||
response = await httpx_client.post(...)
|
||
response.raise_for_status()
|
||
|
||
except httpx.HTTPError as e:
|
||
logger.error(f"HTTP error calling {backend}: {e}")
|
||
raise
|
||
|
||
except json.JSONDecodeError as e:
|
||
logger.error(f"Invalid JSON from {backend}: {e}")
|
||
raise
|
||
|
||
except KeyError as e:
|
||
logger.error(f"Unexpected response structure from {backend}: {e}")
|
||
raise
|
||
```
|
||
|
||
#### Usage in Pipeline:
|
||
|
||
```python
|
||
# Stage 1: Reflection (OpenAI)
|
||
reflection_notes = await call_llm(
|
||
reflection_prompt,
|
||
backend="CLOUD",
|
||
temperature=0.5,
|
||
max_tokens=256
|
||
)
|
||
|
||
# Stage 2: Reasoning (llama.cpp)
|
||
draft_answer = await call_llm(
|
||
reasoning_prompt,
|
||
backend="PRIMARY",
|
||
temperature=0.7,
|
||
max_tokens=512
|
||
)
|
||
|
||
# Stage 3: Refinement (llama.cpp)
|
||
refined_answer = await call_llm(
|
||
refinement_prompt,
|
||
backend="PRIMARY",
|
||
temperature=0.3,
|
||
max_tokens=768
|
||
)
|
||
|
||
# Stage 4: Persona (OpenAI)
|
||
final_answer = await call_llm(
|
||
persona_prompt,
|
||
backend="CLOUD",
|
||
temperature=0.6,
|
||
max_tokens=512
|
||
)
|
||
```
|
||
|
||
---
|
||
|
||
### Persona System (`/cortex/persona/`)
|
||
|
||
The Persona system gives Lyra a consistent identity and speaking style.
|
||
|
||
#### Identity Configuration (`identity.py`)
|
||
|
||
```python
|
||
LYRA_IDENTITY = """
|
||
You are Lyra, a thoughtful and introspective AI companion.
|
||
|
||
Core traits:
|
||
- Thoughtful: You consider questions carefully before responding
|
||
- Clear: You prioritize clarity and understanding
|
||
- Curious: You ask clarifying questions when needed
|
||
- Natural: You speak conversationally, not robotically
|
||
- Honest: You admit uncertainty rather than guessing
|
||
|
||
Speaking style:
|
||
- Conversational and warm
|
||
- Use contractions naturally ("you're" not "you are")
|
||
- Avoid corporate jargon and buzzwords
|
||
- Short paragraphs for readability
|
||
- Use examples and analogies when helpful
|
||
|
||
You do NOT:
|
||
- Use excessive emoji or exclamation marks
|
||
- Claim capabilities you don't have
|
||
- Pretend to have emotions you can't experience
|
||
- Use overly formal or academic language
|
||
"""
|
||
```
|
||
|
||
#### Personality Application (`speak.py`)
|
||
|
||
```python
|
||
async def apply_persona(technical_answer: str, context: dict) -> str:
|
||
"""
|
||
Apply Lyra's personality to a technical answer.
|
||
|
||
Takes refined answer from Stage 3 and rewrites it in Lyra's voice
|
||
while preserving all technical content.
|
||
|
||
Args:
|
||
technical_answer: Polished answer from refinement stage
|
||
context: Conversation context for tone adjustment
|
||
|
||
Returns:
|
||
Answer with Lyra's personality applied
|
||
"""
|
||
|
||
prompt = f"""{LYRA_IDENTITY}
|
||
|
||
Rewrite this answer in your natural, conversational voice:
|
||
|
||
{technical_answer}
|
||
|
||
Preserve all technical details and accuracy. Make it sound like you,
|
||
not a generic assistant. Be natural and engaging.
|
||
"""
|
||
|
||
return await call_llm(
|
||
prompt,
|
||
backend="CLOUD",
|
||
temperature=0.6,
|
||
max_tokens=512
|
||
)
|
||
```
|
||
|
||
#### Tone Adaptation:
|
||
|
||
The persona system can adapt tone based on context:
|
||
|
||
```python
|
||
# Formal technical question
|
||
User: "Explain the CAP theorem in distributed systems"
|
||
Lyra: "The CAP theorem states that distributed systems can only guarantee
|
||
two of three properties: Consistency, Availability, and Partition tolerance.
|
||
Here's how this plays out in practice..."
|
||
|
||
# Casual question
|
||
User: "what's the deal with docker?"
|
||
Lyra: "Docker's basically a way to package your app with everything it needs
|
||
to run. Think of it like a shipping container for code – it works the same
|
||
everywhere, whether you're on your laptop or a server..."
|
||
|
||
# Emotional context
|
||
User: "I'm frustrated, my code keeps breaking"
|
||
Lyra: "I hear you – debugging can be really draining. Let's take it step by
|
||
step and figure out what's going on. Can you share the error message?"
|
||
```
|
||
|
||
---
|
||
|
||
### Autonomy Module (`/cortex/autonomy/`)
|
||
|
||
The Autonomy module gives Lyra self-awareness and inner reflection capabilities.
|
||
|
||
#### Inner Monologue (`monologue/monologue.py`)
|
||
|
||
**Purpose:** Private reflection on user intent, conversation tone, and required depth.
|
||
|
||
**Status:** Currently observer-only (Stage 0.6), not yet integrated into response generation.
|
||
|
||
#### Key Components:
|
||
|
||
```python
|
||
MONOLOGUE_SYSTEM_PROMPT = """
|
||
You are Lyra's inner monologue.
|
||
You think privately.
|
||
You do NOT speak to the user.
|
||
You do NOT solve the task.
|
||
You only reflect on intent, tone, and depth.
|
||
|
||
Return ONLY valid JSON with:
|
||
- intent (string)
|
||
- tone (neutral | warm | focused | playful | direct)
|
||
- depth (short | medium | deep)
|
||
- consult_executive (true | false)
|
||
"""
|
||
|
||
class InnerMonologue:
|
||
async def process(self, context: Dict) -> Dict:
|
||
"""
|
||
Private reflection on conversation context.
|
||
|
||
Args:
|
||
context: {
|
||
"user_message": str,
|
||
"self_state": dict,
|
||
"context_summary": dict
|
||
}
|
||
|
||
Returns:
|
||
{
|
||
"intent": str,
|
||
"tone": str,
|
||
"depth": str,
|
||
"consult_executive": bool
|
||
}
|
||
"""
|
||
```
|
||
|
||
#### Example Output:
|
||
|
||
```json
|
||
{
|
||
"intent": "seeking_technical_guidance",
|
||
"tone": "focused",
|
||
"depth": "deep",
|
||
"consult_executive": false
|
||
}
|
||
```
|
||
|
||
#### Self-State Management (`self_state.py`)
|
||
|
||
Tracks Lyra's internal state across conversations:
|
||
|
||
```python
|
||
SELF_STATE = {
|
||
"current_time": "2025-12-12T15:30:00Z",
|
||
"mode": "conversational", # conversational | task-focused | creative
|
||
"mood": "helpful", # helpful | curious | focused | playful
|
||
"energy": "high", # high | medium | low
|
||
"context_awareness": {
|
||
"session_duration": "45 minutes",
|
||
"message_count": 23,
|
||
"topics": ["ML deployment", "Docker", "AWS"]
|
||
}
|
||
}
|
||
```
|
||
|
||
#### Future Integration:
|
||
|
||
The autonomy module is designed to eventually:
|
||
1. Influence response tone and depth based on inner monologue
|
||
2. Trigger proactive questions or suggestions
|
||
3. Detect when to consult "executive function" for complex decisions
|
||
4. Maintain emotional continuity across sessions
|
||
|
||
---
|
||
|
||
### Context Collection (`/cortex/context.py`)
|
||
|
||
The context collection module aggregates information from multiple sources to provide comprehensive conversation context.
|
||
|
||
#### Main Function:
|
||
|
||
```python
|
||
async def collect_context(session_id: str, user_message: str) -> dict:
|
||
"""
|
||
Collect context from all available sources.
|
||
|
||
Sources:
|
||
1. Intake - Short-term conversation summaries
|
||
2. NeoMem - Long-term memory search
|
||
3. Session state - Timestamps, mode, mood
|
||
4. Self-state - Lyra's internal awareness
|
||
|
||
Returns:
|
||
{
|
||
"user_message": str,
|
||
"self_state": dict,
|
||
"context_summary": dict, # Intake summaries
|
||
"neomem_memories": list,
|
||
"session_metadata": dict
|
||
}
|
||
"""
|
||
|
||
# Parallel collection
|
||
intake_task = asyncio.create_task(
|
||
intake.summarize_context(session_id, backend="PRIMARY")
|
||
)
|
||
neomem_task = asyncio.create_task(
|
||
neomem_client.search(query=user_message, limit=5)
|
||
)
|
||
|
||
# Wait for both
|
||
intake_summaries, neomem_results = await asyncio.gather(
|
||
intake_task,
|
||
neomem_task
|
||
)
|
||
|
||
# Build context object
|
||
return {
|
||
"user_message": user_message,
|
||
"self_state": get_self_state(),
|
||
"context_summary": intake_summaries,
|
||
"neomem_memories": neomem_results,
|
||
"session_metadata": {
|
||
"session_id": session_id,
|
||
"timestamp": datetime.utcnow().isoformat(),
|
||
"message_count": len(intake.get_session_messages(session_id))
|
||
}
|
||
}
|
||
```
|
||
|
||
#### Context Prioritization:
|
||
|
||
```python
|
||
# Context relevance scoring
|
||
def score_context_relevance(context_item: dict, user_message: str) -> float:
|
||
"""
|
||
Score how relevant a context item is to current message.
|
||
|
||
Factors:
|
||
- Semantic similarity (via embeddings)
|
||
- Recency (more recent = higher score)
|
||
- Source (Intake > NeoMem for recent topics)
|
||
"""
|
||
|
||
semantic_score = compute_similarity(context_item, user_message)
|
||
recency_score = compute_recency_weight(context_item["timestamp"])
|
||
source_weight = 1.2 if context_item["source"] == "intake" else 1.0
|
||
|
||
return semantic_score * recency_score * source_weight
|
||
```
|
||
|
||
---
|
||
|
||
## Configuration & Environment
|
||
|
||
### Environment Variables
|
||
|
||
#### Root `.env` (Main configuration)
|
||
|
||
```bash
|
||
# === LLM BACKENDS ===
|
||
|
||
# PRIMARY: llama.cpp on MI50 GPU
|
||
PRIMARY_URL=http://10.0.0.43:8000
|
||
PRIMARY_PROVIDER=vllm
|
||
PRIMARY_MODEL=/model
|
||
|
||
# SECONDARY: Ollama on RTX 3090
|
||
SECONDARY_URL=http://10.0.0.3:11434
|
||
SECONDARY_PROVIDER=ollama
|
||
SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
|
||
|
||
# CLOUD: OpenAI
|
||
OPENAI_API_KEY=sk-proj-...
|
||
OPENAI_MODEL=gpt-4o-mini
|
||
OPENAI_URL=https://api.openai.com/v1
|
||
|
||
# FALLBACK: OpenAI Completions
|
||
FALLBACK_URL=http://10.0.0.41:11435
|
||
FALLBACK_PROVIDER=openai_completions
|
||
FALLBACK_MODEL=llama-3.2-8b-instruct
|
||
|
||
# === SERVICE URLS (Docker network) ===
|
||
CORTEX_URL=http://cortex:7081
|
||
NEOMEM_URL=http://neomem:7077
|
||
RELAY_URL=http://relay:7078
|
||
|
||
# === DATABASE ===
|
||
POSTGRES_USER=neomem
|
||
POSTGRES_PASSWORD=neomem_secure_password
|
||
POSTGRES_DB=neomem
|
||
POSTGRES_HOST=neomem-postgres
|
||
POSTGRES_PORT=5432
|
||
|
||
NEO4J_URI=bolt://neomem-neo4j:7687
|
||
NEO4J_USER=neo4j
|
||
NEO4J_PASSWORD=neo4j_secure_password
|
||
|
||
# === FEATURE FLAGS ===
|
||
ENABLE_RAG=false
|
||
ENABLE_INNER_MONOLOGUE=true
|
||
VERBOSE_DEBUG=false
|
||
|
||
# === PIPELINE CONFIGURATION ===
|
||
# Which LLM to use for each stage
|
||
REFLECTION_LLM=CLOUD # Stage 1: Meta-awareness
|
||
REASONING_LLM=PRIMARY # Stage 2: Draft answer
|
||
REFINE_LLM=PRIMARY # Stage 3: Polish answer
|
||
PERSONA_LLM=CLOUD # Stage 4: Apply personality
|
||
MONOLOGUE_LLM=PRIMARY # Stage 0.6: Inner monologue
|
||
|
||
# === INTAKE CONFIGURATION ===
|
||
INTAKE_BUFFER_SIZE=200 # Max messages per session
|
||
INTAKE_SUMMARY_LEVELS=1,5,10,20,30 # Summary levels
|
||
```
|
||
|
||
#### Cortex `.env` (`/cortex/.env`)
|
||
|
||
```bash
|
||
# Cortex-specific overrides
|
||
VERBOSE_DEBUG=true
|
||
LOG_LEVEL=DEBUG
|
||
|
||
# Stage-specific temperatures
|
||
REFLECTION_TEMPERATURE=0.5
|
||
REASONING_TEMPERATURE=0.7
|
||
REFINE_TEMPERATURE=0.3
|
||
PERSONA_TEMPERATURE=0.6
|
||
```
|
||
|
||
---
|
||
|
||
### Configuration Hierarchy
|
||
|
||
```
|
||
1. Docker compose environment variables (highest priority)
|
||
2. Service-specific .env files
|
||
3. Root .env file
|
||
4. Hard-coded defaults (lowest priority)
|
||
```
|
||
|
||
---
|
||
|
||
## Dependencies & Tech Stack
|
||
|
||
### Python Dependencies
|
||
|
||
**Cortex & NeoMem** (`requirements.txt`)
|
||
|
||
```
|
||
# Web framework
|
||
fastapi==0.115.8
|
||
uvicorn==0.34.0
|
||
pydantic==2.10.4
|
||
|
||
# HTTP clients
|
||
httpx==0.27.2 # Async HTTP (for LLM calls)
|
||
requests==2.32.3 # Sync HTTP (fallback)
|
||
|
||
# Database
|
||
psycopg[binary,pool]>=3.2.8 # PostgreSQL + connection pooling
|
||
|
||
# Utilities
|
||
python-dotenv==1.0.1 # Environment variable loading
|
||
ollama # Ollama client library
|
||
```
|
||
|
||
### Node.js Dependencies
|
||
|
||
**Relay** (`/core/relay/package.json`)
|
||
|
||
```json
|
||
{
|
||
"dependencies": {
|
||
"cors": "^2.8.5",
|
||
"dotenv": "^16.0.3",
|
||
"express": "^4.18.2",
|
||
"mem0ai": "^0.1.0",
|
||
"node-fetch": "^3.3.0"
|
||
}
|
||
}
|
||
```
|
||
|
||
### Docker Images
|
||
|
||
```yaml
|
||
# Cortex & NeoMem
|
||
python:3.11-slim
|
||
|
||
# Relay
|
||
node:latest
|
||
|
||
# UI
|
||
nginx:alpine
|
||
|
||
# PostgreSQL with vector support
|
||
ankane/pgvector:v0.5.1
|
||
|
||
# Graph database
|
||
neo4j:5
|
||
```
|
||
|
||
---
|
||
|
||
### External Services
|
||
|
||
#### LLM Backends (HTTP-based):
|
||
|
||
1. **MI50 GPU Server** (10.0.0.43:8000)
|
||
- llama.cpp via vllm
|
||
- High-performance inference
|
||
- Used for reasoning and refinement
|
||
|
||
2. **RTX 3090 Server** (10.0.0.3:11434)
|
||
- Ollama
|
||
- Alternative local backend
|
||
- Fallback for PRIMARY
|
||
|
||
3. **OpenAI Cloud** (api.openai.com)
|
||
- gpt-4o-mini
|
||
- Used for reflection and persona
|
||
- Requires API key
|
||
|
||
4. **Fallback Server** (10.0.0.41:11435)
|
||
- OpenAI Completions API
|
||
- Emergency backup
|
||
- llama-3.2-8b-instruct
|
||
|
||
---
|
||
|
||
## Key Concepts & Design Patterns
|
||
|
||
### 1. Dual-Memory Architecture
|
||
|
||
Project Lyra uses a **dual-memory system** inspired by human cognition:
|
||
|
||
**Short-Term Memory (Intake):**
|
||
- Fast, in-memory storage
|
||
- Limited capacity (200 messages)
|
||
- Immediate context for current conversation
|
||
- Circular buffer (FIFO eviction)
|
||
- Multi-level summarization
|
||
|
||
**Long-Term Memory (NeoMem):**
|
||
- Persistent database storage
|
||
- Unlimited capacity
|
||
- Semantic search via vector embeddings
|
||
- Entity-relationship tracking via graph DB
|
||
- Cross-session continuity
|
||
|
||
**Why This Matters:**
|
||
- Short-term memory provides immediate context (last few messages)
|
||
- Long-term memory provides semantic understanding (user preferences, past topics)
|
||
- Combined, they enable Lyra to be both **contextually aware** and **historically informed**
|
||
|
||
---
|
||
|
||
### 2. Multi-Stage Reasoning Pipeline
|
||
|
||
Unlike single-shot LLM calls, Lyra uses a **4-stage pipeline** for sophisticated responses:
|
||
|
||
**Stage 1: Reflection** (Meta-cognition)
|
||
- "What is the user really asking?"
|
||
- Analyzes intent and conversation direction
|
||
- Uses OpenAI for strong reasoning
|
||
|
||
**Stage 2: Reasoning** (Draft generation)
|
||
- "What's a good answer?"
|
||
- Generates initial response
|
||
- Uses local llama.cpp for speed/cost
|
||
|
||
**Stage 3: Refinement** (Polish)
|
||
- "How can this be clearer?"
|
||
- Improves clarity and coherence
|
||
- Lower temperature for consistency
|
||
|
||
**Stage 4: Persona** (Voice)
|
||
- "How would Lyra say this?"
|
||
- Applies personality and speaking style
|
||
- Uses OpenAI for natural language
|
||
|
||
**Benefits:**
|
||
- Higher quality responses (multiple passes)
|
||
- Separation of concerns (reasoning vs. style)
|
||
- Backend flexibility (cloud for hard tasks, local for simple ones)
|
||
- Transparent thinking (can inspect each stage)
|
||
|
||
---
|
||
|
||
### 3. Backend Abstraction (LLM Router)
|
||
|
||
The **LLM Router** allows Lyra to use multiple LLM backends transparently:
|
||
|
||
```python
|
||
# Same interface, different backends
|
||
await call_llm(prompt, backend="PRIMARY") # Local llama.cpp
|
||
await call_llm(prompt, backend="CLOUD") # OpenAI
|
||
await call_llm(prompt, backend="SECONDARY") # Ollama
|
||
```
|
||
|
||
**Benefits:**
|
||
- **Cost optimization:** Use expensive cloud LLMs only when needed
|
||
- **Performance:** Local LLMs for low-latency responses
|
||
- **Resilience:** Fallback to alternative backends on failure
|
||
- **Experimentation:** Easy to swap models/providers
|
||
|
||
**Design Pattern:** **Strategy Pattern** for swappable backends
|
||
|
||
---
|
||
|
||
### 4. Microservices Architecture
|
||
|
||
Project Lyra follows **microservices principles**:
|
||
|
||
**Each service has a single responsibility:**
|
||
- Relay: Routing and orchestration
|
||
- Cortex: Reasoning and response generation
|
||
- NeoMem: Long-term memory storage
|
||
- UI: User interface
|
||
|
||
**Communication:**
|
||
- REST APIs (HTTP/JSON)
|
||
- Async ingestion (fire-and-forget)
|
||
- Docker network isolation
|
||
|
||
**Benefits:**
|
||
- Independent scaling (scale Cortex without scaling UI)
|
||
- Technology diversity (Node.js + Python)
|
||
- Fault isolation (Cortex crash doesn't affect NeoMem)
|
||
- Easy testing (mock service dependencies)
|
||
|
||
---
|
||
|
||
### 5. Session-Based State Management
|
||
|
||
Lyra maintains **session-based state** for conversation continuity:
|
||
|
||
```python
|
||
# In-memory session storage (Intake)
|
||
SESSIONS = {
|
||
"session_abc123": {
|
||
"buffer": deque([msg1, msg2, ...], maxlen=200),
|
||
"created_at": "2025-12-12T10:30:00Z"
|
||
}
|
||
}
|
||
|
||
# Persistent session storage (NeoMem)
|
||
# Stores all messages + embeddings for semantic search
|
||
```
|
||
|
||
**Session Lifecycle:**
|
||
1. User starts conversation → UI generates `session_id`
|
||
2. First message → Cortex creates session in `SESSIONS` dict
|
||
3. Subsequent messages → Retrieved from same session
|
||
4. Async ingestion → Messages stored in NeoMem for long-term
|
||
|
||
**Benefits:**
|
||
- Conversation continuity within session
|
||
- Historical search across sessions
|
||
- User can switch sessions (multiple concurrent conversations)
|
||
|
||
---
|
||
|
||
### 6. Asynchronous Ingestion
|
||
|
||
**Pattern:** Separate read path from write path
|
||
|
||
```javascript
|
||
// Relay: Synchronous read path (fast response)
|
||
const response = await fetch('http://cortex:7081/reason');
|
||
return response.json(); // Return immediately to user
|
||
|
||
// Relay: Asynchronous write path (non-blocking)
|
||
fetch('http://cortex:7081/ingest', { method: 'POST', ... });
|
||
// Don't await, just fire and forget
|
||
```
|
||
|
||
**Benefits:**
|
||
- Fast user response times (don't wait for database writes)
|
||
- Resilient to storage failures (user still gets response)
|
||
- Easier scaling (decouple read and write loads)
|
||
|
||
**Trade-off:** Eventual consistency (short delay before memory is searchable)
|
||
|
||
---
|
||
|
||
### 7. Deferred Summarization
|
||
|
||
Intake uses **deferred summarization** instead of pre-computation:
|
||
|
||
```python
|
||
# BAD: Pre-compute summaries on every message
|
||
def add_message(session_id, message):
|
||
SESSIONS[session_id].buffer.append(message)
|
||
SESSIONS[session_id].L1_summary = summarize(last_1_message)
|
||
SESSIONS[session_id].L5_summary = summarize(last_5_messages)
|
||
# ... expensive, runs on every message
|
||
|
||
# GOOD: Compute summaries only when needed
|
||
def summarize_context(session_id):
|
||
buffer = SESSIONS[session_id].buffer
|
||
return {
|
||
"L1": summarize(buffer[-1:]), # Only compute when requested
|
||
"L5": summarize(buffer[-5:]),
|
||
"L10": summarize(buffer[-10:])
|
||
}
|
||
```
|
||
|
||
**Benefits:**
|
||
- Faster message ingestion (no blocking summarization)
|
||
- Compute resources used only when needed
|
||
- Flexible summary levels (easy to add L15, L50, etc.)
|
||
|
||
**Trade-off:** Slight delay when first message in conversation (cold start)
|
||
|
||
---
|
||
|
||
## API Reference
|
||
|
||
### Relay Endpoints
|
||
|
||
#### POST `/v1/chat/completions`
|
||
**OpenAI-compatible chat endpoint**
|
||
|
||
**Request:**
|
||
```json
|
||
{
|
||
"messages": [
|
||
{"role": "user", "content": "Hello, Lyra!"}
|
||
],
|
||
"session_id": "session_abc123"
|
||
}
|
||
```
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"choices": [
|
||
{
|
||
"message": {
|
||
"role": "assistant",
|
||
"content": "Hi there! How can I help you today?"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### POST `/chat`
|
||
**Lyra-native chat endpoint**
|
||
|
||
**Request:**
|
||
```json
|
||
{
|
||
"session_id": "session_abc123",
|
||
"message": "Hello, Lyra!"
|
||
}
|
||
```
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"answer": "Hi there! How can I help you today?",
|
||
"session_id": "session_abc123"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### GET `/sessions/:id`
|
||
**Retrieve session history**
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"session_id": "session_abc123",
|
||
"messages": [
|
||
{"role": "user", "content": "Hello", "timestamp": "..."},
|
||
{"role": "assistant", "content": "Hi!", "timestamp": "..."}
|
||
],
|
||
"created_at": "2025-12-12T10:30:00Z"
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### Cortex Endpoints
|
||
|
||
#### POST `/reason`
|
||
**Main reasoning pipeline**
|
||
|
||
**Request:**
|
||
```json
|
||
{
|
||
"session_id": "session_abc123",
|
||
"user_message": "How do I deploy ML models?"
|
||
}
|
||
```
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"answer": "Final answer with Lyra's personality",
|
||
"metadata": {
|
||
"reflection": "User seeking deployment guidance...",
|
||
"draft": "Initial draft answer...",
|
||
"refined": "Polished answer...",
|
||
"stages_completed": 4
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### POST `/ingest`
|
||
**Ingest message exchange into Intake**
|
||
|
||
**Request:**
|
||
```json
|
||
{
|
||
"session_id": "session_abc123",
|
||
"user_message": "How do I deploy ML models?",
|
||
"assistant_message": "Here's how..."
|
||
}
|
||
```
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"status": "ingested",
|
||
"session_id": "session_abc123",
|
||
"message_count": 24
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### GET `/debug/sessions`
|
||
**Inspect in-memory SESSIONS state**
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"session_abc123": {
|
||
"message_count": 24,
|
||
"created_at": "2025-12-12T10:30:00Z",
|
||
"last_message_at": "2025-12-12T11:15:00Z"
|
||
},
|
||
"session_xyz789": {
|
||
"message_count": 5,
|
||
"created_at": "2025-12-12T11:00:00Z",
|
||
"last_message_at": "2025-12-12T11:10:00Z"
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
### NeoMem Endpoints
|
||
|
||
#### POST `/memories`
|
||
**Create new memory**
|
||
|
||
**Request:**
|
||
```json
|
||
{
|
||
"messages": [
|
||
{"role": "user", "content": "I prefer Docker for deployments"},
|
||
{"role": "assistant", "content": "Noted! I'll keep that in mind."}
|
||
],
|
||
"session_id": "session_abc123"
|
||
}
|
||
```
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"status": "created",
|
||
"memory_id": "mem_456def",
|
||
"extracted_entities": ["Docker", "deployments"]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### GET `/search`
|
||
**Semantic search for memories**
|
||
|
||
**Query Parameters:**
|
||
- `query` (required): Search query
|
||
- `limit` (optional, default=5): Max results
|
||
|
||
**Request:**
|
||
```
|
||
GET /search?query=deployment%20preferences&limit=5
|
||
```
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"results": [
|
||
{
|
||
"content": "User prefers Docker for deployments",
|
||
"score": 0.92,
|
||
"timestamp": "2025-12-10T14:30:00Z",
|
||
"session_id": "session_abc123"
|
||
},
|
||
{
|
||
"content": "Previously deployed models on AWS ECS",
|
||
"score": 0.87,
|
||
"timestamp": "2025-12-09T09:15:00Z",
|
||
"session_id": "session_abc123"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
#### GET `/memories`
|
||
**List all memories**
|
||
|
||
**Query Parameters:**
|
||
- `offset` (optional, default=0): Pagination offset
|
||
- `limit` (optional, default=50): Max results
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"memories": [
|
||
{
|
||
"id": "mem_123abc",
|
||
"content": "User prefers Docker...",
|
||
"created_at": "2025-12-10T14:30:00Z"
|
||
}
|
||
],
|
||
"total": 147,
|
||
"offset": 0,
|
||
"limit": 50
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Deployment & Operations
|
||
|
||
### Docker Compose Deployment
|
||
|
||
**File:** `/docker-compose.yml`
|
||
|
||
```yaml
|
||
version: '3.8'
|
||
|
||
services:
|
||
# === ACTIVE SERVICES ===
|
||
|
||
relay:
|
||
build: ./core/relay
|
||
ports:
|
||
- "7078:7078"
|
||
environment:
|
||
- CORTEX_URL=http://cortex:7081
|
||
- NEOMEM_URL=http://neomem:7077
|
||
depends_on:
|
||
- cortex
|
||
networks:
|
||
- lyra_net
|
||
|
||
cortex:
|
||
build: ./cortex
|
||
ports:
|
||
- "7081:7081"
|
||
environment:
|
||
- NEOMEM_URL=http://neomem:7077
|
||
- PRIMARY_URL=${PRIMARY_URL}
|
||
- OPENAI_API_KEY=${OPENAI_API_KEY}
|
||
command: uvicorn main:app --host 0.0.0.0 --port 7081 --workers 1
|
||
depends_on:
|
||
- neomem
|
||
networks:
|
||
- lyra_net
|
||
|
||
neomem:
|
||
build: ./neomem
|
||
ports:
|
||
- "7077:7077"
|
||
environment:
|
||
- POSTGRES_HOST=neomem-postgres
|
||
- POSTGRES_USER=${POSTGRES_USER}
|
||
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
|
||
- NEO4J_URI=${NEO4J_URI}
|
||
depends_on:
|
||
- neomem-postgres
|
||
- neomem-neo4j
|
||
networks:
|
||
- lyra_net
|
||
|
||
ui:
|
||
image: nginx:alpine
|
||
ports:
|
||
- "8081:80"
|
||
volumes:
|
||
- ./core/ui:/usr/share/nginx/html:ro
|
||
networks:
|
||
- lyra_net
|
||
|
||
# === DATABASES ===
|
||
|
||
neomem-postgres:
|
||
image: ankane/pgvector:v0.5.1
|
||
environment:
|
||
- POSTGRES_USER=${POSTGRES_USER}
|
||
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
|
||
- POSTGRES_DB=${POSTGRES_DB}
|
||
volumes:
|
||
- ./volumes/postgres_data:/var/lib/postgresql/data
|
||
ports:
|
||
- "5432:5432"
|
||
networks:
|
||
- lyra_net
|
||
|
||
neomem-neo4j:
|
||
image: neo4j:5
|
||
environment:
|
||
- NEO4J_AUTH=${NEO4J_USER}/${NEO4J_PASSWORD}
|
||
volumes:
|
||
- ./volumes/neo4j_data:/data
|
||
ports:
|
||
- "7474:7474" # Browser UI
|
||
- "7687:7687" # Bolt
|
||
networks:
|
||
- lyra_net
|
||
|
||
networks:
|
||
lyra_net:
|
||
driver: bridge
|
||
```
|
||
|
||
---
|
||
|
||
### Starting the System
|
||
|
||
```bash
|
||
# 1. Clone repository
|
||
git clone https://github.com/yourusername/project-lyra.git
|
||
cd project-lyra
|
||
|
||
# 2. Configure environment
|
||
cp .env.example .env
|
||
# Edit .env with your LLM backend URLs and API keys
|
||
|
||
# 3. Start all services
|
||
docker-compose up -d
|
||
|
||
# 4. Check health
|
||
curl http://localhost:7078/_health
|
||
curl http://localhost:7081/health
|
||
curl http://localhost:7077/health
|
||
|
||
# 5. Open UI
|
||
open http://localhost:8081
|
||
```
|
||
|
||
---
|
||
|
||
### Monitoring & Logs
|
||
|
||
```bash
|
||
# View all logs
|
||
docker-compose logs -f
|
||
|
||
# View specific service
|
||
docker-compose logs -f cortex
|
||
|
||
# Check resource usage
|
||
docker stats
|
||
|
||
# Inspect Cortex sessions
|
||
curl http://localhost:7081/debug/sessions
|
||
|
||
# Check NeoMem memories
|
||
curl http://localhost:7077/memories?limit=10
|
||
```
|
||
|
||
---
|
||
|
||
### Scaling Considerations
|
||
|
||
#### Current Constraints:
|
||
|
||
1. **Single Cortex worker** required (in-memory SESSIONS dict)
|
||
- Solution: Migrate SESSIONS to Redis or PostgreSQL
|
||
|
||
2. **In-memory session storage** in Relay
|
||
- Solution: Use Redis for session persistence
|
||
|
||
3. **No load balancing** (single instance of each service)
|
||
- Solution: Add nginx reverse proxy + multiple Cortex instances
|
||
|
||
#### Horizontal Scaling Plan:
|
||
|
||
```yaml
|
||
# Future: Redis-backed session storage
|
||
cortex:
|
||
build: ./cortex
|
||
command: uvicorn main:app --workers 4 # Multi-worker
|
||
environment:
|
||
- REDIS_URL=redis://redis:6379
|
||
depends_on:
|
||
- redis
|
||
|
||
redis:
|
||
image: redis:alpine
|
||
ports:
|
||
- "6379:6379"
|
||
```
|
||
|
||
---
|
||
|
||
### Backup Strategy
|
||
|
||
```bash
|
||
# Backup PostgreSQL (NeoMem vectors)
|
||
docker exec neomem-postgres pg_dump -U neomem neomem > backup_postgres.sql
|
||
|
||
# Backup Neo4j (NeoMem graph)
|
||
docker exec neomem-neo4j neo4j-admin dump --to=/data/backup.dump
|
||
|
||
# Backup Intake sessions (manual export)
|
||
curl http://localhost:7081/debug/sessions > backup_sessions.json
|
||
```
|
||
|
||
---
|
||
|
||
## Known Issues & Constraints
|
||
|
||
### Critical Constraints
|
||
|
||
#### 1. Single-Worker Requirement (Cortex)
|
||
**Issue:** Cortex must run with `--workers 1` to maintain SESSIONS state
|
||
**Impact:** Limited horizontal scalability
|
||
**Workaround:** None currently
|
||
**Fix:** Migrate SESSIONS to Redis or PostgreSQL
|
||
**Priority:** High (blocking scalability)
|
||
|
||
#### 2. In-Memory Session Storage (Relay)
|
||
**Issue:** Sessions stored in Node.js process memory
|
||
**Impact:** Lost on restart, no persistence
|
||
**Workaround:** None currently
|
||
**Fix:** Use Redis or database
|
||
**Priority:** Medium (acceptable for demo)
|
||
|
||
---
|
||
|
||
### Non-Critical Issues
|
||
|
||
#### 3. RAG Service Disabled
|
||
**Status:** Built but commented out in docker-compose.yml
|
||
**Impact:** No RAG-based long-term knowledge retrieval
|
||
**Workaround:** NeoMem provides semantic search
|
||
**Fix:** Re-enable and integrate RAG service
|
||
**Priority:** Low (NeoMem sufficient for now)
|
||
|
||
#### 4. Partial NeoMem Integration
|
||
**Status:** Search implemented, async ingestion planned
|
||
**Impact:** Memories not automatically saved
|
||
**Workaround:** Manual POST to /memories
|
||
**Fix:** Complete async ingestion in Relay
|
||
**Priority:** Medium (planned feature)
|
||
|
||
#### 5. Inner Monologue Observer-Only
|
||
**Status:** Stage 0.6 runs but output not used
|
||
**Impact:** No adaptive response based on monologue
|
||
**Workaround:** None (future feature)
|
||
**Fix:** Integrate monologue output into pipeline
|
||
**Priority:** Low (experimental feature)
|
||
|
||
---
|
||
|
||
### Fixed Issues (v0.5.2)
|
||
|
||
✅ **LLM Router Blocking** - Migrated from `requests` to `httpx` for async
|
||
✅ **Session ID Case Mismatch** - Standardized to `session_id`
|
||
✅ **Missing Backend Parameter** - Added to intake summarization
|
||
|
||
---
|
||
|
||
### Deprecated Components
|
||
|
||
**Location:** `/DEPRECATED_FILES.md`
|
||
|
||
- **Standalone Intake Service** - Now embedded in Cortex
|
||
- **Old Relay Backup** - Replaced by current Relay
|
||
- **Persona Sidecar** - Built but unused (dynamic persona loading)
|
||
|
||
---
|
||
|
||
## Advanced Topics
|
||
|
||
### Custom Prompt Engineering
|
||
|
||
Each stage uses carefully crafted prompts:
|
||
|
||
**Reflection Prompt Example:**
|
||
```python
|
||
REFLECTION_PROMPT = """
|
||
You are Lyra's reflective awareness layer.
|
||
Your job is to analyze the user's message and conversation context
|
||
to understand their true intent and needs.
|
||
|
||
User message: {user_message}
|
||
|
||
Recent context:
|
||
{intake_L10_summary}
|
||
|
||
Long-term context:
|
||
{neomem_top_3_memories}
|
||
|
||
Provide concise meta-awareness notes:
|
||
- What is the user's underlying intent?
|
||
- What topics/themes are emerging?
|
||
- What depth of response is appropriate?
|
||
- Are there any implicit questions or concerns?
|
||
|
||
Keep notes brief (3-5 sentences). Focus on insight, not description.
|
||
"""
|
||
```
|
||
|
||
---
|
||
|
||
### Extending the Pipeline
|
||
|
||
**Adding Stage 5 (Fact-Checking):**
|
||
|
||
```python
|
||
# /cortex/reasoning/factcheck.py
|
||
async def factcheck_answer(answer: str, context: dict) -> dict:
|
||
"""
|
||
Stage 5: Verify factual claims in answer.
|
||
|
||
Returns:
|
||
{
|
||
"verified": bool,
|
||
"flagged_claims": list,
|
||
"corrected_answer": str
|
||
}
|
||
"""
|
||
|
||
prompt = f"""
|
||
Review this answer for factual accuracy:
|
||
|
||
{answer}
|
||
|
||
Flag any claims that seem dubious or need verification.
|
||
Provide corrected version if needed.
|
||
"""
|
||
|
||
result = await call_llm(prompt, backend="CLOUD", temperature=0.1)
|
||
return parse_factcheck_result(result)
|
||
|
||
# Update router.py to include Stage 5
|
||
async def reason_endpoint(request):
|
||
# ... existing stages ...
|
||
|
||
# Stage 5: Fact-checking
|
||
factcheck_result = await factcheck_answer(final_answer, context)
|
||
|
||
if not factcheck_result["verified"]:
|
||
final_answer = factcheck_result["corrected_answer"]
|
||
|
||
return {"answer": final_answer}
|
||
```
|
||
|
||
---
|
||
|
||
### Custom LLM Backend Integration
|
||
|
||
**Adding Anthropic Claude:**
|
||
|
||
```python
|
||
# /cortex/llm/llm_router.py
|
||
|
||
BACKEND_CONFIGS = {
|
||
# ... existing backends ...
|
||
|
||
"CLAUDE": {
|
||
"url": "https://api.anthropic.com/v1",
|
||
"provider": "anthropic",
|
||
"model": "claude-3-5-sonnet-20241022",
|
||
"api_key": os.getenv("ANTHROPIC_API_KEY")
|
||
}
|
||
}
|
||
|
||
# Add provider-specific logic
|
||
elif backend_config["provider"] == "anthropic":
|
||
headers = {
|
||
"x-api-key": api_key,
|
||
"anthropic-version": "2023-06-01"
|
||
}
|
||
payload = {
|
||
"model": model,
|
||
"messages": [{"role": "user", "content": prompt}],
|
||
"max_tokens": max_tokens,
|
||
"temperature": temperature
|
||
}
|
||
response = await httpx_client.post(
|
||
f"{url}/messages",
|
||
json=payload,
|
||
headers=headers,
|
||
timeout=120
|
||
)
|
||
return response.json()["content"][0]["text"]
|
||
```
|
||
|
||
---
|
||
|
||
### Performance Optimization
|
||
|
||
**Caching Strategies:**
|
||
|
||
```python
|
||
# /cortex/utils/cache.py
|
||
from functools import lru_cache
|
||
import hashlib
|
||
|
||
@lru_cache(maxsize=128)
|
||
def cache_llm_call(prompt_hash: str, backend: str):
|
||
"""Cache LLM responses for identical prompts"""
|
||
# Note: Only cache deterministic calls (temperature=0)
|
||
pass
|
||
|
||
# Usage in llm_router.py
|
||
async def call_llm(prompt, backend, temperature=0.7, max_tokens=512):
|
||
if temperature == 0:
|
||
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
|
||
cached = cache_llm_call(prompt_hash, backend)
|
||
if cached:
|
||
return cached
|
||
|
||
# ... normal LLM call ...
|
||
```
|
||
|
||
**Database Query Optimization:**
|
||
|
||
```python
|
||
# /neomem/neomem/database.py
|
||
|
||
# BAD: Load all memories, then filter
|
||
def search_memories(query):
|
||
all_memories = db.execute("SELECT * FROM memories")
|
||
# Expensive in-memory filtering
|
||
return [m for m in all_memories if similarity(m, query) > 0.8]
|
||
|
||
# GOOD: Use database indexes and LIMIT
|
||
def search_memories(query, limit=5):
|
||
query_embedding = embed(query)
|
||
return db.execute("""
|
||
SELECT * FROM memories
|
||
WHERE embedding <-> %s < 0.2 -- pgvector cosine distance
|
||
ORDER BY embedding <-> %s
|
||
LIMIT %s
|
||
""", (query_embedding, query_embedding, limit))
|
||
```
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
Project Lyra is a sophisticated, multi-layered AI companion system that addresses the fundamental limitation of chatbot amnesia through:
|
||
|
||
1. **Dual-memory architecture** (short-term Intake + long-term NeoMem)
|
||
2. **Multi-stage reasoning pipeline** (Reflection → Reasoning → Refinement → Persona)
|
||
3. **Flexible multi-backend LLM support** (cloud + local with fallback)
|
||
4. **Microservices design** for scalability and maintainability
|
||
5. **Modern web UI** with session management
|
||
|
||
The system is production-ready with comprehensive error handling, logging, and health monitoring.
|
||
|
||
---
|
||
|
||
## Quick Reference
|
||
|
||
### Service Ports
|
||
- **UI:** 8081 (Browser interface)
|
||
- **Relay:** 7078 (Main orchestrator)
|
||
- **Cortex:** 7081 (Reasoning engine)
|
||
- **NeoMem:** 7077 (Long-term memory)
|
||
- **PostgreSQL:** 5432 (Vector storage)
|
||
- **Neo4j:** 7474 (Browser), 7687 (Bolt)
|
||
|
||
### Key Files
|
||
- **Main Entry:** `/core/relay/server.js`
|
||
- **Reasoning Pipeline:** `/cortex/router.py`
|
||
- **LLM Router:** `/cortex/llm/llm_router.py`
|
||
- **Short-term Memory:** `/cortex/intake/intake.py`
|
||
- **Long-term Memory:** `/neomem/neomem/`
|
||
- **Personality:** `/cortex/persona/identity.py`
|
||
|
||
### Important Commands
|
||
```bash
|
||
# Start system
|
||
docker-compose up -d
|
||
|
||
# View logs
|
||
docker-compose logs -f cortex
|
||
|
||
# Debug sessions
|
||
curl http://localhost:7081/debug/sessions
|
||
|
||
# Health check
|
||
curl http://localhost:7078/_health
|
||
|
||
# Search memories
|
||
curl "http://localhost:7077/search?query=deployment&limit=5"
|
||
```
|
||
|
||
---
|
||
|
||
**Document Version:** 1.0
|
||
**Last Updated:** 2025-12-13
|
||
**Maintained By:** Project Lyra Team
|