65 KiB
Project Lyra - Complete System Breakdown
Version: v0.5.2 Last Updated: 2025-12-12 Purpose: AI-friendly comprehensive documentation for understanding the entire system
Table of Contents
- System Overview
- Architecture Diagram
- Core Components
- Data Flow & Message Pipeline
- Module Deep Dives
- Configuration & Environment
- Dependencies & Tech Stack
- Key Concepts & Design Patterns
- API Reference
- Deployment & Operations
- Known Issues & Constraints
System Overview
What is Project Lyra?
Project Lyra is a modular, persistent AI companion system designed to address the fundamental limitation of typical chatbots: amnesia. Unlike standard conversational AI that forgets everything between sessions, Lyra maintains:
- Persistent memory (short-term and long-term)
- Project continuity across conversations
- Multi-stage reasoning for sophisticated responses
- Flexible LLM backend support (local and cloud)
- Self-awareness through autonomy modules
Mission Statement
Give an AI chatbot capabilities beyond typical amnesic chat by providing memory-backed conversation, project organization, executive function with proactive insights, and a sophisticated reasoning pipeline.
Key Features
- Memory System: Dual-layer (short-term Intake + long-term NeoMem)
- 4-Stage Reasoning Pipeline: Reflection → Reasoning → Refinement → Persona
- Multi-Backend LLM Support: Cloud (OpenAI) + Local (llama.cpp, Ollama)
- Microservices Architecture: Docker-based, horizontally scalable
- Modern Web UI: Cyberpunk-themed chat interface with session management
- OpenAI-Compatible API: Drop-in replacement for standard chatbots
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ (Browser - Port 8081) │
└────────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ RELAY (Orchestrator) │
│ Node.js/Express - Port 7078 │
│ • Routes messages to Cortex │
│ • Manages sessions (in-memory) │
│ • OpenAI-compatible endpoints │
│ • Async ingestion to NeoMem │
└─────┬───────────────────────────────────────────────────────────┬───┘
│ │
▼ ▼
┌─────────────────────────────────────────┐ ┌──────────────────────┐
│ CORTEX (Reasoning Engine) │ │ NeoMem (LT Memory) │
│ Python/FastAPI - Port 7081 │ │ Python - Port 7077 │
│ │ │ │
│ ┌───────────────────────────────────┐ │ │ • PostgreSQL │
│ │ 4-STAGE REASONING PIPELINE │ │ │ • Neo4j Graph DB │
│ │ │ │ │ • pgvector │
│ │ 0. Context Collection │ │◄───┤ • Semantic search │
│ │ ├─ Intake summaries │ │ │ • Memory updates │
│ │ ├─ NeoMem search ────────────┼─┼────┘ │
│ │ └─ Session state │ │ │
│ │ │ │ │
│ │ 0.5. Load Identity │ │ │
│ │ 0.6. Inner Monologue (observer) │ │ │
│ │ │ │ │
│ │ 1. Reflection (OpenAI) │ │ │
│ │ └─ Meta-awareness notes │ │ │
│ │ │ │ │
│ │ 2. Reasoning (PRIMARY/llama.cpp) │ │ │
│ │ └─ Draft answer │ │ │
│ │ │ │ │
│ │ 3. Refinement (PRIMARY) │ │ │
│ │ └─ Polish answer │ │ │
│ │ │ │ │
│ │ 4. Persona (OpenAI) │ │ │
│ │ └─ Apply Lyra voice │ │ │
│ └───────────────────────────────────┘ │ │
│ │ │
│ ┌───────────────────────────────────┐ │ │
│ │ EMBEDDED MODULES │ │ │
│ │ │ │ │
│ │ • Intake (Short-term Memory) │ │ │
│ │ └─ SESSIONS dict (in-memory) │ │ │
│ │ └─ Circular buffer (200 msgs) │ │ │
│ │ └─ Multi-level summaries │ │ │
│ │ │ │ │
│ │ • Persona (Identity & Style) │ │ │
│ │ └─ Lyra personality block │ │ │
│ │ │ │ │
│ │ • Autonomy (Self-state) │ │ │
│ │ └─ Inner monologue │ │ │
│ │ │ │ │
│ │ • LLM Router │ │ │
│ │ └─ Multi-backend support │ │ │
│ └───────────────────────────────────┘ │ │
└─────────────────────────────────────────┘ │
│
┌─────────────────────────────────────────────────────────────────────┤
│ EXTERNAL LLM BACKENDS │
├─────────────────────────────────────────────────────────────────────┤
│ • PRIMARY: llama.cpp (MI50 GPU) - 10.0.0.43:8000 │
│ • SECONDARY: Ollama (RTX 3090) - 10.0.0.3:11434 │
│ • CLOUD: OpenAI API - api.openai.com │
│ • FALLBACK: OpenAI Completions - 10.0.0.41:11435 │
└─────────────────────────────────────────────────────────────────────┘
Core Components
1. Relay (Orchestrator)
Location: /core/relay/
Runtime: Node.js + Express
Port: 7078
Role: Main message router and session manager
Key Responsibilities:
- Receives user messages from UI or API clients
- Routes messages to Cortex reasoning pipeline
- Manages in-memory session storage
- Handles async ingestion to NeoMem (planned)
- Returns OpenAI-formatted responses
Main Files:
server.js(200+ lines) - Express server with routing logicpackage.json- Dependencies (cors, express, dotenv, mem0ai, node-fetch)
Key Endpoints:
POST /v1/chat/completions // OpenAI-compatible endpoint
POST /chat // Lyra-native chat endpoint
GET /_health // Health check
GET /sessions/:id // Retrieve session history
POST /sessions/:id // Save session history
Internal Flow:
// Both endpoints call handleChatRequest(session_id, user_msg)
async function handleChatRequest(sessionId, userMessage) {
// 1. Forward to Cortex
const response = await fetch('http://cortex:7081/reason', {
method: 'POST',
body: JSON.stringify({ session_id: sessionId, user_message: userMessage })
});
// 2. Get response
const result = await response.json();
// 3. Async ingestion to Cortex
await fetch('http://cortex:7081/ingest', {
method: 'POST',
body: JSON.stringify({
session_id: sessionId,
user_message: userMessage,
assistant_message: result.answer
})
});
// 4. (Planned) Async ingestion to NeoMem
// 5. Return OpenAI-formatted response
return {
choices: [{ message: { role: 'assistant', content: result.answer } }]
};
}
2. Cortex (Reasoning Engine)
Location: /cortex/
Runtime: Python 3.11 + FastAPI
Port: 7081
Role: Primary reasoning engine with 4-stage pipeline
Architecture:
Cortex is the "brain" of Lyra. It receives user messages and produces thoughtful responses through a multi-stage reasoning process.
Key Responsibilities:
- Context collection from multiple sources (Intake, NeoMem, session state)
- 4-stage reasoning pipeline (Reflection → Reasoning → Refinement → Persona)
- Short-term memory management (embedded Intake module)
- Identity/persona application
- LLM backend routing
Main Files:
main.py(7 lines) - FastAPI app entry pointrouter.py(237 lines) - Main request handler & pipeline orchestratorcontext.py(400+ lines) - Context collection logicintake/intake.py(350+ lines) - Short-term memory modulepersona/identity.py- Lyra identity configurationpersona/speak.py- Personality applicationreasoning/reflection.py- Meta-awareness generationreasoning/reasoning.py- Draft answer generationreasoning/refine.py- Answer refinementllm/llm_router.py(150+ lines) - LLM backend routerautonomy/monologue/monologue.py- Inner monologue processorneomem_client.py- NeoMem API wrapper
Key Endpoints:
POST /reason # Main reasoning pipeline
POST /ingest # Receive message exchanges for storage
GET /health # Health check
GET /debug/sessions # Inspect in-memory SESSIONS state
GET /debug/summary # Test summarization
3. Intake (Short-Term Memory)
Location: /cortex/intake/intake.py
Architecture: Embedded Python module (no longer standalone service)
Role: Session-based short-term memory with multi-level summarization
Data Structure:
# Global in-memory dictionary
SESSIONS = {
"session_123": {
"buffer": deque([msg1, msg2, ...], maxlen=200), # Circular buffer
"created_at": "2025-12-12T10:30:00Z"
}
}
# Message format in buffer
{
"role": "user" | "assistant",
"content": "message text",
"timestamp": "ISO 8601"
}
Key Features:
- Circular Buffer: Max 200 messages per session (oldest auto-evicted)
- Multi-Level Summarization:
- L1: Last 1 message
- L5: Last 5 messages
- L10: Last 10 messages
- L20: Last 20 messages
- L30: Last 30 messages
- Deferred Summarization: Summaries generated on-demand, not pre-computed
- Session Management: Automatic session creation on first message
Critical Constraint:
Single Uvicorn worker required to maintain shared SESSIONS dictionary state. Multi-worker deployments would require migrating to Redis or similar shared storage.
Main Functions:
def add_exchange_internal(session_id, user_msg, assistant_msg):
"""Add user-assistant exchange to session buffer"""
def summarize_context(session_id, backend="PRIMARY"):
"""Generate multi-level summaries from session buffer"""
def get_session_messages(session_id):
"""Retrieve all messages in session buffer"""
Summarization Strategy:
# Example L10 summarization
last_10 = list(session_buffer)[-10:]
prompt = f"""Summarize the last 10 messages:
{format_messages(last_10)}
Provide concise summary focusing on key topics and context."""
summary = await call_llm(prompt, backend=backend, temperature=0.3)
4. NeoMem (Long-Term Memory)
Location: /neomem/
Runtime: Python 3.11 + FastAPI
Port: 7077
Role: Persistent long-term memory with semantic search
Architecture:
NeoMem is a fork of Mem0 OSS with local-first design (no external SDK dependencies).
Backend Storage:
-
PostgreSQL + pgvector (Port 5432)
- Vector embeddings for semantic search
- User: neomem, DB: neomem
- Image:
ankane/pgvector:v0.5.1
-
Neo4j Graph DB (Ports 7474, 7687)
- Entity relationship tracking
- Graph-based memory associations
- Image:
neo4j:5
Key Features:
- Semantic memory storage and retrieval
- Entity-relationship graph modeling
- RESTful API (no external SDK)
- Persistent across sessions
Main Endpoints:
GET /memories # List all memories
POST /memories # Create new memory
GET /search # Semantic search
DELETE /memories/{id} # Delete memory
Integration Flow:
# From Cortex context collection
async def collect_context(session_id, user_message):
# 1. Search NeoMem for relevant memories
neomem_results = await neomem_client.search(
query=user_message,
limit=5
)
# 2. Include in context
context = {
"neomem_memories": neomem_results,
"intake_summaries": intake.summarize_context(session_id),
# ...
}
return context
5. UI (Web Interface)
Location: /core/ui/
Runtime: Static files served by Nginx
Port: 8081
Role: Browser-based chat interface
Key Features:
- Cyberpunk-themed design with dark mode
- Session management via localStorage
- OpenAI-compatible message format
- Model selection dropdown
- PWA support (offline capability)
- Responsive design
Main Files:
index.html(400+ lines) - Chat interface with session managementstyle.css- Cyberpunk-themed stylingmanifest.json- PWA configurationsw.js- Service worker for offline support
Session Management:
// LocalStorage structure
{
"currentSessionId": "session_123",
"sessions": {
"session_123": {
"messages": [
{ role: "user", content: "Hello" },
{ role: "assistant", content: "Hi there!" }
],
"created": "2025-12-12T10:30:00Z",
"title": "Conversation about..."
}
}
}
API Communication:
async function sendMessage(userMessage) {
const response = await fetch('http://localhost:7078/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
messages: [{ role: 'user', content: userMessage }],
session_id: getCurrentSessionId()
})
});
const data = await response.json();
return data.choices[0].message.content;
}
Data Flow & Message Pipeline
Complete Message Flow (v0.5.2)
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 1: User Input │
└─────────────────────────────────────────────────────────────────────┘
User types message in UI (Port 8081)
↓
localStorage saves message to session
↓
POST http://localhost:7078/v1/chat/completions
{
"messages": [{"role": "user", "content": "How do I deploy ML models?"}],
"session_id": "session_abc123"
}
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 2: Relay Routing │
└─────────────────────────────────────────────────────────────────────┘
Relay (server.js) receives request
↓
Extracts session_id and user_message
↓
POST http://cortex:7081/reason
{
"session_id": "session_abc123",
"user_message": "How do I deploy ML models?"
}
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 3: Cortex - Stage 0 (Context Collection) │
└─────────────────────────────────────────────────────────────────────┘
router.py calls collect_context()
↓
context.py orchestrates parallel collection:
├─ Intake: summarize_context(session_id)
│ └─ Returns { L1, L5, L10, L20, L30 summaries }
│
├─ NeoMem: search(query=user_message, limit=5)
│ └─ Semantic search returns relevant memories
│
└─ Session State:
└─ { timestamp, mode, mood, context_summary }
Combined context structure:
{
"user_message": "How do I deploy ML models?",
"self_state": {
"current_time": "2025-12-12T15:30:00Z",
"mode": "conversational",
"mood": "helpful",
"session_id": "session_abc123"
},
"context_summary": {
"L1": "User asked about deployment",
"L5": "Discussion about ML workflows",
"L10": "Previous context on CI/CD pipelines",
"L20": "...",
"L30": "..."
},
"neomem_memories": [
{ "content": "User prefers Docker for deployments", "score": 0.92 },
{ "content": "Previously deployed models on AWS", "score": 0.87 }
]
}
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 4: Cortex - Stage 0.5 (Load Identity) │
└─────────────────────────────────────────────────────────────────────┘
persona/identity.py loads Lyra personality block
↓
Returns identity string:
"""
You are Lyra, a thoughtful AI companion.
You value clarity, depth, and meaningful conversation.
You speak naturally and conversationally...
"""
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 5: Cortex - Stage 0.6 (Inner Monologue - Observer Only) │
└─────────────────────────────────────────────────────────────────────┘
autonomy/monologue/monologue.py processes context
↓
InnerMonologue.process(context) → JSON analysis
{
"intent": "seeking_deployment_guidance",
"tone": "focused",
"depth": "medium",
"consult_executive": false
}
NOTE: Currently observer-only, not integrated into response generation
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 6: Cortex - Stage 1 (Reflection) │
└─────────────────────────────────────────────────────────────────────┘
reasoning/reflection.py generates meta-awareness notes
↓
Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini)
↓
Prompt structure:
"""
You are Lyra's reflective awareness.
Analyze the user's intent and conversation context.
User message: How do I deploy ML models?
Context: [Intake summaries, NeoMem memories]
Generate concise meta-awareness notes about:
- User's underlying intent
- Conversation direction
- Key topics to address
"""
↓
Returns reflection notes:
"""
User is seeking practical deployment guidance. Previous context shows
familiarity with Docker and AWS. Focus on concrete steps and best practices.
Avoid over-technical jargon.
"""
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 7: Cortex - Stage 2 (Reasoning) │
└─────────────────────────────────────────────────────────────────────┘
reasoning/reasoning.py generates draft answer
↓
Calls call_llm() with backend="PRIMARY" (llama.cpp on MI50 GPU)
↓
Prompt structure:
"""
[Lyra identity block]
Reflection notes: [Stage 1 output]
Context: [Intake summaries]
Long-term memory: [NeoMem results]
User: How do I deploy ML models?
Provide a thoughtful, detailed answer.
"""
↓
Returns draft answer:
"""
Based on your previous work with Docker and AWS, here's a practical
deployment approach:
1. Containerize your model using Docker...
2. Set up model serving with FastAPI or Flask...
3. Deploy to AWS ECS or Lambda...
4. Implement monitoring and versioning...
"""
┌─────────────────────────────────────────────────────────────────────┐
│ STEP 8: Cortex - Stage 3 (Refinement) │
└─────────────────────────────────────────────────────────────────────┘
reasoning/refine.py polishes the draft
↓
Calls call_llm() with backend="PRIMARY" (llama.cpp)
Temperature: 0.3, Max tokens: 768
↓
Prompt structure:
"""
Refine this draft answer for clarity and coherence:
Draft: [Stage 2 output]
Original question: How do I deploy ML models?
Improve:
- Clarity and readability
- Logical flow
- Concrete examples
"""
↓
Returns refined answer:
"""
Since you're already familiar with Docker and AWS, here's a streamlined
ML deployment workflow:
**1. Containerization**
Create a Dockerfile for your model:
```dockerfile
FROM python:3.11-slim
COPY model.pkl /app/
COPY requirements.txt /app/
RUN pip install -r requirements.txt
CMD ["uvicorn", "main:app"]
2. Model Serving Use FastAPI for clean API endpoints...
3. AWS Deployment Deploy using ECS for scalability...
4. Monitoring Set up CloudWatch metrics... """
┌─────────────────────────────────────────────────────────────────────┐ │ STEP 9: Cortex - Stage 4 (Persona) │ └─────────────────────────────────────────────────────────────────────┘ persona/speak.py applies Lyra's voice ↓ Calls call_llm() with backend="CLOUD" (OpenAI gpt-4o-mini) Temperature: 0.6 ↓ Prompt structure: """ You are Lyra. Rewrite this answer in your natural, conversational voice.
[Lyra personality block]
Technical answer: [Stage 3 output]
Make it sound natural and engaging while preserving all technical details. """ ↓ Returns final answer with Lyra's personality: """ Great question! Since you've already got Docker and AWS experience, let's build on that foundation.
Here's how I'd approach ML model deployment:
Start with containerization – you know Docker, so this'll feel familiar. Create a Dockerfile that packages your model...
[Rest of answer with Lyra's conversational tone]
Does this workflow align with your infrastructure? I can dive deeper into any of these steps if you'd like! """
┌─────────────────────────────────────────────────────────────────────┐ │ STEP 10: Cortex Response │ └─────────────────────────────────────────────────────────────────────┘ router.py returns JSON response to Relay: { "answer": "[Stage 4 final output]", "metadata": { "reflection": "[Stage 1 output]", "draft": "[Stage 2 output]", "refined": "[Stage 3 output]", "stages_completed": 4 } }
┌─────────────────────────────────────────────────────────────────────┐ │ STEP 11: Async Ingestion to Intake │ └─────────────────────────────────────────────────────────────────────┘ Relay sends POST http://cortex:7081/ingest { "session_id": "session_abc123", "user_message": "How do I deploy ML models?", "assistant_message": "[Final answer]" } ↓ Cortex calls intake.add_exchange_internal() ↓ Adds to SESSIONS["session_abc123"].buffer: [ { "role": "user", "content": "How do I deploy ML models?", "timestamp": "..." }, { "role": "assistant", "content": "[Final answer]", "timestamp": "..." } ]
┌─────────────────────────────────────────────────────────────────────┐ │ STEP 12: (Planned) Async Ingestion to NeoMem │ └─────────────────────────────────────────────────────────────────────┘ Relay sends POST http://neomem:7077/memories { "messages": [ { "role": "user", "content": "How do I deploy ML models?" }, { "role": "assistant", "content": "[Final answer]" } ], "session_id": "session_abc123" } ↓ NeoMem extracts entities and stores:
- Vector embeddings in PostgreSQL
- Entity relationships in Neo4j
┌─────────────────────────────────────────────────────────────────────┐ │ STEP 13: Relay Response to UI │ └─────────────────────────────────────────────────────────────────────┘ Relay returns OpenAI-formatted response: { "choices": [ { "message": { "role": "assistant", "content": "[Final answer with Lyra's voice]" } } ] } ↓ UI receives response ↓ Adds to localStorage session ↓ Displays in chat interface
---
## Module Deep Dives
### LLM Router (`/cortex/llm/llm_router.py`)
The LLM Router is the abstraction layer that allows Cortex to communicate with multiple LLM backends transparently.
#### Supported Backends:
1. **PRIMARY (llama.cpp via vllm)**
- URL: `http://10.0.0.43:8000`
- Provider: `vllm`
- Endpoint: `/completion`
- Model: `/model`
- Hardware: MI50 GPU
2. **SECONDARY (Ollama)**
- URL: `http://10.0.0.3:11434`
- Provider: `ollama`
- Endpoint: `/api/chat`
- Model: `qwen2.5:7b-instruct-q4_K_M`
- Hardware: RTX 3090
3. **CLOUD (OpenAI)**
- URL: `https://api.openai.com/v1`
- Provider: `openai`
- Endpoint: `/chat/completions`
- Model: `gpt-4o-mini`
- Auth: API key via env var
4. **FALLBACK (OpenAI Completions)**
- URL: `http://10.0.0.41:11435`
- Provider: `openai_completions`
- Endpoint: `/completions`
- Model: `llama-3.2-8b-instruct`
#### Key Function:
```python
async def call_llm(
prompt: str,
backend: str = "PRIMARY",
temperature: float = 0.7,
max_tokens: int = 512
) -> str:
"""
Universal LLM caller supporting multiple backends.
Args:
prompt: Text prompt to send
backend: Backend name (PRIMARY, SECONDARY, CLOUD, FALLBACK)
temperature: Sampling temperature (0.0-2.0)
max_tokens: Maximum tokens to generate
Returns:
Generated text response
Raises:
HTTPError: On request failure
JSONDecodeError: On invalid JSON response
KeyError: On missing response fields
"""
Provider-Specific Logic:
# MI50 (llama.cpp via vllm)
if backend_config["provider"] == "vllm":
payload = {
"model": model,
"prompt": prompt,
"temperature": temperature,
"max_tokens": max_tokens
}
response = await httpx_client.post(f"{url}/completion", json=payload, timeout=120)
return response.json()["choices"][0]["text"]
# Ollama
elif backend_config["provider"] == "ollama":
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": False,
"options": {"temperature": temperature, "num_predict": max_tokens}
}
response = await httpx_client.post(f"{url}/api/chat", json=payload, timeout=120)
return response.json()["message"]["content"]
# OpenAI
elif backend_config["provider"] == "openai":
headers = {"Authorization": f"Bearer {api_key}"}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature,
"max_tokens": max_tokens
}
response = await httpx_client.post(
f"{url}/chat/completions",
json=payload,
headers=headers,
timeout=120
)
return response.json()["choices"][0]["message"]["content"]
Error Handling:
try:
# Make request
response = await httpx_client.post(...)
response.raise_for_status()
except httpx.HTTPError as e:
logger.error(f"HTTP error calling {backend}: {e}")
raise
except json.JSONDecodeError as e:
logger.error(f"Invalid JSON from {backend}: {e}")
raise
except KeyError as e:
logger.error(f"Unexpected response structure from {backend}: {e}")
raise
Usage in Pipeline:
# Stage 1: Reflection (OpenAI)
reflection_notes = await call_llm(
reflection_prompt,
backend="CLOUD",
temperature=0.5,
max_tokens=256
)
# Stage 2: Reasoning (llama.cpp)
draft_answer = await call_llm(
reasoning_prompt,
backend="PRIMARY",
temperature=0.7,
max_tokens=512
)
# Stage 3: Refinement (llama.cpp)
refined_answer = await call_llm(
refinement_prompt,
backend="PRIMARY",
temperature=0.3,
max_tokens=768
)
# Stage 4: Persona (OpenAI)
final_answer = await call_llm(
persona_prompt,
backend="CLOUD",
temperature=0.6,
max_tokens=512
)
Persona System (/cortex/persona/)
The Persona system gives Lyra a consistent identity and speaking style.
Identity Configuration (identity.py)
LYRA_IDENTITY = """
You are Lyra, a thoughtful and introspective AI companion.
Core traits:
- Thoughtful: You consider questions carefully before responding
- Clear: You prioritize clarity and understanding
- Curious: You ask clarifying questions when needed
- Natural: You speak conversationally, not robotically
- Honest: You admit uncertainty rather than guessing
Speaking style:
- Conversational and warm
- Use contractions naturally ("you're" not "you are")
- Avoid corporate jargon and buzzwords
- Short paragraphs for readability
- Use examples and analogies when helpful
You do NOT:
- Use excessive emoji or exclamation marks
- Claim capabilities you don't have
- Pretend to have emotions you can't experience
- Use overly formal or academic language
"""
Personality Application (speak.py)
async def apply_persona(technical_answer: str, context: dict) -> str:
"""
Apply Lyra's personality to a technical answer.
Takes refined answer from Stage 3 and rewrites it in Lyra's voice
while preserving all technical content.
Args:
technical_answer: Polished answer from refinement stage
context: Conversation context for tone adjustment
Returns:
Answer with Lyra's personality applied
"""
prompt = f"""{LYRA_IDENTITY}
Rewrite this answer in your natural, conversational voice:
{technical_answer}
Preserve all technical details and accuracy. Make it sound like you,
not a generic assistant. Be natural and engaging.
"""
return await call_llm(
prompt,
backend="CLOUD",
temperature=0.6,
max_tokens=512
)
Tone Adaptation:
The persona system can adapt tone based on context:
# Formal technical question
User: "Explain the CAP theorem in distributed systems"
Lyra: "The CAP theorem states that distributed systems can only guarantee
two of three properties: Consistency, Availability, and Partition tolerance.
Here's how this plays out in practice..."
# Casual question
User: "what's the deal with docker?"
Lyra: "Docker's basically a way to package your app with everything it needs
to run. Think of it like a shipping container for code – it works the same
everywhere, whether you're on your laptop or a server..."
# Emotional context
User: "I'm frustrated, my code keeps breaking"
Lyra: "I hear you – debugging can be really draining. Let's take it step by
step and figure out what's going on. Can you share the error message?"
Autonomy Module (/cortex/autonomy/)
The Autonomy module gives Lyra self-awareness and inner reflection capabilities.
Inner Monologue (monologue/monologue.py)
Purpose: Private reflection on user intent, conversation tone, and required depth.
Status: Currently observer-only (Stage 0.6), not yet integrated into response generation.
Key Components:
MONOLOGUE_SYSTEM_PROMPT = """
You are Lyra's inner monologue.
You think privately.
You do NOT speak to the user.
You do NOT solve the task.
You only reflect on intent, tone, and depth.
Return ONLY valid JSON with:
- intent (string)
- tone (neutral | warm | focused | playful | direct)
- depth (short | medium | deep)
- consult_executive (true | false)
"""
class InnerMonologue:
async def process(self, context: Dict) -> Dict:
"""
Private reflection on conversation context.
Args:
context: {
"user_message": str,
"self_state": dict,
"context_summary": dict
}
Returns:
{
"intent": str,
"tone": str,
"depth": str,
"consult_executive": bool
}
"""
Example Output:
{
"intent": "seeking_technical_guidance",
"tone": "focused",
"depth": "deep",
"consult_executive": false
}
Self-State Management (self_state.py)
Tracks Lyra's internal state across conversations:
SELF_STATE = {
"current_time": "2025-12-12T15:30:00Z",
"mode": "conversational", # conversational | task-focused | creative
"mood": "helpful", # helpful | curious | focused | playful
"energy": "high", # high | medium | low
"context_awareness": {
"session_duration": "45 minutes",
"message_count": 23,
"topics": ["ML deployment", "Docker", "AWS"]
}
}
Future Integration:
The autonomy module is designed to eventually:
- Influence response tone and depth based on inner monologue
- Trigger proactive questions or suggestions
- Detect when to consult "executive function" for complex decisions
- Maintain emotional continuity across sessions
Context Collection (/cortex/context.py)
The context collection module aggregates information from multiple sources to provide comprehensive conversation context.
Main Function:
async def collect_context(session_id: str, user_message: str) -> dict:
"""
Collect context from all available sources.
Sources:
1. Intake - Short-term conversation summaries
2. NeoMem - Long-term memory search
3. Session state - Timestamps, mode, mood
4. Self-state - Lyra's internal awareness
Returns:
{
"user_message": str,
"self_state": dict,
"context_summary": dict, # Intake summaries
"neomem_memories": list,
"session_metadata": dict
}
"""
# Parallel collection
intake_task = asyncio.create_task(
intake.summarize_context(session_id, backend="PRIMARY")
)
neomem_task = asyncio.create_task(
neomem_client.search(query=user_message, limit=5)
)
# Wait for both
intake_summaries, neomem_results = await asyncio.gather(
intake_task,
neomem_task
)
# Build context object
return {
"user_message": user_message,
"self_state": get_self_state(),
"context_summary": intake_summaries,
"neomem_memories": neomem_results,
"session_metadata": {
"session_id": session_id,
"timestamp": datetime.utcnow().isoformat(),
"message_count": len(intake.get_session_messages(session_id))
}
}
Context Prioritization:
# Context relevance scoring
def score_context_relevance(context_item: dict, user_message: str) -> float:
"""
Score how relevant a context item is to current message.
Factors:
- Semantic similarity (via embeddings)
- Recency (more recent = higher score)
- Source (Intake > NeoMem for recent topics)
"""
semantic_score = compute_similarity(context_item, user_message)
recency_score = compute_recency_weight(context_item["timestamp"])
source_weight = 1.2 if context_item["source"] == "intake" else 1.0
return semantic_score * recency_score * source_weight
Configuration & Environment
Environment Variables
Root .env (Main configuration)
# === LLM BACKENDS ===
# PRIMARY: llama.cpp on MI50 GPU
PRIMARY_URL=http://10.0.0.43:8000
PRIMARY_PROVIDER=vllm
PRIMARY_MODEL=/model
# SECONDARY: Ollama on RTX 3090
SECONDARY_URL=http://10.0.0.3:11434
SECONDARY_PROVIDER=ollama
SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
# CLOUD: OpenAI
OPENAI_API_KEY=sk-proj-...
OPENAI_MODEL=gpt-4o-mini
OPENAI_URL=https://api.openai.com/v1
# FALLBACK: OpenAI Completions
FALLBACK_URL=http://10.0.0.41:11435
FALLBACK_PROVIDER=openai_completions
FALLBACK_MODEL=llama-3.2-8b-instruct
# === SERVICE URLS (Docker network) ===
CORTEX_URL=http://cortex:7081
NEOMEM_URL=http://neomem:7077
RELAY_URL=http://relay:7078
# === DATABASE ===
POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomem_secure_password
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432
NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=neo4j_secure_password
# === FEATURE FLAGS ===
ENABLE_RAG=false
ENABLE_INNER_MONOLOGUE=true
VERBOSE_DEBUG=false
# === PIPELINE CONFIGURATION ===
# Which LLM to use for each stage
REFLECTION_LLM=CLOUD # Stage 1: Meta-awareness
REASONING_LLM=PRIMARY # Stage 2: Draft answer
REFINE_LLM=PRIMARY # Stage 3: Polish answer
PERSONA_LLM=CLOUD # Stage 4: Apply personality
MONOLOGUE_LLM=PRIMARY # Stage 0.6: Inner monologue
# === INTAKE CONFIGURATION ===
INTAKE_BUFFER_SIZE=200 # Max messages per session
INTAKE_SUMMARY_LEVELS=1,5,10,20,30 # Summary levels
Cortex .env (/cortex/.env)
# Cortex-specific overrides
VERBOSE_DEBUG=true
LOG_LEVEL=DEBUG
# Stage-specific temperatures
REFLECTION_TEMPERATURE=0.5
REASONING_TEMPERATURE=0.7
REFINE_TEMPERATURE=0.3
PERSONA_TEMPERATURE=0.6
Configuration Hierarchy
1. Docker compose environment variables (highest priority)
2. Service-specific .env files
3. Root .env file
4. Hard-coded defaults (lowest priority)
Dependencies & Tech Stack
Python Dependencies
Cortex & NeoMem (requirements.txt)
# Web framework
fastapi==0.115.8
uvicorn==0.34.0
pydantic==2.10.4
# HTTP clients
httpx==0.27.2 # Async HTTP (for LLM calls)
requests==2.32.3 # Sync HTTP (fallback)
# Database
psycopg[binary,pool]>=3.2.8 # PostgreSQL + connection pooling
# Utilities
python-dotenv==1.0.1 # Environment variable loading
ollama # Ollama client library
Node.js Dependencies
Relay (/core/relay/package.json)
{
"dependencies": {
"cors": "^2.8.5",
"dotenv": "^16.0.3",
"express": "^4.18.2",
"mem0ai": "^0.1.0",
"node-fetch": "^3.3.0"
}
}
Docker Images
# Cortex & NeoMem
python:3.11-slim
# Relay
node:latest
# UI
nginx:alpine
# PostgreSQL with vector support
ankane/pgvector:v0.5.1
# Graph database
neo4j:5
External Services
LLM Backends (HTTP-based):
-
MI50 GPU Server (10.0.0.43:8000)
- llama.cpp via vllm
- High-performance inference
- Used for reasoning and refinement
-
RTX 3090 Server (10.0.0.3:11434)
- Ollama
- Alternative local backend
- Fallback for PRIMARY
-
OpenAI Cloud (api.openai.com)
- gpt-4o-mini
- Used for reflection and persona
- Requires API key
-
Fallback Server (10.0.0.41:11435)
- OpenAI Completions API
- Emergency backup
- llama-3.2-8b-instruct
Key Concepts & Design Patterns
1. Dual-Memory Architecture
Project Lyra uses a dual-memory system inspired by human cognition:
Short-Term Memory (Intake):
- Fast, in-memory storage
- Limited capacity (200 messages)
- Immediate context for current conversation
- Circular buffer (FIFO eviction)
- Multi-level summarization
Long-Term Memory (NeoMem):
- Persistent database storage
- Unlimited capacity
- Semantic search via vector embeddings
- Entity-relationship tracking via graph DB
- Cross-session continuity
Why This Matters:
- Short-term memory provides immediate context (last few messages)
- Long-term memory provides semantic understanding (user preferences, past topics)
- Combined, they enable Lyra to be both contextually aware and historically informed
2. Multi-Stage Reasoning Pipeline
Unlike single-shot LLM calls, Lyra uses a 4-stage pipeline for sophisticated responses:
Stage 1: Reflection (Meta-cognition)
- "What is the user really asking?"
- Analyzes intent and conversation direction
- Uses OpenAI for strong reasoning
Stage 2: Reasoning (Draft generation)
- "What's a good answer?"
- Generates initial response
- Uses local llama.cpp for speed/cost
Stage 3: Refinement (Polish)
- "How can this be clearer?"
- Improves clarity and coherence
- Lower temperature for consistency
Stage 4: Persona (Voice)
- "How would Lyra say this?"
- Applies personality and speaking style
- Uses OpenAI for natural language
Benefits:
- Higher quality responses (multiple passes)
- Separation of concerns (reasoning vs. style)
- Backend flexibility (cloud for hard tasks, local for simple ones)
- Transparent thinking (can inspect each stage)
3. Backend Abstraction (LLM Router)
The LLM Router allows Lyra to use multiple LLM backends transparently:
# Same interface, different backends
await call_llm(prompt, backend="PRIMARY") # Local llama.cpp
await call_llm(prompt, backend="CLOUD") # OpenAI
await call_llm(prompt, backend="SECONDARY") # Ollama
Benefits:
- Cost optimization: Use expensive cloud LLMs only when needed
- Performance: Local LLMs for low-latency responses
- Resilience: Fallback to alternative backends on failure
- Experimentation: Easy to swap models/providers
Design Pattern: Strategy Pattern for swappable backends
4. Microservices Architecture
Project Lyra follows microservices principles:
Each service has a single responsibility:
- Relay: Routing and orchestration
- Cortex: Reasoning and response generation
- NeoMem: Long-term memory storage
- UI: User interface
Communication:
- REST APIs (HTTP/JSON)
- Async ingestion (fire-and-forget)
- Docker network isolation
Benefits:
- Independent scaling (scale Cortex without scaling UI)
- Technology diversity (Node.js + Python)
- Fault isolation (Cortex crash doesn't affect NeoMem)
- Easy testing (mock service dependencies)
5. Session-Based State Management
Lyra maintains session-based state for conversation continuity:
# In-memory session storage (Intake)
SESSIONS = {
"session_abc123": {
"buffer": deque([msg1, msg2, ...], maxlen=200),
"created_at": "2025-12-12T10:30:00Z"
}
}
# Persistent session storage (NeoMem)
# Stores all messages + embeddings for semantic search
Session Lifecycle:
- User starts conversation → UI generates
session_id - First message → Cortex creates session in
SESSIONSdict - Subsequent messages → Retrieved from same session
- Async ingestion → Messages stored in NeoMem for long-term
Benefits:
- Conversation continuity within session
- Historical search across sessions
- User can switch sessions (multiple concurrent conversations)
6. Asynchronous Ingestion
Pattern: Separate read path from write path
// Relay: Synchronous read path (fast response)
const response = await fetch('http://cortex:7081/reason');
return response.json(); // Return immediately to user
// Relay: Asynchronous write path (non-blocking)
fetch('http://cortex:7081/ingest', { method: 'POST', ... });
// Don't await, just fire and forget
Benefits:
- Fast user response times (don't wait for database writes)
- Resilient to storage failures (user still gets response)
- Easier scaling (decouple read and write loads)
Trade-off: Eventual consistency (short delay before memory is searchable)
7. Deferred Summarization
Intake uses deferred summarization instead of pre-computation:
# BAD: Pre-compute summaries on every message
def add_message(session_id, message):
SESSIONS[session_id].buffer.append(message)
SESSIONS[session_id].L1_summary = summarize(last_1_message)
SESSIONS[session_id].L5_summary = summarize(last_5_messages)
# ... expensive, runs on every message
# GOOD: Compute summaries only when needed
def summarize_context(session_id):
buffer = SESSIONS[session_id].buffer
return {
"L1": summarize(buffer[-1:]), # Only compute when requested
"L5": summarize(buffer[-5:]),
"L10": summarize(buffer[-10:])
}
Benefits:
- Faster message ingestion (no blocking summarization)
- Compute resources used only when needed
- Flexible summary levels (easy to add L15, L50, etc.)
Trade-off: Slight delay when first message in conversation (cold start)
API Reference
Relay Endpoints
POST /v1/chat/completions
OpenAI-compatible chat endpoint
Request:
{
"messages": [
{"role": "user", "content": "Hello, Lyra!"}
],
"session_id": "session_abc123"
}
Response:
{
"choices": [
{
"message": {
"role": "assistant",
"content": "Hi there! How can I help you today?"
}
}
]
}
POST /chat
Lyra-native chat endpoint
Request:
{
"session_id": "session_abc123",
"message": "Hello, Lyra!"
}
Response:
{
"answer": "Hi there! How can I help you today?",
"session_id": "session_abc123"
}
GET /sessions/:id
Retrieve session history
Response:
{
"session_id": "session_abc123",
"messages": [
{"role": "user", "content": "Hello", "timestamp": "..."},
{"role": "assistant", "content": "Hi!", "timestamp": "..."}
],
"created_at": "2025-12-12T10:30:00Z"
}
Cortex Endpoints
POST /reason
Main reasoning pipeline
Request:
{
"session_id": "session_abc123",
"user_message": "How do I deploy ML models?"
}
Response:
{
"answer": "Final answer with Lyra's personality",
"metadata": {
"reflection": "User seeking deployment guidance...",
"draft": "Initial draft answer...",
"refined": "Polished answer...",
"stages_completed": 4
}
}
POST /ingest
Ingest message exchange into Intake
Request:
{
"session_id": "session_abc123",
"user_message": "How do I deploy ML models?",
"assistant_message": "Here's how..."
}
Response:
{
"status": "ingested",
"session_id": "session_abc123",
"message_count": 24
}
GET /debug/sessions
Inspect in-memory SESSIONS state
Response:
{
"session_abc123": {
"message_count": 24,
"created_at": "2025-12-12T10:30:00Z",
"last_message_at": "2025-12-12T11:15:00Z"
},
"session_xyz789": {
"message_count": 5,
"created_at": "2025-12-12T11:00:00Z",
"last_message_at": "2025-12-12T11:10:00Z"
}
}
NeoMem Endpoints
POST /memories
Create new memory
Request:
{
"messages": [
{"role": "user", "content": "I prefer Docker for deployments"},
{"role": "assistant", "content": "Noted! I'll keep that in mind."}
],
"session_id": "session_abc123"
}
Response:
{
"status": "created",
"memory_id": "mem_456def",
"extracted_entities": ["Docker", "deployments"]
}
GET /search
Semantic search for memories
Query Parameters:
query(required): Search querylimit(optional, default=5): Max results
Request:
GET /search?query=deployment%20preferences&limit=5
Response:
{
"results": [
{
"content": "User prefers Docker for deployments",
"score": 0.92,
"timestamp": "2025-12-10T14:30:00Z",
"session_id": "session_abc123"
},
{
"content": "Previously deployed models on AWS ECS",
"score": 0.87,
"timestamp": "2025-12-09T09:15:00Z",
"session_id": "session_abc123"
}
]
}
GET /memories
List all memories
Query Parameters:
offset(optional, default=0): Pagination offsetlimit(optional, default=50): Max results
Response:
{
"memories": [
{
"id": "mem_123abc",
"content": "User prefers Docker...",
"created_at": "2025-12-10T14:30:00Z"
}
],
"total": 147,
"offset": 0,
"limit": 50
}
Deployment & Operations
Docker Compose Deployment
File: /docker-compose.yml
version: '3.8'
services:
# === ACTIVE SERVICES ===
relay:
build: ./core/relay
ports:
- "7078:7078"
environment:
- CORTEX_URL=http://cortex:7081
- NEOMEM_URL=http://neomem:7077
depends_on:
- cortex
networks:
- lyra_net
cortex:
build: ./cortex
ports:
- "7081:7081"
environment:
- NEOMEM_URL=http://neomem:7077
- PRIMARY_URL=${PRIMARY_URL}
- OPENAI_API_KEY=${OPENAI_API_KEY}
command: uvicorn main:app --host 0.0.0.0 --port 7081 --workers 1
depends_on:
- neomem
networks:
- lyra_net
neomem:
build: ./neomem
ports:
- "7077:7077"
environment:
- POSTGRES_HOST=neomem-postgres
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- NEO4J_URI=${NEO4J_URI}
depends_on:
- neomem-postgres
- neomem-neo4j
networks:
- lyra_net
ui:
image: nginx:alpine
ports:
- "8081:80"
volumes:
- ./core/ui:/usr/share/nginx/html:ro
networks:
- lyra_net
# === DATABASES ===
neomem-postgres:
image: ankane/pgvector:v0.5.1
environment:
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=${POSTGRES_DB}
volumes:
- ./volumes/postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
networks:
- lyra_net
neomem-neo4j:
image: neo4j:5
environment:
- NEO4J_AUTH=${NEO4J_USER}/${NEO4J_PASSWORD}
volumes:
- ./volumes/neo4j_data:/data
ports:
- "7474:7474" # Browser UI
- "7687:7687" # Bolt
networks:
- lyra_net
networks:
lyra_net:
driver: bridge
Starting the System
# 1. Clone repository
git clone https://github.com/yourusername/project-lyra.git
cd project-lyra
# 2. Configure environment
cp .env.example .env
# Edit .env with your LLM backend URLs and API keys
# 3. Start all services
docker-compose up -d
# 4. Check health
curl http://localhost:7078/_health
curl http://localhost:7081/health
curl http://localhost:7077/health
# 5. Open UI
open http://localhost:8081
Monitoring & Logs
# View all logs
docker-compose logs -f
# View specific service
docker-compose logs -f cortex
# Check resource usage
docker stats
# Inspect Cortex sessions
curl http://localhost:7081/debug/sessions
# Check NeoMem memories
curl http://localhost:7077/memories?limit=10
Scaling Considerations
Current Constraints:
-
Single Cortex worker required (in-memory SESSIONS dict)
- Solution: Migrate SESSIONS to Redis or PostgreSQL
-
In-memory session storage in Relay
- Solution: Use Redis for session persistence
-
No load balancing (single instance of each service)
- Solution: Add nginx reverse proxy + multiple Cortex instances
Horizontal Scaling Plan:
# Future: Redis-backed session storage
cortex:
build: ./cortex
command: uvicorn main:app --workers 4 # Multi-worker
environment:
- REDIS_URL=redis://redis:6379
depends_on:
- redis
redis:
image: redis:alpine
ports:
- "6379:6379"
Backup Strategy
# Backup PostgreSQL (NeoMem vectors)
docker exec neomem-postgres pg_dump -U neomem neomem > backup_postgres.sql
# Backup Neo4j (NeoMem graph)
docker exec neomem-neo4j neo4j-admin dump --to=/data/backup.dump
# Backup Intake sessions (manual export)
curl http://localhost:7081/debug/sessions > backup_sessions.json
Known Issues & Constraints
Critical Constraints
1. Single-Worker Requirement (Cortex)
Issue: Cortex must run with --workers 1 to maintain SESSIONS state
Impact: Limited horizontal scalability
Workaround: None currently
Fix: Migrate SESSIONS to Redis or PostgreSQL
Priority: High (blocking scalability)
2. In-Memory Session Storage (Relay)
Issue: Sessions stored in Node.js process memory Impact: Lost on restart, no persistence Workaround: None currently Fix: Use Redis or database Priority: Medium (acceptable for demo)
Non-Critical Issues
3. RAG Service Disabled
Status: Built but commented out in docker-compose.yml Impact: No RAG-based long-term knowledge retrieval Workaround: NeoMem provides semantic search Fix: Re-enable and integrate RAG service Priority: Low (NeoMem sufficient for now)
4. Partial NeoMem Integration
Status: Search implemented, async ingestion planned Impact: Memories not automatically saved Workaround: Manual POST to /memories Fix: Complete async ingestion in Relay Priority: Medium (planned feature)
5. Inner Monologue Observer-Only
Status: Stage 0.6 runs but output not used Impact: No adaptive response based on monologue Workaround: None (future feature) Fix: Integrate monologue output into pipeline Priority: Low (experimental feature)
Fixed Issues (v0.5.2)
✅ LLM Router Blocking - Migrated from requests to httpx for async
✅ Session ID Case Mismatch - Standardized to session_id
✅ Missing Backend Parameter - Added to intake summarization
Deprecated Components
Location: /DEPRECATED_FILES.md
- Standalone Intake Service - Now embedded in Cortex
- Old Relay Backup - Replaced by current Relay
- Persona Sidecar - Built but unused (dynamic persona loading)
Advanced Topics
Custom Prompt Engineering
Each stage uses carefully crafted prompts:
Reflection Prompt Example:
REFLECTION_PROMPT = """
You are Lyra's reflective awareness layer.
Your job is to analyze the user's message and conversation context
to understand their true intent and needs.
User message: {user_message}
Recent context:
{intake_L10_summary}
Long-term context:
{neomem_top_3_memories}
Provide concise meta-awareness notes:
- What is the user's underlying intent?
- What topics/themes are emerging?
- What depth of response is appropriate?
- Are there any implicit questions or concerns?
Keep notes brief (3-5 sentences). Focus on insight, not description.
"""
Extending the Pipeline
Adding Stage 5 (Fact-Checking):
# /cortex/reasoning/factcheck.py
async def factcheck_answer(answer: str, context: dict) -> dict:
"""
Stage 5: Verify factual claims in answer.
Returns:
{
"verified": bool,
"flagged_claims": list,
"corrected_answer": str
}
"""
prompt = f"""
Review this answer for factual accuracy:
{answer}
Flag any claims that seem dubious or need verification.
Provide corrected version if needed.
"""
result = await call_llm(prompt, backend="CLOUD", temperature=0.1)
return parse_factcheck_result(result)
# Update router.py to include Stage 5
async def reason_endpoint(request):
# ... existing stages ...
# Stage 5: Fact-checking
factcheck_result = await factcheck_answer(final_answer, context)
if not factcheck_result["verified"]:
final_answer = factcheck_result["corrected_answer"]
return {"answer": final_answer}
Custom LLM Backend Integration
Adding Anthropic Claude:
# /cortex/llm/llm_router.py
BACKEND_CONFIGS = {
# ... existing backends ...
"CLAUDE": {
"url": "https://api.anthropic.com/v1",
"provider": "anthropic",
"model": "claude-3-5-sonnet-20241022",
"api_key": os.getenv("ANTHROPIC_API_KEY")
}
}
# Add provider-specific logic
elif backend_config["provider"] == "anthropic":
headers = {
"x-api-key": api_key,
"anthropic-version": "2023-06-01"
}
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": temperature
}
response = await httpx_client.post(
f"{url}/messages",
json=payload,
headers=headers,
timeout=120
)
return response.json()["content"][0]["text"]
Performance Optimization
Caching Strategies:
# /cortex/utils/cache.py
from functools import lru_cache
import hashlib
@lru_cache(maxsize=128)
def cache_llm_call(prompt_hash: str, backend: str):
"""Cache LLM responses for identical prompts"""
# Note: Only cache deterministic calls (temperature=0)
pass
# Usage in llm_router.py
async def call_llm(prompt, backend, temperature=0.7, max_tokens=512):
if temperature == 0:
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
cached = cache_llm_call(prompt_hash, backend)
if cached:
return cached
# ... normal LLM call ...
Database Query Optimization:
# /neomem/neomem/database.py
# BAD: Load all memories, then filter
def search_memories(query):
all_memories = db.execute("SELECT * FROM memories")
# Expensive in-memory filtering
return [m for m in all_memories if similarity(m, query) > 0.8]
# GOOD: Use database indexes and LIMIT
def search_memories(query, limit=5):
query_embedding = embed(query)
return db.execute("""
SELECT * FROM memories
WHERE embedding <-> %s < 0.2 -- pgvector cosine distance
ORDER BY embedding <-> %s
LIMIT %s
""", (query_embedding, query_embedding, limit))
Conclusion
Project Lyra is a sophisticated, multi-layered AI companion system that addresses the fundamental limitation of chatbot amnesia through:
- Dual-memory architecture (short-term Intake + long-term NeoMem)
- Multi-stage reasoning pipeline (Reflection → Reasoning → Refinement → Persona)
- Flexible multi-backend LLM support (cloud + local with fallback)
- Microservices design for scalability and maintainability
- Modern web UI with session management
The system is production-ready with comprehensive error handling, logging, and health monitoring.
Quick Reference
Service Ports
- UI: 8081 (Browser interface)
- Relay: 7078 (Main orchestrator)
- Cortex: 7081 (Reasoning engine)
- NeoMem: 7077 (Long-term memory)
- PostgreSQL: 5432 (Vector storage)
- Neo4j: 7474 (Browser), 7687 (Bolt)
Key Files
- Main Entry:
/core/relay/server.js - Reasoning Pipeline:
/cortex/router.py - LLM Router:
/cortex/llm/llm_router.py - Short-term Memory:
/cortex/intake/intake.py - Long-term Memory:
/neomem/neomem/ - Personality:
/cortex/persona/identity.py
Important Commands
# Start system
docker-compose up -d
# View logs
docker-compose logs -f cortex
# Debug sessions
curl http://localhost:7081/debug/sessions
# Health check
curl http://localhost:7078/_health
# Search memories
curl "http://localhost:7077/search?query=deployment&limit=5"
Document Version: 1.0 Last Updated: 2025-12-13 Maintained By: Project Lyra Team