20 KiB
Project Lyra - README v0.5.1
Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using NeoMem + Relay + Cortex, with multi-stage reasoning pipeline powered by HTTP-based LLM backends.
Current Version: v0.5.1 (2025-12-11)
Mission Statement
The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget evertything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
Architecture Overview
Project Lyra operates as a single docker-compose deployment with multiple Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules:
Core Services
1. Relay (Node.js/Express) - Port 7078
- Main orchestrator and message router
- Coordinates all module interactions
- OpenAI-compatible endpoint:
POST /v1/chat/completions - Internal endpoint:
POST /chat - Routes messages through Cortex reasoning pipeline
- Manages async calls to NeoMem and Cortex ingest
2. UI (Static HTML)
- Browser-based chat interface with cyberpunk theme
- Connects to Relay
- Saves and loads sessions
- OpenAI-compatible message format
3. NeoMem (Python/FastAPI) - Port 7077
- Long-term memory database (fork of Mem0 OSS)
- Vector storage (PostgreSQL + pgvector) + Graph storage (Neo4j)
- RESTful API:
/memories,/search - Semantic memory updates and retrieval
- No external SDK dependencies - fully local
Reasoning Layer
4. Cortex (Python/FastAPI) - Port 7081
- Primary reasoning engine with multi-stage pipeline
- Includes embedded Intake module (no separate service as of v0.5.1)
- 4-Stage Processing:
- Reflection - Generates meta-awareness notes about conversation
- Reasoning - Creates initial draft answer using context
- Refinement - Polishes and improves the draft
- Persona - Applies Lyra's personality and speaking style
- Integrates with Intake for short-term context via internal Python imports
- Flexible LLM router supporting multiple backends via HTTP
- Endpoints:
POST /reason- Main reasoning pipelinePOST /ingest- Receives conversation exchanges from RelayGET /health- Service health checkGET /debug/sessions- Inspect in-memory SESSIONS stateGET /debug/summary- Test summarization for a session
5. Intake (Python Module) - Embedded in Cortex
- No longer a standalone service - runs as Python module inside Cortex container
- Short-term memory management with session-based circular buffer
- In-memory SESSIONS dictionary:
session_id → {buffer: deque(maxlen=200), created_at: timestamp} - Multi-level summarization (L1/L5/L10/L20/L30) produced by
summarize_context() - Deferred summarization - actual summary generation happens during
/reasoncall - Internal Python API:
add_exchange_internal(exchange)- Direct function call from Cortexsummarize_context(session_id, exchanges)- Async LLM-based summarizationSESSIONS- Module-level global state (requires single Uvicorn worker)
LLM Backends (HTTP-based)
All LLM communication is done via HTTP APIs:
- PRIMARY: llama.cpp server (
http://10.0.0.44:8080) - AMD MI50 GPU backend - SECONDARY: Ollama server (
http://10.0.0.3:11434) - RTX 3090 backend- Model: qwen2.5:7b-instruct-q4_K_M
- CLOUD: OpenAI API (
https://api.openai.com/v1) - Cloud-based models- Model: gpt-4o-mini
- FALLBACK: Local backup (
http://10.0.0.41:11435) - Emergency fallback- Model: llama-3.2-8b-instruct
Each module can be configured to use a different backend via environment variables.
Data Flow Architecture (v0.5.1)
Normal Message Flow:
User (UI) → POST /v1/chat/completions
↓
Relay (7078)
↓ POST /reason
Cortex (7081)
↓ (internal Python call)
Intake module → summarize_context()
↓
Cortex processes (4 stages):
1. reflection.py → meta-awareness notes (CLOUD backend)
2. reasoning.py → draft answer (PRIMARY backend)
3. refine.py → refined answer (PRIMARY backend)
4. persona/speak.py → Lyra personality (CLOUD backend)
↓
Returns persona answer to Relay
↓
Relay → POST /ingest (async)
↓
Cortex → add_exchange_internal() → SESSIONS buffer
↓
Relay → NeoMem /memories (async, planned)
↓
Relay → UI (returns final response)
Cortex 4-Stage Reasoning Pipeline:
-
Reflection (
reflection.py) - Cloud LLM (OpenAI)- Analyzes user intent and conversation context
- Generates meta-awareness notes
- "What is the user really asking?"
-
Reasoning (
reasoning.py) - Primary LLM (llama.cpp)- Retrieves short-term context from Intake module
- Creates initial draft answer
- Integrates context, reflection notes, and user prompt
-
Refinement (
refine.py) - Primary LLM (llama.cpp)- Polishes the draft answer
- Improves clarity and coherence
- Ensures factual consistency
-
Persona (
speak.py) - Cloud LLM (OpenAI)- Applies Lyra's personality and speaking style
- Natural, conversational output
- Final answer returned to user
Features
Core Services
Relay:
- Main orchestrator and message router
- OpenAI-compatible endpoint:
POST /v1/chat/completions - Internal endpoint:
POST /chat - Health check:
GET /_health - Async non-blocking calls to Cortex
- Shared request handler for code reuse
- Comprehensive error handling
NeoMem (Memory Engine):
- Forked from Mem0 OSS - fully independent
- Drop-in compatible API (
/memories,/search) - Local-first: runs on FastAPI with Postgres + Neo4j
- No external SDK dependencies
- Semantic memory updates - compares embeddings and performs in-place updates
- Default service:
neomem-api(port 7077)
UI:
- Lightweight static HTML chat interface
- Cyberpunk theme
- Session save/load functionality
- OpenAI message format support
Reasoning Layer
Cortex (v0.5.1):
- Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
- Flexible LLM backend routing via HTTP
- Per-stage backend selection
- Async processing throughout
- Embedded Intake module for short-term context
/reason,/ingest,/health,/debug/sessions,/debug/summaryendpoints- Lenient error handling - never fails the chat pipeline
Intake (Embedded Module):
- Architectural change: Now runs as Python module inside Cortex container
- In-memory SESSIONS management (session_id → buffer)
- Multi-level summarization: L1 (ultra-short), L5 (short), L10 (medium), L20 (detailed), L30 (full)
- Deferred summarization strategy - summaries generated during
/reasoncall bg_summarize()is a logging stub - actual work deferred- Single-worker constraint: SESSIONS requires single Uvicorn worker or Redis/shared storage
LLM Router:
- Dynamic backend selection via HTTP
- Environment-driven configuration
- Support for llama.cpp, Ollama, OpenAI, custom endpoints
- Per-module backend preferences:
CORTEX_LLM=SECONDARY(Ollama for reasoning)INTAKE_LLM=PRIMARY(llama.cpp for summarization)SPEAK_LLM=OPENAI(Cloud for persona)NEOMEM_LLM=PRIMARY(llama.cpp for memory operations)
Beta Lyrae (RAG Memory DB) - Currently Disabled
- RAG Knowledge DB - Beta Lyrae (sheliak)
- This module implements the Retrieval-Augmented Generation (RAG) layer for Project Lyra.
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
- Status: Disabled in docker-compose.yml (v0.5.1)
The system uses:
- ChromaDB for persistent vector storage
- OpenAI Embeddings (
text-embedding-3-small) for semantic similarity - FastAPI (port 7090) for the
/rag/searchREST endpoint
Directory Layout:
rag/
├── rag_chat_import.py # imports JSON chat logs
├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
├── rag_build.py # legacy single-folder builder
├── rag_query.py # command-line query helper
├── rag_api.py # FastAPI service providing /rag/search
├── chromadb/ # persistent vector store
├── chatlogs/ # organized source data
│ ├── poker/
│ ├── work/
│ ├── lyra/
│ ├── personal/
│ └── ...
└── import.log # progress log for batch runs
OpenAI chatlog importer features:
- Recursive folder indexing with category detection from directory name
- Smart chunking for long messages (5,000 chars per slice)
- Automatic deduplication using SHA-1 hash of file + chunk
- Timestamps for both file modification and import time
- Full progress logging via tqdm
- Safe to run in background with
nohup … &
Docker Deployment
All services run in a single docker-compose stack with the following containers:
Active Services:
- neomem-postgres - PostgreSQL with pgvector extension (port 5432)
- neomem-neo4j - Neo4j graph database (ports 7474, 7687)
- neomem-api - NeoMem memory service (port 7077)
- relay - Main orchestrator (port 7078)
- cortex - Reasoning engine with embedded Intake (port 7081)
Disabled Services:
- intake - No longer needed (embedded in Cortex as of v0.5.1)
- rag - Beta Lyrae RAG service (port 7090) - currently disabled
All containers communicate via the lyra_net Docker bridge network.
External LLM Services
The following LLM backends are accessed via HTTP (not part of docker-compose):
-
llama.cpp Server (
http://10.0.0.44:8080)- AMD MI50 GPU-accelerated inference
- Primary backend for reasoning and refinement stages
- Model path:
/model
-
Ollama Server (
http://10.0.0.3:11434)- RTX 3090 GPU-accelerated inference
- Secondary/configurable backend
- Model: qwen2.5:7b-instruct-q4_K_M
-
OpenAI API (
https://api.openai.com/v1)- Cloud-based inference
- Used for reflection and persona stages
- Model: gpt-4o-mini
-
Fallback Server (
http://10.0.0.41:11435)- Emergency backup endpoint
- Local llama-3.2-8b-instruct model
Version History
v0.5.1 (2025-12-11) - Current Release
Critical Intake Integration Fixes:
- ✅ Fixed
bg_summarize()NameError preventing SESSIONS persistence - ✅ Fixed
/ingestendpoint unreachable code - ✅ Added
cortex/intake/__init__.pyfor proper package structure - ✅ Added diagnostic logging to verify SESSIONS singleton behavior
- ✅ Added
/debug/sessionsand/debug/summaryendpoints - ✅ Documented single-worker constraint in Dockerfile
- ✅ Implemented lenient error handling (never fails chat pipeline)
- ✅ Intake now embedded in Cortex - no longer standalone service
Architecture Changes:
- Intake module runs inside Cortex container as pure Python import
- No HTTP calls between Cortex and Intake (internal function calls)
- SESSIONS persist correctly in Uvicorn worker
- Deferred summarization strategy (summaries generated during
/reason)
v0.5.0 (2025-11-28)
- ✅ Fixed all critical API wiring issues
- ✅ Added OpenAI-compatible endpoint to Relay (
/v1/chat/completions) - ✅ Fixed Cortex → Intake integration
- ✅ Added missing Python package
__init__.pyfiles - ✅ End-to-end message flow verified and working
Infrastructure v1.0.0 (2025-11-26)
- Consolidated 9 scattered
.envfiles into single source of truth - Multi-backend LLM strategy implemented
- Docker Compose consolidation
- Created
.env.examplesecurity templates
v0.4.x (Major Rewire)
- Cortex multi-stage reasoning pipeline
- LLM router with multi-backend support
- Major architectural restructuring
v0.3.x
- Beta Lyrae RAG system
- NeoMem integration
- Basic Cortex reasoning loop
Known Issues (v0.5.1)
Critical (Fixed in v0.5.1)
Intake SESSIONS not persisting✅ FIXED✅ FIXEDbg_summarize()NameError✅ FIXED/ingestendpoint unreachable code
Non-Critical
- Session management endpoints not fully implemented in Relay
- RAG service currently disabled in docker-compose.yml
- NeoMem integration in Relay not yet active (planned for v0.5.2)
Operational Notes
- Single-worker constraint: Cortex must run with single Uvicorn worker to maintain SESSIONS state
- Multi-worker scaling requires migrating SESSIONS to Redis or shared storage
- Diagnostic endpoints (
/debug/sessions,/debug/summary) available for troubleshooting
Future Enhancements
- Re-enable RAG service integration
- Implement full session persistence
- Migrate SESSIONS to Redis for multi-worker support
- Add request correlation IDs for tracing
- Comprehensive health checks across all services
- NeoMem integration in Relay
Quick Start
Prerequisites
- Docker + Docker Compose
- At least one HTTP-accessible LLM endpoint (llama.cpp, Ollama, or OpenAI API key)
Setup
-
Copy
.env.exampleto.envand configure your LLM backend URLs and API keys:# Required: Configure at least one LLM backend LLM_PRIMARY_URL=http://10.0.0.44:8080 # llama.cpp LLM_SECONDARY_URL=http://10.0.0.3:11434 # Ollama OPENAI_API_KEY=sk-... # OpenAI -
Start all services with docker-compose:
docker-compose up -d -
Check service health:
# Relay health curl http://localhost:7078/_health # Cortex health curl http://localhost:7081/health # NeoMem health curl http://localhost:7077/health -
Access the UI at
http://localhost:7078
Test
Test Relay → Cortex pipeline:
curl -X POST http://localhost:7078/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello Lyra!"}],
"session_id": "test"
}'
Test Cortex /ingest endpoint:
curl -X POST http://localhost:7081/ingest \
-H "Content-Type: application/json" \
-d '{
"session_id": "test",
"user_msg": "Hello",
"assistant_msg": "Hi there!"
}'
Inspect SESSIONS state:
curl http://localhost:7081/debug/sessions
Get summary for a session:
curl "http://localhost:7081/debug/summary?session_id=test"
All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.
Environment Variables
LLM Backend Configuration
Backend URLs (Full API endpoints):
LLM_PRIMARY_URL=http://10.0.0.44:8080 # llama.cpp
LLM_PRIMARY_MODEL=/model
LLM_SECONDARY_URL=http://10.0.0.3:11434 # Ollama
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
Module-specific backend selection:
CORTEX_LLM=SECONDARY # Use Ollama for reasoning
INTAKE_LLM=PRIMARY # Use llama.cpp for summarization
SPEAK_LLM=OPENAI # Use OpenAI for persona
NEOMEM_LLM=PRIMARY # Use llama.cpp for memory
UI_LLM=OPENAI # Use OpenAI for UI
RELAY_LLM=PRIMARY # Use llama.cpp for relay
Database Configuration
POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomempass
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432
NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=neomemgraph
Service URLs (Internal Docker Network)
NEOMEM_API=http://neomem-api:7077
CORTEX_API=http://cortex:7081
CORTEX_REASON_URL=http://cortex:7081/reason
CORTEX_INGEST_URL=http://cortex:7081/ingest
RELAY_URL=http://relay:7078
Feature Flags
CORTEX_ENABLED=true
MEMORY_ENABLED=true
PERSONA_ENABLED=false
DEBUG_PROMPT=true
VERBOSE_DEBUG=true
For complete environment variable reference, see ENVIRONMENT_VARIABLES.md.
Documentation
- CHANGELOG.md - Detailed version history
- PROJECT_SUMMARY.md - Comprehensive project overview for AI context
- ENVIRONMENT_VARIABLES.md - Environment variable reference
- DEPRECATED_FILES.md - Deprecated files and migration guide
Troubleshooting
SESSIONS not persisting
Symptom: Intake buffer always shows 0 exchanges, summaries always empty.
Solution (Fixed in v0.5.1):
- Ensure
cortex/intake/__init__.pyexists - Check Cortex logs for
[Intake Module Init]message showing SESSIONS object ID - Verify single-worker mode (Dockerfile:
uvicorn main:app --workers 1) - Use
/debug/sessionsendpoint to inspect current state
Cortex connection errors
Symptom: Relay can't reach Cortex, 502 errors.
Solution:
- Verify Cortex container is running:
docker ps | grep cortex - Check Cortex health:
curl http://localhost:7081/health - Verify environment variables:
CORTEX_REASON_URL=http://cortex:7081/reason - Check docker network:
docker network inspect lyra_net
LLM backend timeouts
Symptom: Reasoning stage hangs or times out.
Solution:
- Verify LLM backend is running and accessible
- Check LLM backend health:
curl http://10.0.0.44:8080/health - Increase timeout in llm_router.py if using slow models
- Check logs for specific backend errors
License
NeoMem is a derivative work based on Mem0 OSS (Apache 2.0). © 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
Built with Claude Code
Integration Notes
- NeoMem API is compatible with Mem0 OSS endpoints (
/memories,/search) - All services communicate via Docker internal networking on the
lyra_netbridge - History and entity graphs are managed via PostgreSQL + Neo4j
- LLM backends are accessed via HTTP and configured in
.env - Intake module is imported internally by Cortex (no HTTP communication)
- SESSIONS state is maintained in-memory within Cortex container
Beta Lyrae - RAG Memory System (Currently Disabled)
Note: The RAG service is currently disabled in docker-compose.yml
Requirements
- Python 3.10+
- Dependencies:
chromadb openai tqdm python-dotenv fastapi uvicorn - Persistent storage:
./chromadbor/mnt/data/lyra_rag_db
Setup
-
Import chat logs (must be in OpenAI message format):
python3 rag/rag_chat_import.py -
Build and start the RAG API server:
cd rag python3 rag_build.py uvicorn rag_api:app --host 0.0.0.0 --port 7090 -
Query the RAG system:
curl -X POST http://127.0.0.1:7090/rag/search \ -H "Content-Type: application/json" \ -d '{ "query": "What is the current state of Cortex?", "where": {"category": "lyra"} }'
Development Notes
Cortex Architecture (v0.5.1)
- Cortex contains embedded Intake module at
cortex/intake/ - Intake is imported as:
from intake.intake import add_exchange_internal, SESSIONS - SESSIONS is a module-level global dictionary (singleton pattern)
- Single-worker constraint required to maintain SESSIONS state
- Diagnostic endpoints available for debugging:
/debug/sessions,/debug/summary
Adding New LLM Backends
-
Add backend URL to
.env:LLM_CUSTOM_URL=http://your-backend:port LLM_CUSTOM_MODEL=model-name -
Configure module to use new backend:
CORTEX_LLM=CUSTOM -
Restart Cortex container:
docker-compose restart cortex
Debugging Tips
- Enable verbose logging:
VERBOSE_DEBUG=truein.env - Check Cortex logs:
docker logs cortex -f - Inspect SESSIONS:
curl http://localhost:7081/debug/sessions - Test summarization:
curl "http://localhost:7081/debug/summary?session_id=test" - Check Relay logs:
docker logs relay -f - Monitor Docker network:
docker network inspect lyra_net