Project Lyra - README v0.5.0
Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using NeoMem + Relay + Cortex, with multi-stage reasoning pipeline powered by HTTP-based LLM backends.
Mission Statement
The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
Architecture Overview
Project Lyra operates as a single docker-compose deployment with multiple Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules:
Core Services
1. Relay (Node.js/Express) - Port 7078
- Main orchestrator and message router
- Coordinates all module interactions
- OpenAI-compatible endpoint:
POST /v1/chat/completions - Internal endpoint:
POST /chat - Routes messages through Cortex reasoning pipeline
- Manages async calls to Intake and NeoMem
2. UI (Static HTML)
- Browser-based chat interface with cyberpunk theme
- Connects to Relay
- Saves and loads sessions
- OpenAI-compatible message format
3. NeoMem (Python/FastAPI) - Port 7077
- Long-term memory database (fork of Mem0 OSS)
- Vector storage (PostgreSQL + pgvector) + Graph storage (Neo4j)
- RESTful API:
/memories,/search - Semantic memory updates and retrieval
- No external SDK dependencies - fully local
Reasoning Layer
4. Cortex (Python/FastAPI) - Port 7081
- Primary reasoning engine with multi-stage pipeline
- 4-Stage Processing:
- Reflection - Generates meta-awareness notes about conversation
- Reasoning - Creates initial draft answer using context
- Refinement - Polishes and improves the draft
- Persona - Applies Lyra's personality and speaking style
- Integrates with Intake for short-term context
- Flexible LLM router supporting multiple backends via HTTP
5. Intake v0.2 (Python/FastAPI) - Port 7080
- Simplified short-term memory summarization
- Session-based circular buffer (deque, maxlen=200)
- Single-level simple summarization (no cascading)
- Background async processing with FastAPI BackgroundTasks
- Pushes summaries to NeoMem automatically
- API Endpoints:
POST /add_exchange- Add conversation exchangeGET /summaries?session_id={id}- Retrieve session summaryPOST /close_session/{id}- Close and cleanup session
LLM Backends (HTTP-based)
All LLM communication is done via HTTP APIs:
- PRIMARY: vLLM server (
http://10.0.0.43:8000) - AMD MI50 GPU backend - SECONDARY: Ollama server (
http://10.0.0.3:11434) - RTX 3090 backend - CLOUD: OpenAI API (
https://api.openai.com/v1) - Cloud-based models - FALLBACK: Local backup (
http://10.0.0.41:11435) - Emergency fallback
Each module can be configured to use a different backend via environment variables.
Data Flow Architecture (v0.5.0)
Normal Message Flow:
User (UI) → POST /v1/chat/completions
↓
Relay (7078)
↓ POST /reason
Cortex (7081)
↓ GET /summaries?session_id=xxx
Intake (7080) [RETURNS SUMMARY]
↓
Cortex processes (4 stages):
1. reflection.py → meta-awareness notes
2. reasoning.py → draft answer (uses LLM)
3. refine.py → refined answer (uses LLM)
4. persona/speak.py → Lyra personality (uses LLM)
↓
Returns persona answer to Relay
↓
Relay → Cortex /ingest (async, stub)
Relay → Intake /add_exchange (async)
↓
Intake → Background summarize → NeoMem
↓
Relay → UI (returns final response)
Cortex 4-Stage Reasoning Pipeline:
-
Reflection (
reflection.py) - Configurable LLM via HTTP- Analyzes user intent and conversation context
- Generates meta-awareness notes
- "What is the user really asking?"
-
Reasoning (
reasoning.py) - Configurable LLM via HTTP- Retrieves short-term context from Intake
- Creates initial draft answer
- Integrates context, reflection notes, and user prompt
-
Refinement (
refine.py) - Configurable LLM via HTTP- Polishes the draft answer
- Improves clarity and coherence
- Ensures factual consistency
-
Persona (
speak.py) - Configurable LLM via HTTP- Applies Lyra's personality and speaking style
- Natural, conversational output
- Final answer returned to user
Features
Core Services
Relay:
- Main orchestrator and message router
- OpenAI-compatible endpoint:
POST /v1/chat/completions - Internal endpoint:
POST /chat - Health check:
GET /_health - Async non-blocking calls to Cortex and Intake
- Shared request handler for code reuse
- Comprehensive error handling
NeoMem (Memory Engine):
- Forked from Mem0 OSS - fully independent
- Drop-in compatible API (
/memories,/search) - Local-first: runs on FastAPI with Postgres + Neo4j
- No external SDK dependencies
- Semantic memory updates - compares embeddings and performs in-place updates
- Default service:
neomem-api(port 7077)
UI:
- Lightweight static HTML chat interface
- Cyberpunk theme
- Session save/load functionality
- OpenAI message format support
Reasoning Layer
Cortex (v0.5):
- Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
- Flexible LLM backend routing via HTTP
- Per-stage backend selection
- Async processing throughout
- IntakeClient integration for short-term context
/reason,/ingest(stub),/healthendpoints
Intake (v0.2):
- Simplified single-level summarization
- Session-based circular buffer (200 exchanges max)
- Background async summarization
- Automatic NeoMem push
- No persistent log files (memory-only)
- Breaking change from v0.1: Removed cascading summaries (L1, L2, L5, L10, L20, L30)
LLM Router:
- Dynamic backend selection via HTTP
- Environment-driven configuration
- Support for vLLM, Ollama, OpenAI, custom endpoints
- Per-module backend preferences
Beta Lyrae (RAG Memory DB) - added 11-3-25
- RAG Knowledge DB - Beta Lyrae (sheliak)
- This module implements the Retrieval-Augmented Generation (RAG) layer for Project Lyra.
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. The system uses:
- ChromaDB for persistent vector storage
- OpenAI Embeddings (
text-embedding-3-small) for semantic similarity - FastAPI (port 7090) for the
/rag/searchREST endpoint - Directory Layout rag/ ├── rag_chat_import.py # imports JSON chat logs ├── rag_docs_import.py # (planned) PDF/EPUB/manual importer ├── rag_build.py # legacy single-folder builder ├── rag_query.py # command-line query helper ├── rag_api.py # FastAPI service providing /rag/search ├── chromadb/ # persistent vector store ├── chatlogs/ # organized source data │ ├── poker/ │ ├── work/ │ ├── lyra/ │ ├── personal/ │ └── ... └── import.log # progress log for batch runs
- **OpenAI chatlog importer.
- Takes JSON formatted chat logs and imports it to the RAG.
- fetures include:
- Recursive folder indexing with category detection from directory name
- Smart chunking for long messages (5 000 chars per slice)
- Automatic deduplication using SHA-1 hash of file + chunk
- Timestamps for both file modification and import time
- Full progress logging via tqdm
- Safe to run in background with nohup … &
- Metadata per chunk:
{ "chat_id": "<sha1 of filename>", "chunk_index": 0, "source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json", "title": "cortex LLMs 11-1-25", "role": "assistant", "category": "lyra", "type": "chat", "file_modified": "2025-11-06T23:41:02", "imported_at": "2025-11-07T03:55:00Z" }```
Docker Deployment
All services run in a single docker-compose stack with the following containers:
- neomem-postgres - PostgreSQL with pgvector extension (port 5432)
- neomem-neo4j - Neo4j graph database (ports 7474, 7687)
- neomem-api - NeoMem memory service (port 7077)
- relay - Main orchestrator (port 7078)
- cortex - Reasoning engine (port 7081)
- intake - Short-term memory summarization (port 7080) - currently disabled
- rag - RAG search service (port 7090) - currently disabled
All containers communicate via the lyra_net Docker bridge network.
External LLM Services
The following LLM backends are accessed via HTTP (not part of docker-compose):
-
vLLM Server (
http://10.0.0.43:8000)- AMD MI50 GPU-accelerated inference
- Custom ROCm-enabled vLLM build
- Primary backend for reasoning and refinement stages
-
Ollama Server (
http://10.0.0.3:11434)- RTX 3090 GPU-accelerated inference
- Secondary/configurable backend
- Model: qwen2.5:7b-instruct-q4_K_M
-
OpenAI API (
https://api.openai.com/v1)- Cloud-based inference
- Used for reflection and persona stages
- Model: gpt-4o-mini
-
Fallback Server (
http://10.0.0.41:11435)- Emergency backup endpoint
- Local llama-3.2-8b-instruct model
Version History
v0.5.0 (2025-11-28) - Current Release
- ✅ Fixed all critical API wiring issues
- ✅ Added OpenAI-compatible endpoint to Relay (
/v1/chat/completions) - ✅ Fixed Cortex → Intake integration
- ✅ Added missing Python package
__init__.pyfiles - ✅ End-to-end message flow verified and working
v0.4.x (Major Rewire)
- Cortex multi-stage reasoning pipeline
- Intake v0.2 simplification
- LLM router with multi-backend support
- Major architectural restructuring
v0.3.x
- Beta Lyrae RAG system
- NeoMem integration
- Basic Cortex reasoning loop
Known Issues (v0.5.0)
Non-Critical
- Session management endpoints not fully implemented in Relay
- Intake service currently disabled in docker-compose.yml
- RAG service currently disabled in docker-compose.yml
- Cortex
/ingestendpoint is a stub
Future Enhancements
- Re-enable RAG service integration
- Implement full session persistence
- Add request correlation IDs for tracing
- Comprehensive health checks
Quick Start
Prerequisites
- Docker + Docker Compose
- At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key)
Setup
- Copy
.env.exampleto.envand configure your LLM backend URLs and API keys - Start all services with docker-compose:
docker-compose up -d - Check service health:
curl http://localhost:7078/_health - Access the UI at
http://localhost:7078
Test
curl -X POST http://localhost:7078/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello Lyra!"}],
"session_id": "test"
}'
All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.
Documentation
- See CHANGELOG.md for detailed version history
- See
ENVIRONMENT_VARIABLES.mdfor environment variable reference - Additional information available in the Trilium docs
License
NeoMem is a derivative work based on Mem0 OSS (Apache 2.0). © 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
Built with Claude Code
Integration Notes
- NeoMem API is compatible with Mem0 OSS endpoints (
/memories,/search) - All services communicate via Docker internal networking on the
lyra_netbridge - History and entity graphs are managed via PostgreSQL + Neo4j
- LLM backends are accessed via HTTP and configured in
.env
Beta Lyrae - RAG Memory System (Currently Disabled)
Note: The RAG service is currently disabled in docker-compose.yml
Requirements
- Python 3.10+
- Dependencies:
chromadb openai tqdm python-dotenv fastapi uvicorn - Persistent storage:
./chromadbor/mnt/data/lyra_rag_db
Setup
-
Import chat logs (must be in OpenAI message format):
python3 rag/rag_chat_import.py -
Build and start the RAG API server:
cd rag python3 rag_build.py uvicorn rag_api:app --host 0.0.0.0 --port 7090 -
Query the RAG system:
curl -X POST http://127.0.0.1:7090/rag/search \ -H "Content-Type: application/json" \ -d '{ "query": "What is the current state of Cortex?", "where": {"category": "lyra"} }'