# Project Lyra - README v0.5.0 Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**, with multi-stage reasoning pipeline powered by HTTP-based LLM backends. ## Mission Statement The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later. --- ## Architecture Overview Project Lyra operates as a **single docker-compose deployment** with multiple Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules: ### Core Services **1. Relay** (Node.js/Express) - Port 7078 - Main orchestrator and message router - Coordinates all module interactions - OpenAI-compatible endpoint: `POST /v1/chat/completions` - Internal endpoint: `POST /chat` - Routes messages through Cortex reasoning pipeline - Manages async calls to Intake and NeoMem **2. UI** (Static HTML) - Browser-based chat interface with cyberpunk theme - Connects to Relay - Saves and loads sessions - OpenAI-compatible message format **3. NeoMem** (Python/FastAPI) - Port 7077 - Long-term memory database (fork of Mem0 OSS) - Vector storage (PostgreSQL + pgvector) + Graph storage (Neo4j) - RESTful API: `/memories`, `/search` - Semantic memory updates and retrieval - No external SDK dependencies - fully local ### Reasoning Layer **4. Cortex** (Python/FastAPI) - Port 7081 - Primary reasoning engine with multi-stage pipeline - **4-Stage Processing:** 1. **Reflection** - Generates meta-awareness notes about conversation 2. **Reasoning** - Creates initial draft answer using context 3. **Refinement** - Polishes and improves the draft 4. **Persona** - Applies Lyra's personality and speaking style - Integrates with Intake for short-term context - Flexible LLM router supporting multiple backends via HTTP **5. Intake v0.2** (Python/FastAPI) - Port 7080 - Simplified short-term memory summarization - Session-based circular buffer (deque, maxlen=200) - Single-level simple summarization (no cascading) - Background async processing with FastAPI BackgroundTasks - Pushes summaries to NeoMem automatically - **API Endpoints:** - `POST /add_exchange` - Add conversation exchange - `GET /summaries?session_id={id}` - Retrieve session summary - `POST /close_session/{id}` - Close and cleanup session ### LLM Backends (HTTP-based) **All LLM communication is done via HTTP APIs:** - **PRIMARY**: vLLM server (`http://10.0.0.43:8000`) - AMD MI50 GPU backend - **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend - **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models - **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback Each module can be configured to use a different backend via environment variables. --- ## Data Flow Architecture (v0.5.0) ### Normal Message Flow: ``` User (UI) → POST /v1/chat/completions ↓ Relay (7078) ↓ POST /reason Cortex (7081) ↓ GET /summaries?session_id=xxx Intake (7080) [RETURNS SUMMARY] ↓ Cortex processes (4 stages): 1. reflection.py → meta-awareness notes 2. reasoning.py → draft answer (uses LLM) 3. refine.py → refined answer (uses LLM) 4. persona/speak.py → Lyra personality (uses LLM) ↓ Returns persona answer to Relay ↓ Relay → Cortex /ingest (async, stub) Relay → Intake /add_exchange (async) ↓ Intake → Background summarize → NeoMem ↓ Relay → UI (returns final response) ``` ### Cortex 4-Stage Reasoning Pipeline: 1. **Reflection** (`reflection.py`) - Configurable LLM via HTTP - Analyzes user intent and conversation context - Generates meta-awareness notes - "What is the user really asking?" 2. **Reasoning** (`reasoning.py`) - Configurable LLM via HTTP - Retrieves short-term context from Intake - Creates initial draft answer - Integrates context, reflection notes, and user prompt 3. **Refinement** (`refine.py`) - Configurable LLM via HTTP - Polishes the draft answer - Improves clarity and coherence - Ensures factual consistency 4. **Persona** (`speak.py`) - Configurable LLM via HTTP - Applies Lyra's personality and speaking style - Natural, conversational output - Final answer returned to user --- ## Features ### Core Services **Relay**: - Main orchestrator and message router - OpenAI-compatible endpoint: `POST /v1/chat/completions` - Internal endpoint: `POST /chat` - Health check: `GET /_health` - Async non-blocking calls to Cortex and Intake - Shared request handler for code reuse - Comprehensive error handling **NeoMem (Memory Engine)**: - Forked from Mem0 OSS - fully independent - Drop-in compatible API (`/memories`, `/search`) - Local-first: runs on FastAPI with Postgres + Neo4j - No external SDK dependencies - Semantic memory updates - compares embeddings and performs in-place updates - Default service: `neomem-api` (port 7077) **UI**: - Lightweight static HTML chat interface - Cyberpunk theme - Session save/load functionality - OpenAI message format support ### Reasoning Layer **Cortex** (v0.5): - Multi-stage reasoning pipeline (reflection → reasoning → refine → persona) - Flexible LLM backend routing via HTTP - Per-stage backend selection - Async processing throughout - IntakeClient integration for short-term context - `/reason`, `/ingest` (stub), `/health` endpoints **Intake** (v0.2): - Simplified single-level summarization - Session-based circular buffer (200 exchanges max) - Background async summarization - Automatic NeoMem push - No persistent log files (memory-only) - **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30) **LLM Router**: - Dynamic backend selection via HTTP - Environment-driven configuration - Support for vLLM, Ollama, OpenAI, custom endpoints - Per-module backend preferences # Beta Lyrae (RAG Memory DB) - added 11-3-25 - **RAG Knowledge DB - Beta Lyrae (sheliak)** - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra. - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. The system uses: - **ChromaDB** for persistent vector storage - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity - **FastAPI** (port 7090) for the `/rag/search` REST endpoint - Directory Layout rag/ ├── rag_chat_import.py # imports JSON chat logs ├── rag_docs_import.py # (planned) PDF/EPUB/manual importer ├── rag_build.py # legacy single-folder builder ├── rag_query.py # command-line query helper ├── rag_api.py # FastAPI service providing /rag/search ├── chromadb/ # persistent vector store ├── chatlogs/ # organized source data │ ├── poker/ │ ├── work/ │ ├── lyra/ │ ├── personal/ │ └── ... └── import.log # progress log for batch runs - **OpenAI chatlog importer. - Takes JSON formatted chat logs and imports it to the RAG. - **fetures include:** - Recursive folder indexing with **category detection** from directory name - Smart chunking for long messages (5 000 chars per slice) - Automatic deduplication using SHA-1 hash of file + chunk - Timestamps for both file modification and import time - Full progress logging via tqdm - Safe to run in background with nohup … & - Metadata per chunk: ```json { "chat_id": "", "chunk_index": 0, "source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json", "title": "cortex LLMs 11-1-25", "role": "assistant", "category": "lyra", "type": "chat", "file_modified": "2025-11-06T23:41:02", "imported_at": "2025-11-07T03:55:00Z" }``` --- ## Docker Deployment All services run in a single docker-compose stack with the following containers: - **neomem-postgres** - PostgreSQL with pgvector extension (port 5432) - **neomem-neo4j** - Neo4j graph database (ports 7474, 7687) - **neomem-api** - NeoMem memory service (port 7077) - **relay** - Main orchestrator (port 7078) - **cortex** - Reasoning engine (port 7081) - **intake** - Short-term memory summarization (port 7080) - currently disabled - **rag** - RAG search service (port 7090) - currently disabled All containers communicate via the `lyra_net` Docker bridge network. ## External LLM Services The following LLM backends are accessed via HTTP (not part of docker-compose): - **vLLM Server** (`http://10.0.0.43:8000`) - AMD MI50 GPU-accelerated inference - Custom ROCm-enabled vLLM build - Primary backend for reasoning and refinement stages - **Ollama Server** (`http://10.0.0.3:11434`) - RTX 3090 GPU-accelerated inference - Secondary/configurable backend - Model: qwen2.5:7b-instruct-q4_K_M - **OpenAI API** (`https://api.openai.com/v1`) - Cloud-based inference - Used for reflection and persona stages - Model: gpt-4o-mini - **Fallback Server** (`http://10.0.0.41:11435`) - Emergency backup endpoint - Local llama-3.2-8b-instruct model --- ## Version History ### v0.5.0 (2025-11-28) - Current Release - ✅ Fixed all critical API wiring issues - ✅ Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`) - ✅ Fixed Cortex → Intake integration - ✅ Added missing Python package `__init__.py` files - ✅ End-to-end message flow verified and working ### v0.4.x (Major Rewire) - Cortex multi-stage reasoning pipeline - Intake v0.2 simplification - LLM router with multi-backend support - Major architectural restructuring ### v0.3.x - Beta Lyrae RAG system - NeoMem integration - Basic Cortex reasoning loop --- ## Known Issues (v0.5.0) ### Non-Critical - Session management endpoints not fully implemented in Relay - Intake service currently disabled in docker-compose.yml - RAG service currently disabled in docker-compose.yml - Cortex `/ingest` endpoint is a stub ### Future Enhancements - Re-enable RAG service integration - Implement full session persistence - Add request correlation IDs for tracing - Comprehensive health checks --- ## Quick Start ### Prerequisites - Docker + Docker Compose - At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key) ### Setup 1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys 2. Start all services with docker-compose: ```bash docker-compose up -d ``` 3. Check service health: ```bash curl http://localhost:7078/_health ``` 4. Access the UI at `http://localhost:7078` ### Test ```bash curl -X POST http://localhost:7078/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Hello Lyra!"}], "session_id": "test" }' ``` All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack. --- ## Documentation - See [CHANGELOG.md](CHANGELOG.md) for detailed version history - See `ENVIRONMENT_VARIABLES.md` for environment variable reference - Additional information available in the Trilium docs --- ## License NeoMem is a derivative work based on Mem0 OSS (Apache 2.0). © 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0. **Built with Claude Code** --- ## Integration Notes - NeoMem API is compatible with Mem0 OSS endpoints (`/memories`, `/search`) - All services communicate via Docker internal networking on the `lyra_net` bridge - History and entity graphs are managed via PostgreSQL + Neo4j - LLM backends are accessed via HTTP and configured in `.env` --- ## Beta Lyrae - RAG Memory System (Currently Disabled) **Note:** The RAG service is currently disabled in docker-compose.yml ### Requirements - Python 3.10+ - Dependencies: `chromadb openai tqdm python-dotenv fastapi uvicorn` - Persistent storage: `./chromadb` or `/mnt/data/lyra_rag_db` ### Setup 1. Import chat logs (must be in OpenAI message format): ```bash python3 rag/rag_chat_import.py ``` 2. Build and start the RAG API server: ```bash cd rag python3 rag_build.py uvicorn rag_api:app --host 0.0.0.0 --port 7090 ``` 3. Query the RAG system: ```bash curl -X POST http://127.0.0.1:7090/rag/search \ -H "Content-Type: application/json" \ -d '{ "query": "What is the current state of Cortex?", "where": {"category": "lyra"} }' ```