# Project Lyra - README v0.5.0 Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**, with multi-stage reasoning pipeline powered by distributed LLM backends. ## Mission Statement The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later. --- ## Architecture Overview Project Lyra operates as a series of Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules: ### A. VM 100 - lyra-core (Core Services) **1. Relay** (Node.js/Express) - Port 7078 - Main orchestrator and message router - Coordinates all module interactions - OpenAI-compatible endpoint: `POST /v1/chat/completions` - Internal endpoint: `POST /chat` - Routes messages through Cortex reasoning pipeline - Manages async calls to Intake and NeoMem **2. UI** (Static HTML) - Browser-based chat interface with cyberpunk theme - Connects to Relay at `http://10.0.0.40:7078` - Saves and loads sessions - OpenAI-compatible message format **3. NeoMem** (Python/FastAPI) - Port 7077 - Long-term memory database (fork of Mem0 OSS) - Vector storage (PostgreSQL + pgvector) + Graph storage (Neo4j) - RESTful API: `/memories`, `/search` - Semantic memory updates and retrieval - No external SDK dependencies - fully local ### B. VM 101 - lyra-cortex (Reasoning Layer) **4. Cortex** (Python/FastAPI) - Port 7081 - Primary reasoning engine with multi-stage pipeline - **4-Stage Processing:** 1. **Reflection** - Generates meta-awareness notes about conversation 2. **Reasoning** - Creates initial draft answer using context 3. **Refinement** - Polishes and improves the draft 4. **Persona** - Applies Lyra's personality and speaking style - Integrates with Intake for short-term context - Flexible LLM router supporting multiple backends **5. Intake v0.2** (Python/FastAPI) - Port 7080 - Simplified short-term memory summarization - Session-based circular buffer (deque, maxlen=200) - Single-level simple summarization (no cascading) - Background async processing with FastAPI BackgroundTasks - Pushes summaries to NeoMem automatically - **API Endpoints:** - `POST /add_exchange` - Add conversation exchange - `GET /summaries?session_id={id}` - Retrieve session summary - `POST /close_session/{id}` - Close and cleanup session ### C. LLM Backends (Remote/Local APIs) **Multi-Backend Strategy:** - **PRIMARY**: vLLM on AMD MI50 GPU (`http://10.0.0.43:8000`) - Cortex reasoning, Intake - **SECONDARY**: Ollama on RTX 3090 (`http://10.0.0.3:11434`) - Configurable per-module - **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cortex persona layer - **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback --- ## Data Flow Architecture (v0.5.0) ### Normal Message Flow: ``` User (UI) → POST /v1/chat/completions ↓ Relay (7078) ↓ POST /reason Cortex (7081) ↓ GET /summaries?session_id=xxx Intake (7080) [RETURNS SUMMARY] ↓ Cortex processes (4 stages): 1. reflection.py → meta-awareness notes 2. reasoning.py → draft answer (uses LLM) 3. refine.py → refined answer (uses LLM) 4. persona/speak.py → Lyra personality (uses LLM) ↓ Returns persona answer to Relay ↓ Relay → Cortex /ingest (async, stub) Relay → Intake /add_exchange (async) ↓ Intake → Background summarize → NeoMem ↓ Relay → UI (returns final response) ``` ### Cortex 4-Stage Reasoning Pipeline: 1. **Reflection** (`reflection.py`) - Cloud backend (OpenAI) - Analyzes user intent and conversation context - Generates meta-awareness notes - "What is the user really asking?" 2. **Reasoning** (`reasoning.py`) - Primary backend (vLLM) - Retrieves short-term context from Intake - Creates initial draft answer - Integrates context, reflection notes, and user prompt 3. **Refinement** (`refine.py`) - Primary backend (vLLM) - Polishes the draft answer - Improves clarity and coherence - Ensures factual consistency 4. **Persona** (`speak.py`) - Cloud backend (OpenAI) - Applies Lyra's personality and speaking style - Natural, conversational output - Final answer returned to user --- ## Features ### Lyra-Core (VM 100) **Relay**: - Main orchestrator and message router - OpenAI-compatible endpoint: `POST /v1/chat/completions` - Internal endpoint: `POST /chat` - Health check: `GET /_health` - Async non-blocking calls to Cortex and Intake - Shared request handler for code reuse - Comprehensive error handling **NeoMem (Memory Engine)**: - Forked from Mem0 OSS - fully independent - Drop-in compatible API (`/memories`, `/search`) - Local-first: runs on FastAPI with Postgres + Neo4j - No external SDK dependencies - Semantic memory updates - compares embeddings and performs in-place updates - Default service: `neomem-api` (port 7077) **UI**: - Lightweight static HTML chat interface - Cyberpunk theme - Session save/load functionality - OpenAI message format support ### Cortex (VM 101) **Cortex** (v0.5): - Multi-stage reasoning pipeline (reflection → reasoning → refine → persona) - Flexible LLM backend routing - Per-stage backend selection - Async processing throughout - IntakeClient integration for short-term context - `/reason`, `/ingest` (stub), `/health` endpoints **Intake** (v0.2): - Simplified single-level summarization - Session-based circular buffer (200 exchanges max) - Background async summarization - Automatic NeoMem push - No persistent log files (memory-only) - **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30) **LLM Router**: - Dynamic backend selection - Environment-driven configuration - Support for vLLM, Ollama, OpenAI, custom endpoints - Per-module backend preferences # Beta Lyrae (RAG Memory DB) - added 11-3-25 - **RAG Knowledge DB - Beta Lyrae (sheliak)** - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra. - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. The system uses: - **ChromaDB** for persistent vector storage - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity - **FastAPI** (port 7090) for the `/rag/search` REST endpoint - Directory Layout rag/ ├── rag_chat_import.py # imports JSON chat logs ├── rag_docs_import.py # (planned) PDF/EPUB/manual importer ├── rag_build.py # legacy single-folder builder ├── rag_query.py # command-line query helper ├── rag_api.py # FastAPI service providing /rag/search ├── chromadb/ # persistent vector store ├── chatlogs/ # organized source data │ ├── poker/ │ ├── work/ │ ├── lyra/ │ ├── personal/ │ └── ... └── import.log # progress log for batch runs - **OpenAI chatlog importer. - Takes JSON formatted chat logs and imports it to the RAG. - **fetures include:** - Recursive folder indexing with **category detection** from directory name - Smart chunking for long messages (5 000 chars per slice) - Automatic deduplication using SHA-1 hash of file + chunk - Timestamps for both file modification and import time - Full progress logging via tqdm - Safe to run in background with nohup … & - Metadata per chunk: ```json { "chat_id": "", "chunk_index": 0, "source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json", "title": "cortex LLMs 11-1-25", "role": "assistant", "category": "lyra", "type": "chat", "file_modified": "2025-11-06T23:41:02", "imported_at": "2025-11-07T03:55:00Z" }``` # Cortex VM (VM101, CT201) - **CT201 main reasoning orchestrator.** - This is the internal brain of Lyra. - Running in a privellaged LXC. - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. - Accessible via 10.0.0.43:8000/v1/completions. - **Intake v0.1.1 ** - Recieves messages from relay and summarizes them in a cascading format. - Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20) - Intake then sends to cortex for self reflection, neomem for memory consolidation. - **Reflect ** -TBD # Self hosted vLLM server # - **CT201 main reasoning orchestrator.** - This is the internal brain of Lyra. - Running in a privellaged LXC. - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. - Accessible via 10.0.0.43:8000/v1/completions. - **Stack Flow** - [Proxmox Host] └── loads AMDGPU driver └── boots CT201 (order=2) [CT201 GPU Container] ├── lyra-start-vllm.sh → starts vLLM ROCm model server ├── lyra-vllm.service → runs the above automatically ├── lyra-core.service → launches Cortex + Intake Docker stack └── Docker Compose → runs Cortex + Intake containers [Cortex Container] ├── Listens on port 7081 ├── Talks to NVGRAM (mem API) + Intake └── Main relay between Lyra UI ↔ memory ↔ model [Intake Container] ├── Listens on port 7080 ├── Summarizes every few exchanges ├── Writes summaries to /app/logs/summaries.log └── Future: sends summaries → Cortex for reflection --- ## Version History ### v0.5.0 (2025-11-28) - Current Release - ✅ Fixed all critical API wiring issues - ✅ Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`) - ✅ Fixed Cortex → Intake integration - ✅ Added missing Python package `__init__.py` files - ✅ End-to-end message flow verified and working ### v0.4.x (Major Rewire) - Cortex multi-stage reasoning pipeline - Intake v0.2 simplification - LLM router with multi-backend support - Major architectural restructuring ### v0.3.x - Beta Lyrae RAG system - NeoMem integration - Basic Cortex reasoning loop --- ## Known Issues (v0.5.0) ### Non-Critical - Session management endpoints not fully implemented in Relay - RAG service currently disabled in docker-compose.yml - Cortex `/ingest` endpoint is a stub ### Future Enhancements - Re-enable RAG service integration - Implement full session persistence - Add request correlation IDs for tracing - Comprehensive health checks --- ## Quick Start ### Prerequisites - Docker + Docker Compose - PostgreSQL 13+, Neo4j 4.4+ (for NeoMem) - At least one LLM API endpoint (vLLM, Ollama, or OpenAI) ### Setup 1. Configure environment variables in `.env` files 2. Start services: `docker-compose up -d` 3. Check health: `curl http://localhost:7078/_health` 4. Access UI: `http://localhost:7078` ### Test ```bash curl -X POST http://localhost:7078/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Hello Lyra!"}], "session_id": "test" }' ``` --- ## Documentation - See [CHANGELOG.md](CHANGELOG.md) for detailed version history - See `ENVIRONMENT_VARIABLES.md` for environment variable reference - Additional information available in the Trilium docs --- ## License NeoMem is a derivative work based on Mem0 OSS (Apache 2.0). © 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0. **Built with Claude Code** --- ## 📦 Requirements - Docker + Docker Compose - Postgres + Neo4j (for NeoMem) - Access to an open AI or ollama style API. - OpenAI API key (for Relay fallback LLMs) **Dependencies:** - fastapi==0.115.8 - uvicorn==0.34.0 - pydantic==2.10.4 - python-dotenv==1.0.1 - psycopg>=3.2.8 - ollama --- 🔌 Integration Notes Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally. API endpoints remain identical to Mem0 (/memories, /search). History and entity graphs managed internally via Postgres + Neo4j. --- 🧱 Architecture Snapshot User → Relay → Cortex ↓ [RAG Search] ↓ [Reflection Loop] ↓ Intake (async summaries) ↓ NeoMem (persistent memory) **Cortex v0.4.1 introduces the first fully integrated reasoning loop.** - Data Flow: - User message enters Cortex via /reason. - Cortex assembles context: - Intake summaries (short-term memory) - RAG contextual data (knowledge base) - LLM generates initial draft (call_llm). - Reflection loop critiques and refines the answer. - Intake asynchronously summarizes and sends snapshots to NeoMem. RAG API Configuration: Set RAG_API_URL in .env (default: http://localhost:7090). --- ## Setup and Operation ## ## Beta Lyrae - RAG memory system ## **Requirements** -Env= python 3.10+ -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db) **Import Chats** - Chats need to be formatted into the correct format of ``` "messages": [ { "role:" "user", "content": "Message here" }, "messages": [ { "role:" "assistant", "content": "Message here" },``` - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight. - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB). **Build API Server** - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.) - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090``` **Query** - Run: python3 rag_query.py "Question here?" - For testing a curl command can reach it too ``` curl -X POST http://127.0.0.1:7090/rag/search \ -H "Content-Type: application/json" \ -d '{ "query": "What is the current state of Cortex and Project Lyra?", "where": {"category": "lyra"} }' ``` # Beta Lyrae - RAG System ## 📖 License NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0). This fork retains the original Apache 2.0 license and adds local modifications. © 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.