diff --git a/README.md b/README.md index 74bf62d..072f3e0 100644 --- a/README.md +++ b/README.md @@ -2,19 +2,19 @@ Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**, -with multi-stage reasoning pipeline powered by distributed LLM backends. +with multi-stage reasoning pipeline powered by HTTP-based LLM backends. ## Mission Statement The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later. - + --- ## Architecture Overview -Project Lyra operates as a series of Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules: +Project Lyra operates as a **single docker-compose deployment** with multiple Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules: -### A. VM 100 - lyra-core (Core Services) +### Core Services **1. Relay** (Node.js/Express) - Port 7078 - Main orchestrator and message router @@ -26,7 +26,7 @@ Project Lyra operates as a series of Docker containers networked together in a m **2. UI** (Static HTML) - Browser-based chat interface with cyberpunk theme -- Connects to Relay at `http://10.0.0.40:7078` +- Connects to Relay - Saves and loads sessions - OpenAI-compatible message format @@ -37,7 +37,7 @@ Project Lyra operates as a series of Docker containers networked together in a m - Semantic memory updates and retrieval - No external SDK dependencies - fully local -### B. VM 101 - lyra-cortex (Reasoning Layer) +### Reasoning Layer **4. Cortex** (Python/FastAPI) - Port 7081 - Primary reasoning engine with multi-stage pipeline @@ -47,7 +47,7 @@ Project Lyra operates as a series of Docker containers networked together in a m 3. **Refinement** - Polishes and improves the draft 4. **Persona** - Applies Lyra's personality and speaking style - Integrates with Intake for short-term context -- Flexible LLM router supporting multiple backends +- Flexible LLM router supporting multiple backends via HTTP **5. Intake v0.2** (Python/FastAPI) - Port 7080 - Simplified short-term memory summarization @@ -60,13 +60,15 @@ Project Lyra operates as a series of Docker containers networked together in a m - `GET /summaries?session_id={id}` - Retrieve session summary - `POST /close_session/{id}` - Close and cleanup session -### C. LLM Backends (Remote/Local APIs) +### LLM Backends (HTTP-based) -**Multi-Backend Strategy:** -- **PRIMARY**: vLLM on AMD MI50 GPU (`http://10.0.0.43:8000`) - Cortex reasoning, Intake -- **SECONDARY**: Ollama on RTX 3090 (`http://10.0.0.3:11434`) - Configurable per-module -- **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cortex persona layer -- **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback +**All LLM communication is done via HTTP APIs:** +- **PRIMARY**: vLLM server (`http://10.0.0.43:8000`) - AMD MI50 GPU backend +- **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend +- **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models +- **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback + +Each module can be configured to use a different backend via environment variables. --- @@ -101,22 +103,22 @@ Relay → UI (returns final response) ### Cortex 4-Stage Reasoning Pipeline: -1. **Reflection** (`reflection.py`) - Cloud backend (OpenAI) +1. **Reflection** (`reflection.py`) - Configurable LLM via HTTP - Analyzes user intent and conversation context - Generates meta-awareness notes - "What is the user really asking?" -2. **Reasoning** (`reasoning.py`) - Primary backend (vLLM) +2. **Reasoning** (`reasoning.py`) - Configurable LLM via HTTP - Retrieves short-term context from Intake - Creates initial draft answer - Integrates context, reflection notes, and user prompt -3. **Refinement** (`refine.py`) - Primary backend (vLLM) +3. **Refinement** (`refine.py`) - Configurable LLM via HTTP - Polishes the draft answer - Improves clarity and coherence - Ensures factual consistency -4. **Persona** (`speak.py`) - Cloud backend (OpenAI) +4. **Persona** (`speak.py`) - Configurable LLM via HTTP - Applies Lyra's personality and speaking style - Natural, conversational output - Final answer returned to user @@ -125,7 +127,7 @@ Relay → UI (returns final response) ## Features -### Lyra-Core (VM 100) +### Core Services **Relay**: - Main orchestrator and message router @@ -150,11 +152,11 @@ Relay → UI (returns final response) - Session save/load functionality - OpenAI message format support -### Cortex (VM 101) +### Reasoning Layer **Cortex** (v0.5): - Multi-stage reasoning pipeline (reflection → reasoning → refine → persona) -- Flexible LLM backend routing +- Flexible LLM backend routing via HTTP - Per-stage backend selection - Async processing throughout - IntakeClient integration for short-term context @@ -169,7 +171,7 @@ Relay → UI (returns final response) - **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30) **LLM Router**: -- Dynamic backend selection +- Dynamic backend selection via HTTP - Environment-driven configuration - Support for vLLM, Ollama, OpenAI, custom endpoints - Per-module backend preferences @@ -220,49 +222,44 @@ Relay → UI (returns final response) "imported_at": "2025-11-07T03:55:00Z" }``` -# Cortex VM (VM101, CT201) - - **CT201 main reasoning orchestrator.** - - This is the internal brain of Lyra. - - Running in a privellaged LXC. - - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. - - Accessible via 10.0.0.43:8000/v1/completions. +--- - - **Intake v0.1.1 ** - - Recieves messages from relay and summarizes them in a cascading format. - - Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20) - - Intake then sends to cortex for self reflection, neomem for memory consolidation. - - - **Reflect ** - -TBD +## Docker Deployment -# Self hosted vLLM server # - - **CT201 main reasoning orchestrator.** - - This is the internal brain of Lyra. - - Running in a privellaged LXC. - - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. - - Accessible via 10.0.0.43:8000/v1/completions. - - **Stack Flow** - - [Proxmox Host] - └── loads AMDGPU driver - └── boots CT201 (order=2) +All services run in a single docker-compose stack with the following containers: - [CT201 GPU Container] - ├── lyra-start-vllm.sh → starts vLLM ROCm model server - ├── lyra-vllm.service → runs the above automatically - ├── lyra-core.service → launches Cortex + Intake Docker stack - └── Docker Compose → runs Cortex + Intake containers +- **neomem-postgres** - PostgreSQL with pgvector extension (port 5432) +- **neomem-neo4j** - Neo4j graph database (ports 7474, 7687) +- **neomem-api** - NeoMem memory service (port 7077) +- **relay** - Main orchestrator (port 7078) +- **cortex** - Reasoning engine (port 7081) +- **intake** - Short-term memory summarization (port 7080) - currently disabled +- **rag** - RAG search service (port 7090) - currently disabled - [Cortex Container] - ├── Listens on port 7081 - ├── Talks to NVGRAM (mem API) + Intake - └── Main relay between Lyra UI ↔ memory ↔ model +All containers communicate via the `lyra_net` Docker bridge network. - [Intake Container] - ├── Listens on port 7080 - ├── Summarizes every few exchanges - ├── Writes summaries to /app/logs/summaries.log - └── Future: sends summaries → Cortex for reflection +## External LLM Services +The following LLM backends are accessed via HTTP (not part of docker-compose): + +- **vLLM Server** (`http://10.0.0.43:8000`) + - AMD MI50 GPU-accelerated inference + - Custom ROCm-enabled vLLM build + - Primary backend for reasoning and refinement stages + +- **Ollama Server** (`http://10.0.0.3:11434`) + - RTX 3090 GPU-accelerated inference + - Secondary/configurable backend + - Model: qwen2.5:7b-instruct-q4_K_M + +- **OpenAI API** (`https://api.openai.com/v1`) + - Cloud-based inference + - Used for reflection and persona stages + - Model: gpt-4o-mini + +- **Fallback Server** (`http://10.0.0.41:11435`) + - Emergency backup endpoint + - Local llama-3.2-8b-instruct model --- @@ -292,6 +289,7 @@ Relay → UI (returns final response) ### Non-Critical - Session management endpoints not fully implemented in Relay +- Intake service currently disabled in docker-compose.yml - RAG service currently disabled in docker-compose.yml - Cortex `/ingest` endpoint is a stub @@ -307,14 +305,19 @@ Relay → UI (returns final response) ### Prerequisites - Docker + Docker Compose -- PostgreSQL 13+, Neo4j 4.4+ (for NeoMem) -- At least one LLM API endpoint (vLLM, Ollama, or OpenAI) +- At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key) ### Setup -1. Configure environment variables in `.env` files -2. Start services: `docker-compose up -d` -3. Check health: `curl http://localhost:7078/_health` -4. Access UI: `http://localhost:7078` +1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys +2. Start all services with docker-compose: + ```bash + docker-compose up -d + ``` +3. Check service health: + ```bash + curl http://localhost:7078/_health + ``` +4. Access the UI at `http://localhost:7078` ### Test ```bash @@ -326,6 +329,8 @@ curl -X POST http://localhost:7078/v1/chat/completions \ }' ``` +All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack. + --- ## Documentation @@ -345,104 +350,44 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0). --- -## 📦 Requirements +## Integration Notes -- Docker + Docker Compose -- Postgres + Neo4j (for NeoMem) -- Access to an open AI or ollama style API. -- OpenAI API key (for Relay fallback LLMs) - -**Dependencies:** - - fastapi==0.115.8 - - uvicorn==0.34.0 - - pydantic==2.10.4 - - python-dotenv==1.0.1 - - psycopg>=3.2.8 - - ollama +- NeoMem API is compatible with Mem0 OSS endpoints (`/memories`, `/search`) +- All services communicate via Docker internal networking on the `lyra_net` bridge +- History and entity graphs are managed via PostgreSQL + Neo4j +- LLM backends are accessed via HTTP and configured in `.env` --- -🔌 Integration Notes +## Beta Lyrae - RAG Memory System (Currently Disabled) -Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally. +**Note:** The RAG service is currently disabled in docker-compose.yml -API endpoints remain identical to Mem0 (/memories, /search). +### Requirements +- Python 3.10+ +- Dependencies: `chromadb openai tqdm python-dotenv fastapi uvicorn` +- Persistent storage: `./chromadb` or `/mnt/data/lyra_rag_db` -History and entity graphs managed internally via Postgres + Neo4j. +### Setup +1. Import chat logs (must be in OpenAI message format): + ```bash + python3 rag/rag_chat_import.py + ``` ---- +2. Build and start the RAG API server: + ```bash + cd rag + python3 rag_build.py + uvicorn rag_api:app --host 0.0.0.0 --port 7090 + ``` -🧱 Architecture Snapshot - - User → Relay → Cortex - ↓ - [RAG Search] - ↓ - [Reflection Loop] - ↓ - Intake (async summaries) - ↓ - NeoMem (persistent memory) - -**Cortex v0.4.1 introduces the first fully integrated reasoning loop.** -- Data Flow: - - User message enters Cortex via /reason. - - Cortex assembles context: - - Intake summaries (short-term memory) - - RAG contextual data (knowledge base) - - LLM generates initial draft (call_llm). - - Reflection loop critiques and refines the answer. - - Intake asynchronously summarizes and sends snapshots to NeoMem. - -RAG API Configuration: -Set RAG_API_URL in .env (default: http://localhost:7090). - ---- - -## Setup and Operation ## - -## Beta Lyrae - RAG memory system ## -**Requirements** - -Env= python 3.10+ - -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq - -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db) - -**Import Chats** - - Chats need to be formatted into the correct format of - ``` - "messages": [ - { - "role:" "user", - "content": "Message here" - }, - "messages": [ - { - "role:" "assistant", - "content": "Message here" - },``` - - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight. - - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB). - -**Build API Server** - - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.) - - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090``` - -**Query** - - Run: python3 rag_query.py "Question here?" - - For testing a curl command can reach it too - ``` - curl -X POST http://127.0.0.1:7090/rag/search \ - -H "Content-Type: application/json" \ - -d '{ - "query": "What is the current state of Cortex and Project Lyra?", - "where": {"category": "lyra"} - }' - ``` - -# Beta Lyrae - RAG System - -## 📖 License -NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0). -This fork retains the original Apache 2.0 license and adds local modifications. -© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0. +3. Query the RAG system: + ```bash + curl -X POST http://127.0.0.1:7090/rag/search \ + -H "Content-Type: application/json" \ + -d '{ + "query": "What is the current state of Cortex?", + "where": {"category": "lyra"} + }' + ``` diff --git a/core/relay/server.js b/core/relay/server.js index c9e2192..db706d8 100644 --- a/core/relay/server.js +++ b/core/relay/server.js @@ -1,3 +1,6 @@ +// relay v0.3.0 +// Core relay server for Lyra project +// Handles incoming chat requests and forwards them to Cortex services import express from "express"; import dotenv from "dotenv"; import cors from "cors"; @@ -10,9 +13,8 @@ app.use(express.json()); const PORT = Number(process.env.PORT || 7078); -// core endpoints +// Cortex endpoints (only these are used now) const CORTEX_REASON = process.env.CORTEX_REASON_URL || "http://cortex:7081/reason"; -const CORTEX_INGEST = process.env.CORTEX_INGEST_URL || "http://cortex:7081/ingest"; // ----------------------------------------------------- // Helper request wrapper @@ -27,7 +29,6 @@ async function postJSON(url, data) { const raw = await resp.text(); let json; - // Try to parse JSON safely try { json = raw ? JSON.parse(raw) : null; } catch (e) { @@ -42,11 +43,12 @@ async function postJSON(url, data) { } // ----------------------------------------------------- -// Shared chat handler logic +// The unified chat handler // ----------------------------------------------------- async function handleChatRequest(session_id, user_msg) { - // 1. → Cortex.reason: the main pipeline let reason; + + // 1. → Cortex.reason (main pipeline) try { reason = await postJSON(CORTEX_REASON, { session_id, @@ -57,19 +59,13 @@ async function handleChatRequest(session_id, user_msg) { throw new Error(`cortex_reason_failed: ${e.message}`); } - const persona = reason.final_output || reason.persona || "(no persona text)"; + // Correct persona field + const persona = + reason.persona || + reason.final_output || + "(no persona text)"; - // 2. → Cortex.ingest (async, non-blocking) - // Cortex might still want this for separate ingestion pipeline. - postJSON(CORTEX_INGEST, { - session_id, - user_msg, - assistant_msg: persona - }).catch(e => - console.warn("Relay → Cortex.ingest failed:", e.message) - ); - - // 3. Return corrected result + // Return final answer return { session_id, reply: persona @@ -84,7 +80,7 @@ app.get("/_health", (_, res) => { }); // ----------------------------------------------------- -// OPENAI-COMPATIBLE ENDPOINT (for UI & clients) +// OPENAI-COMPATIBLE ENDPOINT // ----------------------------------------------------- app.post("/v1/chat/completions", async (req, res) => { try { @@ -101,7 +97,7 @@ app.post("/v1/chat/completions", async (req, res) => { const result = await handleChatRequest(session_id, user_msg); - return res.json({ + res.json({ id: `chatcmpl-${Date.now()}`, object: "chat.completion", created: Math.floor(Date.now() / 1000), @@ -134,7 +130,7 @@ app.post("/v1/chat/completions", async (req, res) => { }); // ----------------------------------------------------- -// MAIN ENDPOINT (canonical Lyra UI entrance) +// MAIN ENDPOINT (Lyra-native UI) // ----------------------------------------------------- app.post("/chat", async (req, res) => { try { @@ -144,7 +140,7 @@ app.post("/chat", async (req, res) => { console.log(`Relay → received: "${user_msg}"`); const result = await handleChatRequest(session_id, user_msg); - return res.json(result); + res.json(result); } catch (err) { console.error("Relay fatal:", err); diff --git a/cortex/intake/intake.py b/cortex/intake/intake.py index 050f8d7..897acf8 100644 --- a/cortex/intake/intake.py +++ b/cortex/intake/intake.py @@ -1,6 +1,8 @@ import os from datetime import datetime from typing import List, Dict, Any, TYPE_CHECKING +from collections import deque + if TYPE_CHECKING: from collections import deque as _deque diff --git a/cortex/router.py b/cortex/router.py index 906d3d8..0beb457 100644 --- a/cortex/router.py +++ b/cortex/router.py @@ -10,7 +10,6 @@ from reasoning.reflection import reflect_notes from reasoning.refine import refine_answer from persona.speak import speak from persona.identity import load_identity -from ingest.intake_client import IntakeClient from context import collect_context, update_last_assistant_message from intake.intake import add_exchange_internal @@ -50,9 +49,6 @@ if VERBOSE_DEBUG: # ----------------------------- cortex_router = APIRouter() -# Initialize Intake client once -intake_client = IntakeClient() - # ----------------------------- # Pydantic models @@ -202,11 +198,10 @@ class IngestPayload(BaseModel): assistant_msg: str @cortex_router.post("/ingest") -async def ingest(payload: IngestPayload): - """ - Relay calls this after /reason. - We update Cortex state AND feed Intake's internal buffer. - """ +async def ingest_stub(): + # Intake is internal now — this endpoint is only for compatibility. + return {"status": "ok", "note": "intake is internal now"} + # 1. Update Cortex session state update_last_assistant_message(payload.session_id, payload.assistant_msg)