intake/relay rewire

2025-12-06 04:32:42 -05:00
parent fc85557f76
commit 4acaddfd12
4 changed files with 123 additions and 185 deletions
@@ -2,19 +2,19 @@

 Lyra is a modular persistent AI companion system with advanced reasoning capabilities.
 It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**,
-with multi-stage reasoning pipeline powered by distributed LLM backends.
+with multi-stage reasoning pipeline powered by HTTP-based LLM backends.

 ## Mission Statement

 The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
-	
+
 ---

 ## Architecture Overview

-Project Lyra operates as a series of Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules:
+Project Lyra operates as a **single docker-compose deployment** with multiple Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules:

-### A. VM 100 - lyra-core (Core Services)
+### Core Services

 **1. Relay** (Node.js/Express) - Port 7078
 - Main orchestrator and message router
@@ -26,7 +26,7 @@ Project Lyra operates as a series of Docker containers networked together in a m

 **2. UI** (Static HTML)
 - Browser-based chat interface with cyberpunk theme
- Connects to Relay at `http://10.0.0.40:7078`
+- Connects to Relay
 - Saves and loads sessions
 - OpenAI-compatible message format

@@ -37,7 +37,7 @@ Project Lyra operates as a series of Docker containers networked together in a m
 - Semantic memory updates and retrieval
 - No external SDK dependencies - fully local

-### B. VM 101 - lyra-cortex (Reasoning Layer)
+### Reasoning Layer

 **4. Cortex** (Python/FastAPI) - Port 7081
 - Primary reasoning engine with multi-stage pipeline
@@ -47,7 +47,7 @@ Project Lyra operates as a series of Docker containers networked together in a m
  3. **Refinement** - Polishes and improves the draft
  4. **Persona** - Applies Lyra's personality and speaking style
 - Integrates with Intake for short-term context
- Flexible LLM router supporting multiple backends
+- Flexible LLM router supporting multiple backends via HTTP

 **5. Intake v0.2** (Python/FastAPI) - Port 7080
 - Simplified short-term memory summarization
@@ -60,13 +60,15 @@ Project Lyra operates as a series of Docker containers networked together in a m
  - `GET /summaries?session_id={id}` - Retrieve session summary
  - `POST /close_session/{id}` - Close and cleanup session

-### C. LLM Backends (Remote/Local APIs)
+### LLM Backends (HTTP-based)

-**Multi-Backend Strategy:**
- **PRIMARY**: vLLM on AMD MI50 GPU (`http://10.0.0.43:8000`) - Cortex reasoning, Intake
- **SECONDARY**: Ollama on RTX 3090 (`http://10.0.0.3:11434`) - Configurable per-module
- **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cortex persona layer
- **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback 
+**All LLM communication is done via HTTP APIs:**
+- **PRIMARY**: vLLM server (`http://10.0.0.43:8000`) - AMD MI50 GPU backend
+- **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend
+- **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models
+- **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback
+
+Each module can be configured to use a different backend via environment variables. 
 			
 ---

@@ -101,22 +103,22 @@ Relay → UI (returns final response)

 ### Cortex 4-Stage Reasoning Pipeline:

-1. **Reflection** (`reflection.py`) - Cloud backend (OpenAI)
+1. **Reflection** (`reflection.py`) - Configurable LLM via HTTP
   - Analyzes user intent and conversation context
   - Generates meta-awareness notes
   - "What is the user really asking?"

-2. **Reasoning** (`reasoning.py`) - Primary backend (vLLM)
+2. **Reasoning** (`reasoning.py`) - Configurable LLM via HTTP
   - Retrieves short-term context from Intake
   - Creates initial draft answer
   - Integrates context, reflection notes, and user prompt

-3. **Refinement** (`refine.py`) - Primary backend (vLLM)
+3. **Refinement** (`refine.py`) - Configurable LLM via HTTP
   - Polishes the draft answer
   - Improves clarity and coherence
   - Ensures factual consistency

-4. **Persona** (`speak.py`) - Cloud backend (OpenAI)
+4. **Persona** (`speak.py`) - Configurable LLM via HTTP
   - Applies Lyra's personality and speaking style
   - Natural, conversational output
   - Final answer returned to user
@@ -125,7 +127,7 @@ Relay → UI (returns final response)

 ## Features

-### Lyra-Core (VM 100)
+### Core Services

 **Relay**:
 - Main orchestrator and message router
@@ -150,11 +152,11 @@ Relay → UI (returns final response)
 - Session save/load functionality
 - OpenAI message format support

-### Cortex (VM 101)
+### Reasoning Layer

 **Cortex** (v0.5):
 - Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
- Flexible LLM backend routing
+- Flexible LLM backend routing via HTTP
 - Per-stage backend selection
 - Async processing throughout
 - IntakeClient integration for short-term context
@@ -169,7 +171,7 @@ Relay → UI (returns final response)
 - **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30)

 **LLM Router**:
- Dynamic backend selection
+- Dynamic backend selection via HTTP
 - Environment-driven configuration
 - Support for vLLM, Ollama, OpenAI, custom endpoints
 - Per-module backend preferences
@@ -220,49 +222,44 @@ Relay → UI (returns final response)
 			"imported_at": "2025-11-07T03:55:00Z"
 		  }```

-# Cortex VM (VM101, CT201)
-  - **CT201 main reasoning orchestrator.**
-    - This is the internal brain of Lyra.
-	- Running in a privellaged LXC.	
-	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
-	- Accessible via 10.0.0.43:8000/v1/completions.
+---

-  - **Intake v0.1.1 **
-    - Recieves messages from relay and summarizes them in a cascading format.
-	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
-	- Intake then sends to cortex for self reflection, neomem for memory consolidation.
-	
-  - **Reflect **
-    -TBD
+## Docker Deployment

-# Self hosted vLLM server #
-  - **CT201 main reasoning orchestrator.**
-    - This is the internal brain of Lyra.
-	- Running in a privellaged LXC.	
-	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
-	- Accessible via 10.0.0.43:8000/v1/completions.
-  - **Stack Flow**
-    -	[Proxmox Host]
-			 └── loads AMDGPU driver
-			 └── boots CT201 (order=2)
+All services run in a single docker-compose stack with the following containers:

-		[CT201 GPU Container]
-			 ├── lyra-start-vllm.sh → starts vLLM ROCm model server
-			 ├── lyra-vllm.service   → runs the above automatically
-			 ├── lyra-core.service   → launches Cortex + Intake Docker stack
-			 └── Docker Compose      → runs Cortex + Intake containers
+- **neomem-postgres** - PostgreSQL with pgvector extension (port 5432)
+- **neomem-neo4j** - Neo4j graph database (ports 7474, 7687)
+- **neomem-api** - NeoMem memory service (port 7077)
+- **relay** - Main orchestrator (port 7078)
+- **cortex** - Reasoning engine (port 7081)
+- **intake** - Short-term memory summarization (port 7080) - currently disabled
+- **rag** - RAG search service (port 7090) - currently disabled

-		[Cortex Container]
-			 ├── Listens on port 7081
-			 ├── Talks to NVGRAM (mem API) + Intake
-			 └── Main relay between Lyra UI ↔ memory ↔ model
+All containers communicate via the `lyra_net` Docker bridge network.

-		[Intake Container]
-			├── Listens on port 7080
-			├── Summarizes every few exchanges
-			├── Writes summaries to /app/logs/summaries.log
-			└── Future: sends summaries → Cortex for reflection
+## External LLM Services

+The following LLM backends are accessed via HTTP (not part of docker-compose):
+
+- **vLLM Server** (`http://10.0.0.43:8000`)
+  - AMD MI50 GPU-accelerated inference
+  - Custom ROCm-enabled vLLM build
+  - Primary backend for reasoning and refinement stages
+
+- **Ollama Server** (`http://10.0.0.3:11434`)
+  - RTX 3090 GPU-accelerated inference
+  - Secondary/configurable backend
+  - Model: qwen2.5:7b-instruct-q4_K_M
+
+- **OpenAI API** (`https://api.openai.com/v1`)
+  - Cloud-based inference
+  - Used for reflection and persona stages
+  - Model: gpt-4o-mini
+
+- **Fallback Server** (`http://10.0.0.41:11435`)
+  - Emergency backup endpoint
+  - Local llama-3.2-8b-instruct model

 ---

@@ -292,6 +289,7 @@ Relay → UI (returns final response)

 ### Non-Critical
 - Session management endpoints not fully implemented in Relay
+- Intake service currently disabled in docker-compose.yml
 - RAG service currently disabled in docker-compose.yml
 - Cortex `/ingest` endpoint is a stub

@@ -307,14 +305,19 @@ Relay → UI (returns final response)

 ### Prerequisites
 - Docker + Docker Compose
- PostgreSQL 13+, Neo4j 4.4+ (for NeoMem)
- At least one LLM API endpoint (vLLM, Ollama, or OpenAI)
+- At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key)

 ### Setup
-1. Configure environment variables in `.env` files
-2. Start services: `docker-compose up -d`
-3. Check health: `curl http://localhost:7078/_health`
-4. Access UI: `http://localhost:7078`
+1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys
+2. Start all services with docker-compose:
+   ```bash
+   docker-compose up -d
+   ```
+3. Check service health:
+   ```bash
+   curl http://localhost:7078/_health
+   ```
+4. Access the UI at `http://localhost:7078`

 ### Test
 ```bash
@@ -326,6 +329,8 @@ curl -X POST http://localhost:7078/v1/chat/completions \
  }'
 ```

+All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.
+
 ---

 ## Documentation
@@ -345,104 +350,44 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).

 ---

-## 📦 Requirements
+## Integration Notes

- Docker + Docker Compose  
- Postgres + Neo4j (for NeoMem)
- Access to an open AI or ollama style API.
- OpenAI API key (for Relay fallback LLMs)
-
-**Dependencies:**
-	- fastapi==0.115.8
-	- uvicorn==0.34.0
-	- pydantic==2.10.4
-	- python-dotenv==1.0.1
-	- psycopg>=3.2.8
-	- ollama
+- NeoMem API is compatible with Mem0 OSS endpoints (`/memories`, `/search`)
+- All services communicate via Docker internal networking on the `lyra_net` bridge
+- History and entity graphs are managed via PostgreSQL + Neo4j
+- LLM backends are accessed via HTTP and configured in `.env`

 ---

-🔌 Integration Notes
+## Beta Lyrae - RAG Memory System (Currently Disabled)

-Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.
+**Note:** The RAG service is currently disabled in docker-compose.yml

-API endpoints remain identical to Mem0 (/memories, /search).
+### Requirements
+- Python 3.10+
+- Dependencies: `chromadb openai tqdm python-dotenv fastapi uvicorn`
+- Persistent storage: `./chromadb` or `/mnt/data/lyra_rag_db`

-History and entity graphs managed internally via Postgres + Neo4j.
+### Setup
+1. Import chat logs (must be in OpenAI message format):
+   ```bash
+   python3 rag/rag_chat_import.py
+   ```

---
+2. Build and start the RAG API server:
+   ```bash
+   cd rag
+   python3 rag_build.py
+   uvicorn rag_api:app --host 0.0.0.0 --port 7090
+   ```

-🧱 Architecture Snapshot
-
-	User → Relay → Cortex
-			 ↓
-		 [RAG Search]
-			 ↓
-		 [Reflection Loop]
-			 ↓
-		 Intake (async summaries)
-			 ↓
-		 NeoMem (persistent memory)
-
-**Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
- Data Flow:
-  - User message enters Cortex via /reason.
-  - Cortex assembles context:
-	- Intake summaries (short-term memory)
-	- RAG contextual data (knowledge base)
-  - LLM generates initial draft (call_llm).
-  - Reflection loop critiques and refines the answer.
-  - Intake asynchronously summarizes and sends snapshots to NeoMem.
-
-RAG API Configuration:
-Set RAG_API_URL in .env (default: http://localhost:7090).
-
---
-
-## Setup and Operation ##
-
-## Beta Lyrae - RAG memory system ##
-**Requirements**
-  -Env= python 3.10+
-  -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
-  -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)
-
-**Import Chats**
-  - Chats need to be formatted into the correct format of
-	```
-	  "messages": [
-	    {
-		  "role:" "user",
-		  "content": "Message here"
-		},
-		"messages": [
-	    {
-		  "role:" "assistant",
-		  "content": "Message here"
-		},```
-  - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
-  - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).
-
-**Build API Server**
-  - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
-  - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```
-
-**Query**
-  - Run: python3 rag_query.py "Question here?"
-  - For testing a curl command can reach it too
-    ```
-	curl -X POST http://127.0.0.1:7090/rag/search \
-	  -H "Content-Type: application/json" \
-	  -d '{
-			"query": "What is the current state of Cortex and Project Lyra?",
-			"where": {"category": "lyra"}
-		  }'
-	```
-	
-# Beta Lyrae - RAG System
-
-## 📖 License
-NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).  
-This fork retains the original Apache 2.0 license and adds local modifications.  
-© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
+3. Query the RAG system:
+   ```bash
+   curl -X POST http://127.0.0.1:7090/rag/search \
+     -H "Content-Type: application/json" \
+     -d '{
+       "query": "What is the current state of Cortex?",
+       "where": {"category": "lyra"}
+     }'
+   ```

@@ -1,3 +1,6 @@
+// relay v0.3.0
+// Core relay server for Lyra project
+// Handles incoming chat requests and forwards them to Cortex services
 import express from "express";
 import dotenv from "dotenv";
 import cors from "cors";
@@ -10,9 +13,8 @@ app.use(express.json());

 const PORT = Number(process.env.PORT || 7078);

-// core endpoints
+// Cortex endpoints (only these are used now)
 const CORTEX_REASON = process.env.CORTEX_REASON_URL || "http://cortex:7081/reason";
-const CORTEX_INGEST = process.env.CORTEX_INGEST_URL || "http://cortex:7081/ingest";

 // -----------------------------------------------------
 // Helper request wrapper
@@ -27,7 +29,6 @@ async function postJSON(url, data) {
  const raw = await resp.text();
  let json;

-  // Try to parse JSON safely
  try {
    json = raw ? JSON.parse(raw) : null;
  } catch (e) {
@@ -42,11 +43,12 @@ async function postJSON(url, data) {
 }

 // -----------------------------------------------------
-// Shared chat handler logic
+// The unified chat handler
 // -----------------------------------------------------
 async function handleChatRequest(session_id, user_msg) {
-  // 1. → Cortex.reason: the main pipeline
  let reason;
+
+  // 1. → Cortex.reason (main pipeline)
  try {
    reason = await postJSON(CORTEX_REASON, {
      session_id,
@@ -57,19 +59,13 @@ async function handleChatRequest(session_id, user_msg) {
    throw new Error(`cortex_reason_failed: ${e.message}`);
  }

-  const persona = reason.final_output || reason.persona || "(no persona text)";
+  // Correct persona field
+  const persona =
+    reason.persona ||
+    reason.final_output ||
+    "(no persona text)";

-  // 2. → Cortex.ingest (async, non-blocking)
-  // Cortex might still want this for separate ingestion pipeline.
-  postJSON(CORTEX_INGEST, {
-    session_id,
-    user_msg,
-    assistant_msg: persona
-  }).catch(e =>
-    console.warn("Relay → Cortex.ingest failed:", e.message)
-  );
-
-  // 3. Return corrected result
+  // Return final answer
  return {
    session_id,
    reply: persona
@@ -84,7 +80,7 @@ app.get("/_health", (_, res) => {
 });

 // -----------------------------------------------------
-// OPENAI-COMPATIBLE ENDPOINT (for UI & clients)
+// OPENAI-COMPATIBLE ENDPOINT
 // -----------------------------------------------------
 app.post("/v1/chat/completions", async (req, res) => {
  try {
@@ -101,7 +97,7 @@ app.post("/v1/chat/completions", async (req, res) => {

    const result = await handleChatRequest(session_id, user_msg);

-    return res.json({
+    res.json({
      id: `chatcmpl-${Date.now()}`,
      object: "chat.completion",
      created: Math.floor(Date.now() / 1000),
@@ -134,7 +130,7 @@ app.post("/v1/chat/completions", async (req, res) => {
 });

 // -----------------------------------------------------
-// MAIN ENDPOINT (canonical Lyra UI entrance)
+// MAIN ENDPOINT (Lyra-native UI)
 // -----------------------------------------------------
 app.post("/chat", async (req, res) => {
  try {
@@ -144,7 +140,7 @@ app.post("/chat", async (req, res) => {
    console.log(`Relay → received: "${user_msg}"`);

    const result = await handleChatRequest(session_id, user_msg);
-    return res.json(result);
+    res.json(result);

  } catch (err) {
    console.error("Relay fatal:", err);
@@ -1,6 +1,8 @@
 import os
 from datetime import datetime
 from typing import List, Dict, Any, TYPE_CHECKING
+from collections import deque
+

 if TYPE_CHECKING:
    from collections import deque as _deque
@@ -10,7 +10,6 @@ from reasoning.reflection import reflect_notes
 from reasoning.refine import refine_answer
 from persona.speak import speak
 from persona.identity import load_identity
-from ingest.intake_client import IntakeClient
 from context import collect_context, update_last_assistant_message
 from intake.intake import add_exchange_internal

@@ -50,9 +49,6 @@ if VERBOSE_DEBUG:
 # -----------------------------
 cortex_router = APIRouter()

-# Initialize Intake client once
-intake_client = IntakeClient()
-

 # -----------------------------
 # Pydantic models
@@ -202,11 +198,10 @@ class IngestPayload(BaseModel):
    assistant_msg: str

@cortex_router.post("/ingest")
-async def ingest(payload: IngestPayload):
-    """
-    Relay calls this after /reason.
-    We update Cortex state AND feed Intake's internal buffer.
-    """
+async def ingest_stub():
+    # Intake is internal now — this endpoint is only for compatibility.
+    return {"status": "ok", "note": "intake is internal now"}
+

    # 1. Update Cortex session state
    update_last_assistant_message(payload.session_id, payload.assistant_msg)