project-lyra/README.md

# Project Lyra - README v0.5.0

Lyra is a modular persistent AI companion system with advanced reasoning capabilities.
It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**,
with multi-stage reasoning pipeline powered by distributed LLM backends.

## Mission Statement

The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.

---

## Architecture Overview

Project Lyra operates as a series of Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules:

### A. VM 100 - lyra-core (Core Services)

**1. Relay** (Node.js/Express) - Port 7078
- Main orchestrator and message router
- Coordinates all module interactions
- OpenAI-compatible endpoint: `POST /v1/chat/completions`
- Internal endpoint: `POST /chat`
- Routes messages through Cortex reasoning pipeline
- Manages async calls to Intake and NeoMem

**2. UI** (Static HTML)
- Browser-based chat interface with cyberpunk theme
- Connects to Relay at `http://10.0.0.40:7078`
- Saves and loads sessions
- OpenAI-compatible message format

**3. NeoMem** (Python/FastAPI) - Port 7077
- Long-term memory database (fork of Mem0 OSS)
- Vector storage (PostgreSQL + pgvector) + Graph storage (Neo4j)
- RESTful API: `/memories`, `/search`
- Semantic memory updates and retrieval
- No external SDK dependencies - fully local

### B. VM 101 - lyra-cortex (Reasoning Layer)

**4. Cortex** (Python/FastAPI) - Port 7081
- Primary reasoning engine with multi-stage pipeline
- **4-Stage Processing:**
  1. **Reflection** - Generates meta-awareness notes about conversation
  2. **Reasoning** - Creates initial draft answer using context
  3. **Refinement** - Polishes and improves the draft
  4. **Persona** - Applies Lyra's personality and speaking style
- Integrates with Intake for short-term context
- Flexible LLM router supporting multiple backends

**5. Intake v0.2** (Python/FastAPI) - Port 7080
- Simplified short-term memory summarization
- Session-based circular buffer (deque, maxlen=200)
- Single-level simple summarization (no cascading)
- Background async processing with FastAPI BackgroundTasks
- Pushes summaries to NeoMem automatically
- **API Endpoints:**
  - `POST /add_exchange` - Add conversation exchange
  - `GET /summaries?session_id={id}` - Retrieve session summary
  - `POST /close_session/{id}` - Close and cleanup session

### C. LLM Backends (Remote/Local APIs)

**Multi-Backend Strategy:**
- **PRIMARY**: vLLM on AMD MI50 GPU (`http://10.0.0.43:8000`) - Cortex reasoning, Intake
- **SECONDARY**: Ollama on RTX 3090 (`http://10.0.0.3:11434`) - Configurable per-module
- **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cortex persona layer
- **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback

---

## Data Flow Architecture (v0.5.0)

### Normal Message Flow:

```
User (UI) → POST /v1/chat/completions
  ↓
Relay (7078)
  ↓ POST /reason
Cortex (7081)
  ↓ GET /summaries?session_id=xxx
Intake (7080) [RETURNS SUMMARY]
  ↓
Cortex processes (4 stages):
  1. reflection.py → meta-awareness notes
  2. reasoning.py → draft answer (uses LLM)
  3. refine.py → refined answer (uses LLM)
  4. persona/speak.py → Lyra personality (uses LLM)
  ↓
Returns persona answer to Relay
  ↓
Relay → Cortex /ingest (async, stub)
Relay → Intake /add_exchange (async)
  ↓
Intake → Background summarize → NeoMem
  ↓
Relay → UI (returns final response)
```

### Cortex 4-Stage Reasoning Pipeline:

1. **Reflection** (`reflection.py`) - Cloud backend (OpenAI)
   - Analyzes user intent and conversation context
   - Generates meta-awareness notes
   - "What is the user really asking?"

2. **Reasoning** (`reasoning.py`) - Primary backend (vLLM)
   - Retrieves short-term context from Intake
   - Creates initial draft answer
   - Integrates context, reflection notes, and user prompt

3. **Refinement** (`refine.py`) - Primary backend (vLLM)
   - Polishes the draft answer
   - Improves clarity and coherence
   - Ensures factual consistency

4. **Persona** (`speak.py`) - Cloud backend (OpenAI)
   - Applies Lyra's personality and speaking style
   - Natural, conversational output
   - Final answer returned to user

---

## Features

### Lyra-Core (VM 100)

**Relay**:
- Main orchestrator and message router
- OpenAI-compatible endpoint: `POST /v1/chat/completions`
- Internal endpoint: `POST /chat`
- Health check: `GET /_health`
- Async non-blocking calls to Cortex and Intake
- Shared request handler for code reuse
- Comprehensive error handling

**NeoMem (Memory Engine)**:
- Forked from Mem0 OSS - fully independent
- Drop-in compatible API (`/memories`, `/search`)
- Local-first: runs on FastAPI with Postgres + Neo4j
- No external SDK dependencies
- Semantic memory updates - compares embeddings and performs in-place updates
- Default service: `neomem-api` (port 7077)

**UI**:
- Lightweight static HTML chat interface
- Cyberpunk theme
- Session save/load functionality
- OpenAI message format support

### Cortex (VM 101)

**Cortex** (v0.5):
- Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
- Flexible LLM backend routing
- Per-stage backend selection
- Async processing throughout
- IntakeClient integration for short-term context
- `/reason`, `/ingest` (stub), `/health` endpoints

**Intake** (v0.2):
- Simplified single-level summarization
- Session-based circular buffer (200 exchanges max)
- Background async summarization
- Automatic NeoMem push
- No persistent log files (memory-only)
- **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30)

**LLM Router**:
- Dynamic backend selection
- Environment-driven configuration
- Support for vLLM, Ollama, OpenAI, custom endpoints
- Per-module backend preferences

# Beta Lyrae (RAG Memory DB) - added 11-3-25
- **RAG Knowledge DB - Beta Lyrae (sheliak)**
  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
  - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
		The system uses:
  - **ChromaDB** for persistent vector storage
  - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
  - **FastAPI** (port 7090) for the `/rag/search` REST endpoint
  - Directory Layout
		rag/
		├── rag_chat_import.py # imports JSON chat logs
		├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
		├── rag_build.py # legacy single-folder builder
		├── rag_query.py # command-line query helper
		├── rag_api.py # FastAPI service providing /rag/search
		├── chromadb/ # persistent vector store
		├── chatlogs/ # organized source data
		│ ├── poker/
		│ ├── work/
		│ ├── lyra/
		│ ├── personal/
		│ └── ...
		└── import.log # progress log for batch runs
  - **OpenAI chatlog importer.
	  - Takes JSON formatted chat logs and imports it to the RAG.
	  - **fetures include:**
	    - Recursive folder indexing with **category detection** from directory name
		- Smart chunking for long messages (5 000 chars per slice)
		- Automatic deduplication using SHA-1 hash of file + chunk
		- Timestamps for both file modification and import time
		- Full progress logging via tqdm
		- Safe to run in background with nohup … &
		- Metadata per chunk:
		  ```json
		  {
			"chat_id": "<sha1 of filename>",
			"chunk_index": 0,
			"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
			"title": "cortex LLMs 11-1-25",
			"role": "assistant",
			"category": "lyra",
			"type": "chat",
			"file_modified": "2025-11-06T23:41:02",
			"imported_at": "2025-11-07T03:55:00Z"
		  }```

# Cortex VM (VM101, CT201)
  - **CT201 main reasoning orchestrator.**
    - This is the internal brain of Lyra.
	- Running in a privellaged LXC.
	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
	- Accessible via 10.0.0.43:8000/v1/completions.

  - **Intake v0.1.1 **
    - Recieves messages from relay and summarizes them in a cascading format.
	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
	- Intake then sends to cortex for self reflection, neomem for memory consolidation.

  - **Reflect **
    -TBD

# Self hosted vLLM server #
  - **CT201 main reasoning orchestrator.**
    - This is the internal brain of Lyra.
	- Running in a privellaged LXC.
	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
	- Accessible via 10.0.0.43:8000/v1/completions.
  - **Stack Flow**
    -	[Proxmox Host]
			 └── loads AMDGPU driver
			 └── boots CT201 (order=2)

		[CT201 GPU Container]
			 ├── lyra-start-vllm.sh → starts vLLM ROCm model server
			 ├── lyra-vllm.service   → runs the above automatically
			 ├── lyra-core.service   → launches Cortex + Intake Docker stack
			 └── Docker Compose      → runs Cortex + Intake containers

		[Cortex Container]
			 ├── Listens on port 7081
			 ├── Talks to NVGRAM (mem API) + Intake
			 └── Main relay between Lyra UI ↔ memory ↔ model

		[Intake Container]
			├── Listens on port 7080
			├── Summarizes every few exchanges
			├── Writes summaries to /app/logs/summaries.log
			└── Future: sends summaries → Cortex for reflection


---

## Version History

### v0.5.0 (2025-11-28) - Current Release
- ✅ Fixed all critical API wiring issues
- ✅ Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`)
- ✅ Fixed Cortex → Intake integration
- ✅ Added missing Python package `__init__.py` files
- ✅ End-to-end message flow verified and working

### v0.4.x (Major Rewire)
- Cortex multi-stage reasoning pipeline
- Intake v0.2 simplification
- LLM router with multi-backend support
- Major architectural restructuring

### v0.3.x
- Beta Lyrae RAG system
- NeoMem integration
- Basic Cortex reasoning loop

---

## Known Issues (v0.5.0)

### Non-Critical
- Session management endpoints not fully implemented in Relay
- RAG service currently disabled in docker-compose.yml
- Cortex `/ingest` endpoint is a stub

### Future Enhancements
- Re-enable RAG service integration
- Implement full session persistence
- Add request correlation IDs for tracing
- Comprehensive health checks

---

## Quick Start

### Prerequisites
- Docker + Docker Compose
- PostgreSQL 13+, Neo4j 4.4+ (for NeoMem)
- At least one LLM API endpoint (vLLM, Ollama, or OpenAI)

### Setup
1. Configure environment variables in `.env` files
2. Start services: `docker-compose up -d`
3. Check health: `curl http://localhost:7078/_health`
4. Access UI: `http://localhost:7078`

### Test
```bash
curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello Lyra!"}],
    "session_id": "test"
  }'
```

---

## Documentation

- See [CHANGELOG.md](CHANGELOG.md) for detailed version history
- See `ENVIRONMENT_VARIABLES.md` for environment variable reference
- Additional information available in the Trilium docs

---

## License

NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.

**Built with Claude Code**

---

## 📦 Requirements

- Docker + Docker Compose
- Postgres + Neo4j (for NeoMem)
- Access to an open AI or ollama style API.
- OpenAI API key (for Relay fallback LLMs)

**Dependencies:**
	- fastapi==0.115.8
	- uvicorn==0.34.0
	- pydantic==2.10.4
	- python-dotenv==1.0.1
	- psycopg>=3.2.8
	- ollama

---

🔌 Integration Notes

Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.

API endpoints remain identical to Mem0 (/memories, /search).

History and entity graphs managed internally via Postgres + Neo4j.

---

🧱 Architecture Snapshot

	User → Relay → Cortex
			 ↓
		 [RAG Search]
			 ↓
		 [Reflection Loop]
			 ↓
		 Intake (async summaries)
			 ↓
		 NeoMem (persistent memory)

**Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
- Data Flow:
  - User message enters Cortex via /reason.
  - Cortex assembles context:
	- Intake summaries (short-term memory)
	- RAG contextual data (knowledge base)
  - LLM generates initial draft (call_llm).
  - Reflection loop critiques and refines the answer.
  - Intake asynchronously summarizes and sends snapshots to NeoMem.

RAG API Configuration:
Set RAG_API_URL in .env (default: http://localhost:7090).

---

## Setup and Operation ##

## Beta Lyrae - RAG memory system ##
**Requirements**
  -Env= python 3.10+
  -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
  -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)

**Import Chats**
  - Chats need to be formatted into the correct format of
	```
	  "messages": [
	    {
		  "role:" "user",
		  "content": "Message here"
		},
		"messages": [
	    {
		  "role:" "assistant",
		  "content": "Message here"
		},```
  - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
  - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).

**Build API Server**
  - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
  - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```

**Query**
  - Run: python3 rag_query.py "Question here?"
  - For testing a curl command can reach it too
    ```
	curl -X POST http://127.0.0.1:7090/rag/search \
	  -H "Content-Type: application/json" \
	  -d '{
			"query": "What is the current state of Cortex and Project Lyra?",
			"where": {"category": "lyra"}
		  }'
	```

# Beta Lyrae - RAG System

## 📖 License
NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).
This fork retains the original Apache 2.0 license and adds local modifications.
© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.