Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using NeoMem + Relay + Cortex, with multi-stage reasoning pipeline powered by HTTP-based LLM backends.

Mission Statement

The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.

Architecture Overview

Project Lyra operates as a single docker-compose deployment with multiple Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules:

Core Services

1. Relay (Node.js/Express) - Port 7078

Main orchestrator and message router
Coordinates all module interactions
OpenAI-compatible endpoint: POST /v1/chat/completions
Internal endpoint: POST /chat
Routes messages through Cortex reasoning pipeline
Manages async calls to Intake and NeoMem

2. UI (Static HTML)

Browser-based chat interface with cyberpunk theme
Connects to Relay
Saves and loads sessions
OpenAI-compatible message format

3. NeoMem (Python/FastAPI) - Port 7077

Long-term memory database (fork of Mem0 OSS)
Vector storage (PostgreSQL + pgvector) + Graph storage (Neo4j)
RESTful API: /memories, /search
Semantic memory updates and retrieval
No external SDK dependencies - fully local

Reasoning Layer

4. Cortex (Python/FastAPI) - Port 7081

Primary reasoning engine with multi-stage pipeline
4-Stage Processing:
1. Reflection - Generates meta-awareness notes about conversation
2. Reasoning - Creates initial draft answer using context
3. Refinement - Polishes and improves the draft
4. Persona - Applies Lyra's personality and speaking style
Integrates with Intake for short-term context
Flexible LLM router supporting multiple backends via HTTP

5. Intake v0.2 (Python/FastAPI) - Port 7080

Simplified short-term memory summarization
Session-based circular buffer (deque, maxlen=200)
Single-level simple summarization (no cascading)
Background async processing with FastAPI BackgroundTasks
Pushes summaries to NeoMem automatically
API Endpoints:
- POST /add_exchange - Add conversation exchange
- GET /summaries?session_id={id} - Retrieve session summary
- POST /close_session/{id} - Close and cleanup session

LLM Backends (HTTP-based)

All LLM communication is done via HTTP APIs:

PRIMARY: vLLM server (http://10.0.0.43:8000) - AMD MI50 GPU backend
SECONDARY: Ollama server (http://10.0.0.3:11434) - RTX 3090 backend
CLOUD: OpenAI API (https://api.openai.com/v1) - Cloud-based models
FALLBACK: Local backup (http://10.0.0.41:11435) - Emergency fallback

Each module can be configured to use a different backend via environment variables.

Data Flow Architecture (v0.5.0)

Normal Message Flow:

User (UI) → POST /v1/chat/completions
  ↓
Relay (7078)
  ↓ POST /reason
Cortex (7081)
  ↓ GET /summaries?session_id=xxx
Intake (7080) [RETURNS SUMMARY]
  ↓
Cortex processes (4 stages):
  1. reflection.py → meta-awareness notes
  2. reasoning.py → draft answer (uses LLM)
  3. refine.py → refined answer (uses LLM)
  4. persona/speak.py → Lyra personality (uses LLM)
  ↓
Returns persona answer to Relay
  ↓
Relay → Cortex /ingest (async, stub)
Relay → Intake /add_exchange (async)
  ↓
Intake → Background summarize → NeoMem
  ↓
Relay → UI (returns final response)

Cortex 4-Stage Reasoning Pipeline:

Reflection (reflection.py) - Configurable LLM via HTTP
- Analyzes user intent and conversation context
- Generates meta-awareness notes
- "What is the user really asking?"
Reasoning (reasoning.py) - Configurable LLM via HTTP
- Retrieves short-term context from Intake
- Creates initial draft answer
- Integrates context, reflection notes, and user prompt
Refinement (refine.py) - Configurable LLM via HTTP
- Polishes the draft answer
- Improves clarity and coherence
- Ensures factual consistency
Persona (speak.py) - Configurable LLM via HTTP
- Applies Lyra's personality and speaking style
- Natural, conversational output
- Final answer returned to user

Features

Core Services

Relay:

Main orchestrator and message router
OpenAI-compatible endpoint: POST /v1/chat/completions
Internal endpoint: POST /chat
Health check: GET /_health
Async non-blocking calls to Cortex and Intake
Shared request handler for code reuse
Comprehensive error handling

NeoMem (Memory Engine):

Forked from Mem0 OSS - fully independent
Drop-in compatible API (/memories, /search)
Local-first: runs on FastAPI with Postgres + Neo4j
No external SDK dependencies
Semantic memory updates - compares embeddings and performs in-place updates
Default service: neomem-api (port 7077)

UI:

Lightweight static HTML chat interface
Cyberpunk theme
Session save/load functionality
OpenAI message format support

Reasoning Layer

Cortex (v0.5):

Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
Flexible LLM backend routing via HTTP
Per-stage backend selection
Async processing throughout
IntakeClient integration for short-term context
/reason, /ingest (stub), /health endpoints

Intake (v0.2):

Simplified single-level summarization
Session-based circular buffer (200 exchanges max)
Background async summarization
Automatic NeoMem push
No persistent log files (memory-only)
Breaking change from v0.1: Removed cascading summaries (L1, L2, L5, L10, L20, L30)

LLM Router:

Dynamic backend selection via HTTP
Environment-driven configuration
Support for vLLM, Ollama, OpenAI, custom endpoints
Per-module backend preferences

Beta Lyrae (RAG Memory DB) - added 11-3-25

RAG Knowledge DB - Beta Lyrae (sheliak)
- This module implements the Retrieval-Augmented Generation (RAG) layer for Project Lyra.
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. The system uses:
- ChromaDB for persistent vector storage
- OpenAI Embeddings (text-embedding-3-small) for semantic similarity
- FastAPI (port 7090) for the /rag/search REST endpoint
- Directory Layout rag/ ├── rag_chat_import.py # imports JSON chat logs ├── rag_docs_import.py # (planned) PDF/EPUB/manual importer ├── rag_build.py # legacy single-folder builder ├── rag_query.py # command-line query helper ├── rag_api.py # FastAPI service providing /rag/search ├── chromadb/ # persistent vector store ├── chatlogs/ # organized source data │ ├── poker/ │ ├── work/ │ ├── lyra/ │ ├── personal/ │ └── ... └── import.log # progress log for batch runs
- **OpenAI chatlog importer.
  - Takes JSON formatted chat logs and imports it to the RAG.
  - fetures include:
    - Recursive folder indexing with category detection from directory name
    - Smart chunking for long messages (5 000 chars per slice)
    - Automatic deduplication using SHA-1 hash of file + chunk
    - Timestamps for both file modification and import time
    - Full progress logging via tqdm
    - Safe to run in background with nohup … &
    - Metadata per chunk:
      { "chat_id": "<sha1 of filename>", "chunk_index": 0, "source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json", "title": "cortex LLMs 11-1-25", "role": "assistant", "category": "lyra", "type": "chat", "file_modified": "2025-11-06T23:41:02", "imported_at": "2025-11-07T03:55:00Z" }```

Docker Deployment

All services run in a single docker-compose stack with the following containers:

neomem-postgres - PostgreSQL with pgvector extension (port 5432)
neomem-neo4j - Neo4j graph database (ports 7474, 7687)
neomem-api - NeoMem memory service (port 7077)
relay - Main orchestrator (port 7078)
cortex - Reasoning engine (port 7081)
intake - Short-term memory summarization (port 7080) - currently disabled
rag - RAG search service (port 7090) - currently disabled

All containers communicate via the lyra_net Docker bridge network.

External LLM Services

The following LLM backends are accessed via HTTP (not part of docker-compose):

vLLM Server (http://10.0.0.43:8000)
- AMD MI50 GPU-accelerated inference
- Custom ROCm-enabled vLLM build
- Primary backend for reasoning and refinement stages
Ollama Server (http://10.0.0.3:11434)
- RTX 3090 GPU-accelerated inference
- Secondary/configurable backend
- Model: qwen2.5:7b-instruct-q4_K_M
OpenAI API (https://api.openai.com/v1)
- Cloud-based inference
- Used for reflection and persona stages
- Model: gpt-4o-mini
Fallback Server (http://10.0.0.41:11435)
- Emergency backup endpoint
- Local llama-3.2-8b-instruct model

Version History

v0.5.0 (2025-11-28) - Current Release

✅ Fixed all critical API wiring issues
✅ Added OpenAI-compatible endpoint to Relay (/v1/chat/completions)
✅ Fixed Cortex → Intake integration
✅ Added missing Python package __init__.py files
✅ End-to-end message flow verified and working

v0.4.x (Major Rewire)

Cortex multi-stage reasoning pipeline
Intake v0.2 simplification
LLM router with multi-backend support
Major architectural restructuring

v0.3.x

Beta Lyrae RAG system
NeoMem integration
Basic Cortex reasoning loop

Known Issues (v0.5.0)

Non-Critical

Session management endpoints not fully implemented in Relay
Intake service currently disabled in docker-compose.yml
RAG service currently disabled in docker-compose.yml
Cortex /ingest endpoint is a stub

Future Enhancements

Re-enable RAG service integration
Implement full session persistence
Add request correlation IDs for tracing
Comprehensive health checks

Quick Start

Prerequisites

Docker + Docker Compose
At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key)

Setup

Copy .env.example to .env and configure your LLM backend URLs and API keys
Start all services with docker-compose:
```
docker-compose up -d
```
Check service health:
```
curl http://localhost:7078/_health
```
Access the UI at http://localhost:7078

Test

curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello Lyra!"}],
    "session_id": "test"
  }'

All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.

Documentation

See CHANGELOG.md for detailed version history
See ENVIRONMENT_VARIABLES.md for environment variable reference
Additional information available in the Trilium docs

License

Built with Claude Code

Integration Notes

NeoMem API is compatible with Mem0 OSS endpoints (/memories, /search)
All services communicate via Docker internal networking on the lyra_net bridge
History and entity graphs are managed via PostgreSQL + Neo4j
LLM backends are accessed via HTTP and configured in .env

Beta Lyrae - RAG Memory System (Currently Disabled)

Note: The RAG service is currently disabled in docker-compose.yml

Requirements

Python 3.10+
Dependencies: chromadb openai tqdm python-dotenv fastapi uvicorn
Persistent storage: ./chromadb or /mnt/data/lyra_rag_db

Setup

Import chat logs (must be in OpenAI message format):
```
python3 rag/rag_chat_import.py
```

Build and start the RAG API server:

cd rag
python3 rag_build.py
uvicorn rag_api:app --host 0.0.0.0 --port 7090

Query the RAG system:

curl -X POST http://127.0.0.1:7090/rag/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the current state of Cortex?",
    "where": {"category": "lyra"}
  }'

Languages

Python 92.8%

HTML 3.9%

JavaScript 1.7%

CSS 1.3%

Dockerfile 0.3%