Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using NeoMem + Relay + Cortex, with multi-stage reasoning pipeline powered by HTTP-based LLM backends.

Current Version: v0.5.1 (2025-12-11)

Mission Statement

The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget evertything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.

Architecture Overview

Project Lyra operates as a single docker-compose deployment with multiple Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules:

Core Services

1. Relay (Node.js/Express) - Port 7078

Main orchestrator and message router
Coordinates all module interactions
OpenAI-compatible endpoint: POST /v1/chat/completions
Internal endpoint: POST /chat
Routes messages through Cortex reasoning pipeline
Manages async calls to NeoMem and Cortex ingest

2. UI (Static HTML)

Browser-based chat interface with cyberpunk theme
Connects to Relay
Saves and loads sessions
OpenAI-compatible message format

3. NeoMem (Python/FastAPI) - Port 7077

Long-term memory database (fork of Mem0 OSS)
Vector storage (PostgreSQL + pgvector) + Graph storage (Neo4j)
RESTful API: /memories, /search
Semantic memory updates and retrieval
No external SDK dependencies - fully local

Reasoning Layer

4. Cortex (Python/FastAPI) - Port 7081

Primary reasoning engine with multi-stage pipeline
Includes embedded Intake module (no separate service as of v0.5.1)
4-Stage Processing:
1. Reflection - Generates meta-awareness notes about conversation
2. Reasoning - Creates initial draft answer using context
3. Refinement - Polishes and improves the draft
4. Persona - Applies Lyra's personality and speaking style
Integrates with Intake for short-term context via internal Python imports
Flexible LLM router supporting multiple backends via HTTP
Endpoints:
- POST /reason - Main reasoning pipeline
- POST /ingest - Receives conversation exchanges from Relay
- GET /health - Service health check
- GET /debug/sessions - Inspect in-memory SESSIONS state
- GET /debug/summary - Test summarization for a session

5. Intake (Python Module) - Embedded in Cortex

No longer a standalone service - runs as Python module inside Cortex container
Short-term memory management with session-based circular buffer
In-memory SESSIONS dictionary: session_id → {buffer: deque(maxlen=200), created_at: timestamp}
Multi-level summarization (L1/L5/L10/L20/L30) produced by summarize_context()
Deferred summarization - actual summary generation happens during /reason call
Internal Python API:
- add_exchange_internal(exchange) - Direct function call from Cortex
- summarize_context(session_id, exchanges) - Async LLM-based summarization
- SESSIONS - Module-level global state (requires single Uvicorn worker)

LLM Backends (HTTP-based)

All LLM communication is done via HTTP APIs:

PRIMARY: llama.cpp server (http://10.0.0.44:8080) - AMD MI50 GPU backend
SECONDARY: Ollama server (http://10.0.0.3:11434) - RTX 3090 backend
- Model: qwen2.5:7b-instruct-q4_K_M
CLOUD: OpenAI API (https://api.openai.com/v1) - Cloud-based models
- Model: gpt-4o-mini
FALLBACK: Local backup (http://10.0.0.41:11435) - Emergency fallback
- Model: llama-3.2-8b-instruct

Each module can be configured to use a different backend via environment variables.

Data Flow Architecture (v0.5.1)

Normal Message Flow:

User (UI) → POST /v1/chat/completions
  ↓
Relay (7078)
  ↓ POST /reason
Cortex (7081)
  ↓ (internal Python call)
Intake module → summarize_context()
  ↓
Cortex processes (4 stages):
  1. reflection.py → meta-awareness notes (CLOUD backend)
  2. reasoning.py → draft answer (PRIMARY backend)
  3. refine.py → refined answer (PRIMARY backend)
  4. persona/speak.py → Lyra personality (CLOUD backend)
  ↓
Returns persona answer to Relay
  ↓
Relay → POST /ingest (async)
  ↓
Cortex → add_exchange_internal() → SESSIONS buffer
  ↓
Relay → NeoMem /memories (async, planned)
  ↓
Relay → UI (returns final response)

Cortex 4-Stage Reasoning Pipeline:

Reflection (reflection.py) - Cloud LLM (OpenAI)
- Analyzes user intent and conversation context
- Generates meta-awareness notes
- "What is the user really asking?"
Reasoning (reasoning.py) - Primary LLM (llama.cpp)
- Retrieves short-term context from Intake module
- Creates initial draft answer
- Integrates context, reflection notes, and user prompt
Refinement (refine.py) - Primary LLM (llama.cpp)
- Polishes the draft answer
- Improves clarity and coherence
- Ensures factual consistency
Persona (speak.py) - Cloud LLM (OpenAI)
- Applies Lyra's personality and speaking style
- Natural, conversational output
- Final answer returned to user

Features

Core Services

Relay:

Main orchestrator and message router
OpenAI-compatible endpoint: POST /v1/chat/completions
Internal endpoint: POST /chat
Health check: GET /_health
Async non-blocking calls to Cortex
Shared request handler for code reuse
Comprehensive error handling

NeoMem (Memory Engine):

Forked from Mem0 OSS - fully independent
Drop-in compatible API (/memories, /search)
Local-first: runs on FastAPI with Postgres + Neo4j
No external SDK dependencies
Semantic memory updates - compares embeddings and performs in-place updates
Default service: neomem-api (port 7077)

UI:

Lightweight static HTML chat interface
Cyberpunk theme
Session save/load functionality
OpenAI message format support

Reasoning Layer

Cortex (v0.5.1):

Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
Flexible LLM backend routing via HTTP
Per-stage backend selection
Async processing throughout
Embedded Intake module for short-term context
/reason, /ingest, /health, /debug/sessions, /debug/summary endpoints
Lenient error handling - never fails the chat pipeline

Intake (Embedded Module):

Architectural change: Now runs as Python module inside Cortex container
In-memory SESSIONS management (session_id → buffer)
Multi-level summarization: L1 (ultra-short), L5 (short), L10 (medium), L20 (detailed), L30 (full)
Deferred summarization strategy - summaries generated during /reason call
bg_summarize() is a logging stub - actual work deferred
Single-worker constraint: SESSIONS requires single Uvicorn worker or Redis/shared storage

LLM Router:

Dynamic backend selection via HTTP
Environment-driven configuration
Support for llama.cpp, Ollama, OpenAI, custom endpoints
Per-module backend preferences:
- CORTEX_LLM=SECONDARY (Ollama for reasoning)
- INTAKE_LLM=PRIMARY (llama.cpp for summarization)
- SPEAK_LLM=OPENAI (Cloud for persona)
- NEOMEM_LLM=PRIMARY (llama.cpp for memory operations)

Beta Lyrae (RAG Memory DB) - Currently Disabled

RAG Knowledge DB - Beta Lyrae (sheliak)
- This module implements the Retrieval-Augmented Generation (RAG) layer for Project Lyra.
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
- Status: Disabled in docker-compose.yml (v0.5.1)

The system uses:

ChromaDB for persistent vector storage
OpenAI Embeddings (text-embedding-3-small) for semantic similarity
FastAPI (port 7090) for the /rag/search REST endpoint

Directory Layout:

rag/
├── rag_chat_import.py    # imports JSON chat logs
├── rag_docs_import.py    # (planned) PDF/EPUB/manual importer
├── rag_build.py          # legacy single-folder builder
├── rag_query.py          # command-line query helper
├── rag_api.py            # FastAPI service providing /rag/search
├── chromadb/             # persistent vector store
├── chatlogs/             # organized source data
│   ├── poker/
│   ├── work/
│   ├── lyra/
│   ├── personal/
│   └── ...
└── import.log            # progress log for batch runs

OpenAI chatlog importer features:

Recursive folder indexing with category detection from directory name
Smart chunking for long messages (5,000 chars per slice)
Automatic deduplication using SHA-1 hash of file + chunk
Timestamps for both file modification and import time
Full progress logging via tqdm
Safe to run in background with nohup … &

Docker Deployment

All services run in a single docker-compose stack with the following containers:

Active Services:

neomem-postgres - PostgreSQL with pgvector extension (port 5432)
neomem-neo4j - Neo4j graph database (ports 7474, 7687)
neomem-api - NeoMem memory service (port 7077)
relay - Main orchestrator (port 7078)
cortex - Reasoning engine with embedded Intake (port 7081)

Disabled Services:

intake - No longer needed (embedded in Cortex as of v0.5.1)
rag - Beta Lyrae RAG service (port 7090) - currently disabled

All containers communicate via the lyra_net Docker bridge network.

External LLM Services

The following LLM backends are accessed via HTTP (not part of docker-compose):

llama.cpp Server (http://10.0.0.44:8080)
- AMD MI50 GPU-accelerated inference
- Primary backend for reasoning and refinement stages
- Model path: /model
Ollama Server (http://10.0.0.3:11434)
- RTX 3090 GPU-accelerated inference
- Secondary/configurable backend
- Model: qwen2.5:7b-instruct-q4_K_M
OpenAI API (https://api.openai.com/v1)
- Cloud-based inference
- Used for reflection and persona stages
- Model: gpt-4o-mini
Fallback Server (http://10.0.0.41:11435)
- Emergency backup endpoint
- Local llama-3.2-8b-instruct model

Version History

v0.5.1 (2025-12-11) - Current Release

Critical Intake Integration Fixes:

✅ Fixed bg_summarize() NameError preventing SESSIONS persistence
✅ Fixed /ingest endpoint unreachable code
✅ Added cortex/intake/__init__.py for proper package structure
✅ Added diagnostic logging to verify SESSIONS singleton behavior
✅ Added /debug/sessions and /debug/summary endpoints
✅ Documented single-worker constraint in Dockerfile
✅ Implemented lenient error handling (never fails chat pipeline)
✅ Intake now embedded in Cortex - no longer standalone service

Architecture Changes:

Intake module runs inside Cortex container as pure Python import
No HTTP calls between Cortex and Intake (internal function calls)
SESSIONS persist correctly in Uvicorn worker
Deferred summarization strategy (summaries generated during /reason)

v0.5.0 (2025-11-28)

✅ Fixed all critical API wiring issues
✅ Added OpenAI-compatible endpoint to Relay (/v1/chat/completions)
✅ Fixed Cortex → Intake integration
✅ Added missing Python package __init__.py files
✅ End-to-end message flow verified and working

Infrastructure v1.0.0 (2025-11-26)

Consolidated 9 scattered .env files into single source of truth
Multi-backend LLM strategy implemented
Docker Compose consolidation
Created .env.example security templates

v0.4.x (Major Rewire)

Cortex multi-stage reasoning pipeline
LLM router with multi-backend support
Major architectural restructuring

v0.3.x

Beta Lyrae RAG system
NeoMem integration
Basic Cortex reasoning loop

Known Issues (v0.5.1)

Critical (Fixed in v0.5.1)

~~Intake SESSIONS not persisting~~ ✅ FIXED
~~bg_summarize() NameError~~ ✅ FIXED
~~/ingest endpoint unreachable code~~ ✅ FIXED

Non-Critical

Session management endpoints not fully implemented in Relay
RAG service currently disabled in docker-compose.yml
NeoMem integration in Relay not yet active (planned for v0.5.2)

Operational Notes

Single-worker constraint: Cortex must run with single Uvicorn worker to maintain SESSIONS state
- Multi-worker scaling requires migrating SESSIONS to Redis or shared storage
Diagnostic endpoints (/debug/sessions, /debug/summary) available for troubleshooting

Future Enhancements

Re-enable RAG service integration
Implement full session persistence
Migrate SESSIONS to Redis for multi-worker support
Add request correlation IDs for tracing
Comprehensive health checks across all services
NeoMem integration in Relay

Quick Start

Prerequisites

Docker + Docker Compose
At least one HTTP-accessible LLM endpoint (llama.cpp, Ollama, or OpenAI API key)

Setup

Copy .env.example to .env and configure your LLM backend URLs and API keys:

# Required: Configure at least one LLM backend
LLM_PRIMARY_URL=http://10.0.0.44:8080       # llama.cpp
LLM_SECONDARY_URL=http://10.0.0.3:11434     # Ollama
OPENAI_API_KEY=sk-...                        # OpenAI

Start all services with docker-compose:
```
docker-compose up -d
```

Check service health:

# Relay health
curl http://localhost:7078/_health

# Cortex health
curl http://localhost:7081/health

# NeoMem health
curl http://localhost:7077/health

Access the UI at http://localhost:7078

Test

Test Relay → Cortex pipeline:

curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello Lyra!"}],
    "session_id": "test"
  }'

Test Cortex /ingest endpoint:

curl -X POST http://localhost:7081/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "test",
    "user_msg": "Hello",
    "assistant_msg": "Hi there!"
  }'

Inspect SESSIONS state:

curl http://localhost:7081/debug/sessions

Get summary for a session:

curl "http://localhost:7081/debug/summary?session_id=test"

All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.

Environment Variables

LLM Backend Configuration

Backend URLs (Full API endpoints):

LLM_PRIMARY_URL=http://10.0.0.44:8080           # llama.cpp
LLM_PRIMARY_MODEL=/model

LLM_SECONDARY_URL=http://10.0.0.3:11434         # Ollama
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M

LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...

Module-specific backend selection:

CORTEX_LLM=SECONDARY      # Use Ollama for reasoning
INTAKE_LLM=PRIMARY        # Use llama.cpp for summarization
SPEAK_LLM=OPENAI          # Use OpenAI for persona
NEOMEM_LLM=PRIMARY        # Use llama.cpp for memory
UI_LLM=OPENAI             # Use OpenAI for UI
RELAY_LLM=PRIMARY         # Use llama.cpp for relay

Database Configuration

POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomempass
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432

NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=neomemgraph

Service URLs (Internal Docker Network)

NEOMEM_API=http://neomem-api:7077
CORTEX_API=http://cortex:7081
CORTEX_REASON_URL=http://cortex:7081/reason
CORTEX_INGEST_URL=http://cortex:7081/ingest
RELAY_URL=http://relay:7078

Feature Flags

CORTEX_ENABLED=true
MEMORY_ENABLED=true
PERSONA_ENABLED=false
DEBUG_PROMPT=true
VERBOSE_DEBUG=true

For complete environment variable reference, see ENVIRONMENT_VARIABLES.md.

Documentation

CHANGELOG.md - Detailed version history
PROJECT_SUMMARY.md - Comprehensive project overview for AI context
ENVIRONMENT_VARIABLES.md - Environment variable reference
DEPRECATED_FILES.md - Deprecated files and migration guide

Troubleshooting

SESSIONS not persisting

Symptom: Intake buffer always shows 0 exchanges, summaries always empty.

Solution (Fixed in v0.5.1):

Ensure cortex/intake/__init__.py exists
Check Cortex logs for [Intake Module Init] message showing SESSIONS object ID
Verify single-worker mode (Dockerfile: uvicorn main:app --workers 1)
Use /debug/sessions endpoint to inspect current state

Cortex connection errors

Symptom: Relay can't reach Cortex, 502 errors.

Solution:

Verify Cortex container is running: docker ps | grep cortex
Check Cortex health: curl http://localhost:7081/health
Verify environment variables: CORTEX_REASON_URL=http://cortex:7081/reason
Check docker network: docker network inspect lyra_net

LLM backend timeouts

Symptom: Reasoning stage hangs or times out.

Solution:

Verify LLM backend is running and accessible
Check LLM backend health: curl http://10.0.0.44:8080/health
Increase timeout in llm_router.py if using slow models
Check logs for specific backend errors

License

Built with Claude Code

Integration Notes

NeoMem API is compatible with Mem0 OSS endpoints (/memories, /search)
All services communicate via Docker internal networking on the lyra_net bridge
History and entity graphs are managed via PostgreSQL + Neo4j
LLM backends are accessed via HTTP and configured in .env
Intake module is imported internally by Cortex (no HTTP communication)
SESSIONS state is maintained in-memory within Cortex container

Beta Lyrae - RAG Memory System (Currently Disabled)

Note: The RAG service is currently disabled in docker-compose.yml

Requirements

Python 3.10+
Dependencies: chromadb openai tqdm python-dotenv fastapi uvicorn
Persistent storage: ./chromadb or /mnt/data/lyra_rag_db

Setup

Import chat logs (must be in OpenAI message format):
```
python3 rag/rag_chat_import.py
```

Build and start the RAG API server:

cd rag
python3 rag_build.py
uvicorn rag_api:app --host 0.0.0.0 --port 7090

Query the RAG system:

curl -X POST http://127.0.0.1:7090/rag/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the current state of Cortex?",
    "where": {"category": "lyra"}
  }'

Development Notes

Cortex Architecture (v0.5.1)

Cortex contains embedded Intake module at cortex/intake/
Intake is imported as: from intake.intake import add_exchange_internal, SESSIONS
SESSIONS is a module-level global dictionary (singleton pattern)
Single-worker constraint required to maintain SESSIONS state
Diagnostic endpoints available for debugging: /debug/sessions, /debug/summary

Adding New LLM Backends

Add backend URL to .env:

LLM_CUSTOM_URL=http://your-backend:port
LLM_CUSTOM_MODEL=model-name

Configure module to use new backend:
```
CORTEX_LLM=CUSTOM
```
Restart Cortex container:
```
docker-compose restart cortex
```

Debugging Tips

Enable verbose logging: VERBOSE_DEBUG=true in .env
Check Cortex logs: docker logs cortex -f
Inspect SESSIONS: curl http://localhost:7081/debug/sessions
Test summarization: curl "http://localhost:7081/debug/summary?session_id=test"
Check Relay logs: docker logs relay -f
Monitor Docker network: docker network inspect lyra_net

Languages

Python 92.8%

HTML 3.9%

JavaScript 1.7%

CSS 1.3%

Dockerfile 0.3%