2025-12-12 02:58:23 -05:00
2025-12-11 03:40:47 -05:00
2025-12-11 13:12:44 -05:00
2025-11-26 03:18:15 -05:00
2025-11-26 03:18:15 -05:00
2025-11-26 03:18:15 -05:00

Project Lyra - README v0.5.1

Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using NeoMem + Relay + Cortex, with multi-stage reasoning pipeline powered by HTTP-based LLM backends.

Current Version: v0.5.1 (2025-12-11)

Mission Statement

The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget evertything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.


Architecture Overview

Project Lyra operates as a single docker-compose deployment with multiple Docker containers networked together in a microservices architecture. Like how the brain has regions, Lyra has modules:

Core Services

1. Relay (Node.js/Express) - Port 7078

  • Main orchestrator and message router
  • Coordinates all module interactions
  • OpenAI-compatible endpoint: POST /v1/chat/completions
  • Internal endpoint: POST /chat
  • Routes messages through Cortex reasoning pipeline
  • Manages async calls to NeoMem and Cortex ingest

2. UI (Static HTML)

  • Browser-based chat interface with cyberpunk theme
  • Connects to Relay
  • Saves and loads sessions
  • OpenAI-compatible message format

3. NeoMem (Python/FastAPI) - Port 7077

  • Long-term memory database (fork of Mem0 OSS)
  • Vector storage (PostgreSQL + pgvector) + Graph storage (Neo4j)
  • RESTful API: /memories, /search
  • Semantic memory updates and retrieval
  • No external SDK dependencies - fully local

Reasoning Layer

4. Cortex (Python/FastAPI) - Port 7081

  • Primary reasoning engine with multi-stage pipeline
  • Includes embedded Intake module (no separate service as of v0.5.1)
  • 4-Stage Processing:
    1. Reflection - Generates meta-awareness notes about conversation
    2. Reasoning - Creates initial draft answer using context
    3. Refinement - Polishes and improves the draft
    4. Persona - Applies Lyra's personality and speaking style
  • Integrates with Intake for short-term context via internal Python imports
  • Flexible LLM router supporting multiple backends via HTTP
  • Endpoints:
    • POST /reason - Main reasoning pipeline
    • POST /ingest - Receives conversation exchanges from Relay
    • GET /health - Service health check
    • GET /debug/sessions - Inspect in-memory SESSIONS state
    • GET /debug/summary - Test summarization for a session

5. Intake (Python Module) - Embedded in Cortex

  • No longer a standalone service - runs as Python module inside Cortex container
  • Short-term memory management with session-based circular buffer
  • In-memory SESSIONS dictionary: session_id → {buffer: deque(maxlen=200), created_at: timestamp}
  • Multi-level summarization (L1/L5/L10/L20/L30) produced by summarize_context()
  • Deferred summarization - actual summary generation happens during /reason call
  • Internal Python API:
    • add_exchange_internal(exchange) - Direct function call from Cortex
    • summarize_context(session_id, exchanges) - Async LLM-based summarization
    • SESSIONS - Module-level global state (requires single Uvicorn worker)

LLM Backends (HTTP-based)

All LLM communication is done via HTTP APIs:

  • PRIMARY: llama.cpp server (http://10.0.0.44:8080) - AMD MI50 GPU backend
  • SECONDARY: Ollama server (http://10.0.0.3:11434) - RTX 3090 backend
    • Model: qwen2.5:7b-instruct-q4_K_M
  • CLOUD: OpenAI API (https://api.openai.com/v1) - Cloud-based models
    • Model: gpt-4o-mini
  • FALLBACK: Local backup (http://10.0.0.41:11435) - Emergency fallback
    • Model: llama-3.2-8b-instruct

Each module can be configured to use a different backend via environment variables.


Data Flow Architecture (v0.5.1)

Normal Message Flow:

User (UI) → POST /v1/chat/completions
  ↓
Relay (7078)
  ↓ POST /reason
Cortex (7081)
  ↓ (internal Python call)
Intake module → summarize_context()
  ↓
Cortex processes (4 stages):
  1. reflection.py → meta-awareness notes (CLOUD backend)
  2. reasoning.py → draft answer (PRIMARY backend)
  3. refine.py → refined answer (PRIMARY backend)
  4. persona/speak.py → Lyra personality (CLOUD backend)
  ↓
Returns persona answer to Relay
  ↓
Relay → POST /ingest (async)
  ↓
Cortex → add_exchange_internal() → SESSIONS buffer
  ↓
Relay → NeoMem /memories (async, planned)
  ↓
Relay → UI (returns final response)

Cortex 4-Stage Reasoning Pipeline:

  1. Reflection (reflection.py) - Cloud LLM (OpenAI)

    • Analyzes user intent and conversation context
    • Generates meta-awareness notes
    • "What is the user really asking?"
  2. Reasoning (reasoning.py) - Primary LLM (llama.cpp)

    • Retrieves short-term context from Intake module
    • Creates initial draft answer
    • Integrates context, reflection notes, and user prompt
  3. Refinement (refine.py) - Primary LLM (llama.cpp)

    • Polishes the draft answer
    • Improves clarity and coherence
    • Ensures factual consistency
  4. Persona (speak.py) - Cloud LLM (OpenAI)

    • Applies Lyra's personality and speaking style
    • Natural, conversational output
    • Final answer returned to user

Features

Core Services

Relay:

  • Main orchestrator and message router
  • OpenAI-compatible endpoint: POST /v1/chat/completions
  • Internal endpoint: POST /chat
  • Health check: GET /_health
  • Async non-blocking calls to Cortex
  • Shared request handler for code reuse
  • Comprehensive error handling

NeoMem (Memory Engine):

  • Forked from Mem0 OSS - fully independent
  • Drop-in compatible API (/memories, /search)
  • Local-first: runs on FastAPI with Postgres + Neo4j
  • No external SDK dependencies
  • Semantic memory updates - compares embeddings and performs in-place updates
  • Default service: neomem-api (port 7077)

UI:

  • Lightweight static HTML chat interface
  • Cyberpunk theme
  • Session save/load functionality
  • OpenAI message format support

Reasoning Layer

Cortex (v0.5.1):

  • Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
  • Flexible LLM backend routing via HTTP
  • Per-stage backend selection
  • Async processing throughout
  • Embedded Intake module for short-term context
  • /reason, /ingest, /health, /debug/sessions, /debug/summary endpoints
  • Lenient error handling - never fails the chat pipeline

Intake (Embedded Module):

  • Architectural change: Now runs as Python module inside Cortex container
  • In-memory SESSIONS management (session_id → buffer)
  • Multi-level summarization: L1 (ultra-short), L5 (short), L10 (medium), L20 (detailed), L30 (full)
  • Deferred summarization strategy - summaries generated during /reason call
  • bg_summarize() is a logging stub - actual work deferred
  • Single-worker constraint: SESSIONS requires single Uvicorn worker or Redis/shared storage

LLM Router:

  • Dynamic backend selection via HTTP
  • Environment-driven configuration
  • Support for llama.cpp, Ollama, OpenAI, custom endpoints
  • Per-module backend preferences:
    • CORTEX_LLM=SECONDARY (Ollama for reasoning)
    • INTAKE_LLM=PRIMARY (llama.cpp for summarization)
    • SPEAK_LLM=OPENAI (Cloud for persona)
    • NEOMEM_LLM=PRIMARY (llama.cpp for memory operations)

Beta Lyrae (RAG Memory DB) - Currently Disabled

  • RAG Knowledge DB - Beta Lyrae (sheliak)
    • This module implements the Retrieval-Augmented Generation (RAG) layer for Project Lyra.
    • It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
    • Status: Disabled in docker-compose.yml (v0.5.1)

The system uses:

  • ChromaDB for persistent vector storage
  • OpenAI Embeddings (text-embedding-3-small) for semantic similarity
  • FastAPI (port 7090) for the /rag/search REST endpoint

Directory Layout:

rag/
├── rag_chat_import.py    # imports JSON chat logs
├── rag_docs_import.py    # (planned) PDF/EPUB/manual importer
├── rag_build.py          # legacy single-folder builder
├── rag_query.py          # command-line query helper
├── rag_api.py            # FastAPI service providing /rag/search
├── chromadb/             # persistent vector store
├── chatlogs/             # organized source data
│   ├── poker/
│   ├── work/
│   ├── lyra/
│   ├── personal/
│   └── ...
└── import.log            # progress log for batch runs

OpenAI chatlog importer features:

  • Recursive folder indexing with category detection from directory name
  • Smart chunking for long messages (5,000 chars per slice)
  • Automatic deduplication using SHA-1 hash of file + chunk
  • Timestamps for both file modification and import time
  • Full progress logging via tqdm
  • Safe to run in background with nohup … &

Docker Deployment

All services run in a single docker-compose stack with the following containers:

Active Services:

  • neomem-postgres - PostgreSQL with pgvector extension (port 5432)
  • neomem-neo4j - Neo4j graph database (ports 7474, 7687)
  • neomem-api - NeoMem memory service (port 7077)
  • relay - Main orchestrator (port 7078)
  • cortex - Reasoning engine with embedded Intake (port 7081)

Disabled Services:

  • intake - No longer needed (embedded in Cortex as of v0.5.1)
  • rag - Beta Lyrae RAG service (port 7090) - currently disabled

All containers communicate via the lyra_net Docker bridge network.

External LLM Services

The following LLM backends are accessed via HTTP (not part of docker-compose):

  • llama.cpp Server (http://10.0.0.44:8080)

    • AMD MI50 GPU-accelerated inference
    • Primary backend for reasoning and refinement stages
    • Model path: /model
  • Ollama Server (http://10.0.0.3:11434)

    • RTX 3090 GPU-accelerated inference
    • Secondary/configurable backend
    • Model: qwen2.5:7b-instruct-q4_K_M
  • OpenAI API (https://api.openai.com/v1)

    • Cloud-based inference
    • Used for reflection and persona stages
    • Model: gpt-4o-mini
  • Fallback Server (http://10.0.0.41:11435)

    • Emergency backup endpoint
    • Local llama-3.2-8b-instruct model

Version History

v0.5.1 (2025-12-11) - Current Release

Critical Intake Integration Fixes:

  • Fixed bg_summarize() NameError preventing SESSIONS persistence
  • Fixed /ingest endpoint unreachable code
  • Added cortex/intake/__init__.py for proper package structure
  • Added diagnostic logging to verify SESSIONS singleton behavior
  • Added /debug/sessions and /debug/summary endpoints
  • Documented single-worker constraint in Dockerfile
  • Implemented lenient error handling (never fails chat pipeline)
  • Intake now embedded in Cortex - no longer standalone service

Architecture Changes:

  • Intake module runs inside Cortex container as pure Python import
  • No HTTP calls between Cortex and Intake (internal function calls)
  • SESSIONS persist correctly in Uvicorn worker
  • Deferred summarization strategy (summaries generated during /reason)

v0.5.0 (2025-11-28)

  • Fixed all critical API wiring issues
  • Added OpenAI-compatible endpoint to Relay (/v1/chat/completions)
  • Fixed Cortex → Intake integration
  • Added missing Python package __init__.py files
  • End-to-end message flow verified and working

Infrastructure v1.0.0 (2025-11-26)

  • Consolidated 9 scattered .env files into single source of truth
  • Multi-backend LLM strategy implemented
  • Docker Compose consolidation
  • Created .env.example security templates

v0.4.x (Major Rewire)

  • Cortex multi-stage reasoning pipeline
  • LLM router with multi-backend support
  • Major architectural restructuring

v0.3.x

  • Beta Lyrae RAG system
  • NeoMem integration
  • Basic Cortex reasoning loop

Known Issues (v0.5.1)

Critical (Fixed in v0.5.1)

  • Intake SESSIONS not persisting FIXED
  • bg_summarize() NameError FIXED
  • /ingest endpoint unreachable code FIXED

Non-Critical

  • Session management endpoints not fully implemented in Relay
  • RAG service currently disabled in docker-compose.yml
  • NeoMem integration in Relay not yet active (planned for v0.5.2)

Operational Notes

  • Single-worker constraint: Cortex must run with single Uvicorn worker to maintain SESSIONS state
    • Multi-worker scaling requires migrating SESSIONS to Redis or shared storage
  • Diagnostic endpoints (/debug/sessions, /debug/summary) available for troubleshooting

Future Enhancements

  • Re-enable RAG service integration
  • Implement full session persistence
  • Migrate SESSIONS to Redis for multi-worker support
  • Add request correlation IDs for tracing
  • Comprehensive health checks across all services
  • NeoMem integration in Relay

Quick Start

Prerequisites

  • Docker + Docker Compose
  • At least one HTTP-accessible LLM endpoint (llama.cpp, Ollama, or OpenAI API key)

Setup

  1. Copy .env.example to .env and configure your LLM backend URLs and API keys:

    # Required: Configure at least one LLM backend
    LLM_PRIMARY_URL=http://10.0.0.44:8080       # llama.cpp
    LLM_SECONDARY_URL=http://10.0.0.3:11434     # Ollama
    OPENAI_API_KEY=sk-...                        # OpenAI
    
  2. Start all services with docker-compose:

    docker-compose up -d
    
  3. Check service health:

    # Relay health
    curl http://localhost:7078/_health
    
    # Cortex health
    curl http://localhost:7081/health
    
    # NeoMem health
    curl http://localhost:7077/health
    
  4. Access the UI at http://localhost:7078

Test

Test Relay → Cortex pipeline:

curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello Lyra!"}],
    "session_id": "test"
  }'

Test Cortex /ingest endpoint:

curl -X POST http://localhost:7081/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "test",
    "user_msg": "Hello",
    "assistant_msg": "Hi there!"
  }'

Inspect SESSIONS state:

curl http://localhost:7081/debug/sessions

Get summary for a session:

curl "http://localhost:7081/debug/summary?session_id=test"

All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.


Environment Variables

LLM Backend Configuration

Backend URLs (Full API endpoints):

LLM_PRIMARY_URL=http://10.0.0.44:8080           # llama.cpp
LLM_PRIMARY_MODEL=/model

LLM_SECONDARY_URL=http://10.0.0.3:11434         # Ollama
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M

LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...

Module-specific backend selection:

CORTEX_LLM=SECONDARY      # Use Ollama for reasoning
INTAKE_LLM=PRIMARY        # Use llama.cpp for summarization
SPEAK_LLM=OPENAI          # Use OpenAI for persona
NEOMEM_LLM=PRIMARY        # Use llama.cpp for memory
UI_LLM=OPENAI             # Use OpenAI for UI
RELAY_LLM=PRIMARY         # Use llama.cpp for relay

Database Configuration

POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomempass
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432

NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=neomemgraph

Service URLs (Internal Docker Network)

NEOMEM_API=http://neomem-api:7077
CORTEX_API=http://cortex:7081
CORTEX_REASON_URL=http://cortex:7081/reason
CORTEX_INGEST_URL=http://cortex:7081/ingest
RELAY_URL=http://relay:7078

Feature Flags

CORTEX_ENABLED=true
MEMORY_ENABLED=true
PERSONA_ENABLED=false
DEBUG_PROMPT=true
VERBOSE_DEBUG=true

For complete environment variable reference, see ENVIRONMENT_VARIABLES.md.


Documentation


Troubleshooting

SESSIONS not persisting

Symptom: Intake buffer always shows 0 exchanges, summaries always empty.

Solution (Fixed in v0.5.1):

  • Ensure cortex/intake/__init__.py exists
  • Check Cortex logs for [Intake Module Init] message showing SESSIONS object ID
  • Verify single-worker mode (Dockerfile: uvicorn main:app --workers 1)
  • Use /debug/sessions endpoint to inspect current state

Cortex connection errors

Symptom: Relay can't reach Cortex, 502 errors.

Solution:

  • Verify Cortex container is running: docker ps | grep cortex
  • Check Cortex health: curl http://localhost:7081/health
  • Verify environment variables: CORTEX_REASON_URL=http://cortex:7081/reason
  • Check docker network: docker network inspect lyra_net

LLM backend timeouts

Symptom: Reasoning stage hangs or times out.

Solution:

  • Verify LLM backend is running and accessible
  • Check LLM backend health: curl http://10.0.0.44:8080/health
  • Increase timeout in llm_router.py if using slow models
  • Check logs for specific backend errors

License

NeoMem is a derivative work based on Mem0 OSS (Apache 2.0). © 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.

Built with Claude Code


Integration Notes

  • NeoMem API is compatible with Mem0 OSS endpoints (/memories, /search)
  • All services communicate via Docker internal networking on the lyra_net bridge
  • History and entity graphs are managed via PostgreSQL + Neo4j
  • LLM backends are accessed via HTTP and configured in .env
  • Intake module is imported internally by Cortex (no HTTP communication)
  • SESSIONS state is maintained in-memory within Cortex container

Beta Lyrae - RAG Memory System (Currently Disabled)

Note: The RAG service is currently disabled in docker-compose.yml

Requirements

  • Python 3.10+
  • Dependencies: chromadb openai tqdm python-dotenv fastapi uvicorn
  • Persistent storage: ./chromadb or /mnt/data/lyra_rag_db

Setup

  1. Import chat logs (must be in OpenAI message format):

    python3 rag/rag_chat_import.py
    
  2. Build and start the RAG API server:

    cd rag
    python3 rag_build.py
    uvicorn rag_api:app --host 0.0.0.0 --port 7090
    
  3. Query the RAG system:

    curl -X POST http://127.0.0.1:7090/rag/search \
      -H "Content-Type: application/json" \
      -d '{
        "query": "What is the current state of Cortex?",
        "where": {"category": "lyra"}
      }'
    

Development Notes

Cortex Architecture (v0.5.1)

  • Cortex contains embedded Intake module at cortex/intake/
  • Intake is imported as: from intake.intake import add_exchange_internal, SESSIONS
  • SESSIONS is a module-level global dictionary (singleton pattern)
  • Single-worker constraint required to maintain SESSIONS state
  • Diagnostic endpoints available for debugging: /debug/sessions, /debug/summary

Adding New LLM Backends

  1. Add backend URL to .env:

    LLM_CUSTOM_URL=http://your-backend:port
    LLM_CUSTOM_MODEL=model-name
    
  2. Configure module to use new backend:

    CORTEX_LLM=CUSTOM
    
  3. Restart Cortex container:

    docker-compose restart cortex
    

Debugging Tips

  • Enable verbose logging: VERBOSE_DEBUG=true in .env
  • Check Cortex logs: docker logs cortex -f
  • Inspect SESSIONS: curl http://localhost:7081/debug/sessions
  • Test summarization: curl "http://localhost:7081/debug/summary?session_id=test"
  • Check Relay logs: docker logs relay -f
  • Monitor Docker network: docker network inspect lyra_net
Description
Beepo Boop this is a robot beep.
Readme 3.3 MiB
Languages
Python 92.8%
HTML 3.9%
JavaScript 1.7%
CSS 1.3%
Dockerfile 0.3%