Files

T

serversdown 5f53fb32a4 feat: Refactor LLM router and integrate health check endpoint

- Simplified LLM call logic in llm_router.py, removing tool adapter complexity and enhancing error handling.
- Added health check endpoint to main.py for system status verification.
- Cleaned up router.py by removing unused imports and commented-out code, streamlining the structure.
- Updated docker-compose.yml to unify services under a single Lyra container, enhancing deployment simplicity.
- Created Dockerfile for unified container setup, including both Relay and Cortex services.
- Added QUICKSTART.md for improved onboarding and usage instructions.
- Implemented start.sh script to manage service startup and health checks.

2026-05-29 18:20:56 -04:00

12 KiB

Raw Blame History

Project Lyra

A streamlined AI conversation system with intelligent summarization and memory

Lyra is a unified conversational AI system that processes your thoughts, summarizes conversations at multiple levels, and prepares them for semantic memory storage. Think of it as your personal thought processor—you dump ideas, it makes sense of them, and stores both the raw conversation and progressive summaries.

Current Version: v1.0.0 (2026-02-23)

Mission Statement

Project Lyra is designed to be your external brain. Unlike typical chatbots that forget everything, Lyra:

Captures everything you say in raw form
Summarizes conversations at multiple granularities (L1-L30)
Stores both raw and summarized data for future retrieval
Prepares everything for semantic search via vector embeddings (Nebula, coming soon)

You can vomit ideas at it, and Lyra will organize, summarize, and remember.

Architecture Overview

Lyra runs as a unified Docker container with a clean separation of concerns:

┌─────────────────────────────────────────────┐
│   Unified Container (lyra)                  │
│                                              │
│  ┌──────────────┐  ┌──────────────────────┐ │
│  │ Relay :7078  │  │   Cortex :7081       │ │
│  │  (Node.js)   │→ │   (Python FastAPI)   │ │
│  │              │  │                       │ │
│  │ - API Gateway│  │ - /reason (full)     │ │
│  │ - Sessions   │  │ - /simple (fast)     │ │
│  │ - OpenAI API │  │ - /ingest (intake)   │ │
│  └──────────────┘  └──────────────────────┘ │
│                            │                 │
│                            ↓                 │
│                    ┌──────────────┐          │
│                    │   Intake     │          │
│                    │  (embedded)  │          │
│                    │              │          │
│                    │ - L1-L30     │          │
│                    │ - Summary    │          │
│                    │ - Buffer     │          │
│                    └──────────────┘          │
│                            │                 │
└────────────────────────────┼─────────────────┘
                             ↓
                      ┌─────────────┐
                      │   Nebula    │  (coming soon)
                      │  (vector    │
                      │   storage)  │
                      └─────────────┘

Components

1. Relay (Node.js - Port 7078)

User-facing API gateway
OpenAI-compatible endpoint: POST /v1/chat/completions
Session management (save, load, rename, delete)
Proxies requests to Cortex

2. Cortex (Python - Port 7081)

Main reasoning and processing brain
Multi-stage reasoning pipeline
LLM routing to different backends
Embedded Intake module

3. Intake (Python Module - Embedded)

Short-term memory buffer (200 messages per session)
Multi-level summarization:
- L1 (5 messages): Ultra-short summary
- L5 (10 messages): Short overview
- L10 (10 messages): "Reality Check" - tone, intent, direction
- L20 (merged L10s): "Session Overview" - progress and themes
- L30 (merged L20s): "Continuity Report" - high-level reflection
Sends summaries to Nebula (HTTP POST with disk fallback)

4. Nebula (Future - Port 7090)

Vector database for semantic memory
RAG (Retrieval-Augmented Generation)
Memory resurfacing based on similarity

What Makes Lyra Different?

Progressive Summarization

Most chatbots either keep raw history (expensive) or forget everything (useless). Lyra does both:

Raw storage: Every conversation turn saved
L1-L30 summaries: Multiple granularities for different use cases
- L1: "What just happened?" (immediate context)
- L10: "What's the vibe?" (tone and direction)
- L20: "What did we accomplish?" (session overview)
- L30: "What's the big picture?" (continuity across sessions)

Nebula-Ready Architecture

Summaries are sent via HTTP to Nebula (when available), with automatic disk fallback:

.nebula_fallback/
  └── {session_id}/
      ├── L10_20260223_203045.json
      ├── L20_20260223_204512.json
      └── L30_20260223_210030.json

Dual Mode Operation

Simple Mode (/simple): Fast, direct LLM responses
Cortex Mode (/reason): Full 4-stage reasoning pipeline
1. Reflection (meta-awareness)
2. Reasoning (draft)
3. Refinement (polish)
4. Persona (Lyra's voice)

Quick Start

Prerequisites

Docker + Docker Compose
At least one LLM backend (llama.cpp, Ollama, OpenAI API)

Run It

# 1. Create .env file with your LLM backend
cp .env.example .env
# Edit .env with your LLM URLs and API keys

# 2. Build and start
docker-compose up -d --build

# 3. Check health
curl http://localhost:7078/_health  # Relay
curl http://localhost:7081/_health  # Cortex

# 4. Open UI
open http://localhost:8081

Test It

# Simple chat
curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "standard",
    "messages": [{"role": "user", "content": "Hello!"}],
    "sessionId": "test"
  }'

# Full reasoning pipeline
curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "cortex",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "sessionId": "test"
  }'

Data Flow

Simple Mode (Fast Path)

User → Relay → Cortex (/simple) → Direct LLM → Response
                  ↓
              Intake (buffer + summarize on triggers)
                  ↓
              Nebula (summaries only)

Cortex Mode (Full Pipeline)

User → Relay → Cortex (/reason)
                  ↓
              1. Reflection (what's being asked?)
                  ↓
              2. Reasoning (draft answer)
                  ↓
              3. Refinement (polish)
                  ↓
              4. Persona (Lyra's voice)
                  ↓
              Intake (buffer + multi-level summaries)
                  ↓
              Nebula (raw + summaries)
                  ↓
              Response

Configuration

Environment Variables

LLM Backends:

# Primary backend (llama.cpp on AMD MI50)
LLM_PRIMARY_URL=http://10.0.0.44:8080
LLM_PRIMARY_MODEL=/model

# Secondary backend (Ollama on RTX 3090)
LLM_SECONDARY_URL=http://10.0.0.3:11434
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M

# Cloud backend (OpenAI)
LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...

Module-Specific Backend Selection:

CORTEX_LLM=PRIMARY       # Reasoning engine
INTAKE_LLM=PRIMARY       # Summarization
SPEAK_LLM=OPENAI         # Persona (final voice)
STANDARD_MODE_LLM=SECONDARY  # Simple mode default

Nebula Integration:

NEBULA_API=http://localhost:7090  # When Nebula is running
NEBULA_KEY=your-api-key           # Optional auth

Intake Settings:

INTAKE_LLM=PRIMARY
SUMMARY_MAX_TOKENS=200
SUMMARY_TEMPERATURE=0.3

API Reference

Relay Endpoints (Port 7078)

Chat (OpenAI-compatible):

POST /v1/chat/completions
{
  "mode": "standard" | "cortex",
  "messages": [{"role": "user", "content": "..."}],
  "sessionId": "session-123"
}

Sessions:

GET    /sessions           # List all sessions
GET    /sessions/:id       # Get session history
POST   /sessions/:id       # Save session
PATCH  /sessions/:id/metadata  # Rename session
DELETE /sessions/:id       # Delete session

Health:

GET /_health

Cortex Endpoints (Port 7081)

Reasoning:

POST /reason
{
  "session_id": "session-123",
  "user_prompt": "Your question here"
}

Simple Mode:

POST /simple
{
  "session_id": "session-123",
  "user_prompt": "Your question here",
  "backend": "SECONDARY"  # Optional
}

Intake:

POST /ingest
{
  "session_id": "session-123",
  "user_msg": "User message",
  "assistant_msg": "Assistant response"
}

Health:

GET /_health

File Structure

project-lyra/
├── Dockerfile              # Unified container (Node + Python)
├── docker-compose.yml      # Single lyra service + UI
├── start.sh                # Startup script (Cortex → Relay)
├── .dockerignore
├── QUICKSTART.md           # Quick reference
│
├── core/
│   └── relay/              # Node.js API gateway
│       ├── server.js
│       ├── lib/
│       │   ├── cortex.js   # Cortex HTTP client
│       │   └── llm.js      # LLM routing
│       └── sessions/       # Session storage (volume)
│
├── cortex/                 # Python reasoning engine
│   ├── main.py             # FastAPI app
│   ├── router.py           # /reason, /simple, /ingest
│   ├── context.py          # Session context
│   ├── llm/
│   │   └── llm_router.py   # Multi-backend LLM routing
│   ├── intake/
│   │   └── intake.py       # Summarization module
│   ├── reasoning/
│   │   ├── reflection.py
│   │   ├── reasoning.py
│   │   └── refine.py
│   └── persona/
│       └── speak.py
│
└── .nebula_fallback/       # Disk storage until Nebula runs
    └── {session_id}/
        ├── L10_*.json
        ├── L20_*.json
        └── L30_*.json

Roadmap

✅ Phase 1 (Complete)

Unified container architecture
Multi-level summarization (L1-L30)
HTTP client for Nebula (with disk fallback)
Session management
Dual-mode operation

🚧 Phase 2 (In Progress)

Build Nebula vector database
RAG integration
Memory resurfacing based on semantic similarity

📋 Phase 3 (Planned)

Entity extraction from summaries
Topic clustering
Automatic knowledge graph generation
Temporal memory (what happened when)

Troubleshooting

Container won't start

# Check logs
docker-compose logs lyra

# Common issues:
# - Missing .env file
# - Invalid LLM backend URLs
# - Port conflicts (7078, 7081)

Summaries not appearing

# Check Nebula fallback directory
ls -la .nebula_fallback/

# Verify Cortex is processing
docker-compose logs lyra | grep "Nebula"

Sessions not persisting

# Check volume mount
docker-compose exec lyra ls -la /app/relay/sessions/

# Verify session save calls
curl http://localhost:7078/sessions

Development

Making Changes

Code changes (hot reload):

docker-compose restart lyra

Dependency changes (rebuild):

docker-compose up -d --build lyra

View logs:

docker-compose logs -f lyra

Adding a New LLM Backend

Add to .env:

LLM_CUSTOM_URL=http://your-backend:port
LLM_CUSTOM_MODEL=model-name

Configure module:

CORTEX_LLM=CUSTOM

Restart:

docker-compose restart lyra

Version History

v1.0.0 (2026-02-23) - The Great Simplification

Major Refactor:

✅ Unified Relay + Cortex into single container
✅ Removed NeoMem (replaced by upcoming Nebula)
✅ Removed old ingest_handler and RAG services
✅ Simplified to core flow: intake → summarize → store
✅ Added HTTP client for Nebula with disk fallback
✅ Cleaned docker-compose (2 services instead of 7)
✅ Updated documentation to reflect new architecture

Architecture Changes:

Intake now sends summaries to Nebula (HTTP POST)
Disk fallback writes JSON files to .nebula_fallback/
Relay and Cortex communicate via localhost (faster)
Single build, single deploy, single log stream

License

Built with Claude Code

Credits

Built by Brian with assistance from Claude (Anthropic).

Special thanks to the open source community:

FastAPI
Express.js
Docker
llama.cpp
Ollama

12 KiB Raw Blame History