project-lyra/README.md

# Project Lyra

**A streamlined AI conversation system with intelligent summarization and memory**

Lyra is a unified conversational AI system that processes your thoughts, summarizes conversations at multiple levels, and prepares them for semantic memory storage. Think of it as your personal thought processor—you dump ideas, it makes sense of them, and stores both the raw conversation and progressive summaries.

**Current Version:** v1.0.0 (2026-02-23)

---

## Mission Statement

Project Lyra is designed to be your **external brain**. Unlike typical chatbots that forget everything, Lyra:
- **Captures** everything you say in raw form
- **Summarizes** conversations at multiple granularities (L1-L30)
- **Stores** both raw and summarized data for future retrieval
- **Prepares** everything for semantic search via vector embeddings (Nebula, coming soon)

You can vomit ideas at it, and Lyra will organize, summarize, and remember.

---

## Architecture Overview

Lyra runs as a **unified Docker container** with a clean separation of concerns:

```
┌─────────────────────────────────────────────┐
│   Unified Container (lyra)                  │
│                                              │
│  ┌──────────────┐  ┌──────────────────────┐ │
│  │ Relay :7078  │  │   Cortex :7081       │ │
│  │  (Node.js)   │→ │   (Python FastAPI)   │ │
│  │              │  │                       │ │
│  │ - API Gateway│  │ - /reason (full)     │ │
│  │ - Sessions   │  │ - /simple (fast)     │ │
│  │ - OpenAI API │  │ - /ingest (intake)   │ │
│  └──────────────┘  └──────────────────────┘ │
│                            │                 │
│                            ↓                 │
│                    ┌──────────────┐          │
│                    │   Intake     │          │
│                    │  (embedded)  │          │
│                    │              │          │
│                    │ - L1-L30     │          │
│                    │ - Summary    │          │
│                    │ - Buffer     │          │
│                    └──────────────┘          │
│                            │                 │
└────────────────────────────┼─────────────────┘
                             ↓
                      ┌─────────────┐
                      │   Nebula    │  (coming soon)
                      │  (vector    │
                      │   storage)  │
                      └─────────────┘
```

### Components

**1. Relay (Node.js - Port 7078)**
- User-facing API gateway
- OpenAI-compatible endpoint: `POST /v1/chat/completions`
- Session management (save, load, rename, delete)
- Proxies requests to Cortex

**2. Cortex (Python - Port 7081)**
- Main reasoning and processing brain
- Multi-stage reasoning pipeline
- LLM routing to different backends
- Embedded Intake module

**3. Intake (Python Module - Embedded)**
- Short-term memory buffer (200 messages per session)
- Multi-level summarization:
  - **L1** (5 messages): Ultra-short summary
  - **L5** (10 messages): Short overview
  - **L10** (10 messages): "Reality Check" - tone, intent, direction
  - **L20** (merged L10s): "Session Overview" - progress and themes
  - **L30** (merged L20s): "Continuity Report" - high-level reflection
- Sends summaries to Nebula (HTTP POST with disk fallback)

**4. Nebula (Future - Port 7090)**
- Vector database for semantic memory
- RAG (Retrieval-Augmented Generation)
- Memory resurfacing based on similarity

---

## What Makes Lyra Different?

### Progressive Summarization
Most chatbots either keep raw history (expensive) or forget everything (useless). Lyra does both:
- **Raw storage**: Every conversation turn saved
- **L1-L30 summaries**: Multiple granularities for different use cases
  - L1: "What just happened?" (immediate context)
  - L10: "What's the vibe?" (tone and direction)
  - L20: "What did we accomplish?" (session overview)
  - L30: "What's the big picture?" (continuity across sessions)

### Nebula-Ready Architecture
Summaries are sent via HTTP to Nebula (when available), with automatic disk fallback:
```
.nebula_fallback/
  └── {session_id}/
      ├── L10_20260223_203045.json
      ├── L20_20260223_204512.json
      └── L30_20260223_210030.json
```

### Dual Mode Operation
- **Simple Mode** (`/simple`): Fast, direct LLM responses
- **Cortex Mode** (`/reason`): Full 4-stage reasoning pipeline
  1. Reflection (meta-awareness)
  2. Reasoning (draft)
  3. Refinement (polish)
  4. Persona (Lyra's voice)

---

## Quick Start

### Prerequisites
- Docker + Docker Compose
- At least one LLM backend (llama.cpp, Ollama, OpenAI API)

### Run It

```bash
# 1. Create .env file with your LLM backend
cp .env.example .env
# Edit .env with your LLM URLs and API keys

# 2. Build and start
docker-compose up -d --build

# 3. Check health
curl http://localhost:7078/_health  # Relay
curl http://localhost:7081/_health  # Cortex

# 4. Open UI
open http://localhost:8081
```

### Test It

```bash
# Simple chat
curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "standard",
    "messages": [{"role": "user", "content": "Hello!"}],
    "sessionId": "test"
  }'

# Full reasoning pipeline
curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "cortex",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "sessionId": "test"
  }'
```

---

## Data Flow

### Simple Mode (Fast Path)
```
User → Relay → Cortex (/simple) → Direct LLM → Response
                  ↓
              Intake (buffer + summarize on triggers)
                  ↓
              Nebula (summaries only)
```

### Cortex Mode (Full Pipeline)
```
User → Relay → Cortex (/reason)
                  ↓
              1. Reflection (what's being asked?)
                  ↓
              2. Reasoning (draft answer)
                  ↓
              3. Refinement (polish)
                  ↓
              4. Persona (Lyra's voice)
                  ↓
              Intake (buffer + multi-level summaries)
                  ↓
              Nebula (raw + summaries)
                  ↓
              Response
```

---

## Configuration

### Environment Variables

**LLM Backends:**
```bash
# Primary backend (llama.cpp on AMD MI50)
LLM_PRIMARY_URL=http://10.0.0.44:8080
LLM_PRIMARY_MODEL=/model

# Secondary backend (Ollama on RTX 3090)
LLM_SECONDARY_URL=http://10.0.0.3:11434
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M

# Cloud backend (OpenAI)
LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
```

**Module-Specific Backend Selection:**
```bash
CORTEX_LLM=PRIMARY       # Reasoning engine
INTAKE_LLM=PRIMARY       # Summarization
SPEAK_LLM=OPENAI         # Persona (final voice)
STANDARD_MODE_LLM=SECONDARY  # Simple mode default
```

**Nebula Integration:**
```bash
NEBULA_API=http://localhost:7090  # When Nebula is running
NEBULA_KEY=your-api-key           # Optional auth
```

**Intake Settings:**
```bash
INTAKE_LLM=PRIMARY
SUMMARY_MAX_TOKENS=200
SUMMARY_TEMPERATURE=0.3
```

---

## API Reference

### Relay Endpoints (Port 7078)

**Chat (OpenAI-compatible):**
```bash
POST /v1/chat/completions
{
  "mode": "standard" | "cortex",
  "messages": [{"role": "user", "content": "..."}],
  "sessionId": "session-123"
}
```

**Sessions:**
```bash
GET    /sessions           # List all sessions
GET    /sessions/:id       # Get session history
POST   /sessions/:id       # Save session
PATCH  /sessions/:id/metadata  # Rename session
DELETE /sessions/:id       # Delete session
```

**Health:**
```bash
GET /_health
```

### Cortex Endpoints (Port 7081)

**Reasoning:**
```bash
POST /reason
{
  "session_id": "session-123",
  "user_prompt": "Your question here"
}
```

**Simple Mode:**
```bash
POST /simple
{
  "session_id": "session-123",
  "user_prompt": "Your question here",
  "backend": "SECONDARY"  # Optional
}
```

**Intake:**
```bash
POST /ingest
{
  "session_id": "session-123",
  "user_msg": "User message",
  "assistant_msg": "Assistant response"
}
```

**Health:**
```bash
GET /_health
```

---

## File Structure

```
project-lyra/
├── Dockerfile              # Unified container (Node + Python)
├── docker-compose.yml      # Single lyra service + UI
├── start.sh                # Startup script (Cortex → Relay)
├── .dockerignore
├── QUICKSTART.md           # Quick reference
│
├── core/
│   └── relay/              # Node.js API gateway
│       ├── server.js
│       ├── lib/
│       │   ├── cortex.js   # Cortex HTTP client
│       │   └── llm.js      # LLM routing
│       └── sessions/       # Session storage (volume)
│
├── cortex/                 # Python reasoning engine
│   ├── main.py             # FastAPI app
│   ├── router.py           # /reason, /simple, /ingest
│   ├── context.py          # Session context
│   ├── llm/
│   │   └── llm_router.py   # Multi-backend LLM routing
│   ├── intake/
│   │   └── intake.py       # Summarization module
│   ├── reasoning/
│   │   ├── reflection.py
│   │   ├── reasoning.py
│   │   └── refine.py
│   └── persona/
│       └── speak.py
│
└── .nebula_fallback/       # Disk storage until Nebula runs
    └── {session_id}/
        ├── L10_*.json
        ├── L20_*.json
        └── L30_*.json
```

---

## Roadmap

### ✅ Phase 1 (Complete)
- Unified container architecture
- Multi-level summarization (L1-L30)
- HTTP client for Nebula (with disk fallback)
- Session management
- Dual-mode operation

### 🚧 Phase 2 (In Progress)
- Build Nebula vector database
- RAG integration
- Memory resurfacing based on semantic similarity

### 📋 Phase 3 (Planned)
- Entity extraction from summaries
- Topic clustering
- Automatic knowledge graph generation
- Temporal memory (what happened when)

---

## Troubleshooting

### Container won't start
```bash
# Check logs
docker-compose logs lyra

# Common issues:
# - Missing .env file
# - Invalid LLM backend URLs
# - Port conflicts (7078, 7081)
```

### Summaries not appearing
```bash
# Check Nebula fallback directory
ls -la .nebula_fallback/

# Verify Cortex is processing
docker-compose logs lyra | grep "Nebula"
```

### Sessions not persisting
```bash
# Check volume mount
docker-compose exec lyra ls -la /app/relay/sessions/

# Verify session save calls
curl http://localhost:7078/sessions
```

---

## Development

### Making Changes

**Code changes (hot reload):**
```bash
docker-compose restart lyra
```

**Dependency changes (rebuild):**
```bash
docker-compose up -d --build lyra
```

**View logs:**
```bash
docker-compose logs -f lyra
```

### Adding a New LLM Backend

1. Add to `.env`:
```bash
LLM_CUSTOM_URL=http://your-backend:port
LLM_CUSTOM_MODEL=model-name
```

2. Configure module:
```bash
CORTEX_LLM=CUSTOM
```

3. Restart:
```bash
docker-compose restart lyra
```

---

## Version History

### v1.0.0 (2026-02-23) - The Great Simplification
**Major Refactor:**
- ✅ Unified Relay + Cortex into single container
- ✅ Removed NeoMem (replaced by upcoming Nebula)
- ✅ Removed old ingest_handler and RAG services
- ✅ Simplified to core flow: intake → summarize → store
- ✅ Added HTTP client for Nebula with disk fallback
- ✅ Cleaned docker-compose (2 services instead of 7)
- ✅ Updated documentation to reflect new architecture

**Architecture Changes:**
- Intake now sends summaries to Nebula (HTTP POST)
- Disk fallback writes JSON files to `.nebula_fallback/`
- Relay and Cortex communicate via localhost (faster)
- Single build, single deploy, single log stream

---

## License

© 2026 Terra-Mechanics / ServersDown Labs. Apache 2.0.

**Built with Claude Code**

---

## Credits

Built by Brian with assistance from Claude (Anthropic).

Special thanks to the open source community:
- FastAPI
- Express.js
- Docker
- llama.cpp
- Ollama