project-lyra/docs/PROJECT_SUMMARY.md

# Project Lyra — Comprehensive AI Context Summary

**Version:** v0.5.1 (2025-12-11)
**Status:** Production-ready modular AI companion system
**Purpose:** Memory-backed conversational AI with multi-stage reasoning, persistent context, and modular LLM backend architecture

---

## Executive Summary

Project Lyra is a **self-hosted AI companion system** designed to overcome the limitations of typical chatbots by providing:
- **Persistent long-term memory** (NeoMem: PostgreSQL + Neo4j graph storage)
- **Multi-stage reasoning pipeline** (Cortex: reflection → reasoning → refinement → persona)
- **Short-term context management** (Intake: session-based summarization embedded in Cortex)
- **Flexible LLM backend routing** (supports llama.cpp, Ollama, OpenAI, custom endpoints)
- **OpenAI-compatible API** (drop-in replacement for chat applications)

**Core Philosophy:** Like a human brain has different regions for different functions, Lyra has specialized modules that work together. She's not just a chatbot—she's a notepad, schedule, database, co-creator, and collaborator with her own executive function.

---

## Quick Context for AI Assistants

If you're an AI being given this project to work on, here's what you need to know:

### What This Project Does
Lyra is a conversational AI system that **remembers everything** across sessions. When a user says something in passing, Lyra stores it, contextualizes it, and can recall it later. She can:
- Track project progress over time
- Remember user preferences and past conversations
- Reason through complex questions using multiple LLM calls
- Apply a consistent personality across all interactions
- Integrate with multiple LLM backends (local and cloud)

### Current Architecture (v0.5.1)
```
User → Relay (Express/Node.js, port 7078)
  ↓
Cortex (FastAPI/Python, port 7081)
  ├─ Intake module (embedded, in-memory SESSIONS)
  ├─ 4-stage reasoning pipeline
  └─ Multi-backend LLM router
  ↓
NeoMem (FastAPI/Python, port 7077)
  ├─ PostgreSQL (vector storage)
  └─ Neo4j (graph relationships)
```

### Key Files You'll Work With

**Backend Services:**
- [cortex/router.py](cortex/router.py) - Main Cortex routing logic (306 lines, `/reason`, `/ingest` endpoints)
- [cortex/intake/intake.py](cortex/intake/intake.py) - Short-term memory module (367 lines, SESSIONS management)
- [cortex/reasoning/reasoning.py](cortex/reasoning/reasoning.py) - Draft answer generation
- [cortex/reasoning/refine.py](cortex/reasoning/refine.py) - Answer refinement
- [cortex/reasoning/reflection.py](cortex/reasoning/reflection.py) - Meta-awareness notes
- [cortex/persona/speak.py](cortex/persona/speak.py) - Personality layer
- [cortex/llm/llm_router.py](cortex/llm/llm_router.py) - LLM backend selector
- [core/relay/server.js](core/relay/server.js) - Main orchestrator (Node.js)
- [neomem/main.py](neomem/main.py) - Long-term memory API

**Configuration:**
- [.env](.env) - Root environment variables (LLM backends, databases, API keys)
- [cortex/.env](cortex/.env) - Cortex-specific overrides
- [docker-compose.yml](docker-compose.yml) - Service definitions (152 lines)

**Documentation:**
- [CHANGELOG.md](CHANGELOG.md) - Complete version history (836 lines, chronological format)
- [README.md](README.md) - User-facing documentation (610 lines)
- [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - This file

### Recent Critical Fixes (v0.5.1)
The most recent work fixed a critical bug where Intake's SESSIONS buffer wasn't persisting:
1. **Fixed**: `bg_summarize()` was only a TYPE_CHECKING stub → implemented as logging stub
2. **Fixed**: `/ingest` endpoint had unreachable code → removed early return, added lenient error handling
3. **Added**: `cortex/intake/__init__.py` → proper Python package structure
4. **Added**: Diagnostic endpoints `/debug/sessions` and `/debug/summary` for troubleshooting

**Key Insight**: Intake is no longer a standalone service—it's embedded in Cortex as a Python module. SESSIONS must persist in a single Uvicorn worker (no multi-worker support without Redis).

---

## Architecture Deep Dive

### Service Topology (Docker Compose)

**Active Containers:**
1. **relay** (Node.js/Express, port 7078)
   - Entry point for all user requests
   - OpenAI-compatible `/v1/chat/completions` endpoint
   - Routes to Cortex for reasoning
   - Async calls to Cortex `/ingest` after response

2. **cortex** (Python/FastAPI, port 7081)
   - Multi-stage reasoning pipeline
   - Embedded Intake module (no HTTP, direct Python imports)
   - Endpoints: `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary`

3. **neomem-api** (Python/FastAPI, port 7077)
   - Long-term memory storage
   - Fork of Mem0 OSS (fully local, no external SDK)
   - Endpoints: `/memories`, `/search`, `/health`

4. **neomem-postgres** (PostgreSQL + pgvector, port 5432)
   - Vector embeddings storage
   - Memory history records

5. **neomem-neo4j** (Neo4j, ports 7474/7687)
   - Graph relationships between memories
   - Entity extraction and linking

**Disabled Services:**
- `intake` - No longer needed (embedded in Cortex as of v0.5.1)
- `rag` - Beta Lyrae RAG service (planned re-enablement)

### External LLM Backends (HTTP APIs)

**PRIMARY Backend** - llama.cpp @ `http://10.0.0.44:8080`
- AMD MI50 GPU-accelerated inference
- Model: `/model` (path-based routing)
- Used for: Reasoning, refinement, summarization

**SECONDARY Backend** - Ollama @ `http://10.0.0.3:11434`
- RTX 3090 GPU-accelerated inference
- Model: `qwen2.5:7b-instruct-q4_K_M`
- Used for: Configurable per-module

**CLOUD Backend** - OpenAI @ `https://api.openai.com/v1`
- Cloud-based inference
- Model: `gpt-4o-mini`
- Used for: Reflection, persona layers

**FALLBACK Backend** - Local @ `http://10.0.0.41:11435`
- CPU-based inference
- Model: `llama-3.2-8b-instruct`
- Used for: Emergency fallback

### Data Flow (Request Lifecycle)

```
1. User sends message → Relay (/v1/chat/completions)
   ↓
2. Relay → Cortex (/reason)
   ↓
3. Cortex calls Intake module (internal Python)
   - Intake.summarize_context(session_id, exchanges)
   - Returns L1/L5/L10/L20/L30 summaries
   ↓
4. Cortex 4-stage pipeline:
   a. reflection.py → Meta-awareness notes (CLOUD backend)
      - "What is the user really asking?"
      - Returns JSON: {"notes": [...]}

   b. reasoning.py → Draft answer (PRIMARY backend)
      - Uses context from Intake
      - Integrates reflection notes
      - Returns draft text

   c. refine.py → Refined answer (PRIMARY backend)
      - Polishes draft for clarity
      - Ensures factual consistency
      - Returns refined text

   d. speak.py → Persona layer (CLOUD backend)
      - Applies Lyra's personality
      - Natural, conversational tone
      - Returns final answer
   ↓
5. Cortex → Relay (returns persona answer)
   ↓
6. Relay → Cortex (/ingest) [async, non-blocking]
   - Sends (session_id, user_msg, assistant_msg)
   - Cortex calls add_exchange_internal()
   - Appends to SESSIONS[session_id]["buffer"]
   ↓
7. Relay → User (returns final response)
   ↓
8. [Planned] Relay → NeoMem (/memories) [async]
   - Store conversation in long-term memory
```

### Intake Module Architecture (v0.5.1)

**Location:** `cortex/intake/`

**Key Change:** Intake is now **embedded in Cortex** as a Python module, not a standalone service.

**Import Pattern:**
```python
from intake.intake import add_exchange_internal, SESSIONS, summarize_context
```

**Core Data Structure:**
```python
SESSIONS: dict[str, dict] = {}

# Structure:
SESSIONS[session_id] = {
    "buffer": deque(maxlen=200),  # Circular buffer of exchanges
    "created_at": datetime
}

# Each exchange in buffer:
{
    "session_id": "...",
    "user_msg": "...",
    "assistant_msg": "...",
    "timestamp": "2025-12-11T..."
}
```

**Functions:**
1. **`add_exchange_internal(exchange: dict)`**
   - Adds exchange to SESSIONS buffer
   - Creates new session if needed
   - Calls `bg_summarize()` stub
   - Returns `{"ok": True, "session_id": "..."}`

2. **`summarize_context(session_id: str, exchanges: list[dict])`** [async]
   - Generates L1/L5/L10/L20/L30 summaries via LLM
   - Called during `/reason` endpoint
   - Returns multi-level summary dict

3. **`bg_summarize(session_id: str)`**
   - **Stub function** - logs only, no actual work
   - Defers summarization to `/reason` call
   - Exists to prevent NameError

**Critical Constraint:** SESSIONS is a module-level global dict. This requires **single-worker Uvicorn** mode. Multi-worker deployments need Redis or shared storage.

**Diagnostic Endpoints:**
- `GET /debug/sessions` - Inspect all SESSIONS (object ID, buffer sizes, recent exchanges)
- `GET /debug/summary?session_id=X` - Test summarization for a session

---

## Environment Configuration

### LLM Backend Registry (Multi-Backend Strategy)

**Root `.env` defines all backend OPTIONS:**
```bash
# PRIMARY Backend (llama.cpp)
LLM_PRIMARY_PROVIDER=llama.cpp
LLM_PRIMARY_URL=http://10.0.0.44:8080
LLM_PRIMARY_MODEL=/model

# SECONDARY Backend (Ollama)
LLM_SECONDARY_PROVIDER=ollama
LLM_SECONDARY_URL=http://10.0.0.3:11434
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M

# CLOUD Backend (OpenAI)
LLM_OPENAI_PROVIDER=openai
LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-proj-...

# FALLBACK Backend
LLM_FALLBACK_PROVIDER=openai_completions
LLM_FALLBACK_URL=http://10.0.0.41:11435
LLM_FALLBACK_MODEL=llama-3.2-8b-instruct
```

**Module-specific backend selection:**
```bash
CORTEX_LLM=SECONDARY      # Cortex uses Ollama
INTAKE_LLM=PRIMARY        # Intake uses llama.cpp
SPEAK_LLM=OPENAI          # Persona uses OpenAI
NEOMEM_LLM=PRIMARY        # NeoMem uses llama.cpp
UI_LLM=OPENAI             # UI uses OpenAI
RELAY_LLM=PRIMARY         # Relay uses llama.cpp
```

**Philosophy:** Root `.env` provides all backend OPTIONS. Each service chooses which backend to USE via `{MODULE}_LLM` variable. This eliminates URL duplication while preserving flexibility.

### Database Configuration
```bash
# PostgreSQL (vector storage)
POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomempass
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432

# Neo4j (graph storage)
NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=neomemgraph
```

### Service URLs (Docker Internal Network)
```bash
NEOMEM_API=http://neomem-api:7077
CORTEX_API=http://cortex:7081
CORTEX_REASON_URL=http://cortex:7081/reason
CORTEX_INGEST_URL=http://cortex:7081/ingest
RELAY_URL=http://relay:7078
```

### Feature Flags
```bash
CORTEX_ENABLED=true
MEMORY_ENABLED=true
PERSONA_ENABLED=false
DEBUG_PROMPT=true
VERBOSE_DEBUG=true
```

---

## Code Structure Overview

### Cortex Service (`cortex/`)

**Main Files:**
- `main.py` - FastAPI app initialization
- `router.py` - Route definitions (`/reason`, `/ingest`, `/health`, `/debug/*`)
- `context.py` - Context aggregation (Intake summaries, session state)

**Reasoning Pipeline (`reasoning/`):**
- `reflection.py` - Meta-awareness notes (Cloud LLM)
- `reasoning.py` - Draft answer generation (Primary LLM)
- `refine.py` - Answer refinement (Primary LLM)

**Persona Layer (`persona/`):**
- `speak.py` - Personality application (Cloud LLM)
- `identity.py` - Persona loader

**Intake Module (`intake/`):**
- `__init__.py` - Package exports (SESSIONS, add_exchange_internal, summarize_context)
- `intake.py` - Core logic (367 lines)
  - SESSIONS dictionary
  - add_exchange_internal()
  - summarize_context()
  - bg_summarize() stub

**LLM Integration (`llm/`):**
- `llm_router.py` - Backend selector and HTTP client
  - call_llm() function
  - Environment-based routing
  - Payload formatting per backend type

**Utilities (`utils/`):**
- Helper functions for common operations

**Configuration:**
- `Dockerfile` - Single-worker constraint documented
- `requirements.txt` - Python dependencies
- `.env` - Service-specific overrides

### Relay Service (`core/relay/`)

**Main Files:**
- `server.js` - Express.js server (Node.js)
  - `/v1/chat/completions` - OpenAI-compatible endpoint
  - `/chat` - Internal endpoint
  - `/_health` - Health check
- `package.json` - Node.js dependencies

**Key Logic:**
- Receives user messages
- Routes to Cortex `/reason`
- Async calls to Cortex `/ingest` after response
- Returns final answer to user

### NeoMem Service (`neomem/`)

**Main Files:**
- `main.py` - FastAPI app (memory API)
- `memory.py` - Memory management logic
- `embedder.py` - Embedding generation
- `graph.py` - Neo4j graph operations
- `Dockerfile` - Container definition
- `requirements.txt` - Python dependencies

**API Endpoints:**
- `POST /memories` - Add new memory
- `POST /search` - Semantic search
- `GET /health` - Service health

---

## Common Development Tasks

### Adding a New Endpoint to Cortex

**Example: Add `/debug/buffer` endpoint**

1. **Edit `cortex/router.py`:**
```python
@cortex_router.get("/debug/buffer")
async def debug_buffer(session_id: str, limit: int = 10):
    """Return last N exchanges from a session buffer."""
    from intake.intake import SESSIONS

    session = SESSIONS.get(session_id)
    if not session:
        return {"error": "session not found", "session_id": session_id}

    buffer = session["buffer"]
    recent = list(buffer)[-limit:]

    return {
        "session_id": session_id,
        "total_exchanges": len(buffer),
        "recent_exchanges": recent
    }
```

2. **Restart Cortex:**
```bash
docker-compose restart cortex
```

3. **Test:**
```bash
curl "http://localhost:7081/debug/buffer?session_id=test&limit=5"
```

### Modifying LLM Backend for a Module

**Example: Switch Cortex to use PRIMARY backend**

1. **Edit `.env`:**
```bash
CORTEX_LLM=PRIMARY  # Change from SECONDARY to PRIMARY
```

2. **Restart Cortex:**
```bash
docker-compose restart cortex
```

3. **Verify in logs:**
```bash
docker logs cortex | grep "Backend"
```

### Adding Diagnostic Logging

**Example: Log every exchange addition**

1. **Edit `cortex/intake/intake.py`:**
```python
def add_exchange_internal(exchange: dict):
    session_id = exchange.get("session_id")

    # Add detailed logging
    print(f"[DEBUG] Adding exchange to {session_id}")
    print(f"[DEBUG] User msg: {exchange.get('user_msg', '')[:100]}")
    print(f"[DEBUG] Assistant msg: {exchange.get('assistant_msg', '')[:100]}")

    # ... rest of function
```

2. **View logs:**
```bash
docker logs cortex -f | grep DEBUG
```

---

## Debugging Guide

### Problem: SESSIONS Not Persisting

**Symptoms:**
- `/debug/sessions` shows empty or only 1 exchange
- Summaries always return empty
- Buffer size doesn't increase

**Diagnosis Steps:**
1. Check Cortex logs for SESSIONS object ID:
   ```bash
   docker logs cortex | grep "SESSIONS object id"
   ```
   - Should show same ID across all calls
   - If IDs differ → module reloading issue

2. Verify single-worker mode:
   ```bash
   docker exec cortex cat Dockerfile | grep uvicorn
   ```
   - Should NOT have `--workers` flag or `--workers 1`

3. Check `/debug/sessions` endpoint:
   ```bash
   curl http://localhost:7081/debug/sessions | jq
   ```
   - Should show sessions_object_id and current sessions

4. Inspect `__init__.py` exists:
   ```bash
   docker exec cortex ls -la intake/__init__.py
   ```

**Solution (Fixed in v0.5.1):**
- Ensure `cortex/intake/__init__.py` exists with proper exports
- Verify `bg_summarize()` is implemented (not just TYPE_CHECKING stub)
- Check `/ingest` endpoint doesn't have early return
- Rebuild Cortex container: `docker-compose build cortex && docker-compose restart cortex`

### Problem: LLM Backend Timeout

**Symptoms:**
- Cortex `/reason` hangs
- 504 Gateway Timeout errors
- Logs show "waiting for LLM response"

**Diagnosis Steps:**
1. Test backend directly:
   ```bash
   # llama.cpp
   curl http://10.0.0.44:8080/health

   # Ollama
   curl http://10.0.0.3:11434/api/tags

   # OpenAI
   curl https://api.openai.com/v1/models \
     -H "Authorization: Bearer $OPENAI_API_KEY"
   ```

2. Check network connectivity:
   ```bash
   docker exec cortex ping -c 3 10.0.0.44
   ```

3. Review Cortex logs:
   ```bash
   docker logs cortex -f | grep "LLM"
   ```

**Solutions:**
- Verify backend URL in `.env` is correct and accessible
- Check firewall rules for backend ports
- Increase timeout in `cortex/llm/llm_router.py`
- Switch to different backend temporarily: `CORTEX_LLM=CLOUD`

### Problem: Docker Compose Won't Start

**Symptoms:**
- `docker-compose up -d` fails
- Container exits immediately
- "port already in use" errors

**Diagnosis Steps:**
1. Check port conflicts:
   ```bash
   netstat -tulpn | grep -E '7078|7081|7077|5432'
   ```

2. Check container logs:
   ```bash
   docker-compose logs --tail=50
   ```

3. Verify environment file:
   ```bash
   cat .env | grep -v "^#" | grep -v "^$"
   ```

**Solutions:**
- Stop conflicting services: `docker-compose down`
- Check `.env` syntax (no quotes unless necessary)
- Rebuild containers: `docker-compose build --no-cache`
- Check Docker daemon: `systemctl status docker`

---

## Testing Checklist

### After Making Changes to Cortex

**1. Build and restart:**
```bash
docker-compose build cortex
docker-compose restart cortex
```

**2. Verify service health:**
```bash
curl http://localhost:7081/health
```

**3. Test /ingest endpoint:**
```bash
curl -X POST http://localhost:7081/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "test",
    "user_msg": "Hello",
    "assistant_msg": "Hi there!"
  }'
```

**4. Verify SESSIONS updated:**
```bash
curl http://localhost:7081/debug/sessions | jq '.sessions.test.buffer_size'
```
- Should show 1 (or increment if already populated)

**5. Test summarization:**
```bash
curl "http://localhost:7081/debug/summary?session_id=test" | jq '.summary'
```
- Should return L1/L5/L10/L20/L30 summaries

**6. Test full pipeline:**
```bash
curl -X POST http://localhost:7078/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Test message"}],
    "session_id": "test"
  }' | jq '.choices[0].message.content'
```

**7. Check logs for errors:**
```bash
docker logs cortex --tail=50
```

---

## Project History & Context

### Evolution Timeline

**v0.1.x (2025-09-23 to 2025-09-25)**
- Initial MVP: Relay + Mem0 + Ollama
- Basic memory storage and retrieval
- Simple UI with session support

**v0.2.x (2025-09-24 to 2025-09-30)**
- Migrated to mem0ai SDK
- Added sessionId support
- Created standalone Lyra-Mem0 stack

**v0.3.x (2025-09-26 to 2025-10-28)**
- Forked Mem0 → NVGRAM → NeoMem
- Added salience filtering
- Integrated Cortex reasoning VM
- Built RAG system (Beta Lyrae)
- Established multi-backend LLM support

**v0.4.x (2025-11-05 to 2025-11-13)**
- Major architectural rewire
- Implemented 4-stage reasoning pipeline
- Added reflection, refinement stages
- RAG integration
- LLM router with per-stage backend selection

**Infrastructure v1.0.0 (2025-11-26)**
- Consolidated 9 `.env` files into single source of truth
- Multi-backend LLM strategy
- Docker Compose consolidation
- Created security templates

**v0.5.0 (2025-11-28)**
- Fixed all critical API wiring issues
- Added OpenAI-compatible Relay endpoint
- Fixed Cortex → Intake integration
- End-to-end flow verification

**v0.5.1 (2025-12-11) - CURRENT**
- **Critical fix**: SESSIONS persistence bug
- Implemented `bg_summarize()` stub
- Fixed `/ingest` unreachable code
- Added `cortex/intake/__init__.py`
- Embedded Intake in Cortex (no longer standalone)
- Added diagnostic endpoints
- Lenient error handling
- Documented single-worker constraint

### Architectural Philosophy

**Modular Design:**
- Each service has a single, clear responsibility
- Services communicate via well-defined HTTP APIs
- Configuration is centralized but allows per-service overrides

**Local-First:**
- No reliance on external services (except optional OpenAI)
- All data stored locally (PostgreSQL + Neo4j)
- Can run entirely air-gapped with local LLMs

**Flexible LLM Backend:**
- Not tied to any single LLM provider
- Can mix local and cloud models
- Per-stage backend selection for optimal performance/cost

**Error Handling:**
- Lenient mode: Never fail the chat pipeline
- Log errors but continue processing
- Graceful degradation

**Observability:**
- Diagnostic endpoints for debugging
- Verbose logging mode
- Object ID tracking for singleton verification

---

## Known Issues & Limitations

### Fixed in v0.5.1
- ✅ Intake SESSIONS not persisting → **FIXED**
- ✅ `bg_summarize()` NameError → **FIXED**
- ✅ `/ingest` endpoint unreachable code → **FIXED**

### Current Limitations

**1. Single-Worker Constraint**
- Cortex must run with single Uvicorn worker
- SESSIONS is in-memory module-level global
- Multi-worker support requires Redis or shared storage
- Documented in `cortex/Dockerfile` lines 7-8

**2. NeoMem Integration Incomplete**
- Relay doesn't yet push to NeoMem after responses
- Memory storage planned for v0.5.2
- Currently all memory is short-term (SESSIONS only)

**3. RAG Service Disabled**
- Beta Lyrae (RAG) commented out in docker-compose.yml
- Awaiting re-enablement after Intake stabilization
- Code exists but not currently integrated

**4. Session Management**
- No session cleanup/expiration
- SESSIONS grows unbounded (maxlen=200 per session, but infinite sessions)
- No session list endpoint in Relay

**5. Persona Integration**
- `PERSONA_ENABLED=false` in `.env`
- Persona Sidecar not fully wired
- Identity loaded but not consistently applied

### Future Enhancements

**Short-term (v0.5.2):**
- Enable NeoMem integration in Relay
- Add session cleanup/expiration
- Session list endpoint
- NeoMem health monitoring

**Medium-term (v0.6.x):**
- Re-enable RAG service
- Migrate SESSIONS to Redis for multi-worker support
- Add request correlation IDs
- Comprehensive health checks

**Long-term (v0.7.x+):**
- Persona Sidecar full integration
- Autonomous "dream" cycles (self-reflection)
- Verifier module for factual grounding
- Advanced RAG with hybrid search
- Memory consolidation strategies

---

## Troubleshooting Quick Reference

| Problem | Quick Check | Solution |
|---------|-------------|----------|
| SESSIONS empty | `curl localhost:7081/debug/sessions` | Rebuild Cortex, verify `__init__.py` exists |
| LLM timeout | `curl http://10.0.0.44:8080/health` | Check backend connectivity, increase timeout |
| Port conflict | `netstat -tulpn \| grep 7078` | Stop conflicting service or change port |
| Container crash | `docker logs cortex` | Check logs for Python errors, verify .env syntax |
| Missing package | `docker exec cortex pip list` | Rebuild container, check requirements.txt |
| 502 from Relay | `curl localhost:7081/health` | Verify Cortex is running, check docker network |

---

## API Reference (Quick)

### Relay (Port 7078)

**POST /v1/chat/completions** - OpenAI-compatible chat
```json
{
  "messages": [{"role": "user", "content": "..."}],
  "session_id": "..."
}
```

**GET /_health** - Service health

### Cortex (Port 7081)

**POST /reason** - Main reasoning pipeline
```json
{
  "session_id": "...",
  "user_prompt": "...",
  "temperature": 0.7  // optional
}
```

**POST /ingest** - Add exchange to SESSIONS
```json
{
  "session_id": "...",
  "user_msg": "...",
  "assistant_msg": "..."
}
```

**GET /debug/sessions** - Inspect SESSIONS state

**GET /debug/summary?session_id=X** - Test summarization

**GET /health** - Service health

### NeoMem (Port 7077)

**POST /memories** - Add memory
```json
{
  "messages": [{"role": "...", "content": "..."}],
  "user_id": "...",
  "metadata": {}
}
```

**POST /search** - Semantic search
```json
{
  "query": "...",
  "user_id": "...",
  "limit": 10
}
```

**GET /health** - Service health

---

## File Manifest (Key Files Only)

```
project-lyra/
├── .env                           # Root environment variables
├── docker-compose.yml             # Service definitions (152 lines)
├── CHANGELOG.md                   # Version history (836 lines)
├── README.md                      # User documentation (610 lines)
├── PROJECT_SUMMARY.md             # This file (AI context)
│
├── cortex/                        # Reasoning engine
│   ├── Dockerfile                 # Single-worker constraint documented
│   ├── requirements.txt
│   ├── .env                       # Cortex overrides
│   ├── main.py                    # FastAPI initialization
│   ├── router.py                  # Routes (306 lines)
│   ├── context.py                 # Context aggregation
│   │
│   ├── intake/                    # Short-term memory (embedded)
│   │   ├── __init__.py           # Package exports
│   │   └── intake.py             # Core logic (367 lines)
│   │
│   ├── reasoning/                 # Reasoning pipeline
│   │   ├── reflection.py         # Meta-awareness
│   │   ├── reasoning.py          # Draft generation
│   │   └── refine.py             # Refinement
│   │
│   ├── persona/                   # Personality layer
│   │   ├── speak.py              # Persona application
│   │   └── identity.py           # Persona loader
│   │
│   └── llm/                       # LLM integration
│       └── llm_router.py         # Backend selector
│
├── core/relay/                    # Orchestrator
│   ├── server.js                 # Express server (Node.js)
│   └── package.json
│
├── neomem/                        # Long-term memory
│   ├── Dockerfile
│   ├── requirements.txt
│   ├── .env                       # NeoMem overrides
│   └── main.py                   # Memory API
│
└── rag/                           # RAG system (disabled)
    ├── rag_api.py
    ├── rag_chat_import.py
    └── chromadb/
```

---

## Final Notes for AI Assistants

### What You Should Know Before Making Changes

1. **SESSIONS is sacred** - It's a module-level global in `cortex/intake/intake.py`. Don't move it, don't duplicate it, don't make it a class attribute. It must remain a singleton.

2. **Single-worker is mandatory** - Until SESSIONS is migrated to Redis, Cortex MUST run with a single Uvicorn worker. Multi-worker will cause SESSIONS to be inconsistent.

3. **Lenient error handling** - The `/ingest` endpoint and other parts of the pipeline use lenient error handling: log errors but always return success. Never fail the chat pipeline.

4. **Backend routing is environment-driven** - Don't hardcode LLM URLs. Use the `{MODULE}_LLM` environment variables and the llm_router.py system.

5. **Intake is embedded** - Don't try to make HTTP calls to Intake. Use direct Python imports: `from intake.intake import ...`

6. **Test with diagnostic endpoints** - Always use `/debug/sessions` and `/debug/summary` to verify SESSIONS behavior after changes.

7. **Follow the changelog format** - When documenting changes, use the chronological format established in CHANGELOG.md v0.5.1. Group by version, then by change type (Fixed, Added, Changed, etc.).

### When You Need Help

- **SESSIONS issues**: Check `cortex/intake/intake.py` lines 11-14 for initialization, lines 325-366 for `add_exchange_internal()`
- **Routing issues**: Check `cortex/router.py` lines 65-189 for `/reason`, lines 201-233 for `/ingest`
- **LLM backend issues**: Check `cortex/llm/llm_router.py` for backend selection logic
- **Environment variables**: Check `.env` lines 13-40 for LLM backends, lines 28-34 for module selection

### Most Important Thing

**This project values reliability over features.** It's better to have a simple, working system than a complex, broken one. When in doubt, keep it simple, log everything, and never fail silently.

---

**End of AI Context Summary**

*This document is maintained to provide complete context for AI assistants working on Project Lyra. Last updated: v0.5.1 (2025-12-11)*