docs updated for v0.5.1

This commit is contained in:
serversdwn
2025-12-11 03:49:23 -05:00
parent e45cdbe54e
commit d5d7ea3469
2 changed files with 1227 additions and 157 deletions

View File

@@ -1,71 +1,925 @@
# Lyra Core — Project Summary # Project Lyra — Comprehensive AI Context Summary
## v0.4 (2025-10-03) **Version:** v0.5.1 (2025-12-11)
**Status:** Production-ready modular AI companion system
**Purpose:** Memory-backed conversational AI with multi-stage reasoning, persistent context, and modular LLM backend architecture
### 🧠 High-Level Architecture ---
- **Lyra Core (v0.3.1)** — Orchestration layer.
- Accepts chat requests (`/v1/chat/completions`).
- Routes through Cortex for subconscious annotation.
- Stores everything in Mem0 (no discard).
- Fetches persona + relevant memories.
- Injects context back into LLM.
- **Cortex (v0.3.0)** — Subconscious annotator. ## Executive Summary
- Runs locally via `llama.cpp` (Phi-3.5-mini Q4_K_M).
- Strict JSON schema: Project Lyra is a **self-hosted AI companion system** designed to overcome the limitations of typical chatbots by providing:
```json - **Persistent long-term memory** (NeoMem: PostgreSQL + Neo4j graph storage)
{ - **Multi-stage reasoning pipeline** (Cortex: reflection → reasoning → refinement → persona)
"sentiment": "positive" | "neutral" | "negative", - **Short-term context management** (Intake: session-based summarization embedded in Cortex)
"novelty": 0.01.0, - **Flexible LLM backend routing** (supports llama.cpp, Ollama, OpenAI, custom endpoints)
"tags": ["keyword", "keyword"], - **OpenAI-compatible API** (drop-in replacement for chat applications)
"notes": "short string"
**Core Philosophy:** Like a human brain has different regions for different functions, Lyra has specialized modules that work together. She's not just a chatbot—she's a notepad, schedule, database, co-creator, and collaborator with her own executive function.
---
## Quick Context for AI Assistants
If you're an AI being given this project to work on, here's what you need to know:
### What This Project Does
Lyra is a conversational AI system that **remembers everything** across sessions. When a user says something in passing, Lyra stores it, contextualizes it, and can recall it later. She can:
- Track project progress over time
- Remember user preferences and past conversations
- Reason through complex questions using multiple LLM calls
- Apply a consistent personality across all interactions
- Integrate with multiple LLM backends (local and cloud)
### Current Architecture (v0.5.1)
```
User → Relay (Express/Node.js, port 7078)
Cortex (FastAPI/Python, port 7081)
├─ Intake module (embedded, in-memory SESSIONS)
├─ 4-stage reasoning pipeline
└─ Multi-backend LLM router
NeoMem (FastAPI/Python, port 7077)
├─ PostgreSQL (vector storage)
└─ Neo4j (graph relationships)
```
### Key Files You'll Work With
**Backend Services:**
- [cortex/router.py](cortex/router.py) - Main Cortex routing logic (306 lines, `/reason`, `/ingest` endpoints)
- [cortex/intake/intake.py](cortex/intake/intake.py) - Short-term memory module (367 lines, SESSIONS management)
- [cortex/reasoning/reasoning.py](cortex/reasoning/reasoning.py) - Draft answer generation
- [cortex/reasoning/refine.py](cortex/reasoning/refine.py) - Answer refinement
- [cortex/reasoning/reflection.py](cortex/reasoning/reflection.py) - Meta-awareness notes
- [cortex/persona/speak.py](cortex/persona/speak.py) - Personality layer
- [cortex/llm/llm_router.py](cortex/llm/llm_router.py) - LLM backend selector
- [core/relay/server.js](core/relay/server.js) - Main orchestrator (Node.js)
- [neomem/main.py](neomem/main.py) - Long-term memory API
**Configuration:**
- [.env](.env) - Root environment variables (LLM backends, databases, API keys)
- [cortex/.env](cortex/.env) - Cortex-specific overrides
- [docker-compose.yml](docker-compose.yml) - Service definitions (152 lines)
**Documentation:**
- [CHANGELOG.md](CHANGELOG.md) - Complete version history (836 lines, chronological format)
- [README.md](README.md) - User-facing documentation (610 lines)
- [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - This file
### Recent Critical Fixes (v0.5.1)
The most recent work fixed a critical bug where Intake's SESSIONS buffer wasn't persisting:
1. **Fixed**: `bg_summarize()` was only a TYPE_CHECKING stub → implemented as logging stub
2. **Fixed**: `/ingest` endpoint had unreachable code → removed early return, added lenient error handling
3. **Added**: `cortex/intake/__init__.py` → proper Python package structure
4. **Added**: Diagnostic endpoints `/debug/sessions` and `/debug/summary` for troubleshooting
**Key Insight**: Intake is no longer a standalone service—it's embedded in Cortex as a Python module. SESSIONS must persist in a single Uvicorn worker (no multi-worker support without Redis).
---
## Architecture Deep Dive
### Service Topology (Docker Compose)
**Active Containers:**
1. **relay** (Node.js/Express, port 7078)
- Entry point for all user requests
- OpenAI-compatible `/v1/chat/completions` endpoint
- Routes to Cortex for reasoning
- Async calls to Cortex `/ingest` after response
2. **cortex** (Python/FastAPI, port 7081)
- Multi-stage reasoning pipeline
- Embedded Intake module (no HTTP, direct Python imports)
- Endpoints: `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary`
3. **neomem-api** (Python/FastAPI, port 7077)
- Long-term memory storage
- Fork of Mem0 OSS (fully local, no external SDK)
- Endpoints: `/memories`, `/search`, `/health`
4. **neomem-postgres** (PostgreSQL + pgvector, port 5432)
- Vector embeddings storage
- Memory history records
5. **neomem-neo4j** (Neo4j, ports 7474/7687)
- Graph relationships between memories
- Entity extraction and linking
**Disabled Services:**
- `intake` - No longer needed (embedded in Cortex as of v0.5.1)
- `rag` - Beta Lyrae RAG service (planned re-enablement)
### External LLM Backends (HTTP APIs)
**PRIMARY Backend** - llama.cpp @ `http://10.0.0.44:8080`
- AMD MI50 GPU-accelerated inference
- Model: `/model` (path-based routing)
- Used for: Reasoning, refinement, summarization
**SECONDARY Backend** - Ollama @ `http://10.0.0.3:11434`
- RTX 3090 GPU-accelerated inference
- Model: `qwen2.5:7b-instruct-q4_K_M`
- Used for: Configurable per-module
**CLOUD Backend** - OpenAI @ `https://api.openai.com/v1`
- Cloud-based inference
- Model: `gpt-4o-mini`
- Used for: Reflection, persona layers
**FALLBACK Backend** - Local @ `http://10.0.0.41:11435`
- CPU-based inference
- Model: `llama-3.2-8b-instruct`
- Used for: Emergency fallback
### Data Flow (Request Lifecycle)
```
1. User sends message → Relay (/v1/chat/completions)
2. Relay → Cortex (/reason)
3. Cortex calls Intake module (internal Python)
- Intake.summarize_context(session_id, exchanges)
- Returns L1/L5/L10/L20/L30 summaries
4. Cortex 4-stage pipeline:
a. reflection.py → Meta-awareness notes (CLOUD backend)
- "What is the user really asking?"
- Returns JSON: {"notes": [...]}
b. reasoning.py → Draft answer (PRIMARY backend)
- Uses context from Intake
- Integrates reflection notes
- Returns draft text
c. refine.py → Refined answer (PRIMARY backend)
- Polishes draft for clarity
- Ensures factual consistency
- Returns refined text
d. speak.py → Persona layer (CLOUD backend)
- Applies Lyra's personality
- Natural, conversational tone
- Returns final answer
5. Cortex → Relay (returns persona answer)
6. Relay → Cortex (/ingest) [async, non-blocking]
- Sends (session_id, user_msg, assistant_msg)
- Cortex calls add_exchange_internal()
- Appends to SESSIONS[session_id]["buffer"]
7. Relay → User (returns final response)
8. [Planned] Relay → NeoMem (/memories) [async]
- Store conversation in long-term memory
```
### Intake Module Architecture (v0.5.1)
**Location:** `cortex/intake/`
**Key Change:** Intake is now **embedded in Cortex** as a Python module, not a standalone service.
**Import Pattern:**
```python
from intake.intake import add_exchange_internal, SESSIONS, summarize_context
```
**Core Data Structure:**
```python
SESSIONS: dict[str, dict] = {}
# Structure:
SESSIONS[session_id] = {
"buffer": deque(maxlen=200), # Circular buffer of exchanges
"created_at": datetime
}
# Each exchange in buffer:
{
"session_id": "...",
"user_msg": "...",
"assistant_msg": "...",
"timestamp": "2025-12-11T..."
}
```
**Functions:**
1. **`add_exchange_internal(exchange: dict)`**
- Adds exchange to SESSIONS buffer
- Creates new session if needed
- Calls `bg_summarize()` stub
- Returns `{"ok": True, "session_id": "..."}`
2. **`summarize_context(session_id: str, exchanges: list[dict])`** [async]
- Generates L1/L5/L10/L20/L30 summaries via LLM
- Called during `/reason` endpoint
- Returns multi-level summary dict
3. **`bg_summarize(session_id: str)`**
- **Stub function** - logs only, no actual work
- Defers summarization to `/reason` call
- Exists to prevent NameError
**Critical Constraint:** SESSIONS is a module-level global dict. This requires **single-worker Uvicorn** mode. Multi-worker deployments need Redis or shared storage.
**Diagnostic Endpoints:**
- `GET /debug/sessions` - Inspect all SESSIONS (object ID, buffer sizes, recent exchanges)
- `GET /debug/summary?session_id=X` - Test summarization for a session
---
## Environment Configuration
### LLM Backend Registry (Multi-Backend Strategy)
**Root `.env` defines all backend OPTIONS:**
```bash
# PRIMARY Backend (llama.cpp)
LLM_PRIMARY_PROVIDER=llama.cpp
LLM_PRIMARY_URL=http://10.0.0.44:8080
LLM_PRIMARY_MODEL=/model
# SECONDARY Backend (Ollama)
LLM_SECONDARY_PROVIDER=ollama
LLM_SECONDARY_URL=http://10.0.0.3:11434
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
# CLOUD Backend (OpenAI)
LLM_OPENAI_PROVIDER=openai
LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-proj-...
# FALLBACK Backend
LLM_FALLBACK_PROVIDER=openai_completions
LLM_FALLBACK_URL=http://10.0.0.41:11435
LLM_FALLBACK_MODEL=llama-3.2-8b-instruct
```
**Module-specific backend selection:**
```bash
CORTEX_LLM=SECONDARY # Cortex uses Ollama
INTAKE_LLM=PRIMARY # Intake uses llama.cpp
SPEAK_LLM=OPENAI # Persona uses OpenAI
NEOMEM_LLM=PRIMARY # NeoMem uses llama.cpp
UI_LLM=OPENAI # UI uses OpenAI
RELAY_LLM=PRIMARY # Relay uses llama.cpp
```
**Philosophy:** Root `.env` provides all backend OPTIONS. Each service chooses which backend to USE via `{MODULE}_LLM` variable. This eliminates URL duplication while preserving flexibility.
### Database Configuration
```bash
# PostgreSQL (vector storage)
POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomempass
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432
# Neo4j (graph storage)
NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=neomemgraph
```
### Service URLs (Docker Internal Network)
```bash
NEOMEM_API=http://neomem-api:7077
CORTEX_API=http://cortex:7081
CORTEX_REASON_URL=http://cortex:7081/reason
CORTEX_INGEST_URL=http://cortex:7081/ingest
RELAY_URL=http://relay:7078
```
### Feature Flags
```bash
CORTEX_ENABLED=true
MEMORY_ENABLED=true
PERSONA_ENABLED=false
DEBUG_PROMPT=true
VERBOSE_DEBUG=true
```
---
## Code Structure Overview
### Cortex Service (`cortex/`)
**Main Files:**
- `main.py` - FastAPI app initialization
- `router.py` - Route definitions (`/reason`, `/ingest`, `/health`, `/debug/*`)
- `context.py` - Context aggregation (Intake summaries, session state)
**Reasoning Pipeline (`reasoning/`):**
- `reflection.py` - Meta-awareness notes (Cloud LLM)
- `reasoning.py` - Draft answer generation (Primary LLM)
- `refine.py` - Answer refinement (Primary LLM)
**Persona Layer (`persona/`):**
- `speak.py` - Personality application (Cloud LLM)
- `identity.py` - Persona loader
**Intake Module (`intake/`):**
- `__init__.py` - Package exports (SESSIONS, add_exchange_internal, summarize_context)
- `intake.py` - Core logic (367 lines)
- SESSIONS dictionary
- add_exchange_internal()
- summarize_context()
- bg_summarize() stub
**LLM Integration (`llm/`):**
- `llm_router.py` - Backend selector and HTTP client
- call_llm() function
- Environment-based routing
- Payload formatting per backend type
**Utilities (`utils/`):**
- Helper functions for common operations
**Configuration:**
- `Dockerfile` - Single-worker constraint documented
- `requirements.txt` - Python dependencies
- `.env` - Service-specific overrides
### Relay Service (`core/relay/`)
**Main Files:**
- `server.js` - Express.js server (Node.js)
- `/v1/chat/completions` - OpenAI-compatible endpoint
- `/chat` - Internal endpoint
- `/_health` - Health check
- `package.json` - Node.js dependencies
**Key Logic:**
- Receives user messages
- Routes to Cortex `/reason`
- Async calls to Cortex `/ingest` after response
- Returns final answer to user
### NeoMem Service (`neomem/`)
**Main Files:**
- `main.py` - FastAPI app (memory API)
- `memory.py` - Memory management logic
- `embedder.py` - Embedding generation
- `graph.py` - Neo4j graph operations
- `Dockerfile` - Container definition
- `requirements.txt` - Python dependencies
**API Endpoints:**
- `POST /memories` - Add new memory
- `POST /search` - Semantic search
- `GET /health` - Service health
---
## Common Development Tasks
### Adding a New Endpoint to Cortex
**Example: Add `/debug/buffer` endpoint**
1. **Edit `cortex/router.py`:**
```python
@cortex_router.get("/debug/buffer")
async def debug_buffer(session_id: str, limit: int = 10):
"""Return last N exchanges from a session buffer."""
from intake.intake import SESSIONS
session = SESSIONS.get(session_id)
if not session:
return {"error": "session not found", "session_id": session_id}
buffer = session["buffer"]
recent = list(buffer)[-limit:]
return {
"session_id": session_id,
"total_exchanges": len(buffer),
"recent_exchanges": recent
} }
``` ```
- Normalizes keys (lowercase).
- Strips Markdown fences before parsing.
- Configurable via `.env` (`CORTEX_ENABLED=true|false`).
- Currently generates annotations, but not yet persisted into Mem0 payloads (stored as empty `{cortex:{}}`).
- **Mem0 (v0.4.0)** — Persistent memory layer. 2. **Restart Cortex:**
- Handles embeddings, graph storage, and retrieval. ```bash
- Dual embedder support: docker-compose restart cortex
- **OpenAI Cloud** (`text-embedding-3-small`, 1536-dim). ```
- **HuggingFace TEI** (gte-Qwen2-1.5B-instruct, 1536-dim, hosted on 3090).
- Environment toggle for provider (`.env.openai` vs `.env.3090`).
- Memory persistence in Postgres (`payload` JSON).
- CSV export pipeline confirmed (id, user_id, data, created_at).
- **Persona Sidecar** 3. **Test:**
- Provides personality, style, and protocol instructions. ```bash
- Injected at runtime into Core prompt building. curl "http://localhost:7081/debug/buffer?session_id=test&limit=5"
```
### Modifying LLM Backend for a Module
**Example: Switch Cortex to use PRIMARY backend**
1. **Edit `.env`:**
```bash
CORTEX_LLM=PRIMARY # Change from SECONDARY to PRIMARY
```
2. **Restart Cortex:**
```bash
docker-compose restart cortex
```
3. **Verify in logs:**
```bash
docker logs cortex | grep "Backend"
```
### Adding Diagnostic Logging
**Example: Log every exchange addition**
1. **Edit `cortex/intake/intake.py`:**
```python
def add_exchange_internal(exchange: dict):
session_id = exchange.get("session_id")
# Add detailed logging
print(f"[DEBUG] Adding exchange to {session_id}")
print(f"[DEBUG] User msg: {exchange.get('user_msg', '')[:100]}")
print(f"[DEBUG] Assistant msg: {exchange.get('assistant_msg', '')[:100]}")
# ... rest of function
```
2. **View logs:**
```bash
docker logs cortex -f | grep DEBUG
```
--- ---
### 🚀 Recent Changes ## Debugging Guide
- **Mem0**
- Added HuggingFace TEI integration (local 3090 embedder).
- Enabled dual-mode environment switch (OpenAI cloud ↔ local TEI).
- Fixed `.env` line ending mismatch (CRLF vs LF).
- Added memory dump/export commands for Postgres.
- **Core/Relay** ### Problem: SESSIONS Not Persisting
- No major changes since v0.3.1 (still routing input → Cortex → Mem0).
- **Cortex** **Symptoms:**
- Still outputs annotations, but not yet persisted into Mem0 payloads. - `/debug/sessions` shows empty or only 1 exchange
- Summaries always return empty
- Buffer size doesn't increase
**Diagnosis Steps:**
1. Check Cortex logs for SESSIONS object ID:
```bash
docker logs cortex | grep "SESSIONS object id"
```
- Should show same ID across all calls
- If IDs differ → module reloading issue
2. Verify single-worker mode:
```bash
docker exec cortex cat Dockerfile | grep uvicorn
```
- Should NOT have `--workers` flag or `--workers 1`
3. Check `/debug/sessions` endpoint:
```bash
curl http://localhost:7081/debug/sessions | jq
```
- Should show sessions_object_id and current sessions
4. Inspect `__init__.py` exists:
```bash
docker exec cortex ls -la intake/__init__.py
```
**Solution (Fixed in v0.5.1):**
- Ensure `cortex/intake/__init__.py` exists with proper exports
- Verify `bg_summarize()` is implemented (not just TYPE_CHECKING stub)
- Check `/ingest` endpoint doesn't have early return
- Rebuild Cortex container: `docker-compose build cortex && docker-compose restart cortex`
### Problem: LLM Backend Timeout
**Symptoms:**
- Cortex `/reason` hangs
- 504 Gateway Timeout errors
- Logs show "waiting for LLM response"
**Diagnosis Steps:**
1. Test backend directly:
```bash
# llama.cpp
curl http://10.0.0.44:8080/health
# Ollama
curl http://10.0.0.3:11434/api/tags
# OpenAI
curl https://api.openai.com/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY"
```
2. Check network connectivity:
```bash
docker exec cortex ping -c 3 10.0.0.44
```
3. Review Cortex logs:
```bash
docker logs cortex -f | grep "LLM"
```
**Solutions:**
- Verify backend URL in `.env` is correct and accessible
- Check firewall rules for backend ports
- Increase timeout in `cortex/llm/llm_router.py`
- Switch to different backend temporarily: `CORTEX_LLM=CLOUD`
### Problem: Docker Compose Won't Start
**Symptoms:**
- `docker-compose up -d` fails
- Container exits immediately
- "port already in use" errors
**Diagnosis Steps:**
1. Check port conflicts:
```bash
netstat -tulpn | grep -E '7078|7081|7077|5432'
```
2. Check container logs:
```bash
docker-compose logs --tail=50
```
3. Verify environment file:
```bash
cat .env | grep -v "^#" | grep -v "^$"
```
**Solutions:**
- Stop conflicting services: `docker-compose down`
- Check `.env` syntax (no quotes unless necessary)
- Rebuild containers: `docker-compose build --no-cache`
- Check Docker daemon: `systemctl status docker`
--- ---
### 📈 Versioning ## Testing Checklist
- **Lyra Core** → v0.3.1
- **Cortex** → v0.3.0 ### After Making Changes to Cortex
- **Mem0** → v0.4.0
**1. Build and restart:**
```bash
docker-compose build cortex
docker-compose restart cortex
```
**2. Verify service health:**
```bash
curl http://localhost:7081/health
```
**3. Test /ingest endpoint:**
```bash
curl -X POST http://localhost:7081/ingest \
-H "Content-Type: application/json" \
-d '{
"session_id": "test",
"user_msg": "Hello",
"assistant_msg": "Hi there!"
}'
```
**4. Verify SESSIONS updated:**
```bash
curl http://localhost:7081/debug/sessions | jq '.sessions.test.buffer_size'
```
- Should show 1 (or increment if already populated)
**5. Test summarization:**
```bash
curl "http://localhost:7081/debug/summary?session_id=test" | jq '.summary'
```
- Should return L1/L5/L10/L20/L30 summaries
**6. Test full pipeline:**
```bash
curl -X POST http://localhost:7078/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Test message"}],
"session_id": "test"
}' | jq '.choices[0].message.content'
```
**7. Check logs for errors:**
```bash
docker logs cortex --tail=50
```
--- ---
### 📋 Next Steps ## Project History & Context
- [ ] Wire Cortex annotations into Mem0 payloads (`cortex` object).
- [ ] Add “export all memories” script to standard workflow. ### Evolution Timeline
- [ ] Consider async embedding for faster `mem.add`.
- [ ] Build visual diagram of data flow (Core ↔ Cortex ↔ Mem0 ↔ Persona). **v0.1.x (2025-09-23 to 2025-09-25)**
- [ ] Explore larger LLMs for Cortex (Qwen2-7B, etc.) for richer subconscious annotation. - Initial MVP: Relay + Mem0 + Ollama
- Basic memory storage and retrieval
- Simple UI with session support
**v0.2.x (2025-09-24 to 2025-09-30)**
- Migrated to mem0ai SDK
- Added sessionId support
- Created standalone Lyra-Mem0 stack
**v0.3.x (2025-09-26 to 2025-10-28)**
- Forked Mem0 → NVGRAM → NeoMem
- Added salience filtering
- Integrated Cortex reasoning VM
- Built RAG system (Beta Lyrae)
- Established multi-backend LLM support
**v0.4.x (2025-11-05 to 2025-11-13)**
- Major architectural rewire
- Implemented 4-stage reasoning pipeline
- Added reflection, refinement stages
- RAG integration
- LLM router with per-stage backend selection
**Infrastructure v1.0.0 (2025-11-26)**
- Consolidated 9 `.env` files into single source of truth
- Multi-backend LLM strategy
- Docker Compose consolidation
- Created security templates
**v0.5.0 (2025-11-28)**
- Fixed all critical API wiring issues
- Added OpenAI-compatible Relay endpoint
- Fixed Cortex → Intake integration
- End-to-end flow verification
**v0.5.1 (2025-12-11) - CURRENT**
- **Critical fix**: SESSIONS persistence bug
- Implemented `bg_summarize()` stub
- Fixed `/ingest` unreachable code
- Added `cortex/intake/__init__.py`
- Embedded Intake in Cortex (no longer standalone)
- Added diagnostic endpoints
- Lenient error handling
- Documented single-worker constraint
### Architectural Philosophy
**Modular Design:**
- Each service has a single, clear responsibility
- Services communicate via well-defined HTTP APIs
- Configuration is centralized but allows per-service overrides
**Local-First:**
- No reliance on external services (except optional OpenAI)
- All data stored locally (PostgreSQL + Neo4j)
- Can run entirely air-gapped with local LLMs
**Flexible LLM Backend:**
- Not tied to any single LLM provider
- Can mix local and cloud models
- Per-stage backend selection for optimal performance/cost
**Error Handling:**
- Lenient mode: Never fail the chat pipeline
- Log errors but continue processing
- Graceful degradation
**Observability:**
- Diagnostic endpoints for debugging
- Verbose logging mode
- Object ID tracking for singleton verification
---
## Known Issues & Limitations
### Fixed in v0.5.1
- ✅ Intake SESSIONS not persisting → **FIXED**
- ✅ `bg_summarize()` NameError → **FIXED**
- ✅ `/ingest` endpoint unreachable code → **FIXED**
### Current Limitations
**1. Single-Worker Constraint**
- Cortex must run with single Uvicorn worker
- SESSIONS is in-memory module-level global
- Multi-worker support requires Redis or shared storage
- Documented in `cortex/Dockerfile` lines 7-8
**2. NeoMem Integration Incomplete**
- Relay doesn't yet push to NeoMem after responses
- Memory storage planned for v0.5.2
- Currently all memory is short-term (SESSIONS only)
**3. RAG Service Disabled**
- Beta Lyrae (RAG) commented out in docker-compose.yml
- Awaiting re-enablement after Intake stabilization
- Code exists but not currently integrated
**4. Session Management**
- No session cleanup/expiration
- SESSIONS grows unbounded (maxlen=200 per session, but infinite sessions)
- No session list endpoint in Relay
**5. Persona Integration**
- `PERSONA_ENABLED=false` in `.env`
- Persona Sidecar not fully wired
- Identity loaded but not consistently applied
### Future Enhancements
**Short-term (v0.5.2):**
- Enable NeoMem integration in Relay
- Add session cleanup/expiration
- Session list endpoint
- NeoMem health monitoring
**Medium-term (v0.6.x):**
- Re-enable RAG service
- Migrate SESSIONS to Redis for multi-worker support
- Add request correlation IDs
- Comprehensive health checks
**Long-term (v0.7.x+):**
- Persona Sidecar full integration
- Autonomous "dream" cycles (self-reflection)
- Verifier module for factual grounding
- Advanced RAG with hybrid search
- Memory consolidation strategies
---
## Troubleshooting Quick Reference
| Problem | Quick Check | Solution |
|---------|-------------|----------|
| SESSIONS empty | `curl localhost:7081/debug/sessions` | Rebuild Cortex, verify `__init__.py` exists |
| LLM timeout | `curl http://10.0.0.44:8080/health` | Check backend connectivity, increase timeout |
| Port conflict | `netstat -tulpn \| grep 7078` | Stop conflicting service or change port |
| Container crash | `docker logs cortex` | Check logs for Python errors, verify .env syntax |
| Missing package | `docker exec cortex pip list` | Rebuild container, check requirements.txt |
| 502 from Relay | `curl localhost:7081/health` | Verify Cortex is running, check docker network |
---
## API Reference (Quick)
### Relay (Port 7078)
**POST /v1/chat/completions** - OpenAI-compatible chat
```json
{
"messages": [{"role": "user", "content": "..."}],
"session_id": "..."
}
```
**GET /_health** - Service health
### Cortex (Port 7081)
**POST /reason** - Main reasoning pipeline
```json
{
"session_id": "...",
"user_prompt": "...",
"temperature": 0.7 // optional
}
```
**POST /ingest** - Add exchange to SESSIONS
```json
{
"session_id": "...",
"user_msg": "...",
"assistant_msg": "..."
}
```
**GET /debug/sessions** - Inspect SESSIONS state
**GET /debug/summary?session_id=X** - Test summarization
**GET /health** - Service health
### NeoMem (Port 7077)
**POST /memories** - Add memory
```json
{
"messages": [{"role": "...", "content": "..."}],
"user_id": "...",
"metadata": {}
}
```
**POST /search** - Semantic search
```json
{
"query": "...",
"user_id": "...",
"limit": 10
}
```
**GET /health** - Service health
---
## File Manifest (Key Files Only)
```
project-lyra/
├── .env # Root environment variables
├── docker-compose.yml # Service definitions (152 lines)
├── CHANGELOG.md # Version history (836 lines)
├── README.md # User documentation (610 lines)
├── PROJECT_SUMMARY.md # This file (AI context)
├── cortex/ # Reasoning engine
│ ├── Dockerfile # Single-worker constraint documented
│ ├── requirements.txt
│ ├── .env # Cortex overrides
│ ├── main.py # FastAPI initialization
│ ├── router.py # Routes (306 lines)
│ ├── context.py # Context aggregation
│ │
│ ├── intake/ # Short-term memory (embedded)
│ │ ├── __init__.py # Package exports
│ │ └── intake.py # Core logic (367 lines)
│ │
│ ├── reasoning/ # Reasoning pipeline
│ │ ├── reflection.py # Meta-awareness
│ │ ├── reasoning.py # Draft generation
│ │ └── refine.py # Refinement
│ │
│ ├── persona/ # Personality layer
│ │ ├── speak.py # Persona application
│ │ └── identity.py # Persona loader
│ │
│ └── llm/ # LLM integration
│ └── llm_router.py # Backend selector
├── core/relay/ # Orchestrator
│ ├── server.js # Express server (Node.js)
│ └── package.json
├── neomem/ # Long-term memory
│ ├── Dockerfile
│ ├── requirements.txt
│ ├── .env # NeoMem overrides
│ └── main.py # Memory API
└── rag/ # RAG system (disabled)
├── rag_api.py
├── rag_chat_import.py
└── chromadb/
```
---
## Final Notes for AI Assistants
### What You Should Know Before Making Changes
1. **SESSIONS is sacred** - It's a module-level global in `cortex/intake/intake.py`. Don't move it, don't duplicate it, don't make it a class attribute. It must remain a singleton.
2. **Single-worker is mandatory** - Until SESSIONS is migrated to Redis, Cortex MUST run with a single Uvicorn worker. Multi-worker will cause SESSIONS to be inconsistent.
3. **Lenient error handling** - The `/ingest` endpoint and other parts of the pipeline use lenient error handling: log errors but always return success. Never fail the chat pipeline.
4. **Backend routing is environment-driven** - Don't hardcode LLM URLs. Use the `{MODULE}_LLM` environment variables and the llm_router.py system.
5. **Intake is embedded** - Don't try to make HTTP calls to Intake. Use direct Python imports: `from intake.intake import ...`
6. **Test with diagnostic endpoints** - Always use `/debug/sessions` and `/debug/summary` to verify SESSIONS behavior after changes.
7. **Follow the changelog format** - When documenting changes, use the chronological format established in CHANGELOG.md v0.5.1. Group by version, then by change type (Fixed, Added, Changed, etc.).
### When You Need Help
- **SESSIONS issues**: Check `cortex/intake/intake.py` lines 11-14 for initialization, lines 325-366 for `add_exchange_internal()`
- **Routing issues**: Check `cortex/router.py` lines 65-189 for `/reason`, lines 201-233 for `/ingest`
- **LLM backend issues**: Check `cortex/llm/llm_router.py` for backend selection logic
- **Environment variables**: Check `.env` lines 13-40 for LLM backends, lines 28-34 for module selection
### Most Important Thing
**This project values reliability over features.** It's better to have a simple, working system than a complex, broken one. When in doubt, keep it simple, log everything, and never fail silently.
---
**End of AI Context Summary**
*This document is maintained to provide complete context for AI assistants working on Project Lyra. Last updated: v0.5.1 (2025-12-11)*

422
README.md
View File

@@ -1,9 +1,11 @@
# Project Lyra - README v0.5.0 # Project Lyra - README v0.5.1
Lyra is a modular persistent AI companion system with advanced reasoning capabilities. Lyra is a modular persistent AI companion system with advanced reasoning capabilities.
It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**, It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**,
with multi-stage reasoning pipeline powered by HTTP-based LLM backends. with multi-stage reasoning pipeline powered by HTTP-based LLM backends.
**Current Version:** v0.5.1 (2025-12-11)
## Mission Statement ## Mission Statement
The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later. The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
@@ -22,7 +24,7 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do
- OpenAI-compatible endpoint: `POST /v1/chat/completions` - OpenAI-compatible endpoint: `POST /v1/chat/completions`
- Internal endpoint: `POST /chat` - Internal endpoint: `POST /chat`
- Routes messages through Cortex reasoning pipeline - Routes messages through Cortex reasoning pipeline
- Manages async calls to Intake and NeoMem - Manages async calls to NeoMem and Cortex ingest
**2. UI** (Static HTML) **2. UI** (Static HTML)
- Browser-based chat interface with cyberpunk theme - Browser-based chat interface with cyberpunk theme
@@ -41,38 +43,48 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do
**4. Cortex** (Python/FastAPI) - Port 7081 **4. Cortex** (Python/FastAPI) - Port 7081
- Primary reasoning engine with multi-stage pipeline - Primary reasoning engine with multi-stage pipeline
- **Includes embedded Intake module** (no separate service as of v0.5.1)
- **4-Stage Processing:** - **4-Stage Processing:**
1. **Reflection** - Generates meta-awareness notes about conversation 1. **Reflection** - Generates meta-awareness notes about conversation
2. **Reasoning** - Creates initial draft answer using context 2. **Reasoning** - Creates initial draft answer using context
3. **Refinement** - Polishes and improves the draft 3. **Refinement** - Polishes and improves the draft
4. **Persona** - Applies Lyra's personality and speaking style 4. **Persona** - Applies Lyra's personality and speaking style
- Integrates with Intake for short-term context - Integrates with Intake for short-term context via internal Python imports
- Flexible LLM router supporting multiple backends via HTTP - Flexible LLM router supporting multiple backends via HTTP
- **Endpoints:**
- `POST /reason` - Main reasoning pipeline
- `POST /ingest` - Receives conversation exchanges from Relay
- `GET /health` - Service health check
- `GET /debug/sessions` - Inspect in-memory SESSIONS state
- `GET /debug/summary` - Test summarization for a session
**5. Intake v0.2** (Python/FastAPI) - Port 7080 **5. Intake** (Python Module) - **Embedded in Cortex**
- Simplified short-term memory summarization - **No longer a standalone service** - runs as Python module inside Cortex container
- Session-based circular buffer (deque, maxlen=200) - Short-term memory management with session-based circular buffer
- Single-level simple summarization (no cascading) - In-memory SESSIONS dictionary: `session_id → {buffer: deque(maxlen=200), created_at: timestamp}`
- Background async processing with FastAPI BackgroundTasks - Multi-level summarization (L1/L5/L10/L20/L30) produced by `summarize_context()`
- Pushes summaries to NeoMem automatically - Deferred summarization - actual summary generation happens during `/reason` call
- **API Endpoints:** - Internal Python API:
- `POST /add_exchange` - Add conversation exchange - `add_exchange_internal(exchange)` - Direct function call from Cortex
- `GET /summaries?session_id={id}` - Retrieve session summary - `summarize_context(session_id, exchanges)` - Async LLM-based summarization
- `POST /close_session/{id}` - Close and cleanup session - `SESSIONS` - Module-level global state (requires single Uvicorn worker)
### LLM Backends (HTTP-based) ### LLM Backends (HTTP-based)
**All LLM communication is done via HTTP APIs:** **All LLM communication is done via HTTP APIs:**
- **PRIMARY**: vLLM server (`http://10.0.0.43:8000`) - AMD MI50 GPU backend - **PRIMARY**: llama.cpp server (`http://10.0.0.44:8080`) - AMD MI50 GPU backend
- **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend - **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend
- Model: qwen2.5:7b-instruct-q4_K_M
- **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models - **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models
- Model: gpt-4o-mini
- **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback - **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback
- Model: llama-3.2-8b-instruct
Each module can be configured to use a different backend via environment variables.
Each module can be configured to use a different backend via environment variables.
--- ---
## Data Flow Architecture (v0.5.0) ## Data Flow Architecture (v0.5.1)
### Normal Message Flow: ### Normal Message Flow:
@@ -82,43 +94,44 @@ User (UI) → POST /v1/chat/completions
Relay (7078) Relay (7078)
↓ POST /reason ↓ POST /reason
Cortex (7081) Cortex (7081)
GET /summaries?session_id=xxx (internal Python call)
Intake (7080) [RETURNS SUMMARY] Intake module → summarize_context()
Cortex processes (4 stages): Cortex processes (4 stages):
1. reflection.py → meta-awareness notes 1. reflection.py → meta-awareness notes (CLOUD backend)
2. reasoning.py → draft answer (uses LLM) 2. reasoning.py → draft answer (PRIMARY backend)
3. refine.py → refined answer (uses LLM) 3. refine.py → refined answer (PRIMARY backend)
4. persona/speak.py → Lyra personality (uses LLM) 4. persona/speak.py → Lyra personality (CLOUD backend)
Returns persona answer to Relay Returns persona answer to Relay
Relay → Cortex /ingest (async, stub) Relay → POST /ingest (async)
Relay → Intake /add_exchange (async)
Intake → Background summarize → NeoMem Cortex → add_exchange_internal() → SESSIONS buffer
Relay → NeoMem /memories (async, planned)
Relay → UI (returns final response) Relay → UI (returns final response)
``` ```
### Cortex 4-Stage Reasoning Pipeline: ### Cortex 4-Stage Reasoning Pipeline:
1. **Reflection** (`reflection.py`) - Configurable LLM via HTTP 1. **Reflection** (`reflection.py`) - Cloud LLM (OpenAI)
- Analyzes user intent and conversation context - Analyzes user intent and conversation context
- Generates meta-awareness notes - Generates meta-awareness notes
- "What is the user really asking?" - "What is the user really asking?"
2. **Reasoning** (`reasoning.py`) - Configurable LLM via HTTP 2. **Reasoning** (`reasoning.py`) - Primary LLM (llama.cpp)
- Retrieves short-term context from Intake - Retrieves short-term context from Intake module
- Creates initial draft answer - Creates initial draft answer
- Integrates context, reflection notes, and user prompt - Integrates context, reflection notes, and user prompt
3. **Refinement** (`refine.py`) - Configurable LLM via HTTP 3. **Refinement** (`refine.py`) - Primary LLM (llama.cpp)
- Polishes the draft answer - Polishes the draft answer
- Improves clarity and coherence - Improves clarity and coherence
- Ensures factual consistency - Ensures factual consistency
4. **Persona** (`speak.py`) - Configurable LLM via HTTP 4. **Persona** (`speak.py`) - Cloud LLM (OpenAI)
- Applies Lyra's personality and speaking style - Applies Lyra's personality and speaking style
- Natural, conversational output - Natural, conversational output
- Final answer returned to user - Final answer returned to user
@@ -134,7 +147,7 @@ Relay → UI (returns final response)
- OpenAI-compatible endpoint: `POST /v1/chat/completions` - OpenAI-compatible endpoint: `POST /v1/chat/completions`
- Internal endpoint: `POST /chat` - Internal endpoint: `POST /chat`
- Health check: `GET /_health` - Health check: `GET /_health`
- Async non-blocking calls to Cortex and Intake - Async non-blocking calls to Cortex
- Shared request handler for code reuse - Shared request handler for code reuse
- Comprehensive error handling - Comprehensive error handling
@@ -154,73 +167,70 @@ Relay → UI (returns final response)
### Reasoning Layer ### Reasoning Layer
**Cortex** (v0.5): **Cortex** (v0.5.1):
- Multi-stage reasoning pipeline (reflection → reasoning → refine → persona) - Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
- Flexible LLM backend routing via HTTP - Flexible LLM backend routing via HTTP
- Per-stage backend selection - Per-stage backend selection
- Async processing throughout - Async processing throughout
- IntakeClient integration for short-term context - Embedded Intake module for short-term context
- `/reason`, `/ingest` (stub), `/health` endpoints - `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary` endpoints
- Lenient error handling - never fails the chat pipeline
**Intake** (v0.2): **Intake** (Embedded Module):
- Simplified single-level summarization - **Architectural change**: Now runs as Python module inside Cortex container
- Session-based circular buffer (200 exchanges max) - In-memory SESSIONS management (session_id → buffer)
- Background async summarization - Multi-level summarization: L1 (ultra-short), L5 (short), L10 (medium), L20 (detailed), L30 (full)
- Automatic NeoMem push - Deferred summarization strategy - summaries generated during `/reason` call
- No persistent log files (memory-only) - `bg_summarize()` is a logging stub - actual work deferred
- **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30) - **Single-worker constraint**: SESSIONS requires single Uvicorn worker or Redis/shared storage
**LLM Router**: **LLM Router**:
- Dynamic backend selection via HTTP - Dynamic backend selection via HTTP
- Environment-driven configuration - Environment-driven configuration
- Support for vLLM, Ollama, OpenAI, custom endpoints - Support for llama.cpp, Ollama, OpenAI, custom endpoints
- Per-module backend preferences - Per-module backend preferences:
- `CORTEX_LLM=SECONDARY` (Ollama for reasoning)
- `INTAKE_LLM=PRIMARY` (llama.cpp for summarization)
- `SPEAK_LLM=OPENAI` (Cloud for persona)
- `NEOMEM_LLM=PRIMARY` (llama.cpp for memory operations)
### Beta Lyrae (RAG Memory DB) - Currently Disabled
# Beta Lyrae (RAG Memory DB) - added 11-3-25
- **RAG Knowledge DB - Beta Lyrae (sheliak)** - **RAG Knowledge DB - Beta Lyrae (sheliak)**
- This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra. - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
The system uses: - **Status**: Disabled in docker-compose.yml (v0.5.1)
- **ChromaDB** for persistent vector storage
- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity The system uses:
- **FastAPI** (port 7090) for the `/rag/search` REST endpoint - **ChromaDB** for persistent vector storage
- Directory Layout - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
rag/ - **FastAPI** (port 7090) for the `/rag/search` REST endpoint
├── rag_chat_import.py # imports JSON chat logs
├── rag_docs_import.py # (planned) PDF/EPUB/manual importer Directory Layout:
├── rag_build.py # legacy single-folder builder ```
├── rag_query.py # command-line query helper rag/
├── rag_api.py # FastAPI service providing /rag/search ├── rag_chat_import.py # imports JSON chat logs
├── chromadb/ # persistent vector store ├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
├── chatlogs/ # organized source data ├── rag_build.py # legacy single-folder builder
│ ├── poker/ ├── rag_query.py # command-line query helper
│ ├── work/ ├── rag_api.py # FastAPI service providing /rag/search
│ ├── lyra/ ├── chromadb/ # persistent vector store
│ ├── personal/ ├── chatlogs/ # organized source data
│ └── ... │ ├── poker/
└── import.log # progress log for batch runs │ ├── work/
- **OpenAI chatlog importer. ├── lyra/
- Takes JSON formatted chat logs and imports it to the RAG. │ ├── personal/
- **fetures include:** └── ...
- Recursive folder indexing with **category detection** from directory name └── import.log # progress log for batch runs
- Smart chunking for long messages (5 000 chars per slice) ```
- Automatic deduplication using SHA-1 hash of file + chunk
- Timestamps for both file modification and import time **OpenAI chatlog importer features:**
- Full progress logging via tqdm - Recursive folder indexing with **category detection** from directory name
- Safe to run in background with nohup … & - Smart chunking for long messages (5,000 chars per slice)
- Metadata per chunk: - Automatic deduplication using SHA-1 hash of file + chunk
```json - Timestamps for both file modification and import time
{ - Full progress logging via tqdm
"chat_id": "<sha1 of filename>", - Safe to run in background with `nohup … &`
"chunk_index": 0,
"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
"title": "cortex LLMs 11-1-25",
"role": "assistant",
"category": "lyra",
"type": "chat",
"file_modified": "2025-11-06T23:41:02",
"imported_at": "2025-11-07T03:55:00Z"
}```
--- ---
@@ -228,13 +238,16 @@ Relay → UI (returns final response)
All services run in a single docker-compose stack with the following containers: All services run in a single docker-compose stack with the following containers:
**Active Services:**
- **neomem-postgres** - PostgreSQL with pgvector extension (port 5432) - **neomem-postgres** - PostgreSQL with pgvector extension (port 5432)
- **neomem-neo4j** - Neo4j graph database (ports 7474, 7687) - **neomem-neo4j** - Neo4j graph database (ports 7474, 7687)
- **neomem-api** - NeoMem memory service (port 7077) - **neomem-api** - NeoMem memory service (port 7077)
- **relay** - Main orchestrator (port 7078) - **relay** - Main orchestrator (port 7078)
- **cortex** - Reasoning engine (port 7081) - **cortex** - Reasoning engine with embedded Intake (port 7081)
- **intake** - Short-term memory summarization (port 7080) - currently disabled
- **rag** - RAG search service (port 7090) - currently disabled **Disabled Services:**
- **intake** - No longer needed (embedded in Cortex as of v0.5.1)
- **rag** - Beta Lyrae RAG service (port 7090) - currently disabled
All containers communicate via the `lyra_net` Docker bridge network. All containers communicate via the `lyra_net` Docker bridge network.
@@ -242,10 +255,10 @@ All containers communicate via the `lyra_net` Docker bridge network.
The following LLM backends are accessed via HTTP (not part of docker-compose): The following LLM backends are accessed via HTTP (not part of docker-compose):
- **vLLM Server** (`http://10.0.0.43:8000`) - **llama.cpp Server** (`http://10.0.0.44:8080`)
- AMD MI50 GPU-accelerated inference - AMD MI50 GPU-accelerated inference
- Custom ROCm-enabled vLLM build
- Primary backend for reasoning and refinement stages - Primary backend for reasoning and refinement stages
- Model path: `/model`
- **Ollama Server** (`http://10.0.0.3:11434`) - **Ollama Server** (`http://10.0.0.3:11434`)
- RTX 3090 GPU-accelerated inference - RTX 3090 GPU-accelerated inference
@@ -265,16 +278,38 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
## Version History ## Version History
### v0.5.0 (2025-11-28) - Current Release ### v0.5.1 (2025-12-11) - Current Release
**Critical Intake Integration Fixes:**
- ✅ Fixed `bg_summarize()` NameError preventing SESSIONS persistence
- ✅ Fixed `/ingest` endpoint unreachable code
- ✅ Added `cortex/intake/__init__.py` for proper package structure
- ✅ Added diagnostic logging to verify SESSIONS singleton behavior
- ✅ Added `/debug/sessions` and `/debug/summary` endpoints
- ✅ Documented single-worker constraint in Dockerfile
- ✅ Implemented lenient error handling (never fails chat pipeline)
- ✅ Intake now embedded in Cortex - no longer standalone service
**Architecture Changes:**
- Intake module runs inside Cortex container as pure Python import
- No HTTP calls between Cortex and Intake (internal function calls)
- SESSIONS persist correctly in Uvicorn worker
- Deferred summarization strategy (summaries generated during `/reason`)
### v0.5.0 (2025-11-28)
- ✅ Fixed all critical API wiring issues - ✅ Fixed all critical API wiring issues
- ✅ Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`) - ✅ Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`)
- ✅ Fixed Cortex → Intake integration - ✅ Fixed Cortex → Intake integration
- ✅ Added missing Python package `__init__.py` files - ✅ Added missing Python package `__init__.py` files
- ✅ End-to-end message flow verified and working - ✅ End-to-end message flow verified and working
### Infrastructure v1.0.0 (2025-11-26)
- Consolidated 9 scattered `.env` files into single source of truth
- Multi-backend LLM strategy implemented
- Docker Compose consolidation
- Created `.env.example` security templates
### v0.4.x (Major Rewire) ### v0.4.x (Major Rewire)
- Cortex multi-stage reasoning pipeline - Cortex multi-stage reasoning pipeline
- Intake v0.2 simplification
- LLM router with multi-backend support - LLM router with multi-backend support
- Major architectural restructuring - Major architectural restructuring
@@ -285,19 +320,30 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
--- ---
## Known Issues (v0.5.0) ## Known Issues (v0.5.1)
### Critical (Fixed in v0.5.1)
- ~~Intake SESSIONS not persisting~~ ✅ **FIXED**
- ~~`bg_summarize()` NameError~~ ✅ **FIXED**
- ~~`/ingest` endpoint unreachable code~~ ✅ **FIXED**
### Non-Critical ### Non-Critical
- Session management endpoints not fully implemented in Relay - Session management endpoints not fully implemented in Relay
- Intake service currently disabled in docker-compose.yml
- RAG service currently disabled in docker-compose.yml - RAG service currently disabled in docker-compose.yml
- Cortex `/ingest` endpoint is a stub - NeoMem integration in Relay not yet active (planned for v0.5.2)
### Operational Notes
- **Single-worker constraint**: Cortex must run with single Uvicorn worker to maintain SESSIONS state
- Multi-worker scaling requires migrating SESSIONS to Redis or shared storage
- Diagnostic endpoints (`/debug/sessions`, `/debug/summary`) available for troubleshooting
### Future Enhancements ### Future Enhancements
- Re-enable RAG service integration - Re-enable RAG service integration
- Implement full session persistence - Implement full session persistence
- Migrate SESSIONS to Redis for multi-worker support
- Add request correlation IDs for tracing - Add request correlation IDs for tracing
- Comprehensive health checks - Comprehensive health checks across all services
- NeoMem integration in Relay
--- ---
@@ -305,21 +351,39 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
### Prerequisites ### Prerequisites
- Docker + Docker Compose - Docker + Docker Compose
- At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key) - At least one HTTP-accessible LLM endpoint (llama.cpp, Ollama, or OpenAI API key)
### Setup ### Setup
1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys 1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys:
```bash
# Required: Configure at least one LLM backend
LLM_PRIMARY_URL=http://10.0.0.44:8080 # llama.cpp
LLM_SECONDARY_URL=http://10.0.0.3:11434 # Ollama
OPENAI_API_KEY=sk-... # OpenAI
```
2. Start all services with docker-compose: 2. Start all services with docker-compose:
```bash ```bash
docker-compose up -d docker-compose up -d
``` ```
3. Check service health: 3. Check service health:
```bash ```bash
# Relay health
curl http://localhost:7078/_health curl http://localhost:7078/_health
# Cortex health
curl http://localhost:7081/health
# NeoMem health
curl http://localhost:7077/health
``` ```
4. Access the UI at `http://localhost:7078` 4. Access the UI at `http://localhost:7078`
### Test ### Test
**Test Relay → Cortex pipeline:**
```bash ```bash
curl -X POST http://localhost:7078/v1/chat/completions \ curl -X POST http://localhost:7078/v1/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
@@ -329,15 +393,130 @@ curl -X POST http://localhost:7078/v1/chat/completions \
}' }'
``` ```
**Test Cortex /ingest endpoint:**
```bash
curl -X POST http://localhost:7081/ingest \
-H "Content-Type: application/json" \
-d '{
"session_id": "test",
"user_msg": "Hello",
"assistant_msg": "Hi there!"
}'
```
**Inspect SESSIONS state:**
```bash
curl http://localhost:7081/debug/sessions
```
**Get summary for a session:**
```bash
curl "http://localhost:7081/debug/summary?session_id=test"
```
All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack. All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.
--- ---
## Environment Variables
### LLM Backend Configuration
**Backend URLs (Full API endpoints):**
```bash
LLM_PRIMARY_URL=http://10.0.0.44:8080 # llama.cpp
LLM_PRIMARY_MODEL=/model
LLM_SECONDARY_URL=http://10.0.0.3:11434 # Ollama
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
```
**Module-specific backend selection:**
```bash
CORTEX_LLM=SECONDARY # Use Ollama for reasoning
INTAKE_LLM=PRIMARY # Use llama.cpp for summarization
SPEAK_LLM=OPENAI # Use OpenAI for persona
NEOMEM_LLM=PRIMARY # Use llama.cpp for memory
UI_LLM=OPENAI # Use OpenAI for UI
RELAY_LLM=PRIMARY # Use llama.cpp for relay
```
### Database Configuration
```bash
POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomempass
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432
NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=neomemgraph
```
### Service URLs (Internal Docker Network)
```bash
NEOMEM_API=http://neomem-api:7077
CORTEX_API=http://cortex:7081
CORTEX_REASON_URL=http://cortex:7081/reason
CORTEX_INGEST_URL=http://cortex:7081/ingest
RELAY_URL=http://relay:7078
```
### Feature Flags
```bash
CORTEX_ENABLED=true
MEMORY_ENABLED=true
PERSONA_ENABLED=false
DEBUG_PROMPT=true
VERBOSE_DEBUG=true
```
For complete environment variable reference, see [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md).
---
## Documentation ## Documentation
- See [CHANGELOG.md](CHANGELOG.md) for detailed version history - [CHANGELOG.md](CHANGELOG.md) - Detailed version history
- See `ENVIRONMENT_VARIABLES.md` for environment variable reference - [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - Comprehensive project overview for AI context
- Additional information available in the Trilium docs - [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md) - Environment variable reference
- [DEPRECATED_FILES.md](DEPRECATED_FILES.md) - Deprecated files and migration guide
---
## Troubleshooting
### SESSIONS not persisting
**Symptom:** Intake buffer always shows 0 exchanges, summaries always empty.
**Solution (Fixed in v0.5.1):**
- Ensure `cortex/intake/__init__.py` exists
- Check Cortex logs for `[Intake Module Init]` message showing SESSIONS object ID
- Verify single-worker mode (Dockerfile: `uvicorn main:app --workers 1`)
- Use `/debug/sessions` endpoint to inspect current state
### Cortex connection errors
**Symptom:** Relay can't reach Cortex, 502 errors.
**Solution:**
- Verify Cortex container is running: `docker ps | grep cortex`
- Check Cortex health: `curl http://localhost:7081/health`
- Verify environment variables: `CORTEX_REASON_URL=http://cortex:7081/reason`
- Check docker network: `docker network inspect lyra_net`
### LLM backend timeouts
**Symptom:** Reasoning stage hangs or times out.
**Solution:**
- Verify LLM backend is running and accessible
- Check LLM backend health: `curl http://10.0.0.44:8080/health`
- Increase timeout in llm_router.py if using slow models
- Check logs for specific backend errors
--- ---
@@ -356,6 +535,8 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
- All services communicate via Docker internal networking on the `lyra_net` bridge - All services communicate via Docker internal networking on the `lyra_net` bridge
- History and entity graphs are managed via PostgreSQL + Neo4j - History and entity graphs are managed via PostgreSQL + Neo4j
- LLM backends are accessed via HTTP and configured in `.env` - LLM backends are accessed via HTTP and configured in `.env`
- Intake module is imported internally by Cortex (no HTTP communication)
- SESSIONS state is maintained in-memory within Cortex container
--- ---
@@ -391,3 +572,38 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
}' }'
``` ```
---
## Development Notes
### Cortex Architecture (v0.5.1)
- Cortex contains embedded Intake module at `cortex/intake/`
- Intake is imported as: `from intake.intake import add_exchange_internal, SESSIONS`
- SESSIONS is a module-level global dictionary (singleton pattern)
- Single-worker constraint required to maintain SESSIONS state
- Diagnostic endpoints available for debugging: `/debug/sessions`, `/debug/summary`
### Adding New LLM Backends
1. Add backend URL to `.env`:
```bash
LLM_CUSTOM_URL=http://your-backend:port
LLM_CUSTOM_MODEL=model-name
```
2. Configure module to use new backend:
```bash
CORTEX_LLM=CUSTOM
```
3. Restart Cortex container:
```bash
docker-compose restart cortex
```
### Debugging Tips
- Enable verbose logging: `VERBOSE_DEBUG=true` in `.env`
- Check Cortex logs: `docker logs cortex -f`
- Inspect SESSIONS: `curl http://localhost:7081/debug/sessions`
- Test summarization: `curl "http://localhost:7081/debug/summary?session_id=test"`
- Check Relay logs: `docker logs relay -f`
- Monitor Docker network: `docker network inspect lyra_net`