diff --git a/PROJECT_SUMMARY.md b/PROJECT_SUMMARY.md index 551170e..7395e46 100644 --- a/PROJECT_SUMMARY.md +++ b/PROJECT_SUMMARY.md @@ -1,71 +1,925 @@ -# Lyra Core β€” Project Summary +# Project Lyra β€” Comprehensive AI Context Summary -## v0.4 (2025-10-03) +**Version:** v0.5.1 (2025-12-11) +**Status:** Production-ready modular AI companion system +**Purpose:** Memory-backed conversational AI with multi-stage reasoning, persistent context, and modular LLM backend architecture -### 🧠 High-Level Architecture -- **Lyra Core (v0.3.1)** β€” Orchestration layer. - - Accepts chat requests (`/v1/chat/completions`). - - Routes through Cortex for subconscious annotation. - - Stores everything in Mem0 (no discard). - - Fetches persona + relevant memories. - - Injects context back into LLM. +--- -- **Cortex (v0.3.0)** β€” Subconscious annotator. - - Runs locally via `llama.cpp` (Phi-3.5-mini Q4_K_M). - - Strict JSON schema: - ```json - { - "sentiment": "positive" | "neutral" | "negative", - "novelty": 0.0–1.0, - "tags": ["keyword", "keyword"], - "notes": "short string" +## Executive Summary + +Project Lyra is a **self-hosted AI companion system** designed to overcome the limitations of typical chatbots by providing: +- **Persistent long-term memory** (NeoMem: PostgreSQL + Neo4j graph storage) +- **Multi-stage reasoning pipeline** (Cortex: reflection β†’ reasoning β†’ refinement β†’ persona) +- **Short-term context management** (Intake: session-based summarization embedded in Cortex) +- **Flexible LLM backend routing** (supports llama.cpp, Ollama, OpenAI, custom endpoints) +- **OpenAI-compatible API** (drop-in replacement for chat applications) + +**Core Philosophy:** Like a human brain has different regions for different functions, Lyra has specialized modules that work together. She's not just a chatbotβ€”she's a notepad, schedule, database, co-creator, and collaborator with her own executive function. + +--- + +## Quick Context for AI Assistants + +If you're an AI being given this project to work on, here's what you need to know: + +### What This Project Does +Lyra is a conversational AI system that **remembers everything** across sessions. When a user says something in passing, Lyra stores it, contextualizes it, and can recall it later. She can: +- Track project progress over time +- Remember user preferences and past conversations +- Reason through complex questions using multiple LLM calls +- Apply a consistent personality across all interactions +- Integrate with multiple LLM backends (local and cloud) + +### Current Architecture (v0.5.1) +``` +User β†’ Relay (Express/Node.js, port 7078) + ↓ +Cortex (FastAPI/Python, port 7081) + β”œβ”€ Intake module (embedded, in-memory SESSIONS) + β”œβ”€ 4-stage reasoning pipeline + └─ Multi-backend LLM router + ↓ +NeoMem (FastAPI/Python, port 7077) + β”œβ”€ PostgreSQL (vector storage) + └─ Neo4j (graph relationships) +``` + +### Key Files You'll Work With + +**Backend Services:** +- [cortex/router.py](cortex/router.py) - Main Cortex routing logic (306 lines, `/reason`, `/ingest` endpoints) +- [cortex/intake/intake.py](cortex/intake/intake.py) - Short-term memory module (367 lines, SESSIONS management) +- [cortex/reasoning/reasoning.py](cortex/reasoning/reasoning.py) - Draft answer generation +- [cortex/reasoning/refine.py](cortex/reasoning/refine.py) - Answer refinement +- [cortex/reasoning/reflection.py](cortex/reasoning/reflection.py) - Meta-awareness notes +- [cortex/persona/speak.py](cortex/persona/speak.py) - Personality layer +- [cortex/llm/llm_router.py](cortex/llm/llm_router.py) - LLM backend selector +- [core/relay/server.js](core/relay/server.js) - Main orchestrator (Node.js) +- [neomem/main.py](neomem/main.py) - Long-term memory API + +**Configuration:** +- [.env](.env) - Root environment variables (LLM backends, databases, API keys) +- [cortex/.env](cortex/.env) - Cortex-specific overrides +- [docker-compose.yml](docker-compose.yml) - Service definitions (152 lines) + +**Documentation:** +- [CHANGELOG.md](CHANGELOG.md) - Complete version history (836 lines, chronological format) +- [README.md](README.md) - User-facing documentation (610 lines) +- [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - This file + +### Recent Critical Fixes (v0.5.1) +The most recent work fixed a critical bug where Intake's SESSIONS buffer wasn't persisting: +1. **Fixed**: `bg_summarize()` was only a TYPE_CHECKING stub β†’ implemented as logging stub +2. **Fixed**: `/ingest` endpoint had unreachable code β†’ removed early return, added lenient error handling +3. **Added**: `cortex/intake/__init__.py` β†’ proper Python package structure +4. **Added**: Diagnostic endpoints `/debug/sessions` and `/debug/summary` for troubleshooting + +**Key Insight**: Intake is no longer a standalone serviceβ€”it's embedded in Cortex as a Python module. SESSIONS must persist in a single Uvicorn worker (no multi-worker support without Redis). + +--- + +## Architecture Deep Dive + +### Service Topology (Docker Compose) + +**Active Containers:** +1. **relay** (Node.js/Express, port 7078) + - Entry point for all user requests + - OpenAI-compatible `/v1/chat/completions` endpoint + - Routes to Cortex for reasoning + - Async calls to Cortex `/ingest` after response + +2. **cortex** (Python/FastAPI, port 7081) + - Multi-stage reasoning pipeline + - Embedded Intake module (no HTTP, direct Python imports) + - Endpoints: `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary` + +3. **neomem-api** (Python/FastAPI, port 7077) + - Long-term memory storage + - Fork of Mem0 OSS (fully local, no external SDK) + - Endpoints: `/memories`, `/search`, `/health` + +4. **neomem-postgres** (PostgreSQL + pgvector, port 5432) + - Vector embeddings storage + - Memory history records + +5. **neomem-neo4j** (Neo4j, ports 7474/7687) + - Graph relationships between memories + - Entity extraction and linking + +**Disabled Services:** +- `intake` - No longer needed (embedded in Cortex as of v0.5.1) +- `rag` - Beta Lyrae RAG service (planned re-enablement) + +### External LLM Backends (HTTP APIs) + +**PRIMARY Backend** - llama.cpp @ `http://10.0.0.44:8080` +- AMD MI50 GPU-accelerated inference +- Model: `/model` (path-based routing) +- Used for: Reasoning, refinement, summarization + +**SECONDARY Backend** - Ollama @ `http://10.0.0.3:11434` +- RTX 3090 GPU-accelerated inference +- Model: `qwen2.5:7b-instruct-q4_K_M` +- Used for: Configurable per-module + +**CLOUD Backend** - OpenAI @ `https://api.openai.com/v1` +- Cloud-based inference +- Model: `gpt-4o-mini` +- Used for: Reflection, persona layers + +**FALLBACK Backend** - Local @ `http://10.0.0.41:11435` +- CPU-based inference +- Model: `llama-3.2-8b-instruct` +- Used for: Emergency fallback + +### Data Flow (Request Lifecycle) + +``` +1. User sends message β†’ Relay (/v1/chat/completions) + ↓ +2. Relay β†’ Cortex (/reason) + ↓ +3. Cortex calls Intake module (internal Python) + - Intake.summarize_context(session_id, exchanges) + - Returns L1/L5/L10/L20/L30 summaries + ↓ +4. Cortex 4-stage pipeline: + a. reflection.py β†’ Meta-awareness notes (CLOUD backend) + - "What is the user really asking?" + - Returns JSON: {"notes": [...]} + + b. reasoning.py β†’ Draft answer (PRIMARY backend) + - Uses context from Intake + - Integrates reflection notes + - Returns draft text + + c. refine.py β†’ Refined answer (PRIMARY backend) + - Polishes draft for clarity + - Ensures factual consistency + - Returns refined text + + d. speak.py β†’ Persona layer (CLOUD backend) + - Applies Lyra's personality + - Natural, conversational tone + - Returns final answer + ↓ +5. Cortex β†’ Relay (returns persona answer) + ↓ +6. Relay β†’ Cortex (/ingest) [async, non-blocking] + - Sends (session_id, user_msg, assistant_msg) + - Cortex calls add_exchange_internal() + - Appends to SESSIONS[session_id]["buffer"] + ↓ +7. Relay β†’ User (returns final response) + ↓ +8. [Planned] Relay β†’ NeoMem (/memories) [async] + - Store conversation in long-term memory +``` + +### Intake Module Architecture (v0.5.1) + +**Location:** `cortex/intake/` + +**Key Change:** Intake is now **embedded in Cortex** as a Python module, not a standalone service. + +**Import Pattern:** +```python +from intake.intake import add_exchange_internal, SESSIONS, summarize_context +``` + +**Core Data Structure:** +```python +SESSIONS: dict[str, dict] = {} + +# Structure: +SESSIONS[session_id] = { + "buffer": deque(maxlen=200), # Circular buffer of exchanges + "created_at": datetime +} + +# Each exchange in buffer: +{ + "session_id": "...", + "user_msg": "...", + "assistant_msg": "...", + "timestamp": "2025-12-11T..." +} +``` + +**Functions:** +1. **`add_exchange_internal(exchange: dict)`** + - Adds exchange to SESSIONS buffer + - Creates new session if needed + - Calls `bg_summarize()` stub + - Returns `{"ok": True, "session_id": "..."}` + +2. **`summarize_context(session_id: str, exchanges: list[dict])`** [async] + - Generates L1/L5/L10/L20/L30 summaries via LLM + - Called during `/reason` endpoint + - Returns multi-level summary dict + +3. **`bg_summarize(session_id: str)`** + - **Stub function** - logs only, no actual work + - Defers summarization to `/reason` call + - Exists to prevent NameError + +**Critical Constraint:** SESSIONS is a module-level global dict. This requires **single-worker Uvicorn** mode. Multi-worker deployments need Redis or shared storage. + +**Diagnostic Endpoints:** +- `GET /debug/sessions` - Inspect all SESSIONS (object ID, buffer sizes, recent exchanges) +- `GET /debug/summary?session_id=X` - Test summarization for a session + +--- + +## Environment Configuration + +### LLM Backend Registry (Multi-Backend Strategy) + +**Root `.env` defines all backend OPTIONS:** +```bash +# PRIMARY Backend (llama.cpp) +LLM_PRIMARY_PROVIDER=llama.cpp +LLM_PRIMARY_URL=http://10.0.0.44:8080 +LLM_PRIMARY_MODEL=/model + +# SECONDARY Backend (Ollama) +LLM_SECONDARY_PROVIDER=ollama +LLM_SECONDARY_URL=http://10.0.0.3:11434 +LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M + +# CLOUD Backend (OpenAI) +LLM_OPENAI_PROVIDER=openai +LLM_OPENAI_URL=https://api.openai.com/v1 +LLM_OPENAI_MODEL=gpt-4o-mini +OPENAI_API_KEY=sk-proj-... + +# FALLBACK Backend +LLM_FALLBACK_PROVIDER=openai_completions +LLM_FALLBACK_URL=http://10.0.0.41:11435 +LLM_FALLBACK_MODEL=llama-3.2-8b-instruct +``` + +**Module-specific backend selection:** +```bash +CORTEX_LLM=SECONDARY # Cortex uses Ollama +INTAKE_LLM=PRIMARY # Intake uses llama.cpp +SPEAK_LLM=OPENAI # Persona uses OpenAI +NEOMEM_LLM=PRIMARY # NeoMem uses llama.cpp +UI_LLM=OPENAI # UI uses OpenAI +RELAY_LLM=PRIMARY # Relay uses llama.cpp +``` + +**Philosophy:** Root `.env` provides all backend OPTIONS. Each service chooses which backend to USE via `{MODULE}_LLM` variable. This eliminates URL duplication while preserving flexibility. + +### Database Configuration +```bash +# PostgreSQL (vector storage) +POSTGRES_USER=neomem +POSTGRES_PASSWORD=neomempass +POSTGRES_DB=neomem +POSTGRES_HOST=neomem-postgres +POSTGRES_PORT=5432 + +# Neo4j (graph storage) +NEO4J_URI=bolt://neomem-neo4j:7687 +NEO4J_USERNAME=neo4j +NEO4J_PASSWORD=neomemgraph +``` + +### Service URLs (Docker Internal Network) +```bash +NEOMEM_API=http://neomem-api:7077 +CORTEX_API=http://cortex:7081 +CORTEX_REASON_URL=http://cortex:7081/reason +CORTEX_INGEST_URL=http://cortex:7081/ingest +RELAY_URL=http://relay:7078 +``` + +### Feature Flags +```bash +CORTEX_ENABLED=true +MEMORY_ENABLED=true +PERSONA_ENABLED=false +DEBUG_PROMPT=true +VERBOSE_DEBUG=true +``` + +--- + +## Code Structure Overview + +### Cortex Service (`cortex/`) + +**Main Files:** +- `main.py` - FastAPI app initialization +- `router.py` - Route definitions (`/reason`, `/ingest`, `/health`, `/debug/*`) +- `context.py` - Context aggregation (Intake summaries, session state) + +**Reasoning Pipeline (`reasoning/`):** +- `reflection.py` - Meta-awareness notes (Cloud LLM) +- `reasoning.py` - Draft answer generation (Primary LLM) +- `refine.py` - Answer refinement (Primary LLM) + +**Persona Layer (`persona/`):** +- `speak.py` - Personality application (Cloud LLM) +- `identity.py` - Persona loader + +**Intake Module (`intake/`):** +- `__init__.py` - Package exports (SESSIONS, add_exchange_internal, summarize_context) +- `intake.py` - Core logic (367 lines) + - SESSIONS dictionary + - add_exchange_internal() + - summarize_context() + - bg_summarize() stub + +**LLM Integration (`llm/`):** +- `llm_router.py` - Backend selector and HTTP client + - call_llm() function + - Environment-based routing + - Payload formatting per backend type + +**Utilities (`utils/`):** +- Helper functions for common operations + +**Configuration:** +- `Dockerfile` - Single-worker constraint documented +- `requirements.txt` - Python dependencies +- `.env` - Service-specific overrides + +### Relay Service (`core/relay/`) + +**Main Files:** +- `server.js` - Express.js server (Node.js) + - `/v1/chat/completions` - OpenAI-compatible endpoint + - `/chat` - Internal endpoint + - `/_health` - Health check +- `package.json` - Node.js dependencies + +**Key Logic:** +- Receives user messages +- Routes to Cortex `/reason` +- Async calls to Cortex `/ingest` after response +- Returns final answer to user + +### NeoMem Service (`neomem/`) + +**Main Files:** +- `main.py` - FastAPI app (memory API) +- `memory.py` - Memory management logic +- `embedder.py` - Embedding generation +- `graph.py` - Neo4j graph operations +- `Dockerfile` - Container definition +- `requirements.txt` - Python dependencies + +**API Endpoints:** +- `POST /memories` - Add new memory +- `POST /search` - Semantic search +- `GET /health` - Service health + +--- + +## Common Development Tasks + +### Adding a New Endpoint to Cortex + +**Example: Add `/debug/buffer` endpoint** + +1. **Edit `cortex/router.py`:** +```python +@cortex_router.get("/debug/buffer") +async def debug_buffer(session_id: str, limit: int = 10): + """Return last N exchanges from a session buffer.""" + from intake.intake import SESSIONS + + session = SESSIONS.get(session_id) + if not session: + return {"error": "session not found", "session_id": session_id} + + buffer = session["buffer"] + recent = list(buffer)[-limit:] + + return { + "session_id": session_id, + "total_exchanges": len(buffer), + "recent_exchanges": recent } - ``` - - Normalizes keys (lowercase). - - Strips Markdown fences before parsing. - - Configurable via `.env` (`CORTEX_ENABLED=true|false`). - - Currently generates annotations, but not yet persisted into Mem0 payloads (stored as empty `{cortex:{}}`). +``` -- **Mem0 (v0.4.0)** β€” Persistent memory layer. - - Handles embeddings, graph storage, and retrieval. - - Dual embedder support: - - **OpenAI Cloud** (`text-embedding-3-small`, 1536-dim). - - **HuggingFace TEI** (gte-Qwen2-1.5B-instruct, 1536-dim, hosted on 3090). - - Environment toggle for provider (`.env.openai` vs `.env.3090`). - - Memory persistence in Postgres (`payload` JSON). - - CSV export pipeline confirmed (id, user_id, data, created_at). +2. **Restart Cortex:** +```bash +docker-compose restart cortex +``` -- **Persona Sidecar** - - Provides personality, style, and protocol instructions. - - Injected at runtime into Core prompt building. +3. **Test:** +```bash +curl "http://localhost:7081/debug/buffer?session_id=test&limit=5" +``` + +### Modifying LLM Backend for a Module + +**Example: Switch Cortex to use PRIMARY backend** + +1. **Edit `.env`:** +```bash +CORTEX_LLM=PRIMARY # Change from SECONDARY to PRIMARY +``` + +2. **Restart Cortex:** +```bash +docker-compose restart cortex +``` + +3. **Verify in logs:** +```bash +docker logs cortex | grep "Backend" +``` + +### Adding Diagnostic Logging + +**Example: Log every exchange addition** + +1. **Edit `cortex/intake/intake.py`:** +```python +def add_exchange_internal(exchange: dict): + session_id = exchange.get("session_id") + + # Add detailed logging + print(f"[DEBUG] Adding exchange to {session_id}") + print(f"[DEBUG] User msg: {exchange.get('user_msg', '')[:100]}") + print(f"[DEBUG] Assistant msg: {exchange.get('assistant_msg', '')[:100]}") + + # ... rest of function +``` + +2. **View logs:** +```bash +docker logs cortex -f | grep DEBUG +``` --- -### πŸš€ Recent Changes -- **Mem0** - - Added HuggingFace TEI integration (local 3090 embedder). - - Enabled dual-mode environment switch (OpenAI cloud ↔ local TEI). - - Fixed `.env` line ending mismatch (CRLF vs LF). - - Added memory dump/export commands for Postgres. +## Debugging Guide -- **Core/Relay** - - No major changes since v0.3.1 (still routing input β†’ Cortex β†’ Mem0). +### Problem: SESSIONS Not Persisting -- **Cortex** - - Still outputs annotations, but not yet persisted into Mem0 payloads. +**Symptoms:** +- `/debug/sessions` shows empty or only 1 exchange +- Summaries always return empty +- Buffer size doesn't increase + +**Diagnosis Steps:** +1. Check Cortex logs for SESSIONS object ID: + ```bash + docker logs cortex | grep "SESSIONS object id" + ``` + - Should show same ID across all calls + - If IDs differ β†’ module reloading issue + +2. Verify single-worker mode: + ```bash + docker exec cortex cat Dockerfile | grep uvicorn + ``` + - Should NOT have `--workers` flag or `--workers 1` + +3. Check `/debug/sessions` endpoint: + ```bash + curl http://localhost:7081/debug/sessions | jq + ``` + - Should show sessions_object_id and current sessions + +4. Inspect `__init__.py` exists: + ```bash + docker exec cortex ls -la intake/__init__.py + ``` + +**Solution (Fixed in v0.5.1):** +- Ensure `cortex/intake/__init__.py` exists with proper exports +- Verify `bg_summarize()` is implemented (not just TYPE_CHECKING stub) +- Check `/ingest` endpoint doesn't have early return +- Rebuild Cortex container: `docker-compose build cortex && docker-compose restart cortex` + +### Problem: LLM Backend Timeout + +**Symptoms:** +- Cortex `/reason` hangs +- 504 Gateway Timeout errors +- Logs show "waiting for LLM response" + +**Diagnosis Steps:** +1. Test backend directly: + ```bash + # llama.cpp + curl http://10.0.0.44:8080/health + + # Ollama + curl http://10.0.0.3:11434/api/tags + + # OpenAI + curl https://api.openai.com/v1/models \ + -H "Authorization: Bearer $OPENAI_API_KEY" + ``` + +2. Check network connectivity: + ```bash + docker exec cortex ping -c 3 10.0.0.44 + ``` + +3. Review Cortex logs: + ```bash + docker logs cortex -f | grep "LLM" + ``` + +**Solutions:** +- Verify backend URL in `.env` is correct and accessible +- Check firewall rules for backend ports +- Increase timeout in `cortex/llm/llm_router.py` +- Switch to different backend temporarily: `CORTEX_LLM=CLOUD` + +### Problem: Docker Compose Won't Start + +**Symptoms:** +- `docker-compose up -d` fails +- Container exits immediately +- "port already in use" errors + +**Diagnosis Steps:** +1. Check port conflicts: + ```bash + netstat -tulpn | grep -E '7078|7081|7077|5432' + ``` + +2. Check container logs: + ```bash + docker-compose logs --tail=50 + ``` + +3. Verify environment file: + ```bash + cat .env | grep -v "^#" | grep -v "^$" + ``` + +**Solutions:** +- Stop conflicting services: `docker-compose down` +- Check `.env` syntax (no quotes unless necessary) +- Rebuild containers: `docker-compose build --no-cache` +- Check Docker daemon: `systemctl status docker` --- -### πŸ“ˆ Versioning -- **Lyra Core** β†’ v0.3.1 -- **Cortex** β†’ v0.3.0 -- **Mem0** β†’ v0.4.0 +## Testing Checklist + +### After Making Changes to Cortex + +**1. Build and restart:** +```bash +docker-compose build cortex +docker-compose restart cortex +``` + +**2. Verify service health:** +```bash +curl http://localhost:7081/health +``` + +**3. Test /ingest endpoint:** +```bash +curl -X POST http://localhost:7081/ingest \ + -H "Content-Type: application/json" \ + -d '{ + "session_id": "test", + "user_msg": "Hello", + "assistant_msg": "Hi there!" + }' +``` + +**4. Verify SESSIONS updated:** +```bash +curl http://localhost:7081/debug/sessions | jq '.sessions.test.buffer_size' +``` +- Should show 1 (or increment if already populated) + +**5. Test summarization:** +```bash +curl "http://localhost:7081/debug/summary?session_id=test" | jq '.summary' +``` +- Should return L1/L5/L10/L20/L30 summaries + +**6. Test full pipeline:** +```bash +curl -X POST http://localhost:7078/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [{"role": "user", "content": "Test message"}], + "session_id": "test" + }' | jq '.choices[0].message.content' +``` + +**7. Check logs for errors:** +```bash +docker logs cortex --tail=50 +``` --- -### πŸ“‹ Next Steps -- [ ] Wire Cortex annotations into Mem0 payloads (`cortex` object). -- [ ] Add β€œexport all memories” script to standard workflow. -- [ ] Consider async embedding for faster `mem.add`. -- [ ] Build visual diagram of data flow (Core ↔ Cortex ↔ Mem0 ↔ Persona). -- [ ] Explore larger LLMs for Cortex (Qwen2-7B, etc.) for richer subconscious annotation. +## Project History & Context + +### Evolution Timeline + +**v0.1.x (2025-09-23 to 2025-09-25)** +- Initial MVP: Relay + Mem0 + Ollama +- Basic memory storage and retrieval +- Simple UI with session support + +**v0.2.x (2025-09-24 to 2025-09-30)** +- Migrated to mem0ai SDK +- Added sessionId support +- Created standalone Lyra-Mem0 stack + +**v0.3.x (2025-09-26 to 2025-10-28)** +- Forked Mem0 β†’ NVGRAM β†’ NeoMem +- Added salience filtering +- Integrated Cortex reasoning VM +- Built RAG system (Beta Lyrae) +- Established multi-backend LLM support + +**v0.4.x (2025-11-05 to 2025-11-13)** +- Major architectural rewire +- Implemented 4-stage reasoning pipeline +- Added reflection, refinement stages +- RAG integration +- LLM router with per-stage backend selection + +**Infrastructure v1.0.0 (2025-11-26)** +- Consolidated 9 `.env` files into single source of truth +- Multi-backend LLM strategy +- Docker Compose consolidation +- Created security templates + +**v0.5.0 (2025-11-28)** +- Fixed all critical API wiring issues +- Added OpenAI-compatible Relay endpoint +- Fixed Cortex β†’ Intake integration +- End-to-end flow verification + +**v0.5.1 (2025-12-11) - CURRENT** +- **Critical fix**: SESSIONS persistence bug +- Implemented `bg_summarize()` stub +- Fixed `/ingest` unreachable code +- Added `cortex/intake/__init__.py` +- Embedded Intake in Cortex (no longer standalone) +- Added diagnostic endpoints +- Lenient error handling +- Documented single-worker constraint + +### Architectural Philosophy + +**Modular Design:** +- Each service has a single, clear responsibility +- Services communicate via well-defined HTTP APIs +- Configuration is centralized but allows per-service overrides + +**Local-First:** +- No reliance on external services (except optional OpenAI) +- All data stored locally (PostgreSQL + Neo4j) +- Can run entirely air-gapped with local LLMs + +**Flexible LLM Backend:** +- Not tied to any single LLM provider +- Can mix local and cloud models +- Per-stage backend selection for optimal performance/cost + +**Error Handling:** +- Lenient mode: Never fail the chat pipeline +- Log errors but continue processing +- Graceful degradation + +**Observability:** +- Diagnostic endpoints for debugging +- Verbose logging mode +- Object ID tracking for singleton verification + +--- + +## Known Issues & Limitations + +### Fixed in v0.5.1 +- βœ… Intake SESSIONS not persisting β†’ **FIXED** +- βœ… `bg_summarize()` NameError β†’ **FIXED** +- βœ… `/ingest` endpoint unreachable code β†’ **FIXED** + +### Current Limitations + +**1. Single-Worker Constraint** +- Cortex must run with single Uvicorn worker +- SESSIONS is in-memory module-level global +- Multi-worker support requires Redis or shared storage +- Documented in `cortex/Dockerfile` lines 7-8 + +**2. NeoMem Integration Incomplete** +- Relay doesn't yet push to NeoMem after responses +- Memory storage planned for v0.5.2 +- Currently all memory is short-term (SESSIONS only) + +**3. RAG Service Disabled** +- Beta Lyrae (RAG) commented out in docker-compose.yml +- Awaiting re-enablement after Intake stabilization +- Code exists but not currently integrated + +**4. Session Management** +- No session cleanup/expiration +- SESSIONS grows unbounded (maxlen=200 per session, but infinite sessions) +- No session list endpoint in Relay + +**5. Persona Integration** +- `PERSONA_ENABLED=false` in `.env` +- Persona Sidecar not fully wired +- Identity loaded but not consistently applied + +### Future Enhancements + +**Short-term (v0.5.2):** +- Enable NeoMem integration in Relay +- Add session cleanup/expiration +- Session list endpoint +- NeoMem health monitoring + +**Medium-term (v0.6.x):** +- Re-enable RAG service +- Migrate SESSIONS to Redis for multi-worker support +- Add request correlation IDs +- Comprehensive health checks + +**Long-term (v0.7.x+):** +- Persona Sidecar full integration +- Autonomous "dream" cycles (self-reflection) +- Verifier module for factual grounding +- Advanced RAG with hybrid search +- Memory consolidation strategies + +--- + +## Troubleshooting Quick Reference + +| Problem | Quick Check | Solution | +|---------|-------------|----------| +| SESSIONS empty | `curl localhost:7081/debug/sessions` | Rebuild Cortex, verify `__init__.py` exists | +| LLM timeout | `curl http://10.0.0.44:8080/health` | Check backend connectivity, increase timeout | +| Port conflict | `netstat -tulpn \| grep 7078` | Stop conflicting service or change port | +| Container crash | `docker logs cortex` | Check logs for Python errors, verify .env syntax | +| Missing package | `docker exec cortex pip list` | Rebuild container, check requirements.txt | +| 502 from Relay | `curl localhost:7081/health` | Verify Cortex is running, check docker network | + +--- + +## API Reference (Quick) + +### Relay (Port 7078) + +**POST /v1/chat/completions** - OpenAI-compatible chat +```json +{ + "messages": [{"role": "user", "content": "..."}], + "session_id": "..." +} +``` + +**GET /_health** - Service health + +### Cortex (Port 7081) + +**POST /reason** - Main reasoning pipeline +```json +{ + "session_id": "...", + "user_prompt": "...", + "temperature": 0.7 // optional +} +``` + +**POST /ingest** - Add exchange to SESSIONS +```json +{ + "session_id": "...", + "user_msg": "...", + "assistant_msg": "..." +} +``` + +**GET /debug/sessions** - Inspect SESSIONS state + +**GET /debug/summary?session_id=X** - Test summarization + +**GET /health** - Service health + +### NeoMem (Port 7077) + +**POST /memories** - Add memory +```json +{ + "messages": [{"role": "...", "content": "..."}], + "user_id": "...", + "metadata": {} +} +``` + +**POST /search** - Semantic search +```json +{ + "query": "...", + "user_id": "...", + "limit": 10 +} +``` + +**GET /health** - Service health + +--- + +## File Manifest (Key Files Only) + +``` +project-lyra/ +β”œβ”€β”€ .env # Root environment variables +β”œβ”€β”€ docker-compose.yml # Service definitions (152 lines) +β”œβ”€β”€ CHANGELOG.md # Version history (836 lines) +β”œβ”€β”€ README.md # User documentation (610 lines) +β”œβ”€β”€ PROJECT_SUMMARY.md # This file (AI context) +β”‚ +β”œβ”€β”€ cortex/ # Reasoning engine +β”‚ β”œβ”€β”€ Dockerfile # Single-worker constraint documented +β”‚ β”œβ”€β”€ requirements.txt +β”‚ β”œβ”€β”€ .env # Cortex overrides +β”‚ β”œβ”€β”€ main.py # FastAPI initialization +β”‚ β”œβ”€β”€ router.py # Routes (306 lines) +β”‚ β”œβ”€β”€ context.py # Context aggregation +β”‚ β”‚ +β”‚ β”œβ”€β”€ intake/ # Short-term memory (embedded) +β”‚ β”‚ β”œβ”€β”€ __init__.py # Package exports +β”‚ β”‚ └── intake.py # Core logic (367 lines) +β”‚ β”‚ +β”‚ β”œβ”€β”€ reasoning/ # Reasoning pipeline +β”‚ β”‚ β”œβ”€β”€ reflection.py # Meta-awareness +β”‚ β”‚ β”œβ”€β”€ reasoning.py # Draft generation +β”‚ β”‚ └── refine.py # Refinement +β”‚ β”‚ +β”‚ β”œβ”€β”€ persona/ # Personality layer +β”‚ β”‚ β”œβ”€β”€ speak.py # Persona application +β”‚ β”‚ └── identity.py # Persona loader +β”‚ β”‚ +β”‚ └── llm/ # LLM integration +β”‚ └── llm_router.py # Backend selector +β”‚ +β”œβ”€β”€ core/relay/ # Orchestrator +β”‚ β”œβ”€β”€ server.js # Express server (Node.js) +β”‚ └── package.json +β”‚ +β”œβ”€β”€ neomem/ # Long-term memory +β”‚ β”œβ”€β”€ Dockerfile +β”‚ β”œβ”€β”€ requirements.txt +β”‚ β”œβ”€β”€ .env # NeoMem overrides +β”‚ └── main.py # Memory API +β”‚ +└── rag/ # RAG system (disabled) + β”œβ”€β”€ rag_api.py + β”œβ”€β”€ rag_chat_import.py + └── chromadb/ +``` + +--- + +## Final Notes for AI Assistants + +### What You Should Know Before Making Changes + +1. **SESSIONS is sacred** - It's a module-level global in `cortex/intake/intake.py`. Don't move it, don't duplicate it, don't make it a class attribute. It must remain a singleton. + +2. **Single-worker is mandatory** - Until SESSIONS is migrated to Redis, Cortex MUST run with a single Uvicorn worker. Multi-worker will cause SESSIONS to be inconsistent. + +3. **Lenient error handling** - The `/ingest` endpoint and other parts of the pipeline use lenient error handling: log errors but always return success. Never fail the chat pipeline. + +4. **Backend routing is environment-driven** - Don't hardcode LLM URLs. Use the `{MODULE}_LLM` environment variables and the llm_router.py system. + +5. **Intake is embedded** - Don't try to make HTTP calls to Intake. Use direct Python imports: `from intake.intake import ...` + +6. **Test with diagnostic endpoints** - Always use `/debug/sessions` and `/debug/summary` to verify SESSIONS behavior after changes. + +7. **Follow the changelog format** - When documenting changes, use the chronological format established in CHANGELOG.md v0.5.1. Group by version, then by change type (Fixed, Added, Changed, etc.). + +### When You Need Help + +- **SESSIONS issues**: Check `cortex/intake/intake.py` lines 11-14 for initialization, lines 325-366 for `add_exchange_internal()` +- **Routing issues**: Check `cortex/router.py` lines 65-189 for `/reason`, lines 201-233 for `/ingest` +- **LLM backend issues**: Check `cortex/llm/llm_router.py` for backend selection logic +- **Environment variables**: Check `.env` lines 13-40 for LLM backends, lines 28-34 for module selection + +### Most Important Thing + +**This project values reliability over features.** It's better to have a simple, working system than a complex, broken one. When in doubt, keep it simple, log everything, and never fail silently. + +--- + +**End of AI Context Summary** + +*This document is maintained to provide complete context for AI assistants working on Project Lyra. Last updated: v0.5.1 (2025-12-11)* diff --git a/README.md b/README.md index 072f3e0..312e289 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,11 @@ -# Project Lyra - README v0.5.0 +# Project Lyra - README v0.5.1 Lyra is a modular persistent AI companion system with advanced reasoning capabilities. It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**, with multi-stage reasoning pipeline powered by HTTP-based LLM backends. +**Current Version:** v0.5.1 (2025-12-11) + ## Mission Statement The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later. @@ -22,7 +24,7 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do - OpenAI-compatible endpoint: `POST /v1/chat/completions` - Internal endpoint: `POST /chat` - Routes messages through Cortex reasoning pipeline -- Manages async calls to Intake and NeoMem +- Manages async calls to NeoMem and Cortex ingest **2. UI** (Static HTML) - Browser-based chat interface with cyberpunk theme @@ -41,38 +43,48 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do **4. Cortex** (Python/FastAPI) - Port 7081 - Primary reasoning engine with multi-stage pipeline +- **Includes embedded Intake module** (no separate service as of v0.5.1) - **4-Stage Processing:** 1. **Reflection** - Generates meta-awareness notes about conversation 2. **Reasoning** - Creates initial draft answer using context 3. **Refinement** - Polishes and improves the draft 4. **Persona** - Applies Lyra's personality and speaking style -- Integrates with Intake for short-term context +- Integrates with Intake for short-term context via internal Python imports - Flexible LLM router supporting multiple backends via HTTP +- **Endpoints:** + - `POST /reason` - Main reasoning pipeline + - `POST /ingest` - Receives conversation exchanges from Relay + - `GET /health` - Service health check + - `GET /debug/sessions` - Inspect in-memory SESSIONS state + - `GET /debug/summary` - Test summarization for a session -**5. Intake v0.2** (Python/FastAPI) - Port 7080 -- Simplified short-term memory summarization -- Session-based circular buffer (deque, maxlen=200) -- Single-level simple summarization (no cascading) -- Background async processing with FastAPI BackgroundTasks -- Pushes summaries to NeoMem automatically -- **API Endpoints:** - - `POST /add_exchange` - Add conversation exchange - - `GET /summaries?session_id={id}` - Retrieve session summary - - `POST /close_session/{id}` - Close and cleanup session +**5. Intake** (Python Module) - **Embedded in Cortex** +- **No longer a standalone service** - runs as Python module inside Cortex container +- Short-term memory management with session-based circular buffer +- In-memory SESSIONS dictionary: `session_id β†’ {buffer: deque(maxlen=200), created_at: timestamp}` +- Multi-level summarization (L1/L5/L10/L20/L30) produced by `summarize_context()` +- Deferred summarization - actual summary generation happens during `/reason` call +- Internal Python API: + - `add_exchange_internal(exchange)` - Direct function call from Cortex + - `summarize_context(session_id, exchanges)` - Async LLM-based summarization + - `SESSIONS` - Module-level global state (requires single Uvicorn worker) ### LLM Backends (HTTP-based) **All LLM communication is done via HTTP APIs:** -- **PRIMARY**: vLLM server (`http://10.0.0.43:8000`) - AMD MI50 GPU backend +- **PRIMARY**: llama.cpp server (`http://10.0.0.44:8080`) - AMD MI50 GPU backend - **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend + - Model: qwen2.5:7b-instruct-q4_K_M - **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models + - Model: gpt-4o-mini - **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback + - Model: llama-3.2-8b-instruct + +Each module can be configured to use a different backend via environment variables. -Each module can be configured to use a different backend via environment variables. - --- -## Data Flow Architecture (v0.5.0) +## Data Flow Architecture (v0.5.1) ### Normal Message Flow: @@ -82,43 +94,44 @@ User (UI) β†’ POST /v1/chat/completions Relay (7078) ↓ POST /reason Cortex (7081) - ↓ GET /summaries?session_id=xxx -Intake (7080) [RETURNS SUMMARY] + ↓ (internal Python call) +Intake module β†’ summarize_context() ↓ Cortex processes (4 stages): - 1. reflection.py β†’ meta-awareness notes - 2. reasoning.py β†’ draft answer (uses LLM) - 3. refine.py β†’ refined answer (uses LLM) - 4. persona/speak.py β†’ Lyra personality (uses LLM) + 1. reflection.py β†’ meta-awareness notes (CLOUD backend) + 2. reasoning.py β†’ draft answer (PRIMARY backend) + 3. refine.py β†’ refined answer (PRIMARY backend) + 4. persona/speak.py β†’ Lyra personality (CLOUD backend) ↓ Returns persona answer to Relay ↓ -Relay β†’ Cortex /ingest (async, stub) -Relay β†’ Intake /add_exchange (async) +Relay β†’ POST /ingest (async) ↓ -Intake β†’ Background summarize β†’ NeoMem +Cortex β†’ add_exchange_internal() β†’ SESSIONS buffer + ↓ +Relay β†’ NeoMem /memories (async, planned) ↓ Relay β†’ UI (returns final response) ``` ### Cortex 4-Stage Reasoning Pipeline: -1. **Reflection** (`reflection.py`) - Configurable LLM via HTTP +1. **Reflection** (`reflection.py`) - Cloud LLM (OpenAI) - Analyzes user intent and conversation context - Generates meta-awareness notes - "What is the user really asking?" -2. **Reasoning** (`reasoning.py`) - Configurable LLM via HTTP - - Retrieves short-term context from Intake +2. **Reasoning** (`reasoning.py`) - Primary LLM (llama.cpp) + - Retrieves short-term context from Intake module - Creates initial draft answer - Integrates context, reflection notes, and user prompt -3. **Refinement** (`refine.py`) - Configurable LLM via HTTP +3. **Refinement** (`refine.py`) - Primary LLM (llama.cpp) - Polishes the draft answer - Improves clarity and coherence - Ensures factual consistency -4. **Persona** (`speak.py`) - Configurable LLM via HTTP +4. **Persona** (`speak.py`) - Cloud LLM (OpenAI) - Applies Lyra's personality and speaking style - Natural, conversational output - Final answer returned to user @@ -134,7 +147,7 @@ Relay β†’ UI (returns final response) - OpenAI-compatible endpoint: `POST /v1/chat/completions` - Internal endpoint: `POST /chat` - Health check: `GET /_health` -- Async non-blocking calls to Cortex and Intake +- Async non-blocking calls to Cortex - Shared request handler for code reuse - Comprehensive error handling @@ -154,73 +167,70 @@ Relay β†’ UI (returns final response) ### Reasoning Layer -**Cortex** (v0.5): +**Cortex** (v0.5.1): - Multi-stage reasoning pipeline (reflection β†’ reasoning β†’ refine β†’ persona) - Flexible LLM backend routing via HTTP - Per-stage backend selection - Async processing throughout -- IntakeClient integration for short-term context -- `/reason`, `/ingest` (stub), `/health` endpoints +- Embedded Intake module for short-term context +- `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary` endpoints +- Lenient error handling - never fails the chat pipeline -**Intake** (v0.2): -- Simplified single-level summarization -- Session-based circular buffer (200 exchanges max) -- Background async summarization -- Automatic NeoMem push -- No persistent log files (memory-only) -- **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30) +**Intake** (Embedded Module): +- **Architectural change**: Now runs as Python module inside Cortex container +- In-memory SESSIONS management (session_id β†’ buffer) +- Multi-level summarization: L1 (ultra-short), L5 (short), L10 (medium), L20 (detailed), L30 (full) +- Deferred summarization strategy - summaries generated during `/reason` call +- `bg_summarize()` is a logging stub - actual work deferred +- **Single-worker constraint**: SESSIONS requires single Uvicorn worker or Redis/shared storage **LLM Router**: - Dynamic backend selection via HTTP - Environment-driven configuration -- Support for vLLM, Ollama, OpenAI, custom endpoints -- Per-module backend preferences +- Support for llama.cpp, Ollama, OpenAI, custom endpoints +- Per-module backend preferences: + - `CORTEX_LLM=SECONDARY` (Ollama for reasoning) + - `INTAKE_LLM=PRIMARY` (llama.cpp for summarization) + - `SPEAK_LLM=OPENAI` (Cloud for persona) + - `NEOMEM_LLM=PRIMARY` (llama.cpp for memory operations) + +### Beta Lyrae (RAG Memory DB) - Currently Disabled -# Beta Lyrae (RAG Memory DB) - added 11-3-25 - **RAG Knowledge DB - Beta Lyrae (sheliak)** - - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra. + - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra. - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. - The system uses: - - **ChromaDB** for persistent vector storage - - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity - - **FastAPI** (port 7090) for the `/rag/search` REST endpoint - - Directory Layout - rag/ - β”œβ”€β”€ rag_chat_import.py # imports JSON chat logs - β”œβ”€β”€ rag_docs_import.py # (planned) PDF/EPUB/manual importer - β”œβ”€β”€ rag_build.py # legacy single-folder builder - β”œβ”€β”€ rag_query.py # command-line query helper - β”œβ”€β”€ rag_api.py # FastAPI service providing /rag/search - β”œβ”€β”€ chromadb/ # persistent vector store - β”œβ”€β”€ chatlogs/ # organized source data - β”‚ β”œβ”€β”€ poker/ - β”‚ β”œβ”€β”€ work/ - β”‚ β”œβ”€β”€ lyra/ - β”‚ β”œβ”€β”€ personal/ - β”‚ └── ... - └── import.log # progress log for batch runs - - **OpenAI chatlog importer. - - Takes JSON formatted chat logs and imports it to the RAG. - - **fetures include:** - - Recursive folder indexing with **category detection** from directory name - - Smart chunking for long messages (5 000 chars per slice) - - Automatic deduplication using SHA-1 hash of file + chunk - - Timestamps for both file modification and import time - - Full progress logging via tqdm - - Safe to run in background with nohup … & - - Metadata per chunk: - ```json - { - "chat_id": "", - "chunk_index": 0, - "source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json", - "title": "cortex LLMs 11-1-25", - "role": "assistant", - "category": "lyra", - "type": "chat", - "file_modified": "2025-11-06T23:41:02", - "imported_at": "2025-11-07T03:55:00Z" - }``` + - **Status**: Disabled in docker-compose.yml (v0.5.1) + +The system uses: +- **ChromaDB** for persistent vector storage +- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity +- **FastAPI** (port 7090) for the `/rag/search` REST endpoint + +Directory Layout: +``` +rag/ +β”œβ”€β”€ rag_chat_import.py # imports JSON chat logs +β”œβ”€β”€ rag_docs_import.py # (planned) PDF/EPUB/manual importer +β”œβ”€β”€ rag_build.py # legacy single-folder builder +β”œβ”€β”€ rag_query.py # command-line query helper +β”œβ”€β”€ rag_api.py # FastAPI service providing /rag/search +β”œβ”€β”€ chromadb/ # persistent vector store +β”œβ”€β”€ chatlogs/ # organized source data +β”‚ β”œβ”€β”€ poker/ +β”‚ β”œβ”€β”€ work/ +β”‚ β”œβ”€β”€ lyra/ +β”‚ β”œβ”€β”€ personal/ +β”‚ └── ... +└── import.log # progress log for batch runs +``` + +**OpenAI chatlog importer features:** +- Recursive folder indexing with **category detection** from directory name +- Smart chunking for long messages (5,000 chars per slice) +- Automatic deduplication using SHA-1 hash of file + chunk +- Timestamps for both file modification and import time +- Full progress logging via tqdm +- Safe to run in background with `nohup … &` --- @@ -228,13 +238,16 @@ Relay β†’ UI (returns final response) All services run in a single docker-compose stack with the following containers: +**Active Services:** - **neomem-postgres** - PostgreSQL with pgvector extension (port 5432) - **neomem-neo4j** - Neo4j graph database (ports 7474, 7687) - **neomem-api** - NeoMem memory service (port 7077) - **relay** - Main orchestrator (port 7078) -- **cortex** - Reasoning engine (port 7081) -- **intake** - Short-term memory summarization (port 7080) - currently disabled -- **rag** - RAG search service (port 7090) - currently disabled +- **cortex** - Reasoning engine with embedded Intake (port 7081) + +**Disabled Services:** +- **intake** - No longer needed (embedded in Cortex as of v0.5.1) +- **rag** - Beta Lyrae RAG service (port 7090) - currently disabled All containers communicate via the `lyra_net` Docker bridge network. @@ -242,10 +255,10 @@ All containers communicate via the `lyra_net` Docker bridge network. The following LLM backends are accessed via HTTP (not part of docker-compose): -- **vLLM Server** (`http://10.0.0.43:8000`) +- **llama.cpp Server** (`http://10.0.0.44:8080`) - AMD MI50 GPU-accelerated inference - - Custom ROCm-enabled vLLM build - Primary backend for reasoning and refinement stages + - Model path: `/model` - **Ollama Server** (`http://10.0.0.3:11434`) - RTX 3090 GPU-accelerated inference @@ -265,16 +278,38 @@ The following LLM backends are accessed via HTTP (not part of docker-compose): ## Version History -### v0.5.0 (2025-11-28) - Current Release +### v0.5.1 (2025-12-11) - Current Release +**Critical Intake Integration Fixes:** +- βœ… Fixed `bg_summarize()` NameError preventing SESSIONS persistence +- βœ… Fixed `/ingest` endpoint unreachable code +- βœ… Added `cortex/intake/__init__.py` for proper package structure +- βœ… Added diagnostic logging to verify SESSIONS singleton behavior +- βœ… Added `/debug/sessions` and `/debug/summary` endpoints +- βœ… Documented single-worker constraint in Dockerfile +- βœ… Implemented lenient error handling (never fails chat pipeline) +- βœ… Intake now embedded in Cortex - no longer standalone service + +**Architecture Changes:** +- Intake module runs inside Cortex container as pure Python import +- No HTTP calls between Cortex and Intake (internal function calls) +- SESSIONS persist correctly in Uvicorn worker +- Deferred summarization strategy (summaries generated during `/reason`) + +### v0.5.0 (2025-11-28) - βœ… Fixed all critical API wiring issues - βœ… Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`) - βœ… Fixed Cortex β†’ Intake integration - βœ… Added missing Python package `__init__.py` files - βœ… End-to-end message flow verified and working +### Infrastructure v1.0.0 (2025-11-26) +- Consolidated 9 scattered `.env` files into single source of truth +- Multi-backend LLM strategy implemented +- Docker Compose consolidation +- Created `.env.example` security templates + ### v0.4.x (Major Rewire) - Cortex multi-stage reasoning pipeline -- Intake v0.2 simplification - LLM router with multi-backend support - Major architectural restructuring @@ -285,19 +320,30 @@ The following LLM backends are accessed via HTTP (not part of docker-compose): --- -## Known Issues (v0.5.0) +## Known Issues (v0.5.1) + +### Critical (Fixed in v0.5.1) +- ~~Intake SESSIONS not persisting~~ βœ… **FIXED** +- ~~`bg_summarize()` NameError~~ βœ… **FIXED** +- ~~`/ingest` endpoint unreachable code~~ βœ… **FIXED** ### Non-Critical - Session management endpoints not fully implemented in Relay -- Intake service currently disabled in docker-compose.yml - RAG service currently disabled in docker-compose.yml -- Cortex `/ingest` endpoint is a stub +- NeoMem integration in Relay not yet active (planned for v0.5.2) + +### Operational Notes +- **Single-worker constraint**: Cortex must run with single Uvicorn worker to maintain SESSIONS state + - Multi-worker scaling requires migrating SESSIONS to Redis or shared storage +- Diagnostic endpoints (`/debug/sessions`, `/debug/summary`) available for troubleshooting ### Future Enhancements - Re-enable RAG service integration - Implement full session persistence +- Migrate SESSIONS to Redis for multi-worker support - Add request correlation IDs for tracing -- Comprehensive health checks +- Comprehensive health checks across all services +- NeoMem integration in Relay --- @@ -305,21 +351,39 @@ The following LLM backends are accessed via HTTP (not part of docker-compose): ### Prerequisites - Docker + Docker Compose -- At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key) +- At least one HTTP-accessible LLM endpoint (llama.cpp, Ollama, or OpenAI API key) ### Setup -1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys +1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys: + ```bash + # Required: Configure at least one LLM backend + LLM_PRIMARY_URL=http://10.0.0.44:8080 # llama.cpp + LLM_SECONDARY_URL=http://10.0.0.3:11434 # Ollama + OPENAI_API_KEY=sk-... # OpenAI + ``` + 2. Start all services with docker-compose: ```bash docker-compose up -d ``` + 3. Check service health: ```bash + # Relay health curl http://localhost:7078/_health + + # Cortex health + curl http://localhost:7081/health + + # NeoMem health + curl http://localhost:7077/health ``` + 4. Access the UI at `http://localhost:7078` ### Test + +**Test Relay β†’ Cortex pipeline:** ```bash curl -X POST http://localhost:7078/v1/chat/completions \ -H "Content-Type: application/json" \ @@ -329,15 +393,130 @@ curl -X POST http://localhost:7078/v1/chat/completions \ }' ``` +**Test Cortex /ingest endpoint:** +```bash +curl -X POST http://localhost:7081/ingest \ + -H "Content-Type: application/json" \ + -d '{ + "session_id": "test", + "user_msg": "Hello", + "assistant_msg": "Hi there!" + }' +``` + +**Inspect SESSIONS state:** +```bash +curl http://localhost:7081/debug/sessions +``` + +**Get summary for a session:** +```bash +curl "http://localhost:7081/debug/summary?session_id=test" +``` + All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack. --- +## Environment Variables + +### LLM Backend Configuration + +**Backend URLs (Full API endpoints):** +```bash +LLM_PRIMARY_URL=http://10.0.0.44:8080 # llama.cpp +LLM_PRIMARY_MODEL=/model + +LLM_SECONDARY_URL=http://10.0.0.3:11434 # Ollama +LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M + +LLM_OPENAI_URL=https://api.openai.com/v1 +LLM_OPENAI_MODEL=gpt-4o-mini +OPENAI_API_KEY=sk-... +``` + +**Module-specific backend selection:** +```bash +CORTEX_LLM=SECONDARY # Use Ollama for reasoning +INTAKE_LLM=PRIMARY # Use llama.cpp for summarization +SPEAK_LLM=OPENAI # Use OpenAI for persona +NEOMEM_LLM=PRIMARY # Use llama.cpp for memory +UI_LLM=OPENAI # Use OpenAI for UI +RELAY_LLM=PRIMARY # Use llama.cpp for relay +``` + +### Database Configuration +```bash +POSTGRES_USER=neomem +POSTGRES_PASSWORD=neomempass +POSTGRES_DB=neomem +POSTGRES_HOST=neomem-postgres +POSTGRES_PORT=5432 + +NEO4J_URI=bolt://neomem-neo4j:7687 +NEO4J_USERNAME=neo4j +NEO4J_PASSWORD=neomemgraph +``` + +### Service URLs (Internal Docker Network) +```bash +NEOMEM_API=http://neomem-api:7077 +CORTEX_API=http://cortex:7081 +CORTEX_REASON_URL=http://cortex:7081/reason +CORTEX_INGEST_URL=http://cortex:7081/ingest +RELAY_URL=http://relay:7078 +``` + +### Feature Flags +```bash +CORTEX_ENABLED=true +MEMORY_ENABLED=true +PERSONA_ENABLED=false +DEBUG_PROMPT=true +VERBOSE_DEBUG=true +``` + +For complete environment variable reference, see [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md). + +--- + ## Documentation -- See [CHANGELOG.md](CHANGELOG.md) for detailed version history -- See `ENVIRONMENT_VARIABLES.md` for environment variable reference -- Additional information available in the Trilium docs +- [CHANGELOG.md](CHANGELOG.md) - Detailed version history +- [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - Comprehensive project overview for AI context +- [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md) - Environment variable reference +- [DEPRECATED_FILES.md](DEPRECATED_FILES.md) - Deprecated files and migration guide + +--- + +## Troubleshooting + +### SESSIONS not persisting +**Symptom:** Intake buffer always shows 0 exchanges, summaries always empty. + +**Solution (Fixed in v0.5.1):** +- Ensure `cortex/intake/__init__.py` exists +- Check Cortex logs for `[Intake Module Init]` message showing SESSIONS object ID +- Verify single-worker mode (Dockerfile: `uvicorn main:app --workers 1`) +- Use `/debug/sessions` endpoint to inspect current state + +### Cortex connection errors +**Symptom:** Relay can't reach Cortex, 502 errors. + +**Solution:** +- Verify Cortex container is running: `docker ps | grep cortex` +- Check Cortex health: `curl http://localhost:7081/health` +- Verify environment variables: `CORTEX_REASON_URL=http://cortex:7081/reason` +- Check docker network: `docker network inspect lyra_net` + +### LLM backend timeouts +**Symptom:** Reasoning stage hangs or times out. + +**Solution:** +- Verify LLM backend is running and accessible +- Check LLM backend health: `curl http://10.0.0.44:8080/health` +- Increase timeout in llm_router.py if using slow models +- Check logs for specific backend errors --- @@ -356,6 +535,8 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0). - All services communicate via Docker internal networking on the `lyra_net` bridge - History and entity graphs are managed via PostgreSQL + Neo4j - LLM backends are accessed via HTTP and configured in `.env` +- Intake module is imported internally by Cortex (no HTTP communication) +- SESSIONS state is maintained in-memory within Cortex container --- @@ -391,3 +572,38 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0). }' ``` +--- + +## Development Notes + +### Cortex Architecture (v0.5.1) +- Cortex contains embedded Intake module at `cortex/intake/` +- Intake is imported as: `from intake.intake import add_exchange_internal, SESSIONS` +- SESSIONS is a module-level global dictionary (singleton pattern) +- Single-worker constraint required to maintain SESSIONS state +- Diagnostic endpoints available for debugging: `/debug/sessions`, `/debug/summary` + +### Adding New LLM Backends +1. Add backend URL to `.env`: + ```bash + LLM_CUSTOM_URL=http://your-backend:port + LLM_CUSTOM_MODEL=model-name + ``` + +2. Configure module to use new backend: + ```bash + CORTEX_LLM=CUSTOM + ``` + +3. Restart Cortex container: + ```bash + docker-compose restart cortex + ``` + +### Debugging Tips +- Enable verbose logging: `VERBOSE_DEBUG=true` in `.env` +- Check Cortex logs: `docker logs cortex -f` +- Inspect SESSIONS: `curl http://localhost:7081/debug/sessions` +- Test summarization: `curl "http://localhost:7081/debug/summary?session_id=test"` +- Check Relay logs: `docker logs relay -f` +- Monitor Docker network: `docker network inspect lyra_net`