docs updated for v0.5.1

This commit is contained in:
serversdwn
2025-12-11 03:49:23 -05:00
parent e45cdbe54e
commit d5d7ea3469
2 changed files with 1227 additions and 157 deletions

422
README.md
View File

@@ -1,9 +1,11 @@
# Project Lyra - README v0.5.0
# Project Lyra - README v0.5.1
Lyra is a modular persistent AI companion system with advanced reasoning capabilities.
It provides memory-backed chat using **NeoMem** + **Relay** + **Cortex**,
with multi-stage reasoning pipeline powered by HTTP-based LLM backends.
**Current Version:** v0.5.1 (2025-12-11)
## Mission Statement
The point of Project Lyra is to give an AI chatbot more abilities than a typical chatbot. Typical chatbots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/database/co-creator/collaborator all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
@@ -22,7 +24,7 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do
- OpenAI-compatible endpoint: `POST /v1/chat/completions`
- Internal endpoint: `POST /chat`
- Routes messages through Cortex reasoning pipeline
- Manages async calls to Intake and NeoMem
- Manages async calls to NeoMem and Cortex ingest
**2. UI** (Static HTML)
- Browser-based chat interface with cyberpunk theme
@@ -41,38 +43,48 @@ Project Lyra operates as a **single docker-compose deployment** with multiple Do
**4. Cortex** (Python/FastAPI) - Port 7081
- Primary reasoning engine with multi-stage pipeline
- **Includes embedded Intake module** (no separate service as of v0.5.1)
- **4-Stage Processing:**
1. **Reflection** - Generates meta-awareness notes about conversation
2. **Reasoning** - Creates initial draft answer using context
3. **Refinement** - Polishes and improves the draft
4. **Persona** - Applies Lyra's personality and speaking style
- Integrates with Intake for short-term context
- Integrates with Intake for short-term context via internal Python imports
- Flexible LLM router supporting multiple backends via HTTP
- **Endpoints:**
- `POST /reason` - Main reasoning pipeline
- `POST /ingest` - Receives conversation exchanges from Relay
- `GET /health` - Service health check
- `GET /debug/sessions` - Inspect in-memory SESSIONS state
- `GET /debug/summary` - Test summarization for a session
**5. Intake v0.2** (Python/FastAPI) - Port 7080
- Simplified short-term memory summarization
- Session-based circular buffer (deque, maxlen=200)
- Single-level simple summarization (no cascading)
- Background async processing with FastAPI BackgroundTasks
- Pushes summaries to NeoMem automatically
- **API Endpoints:**
- `POST /add_exchange` - Add conversation exchange
- `GET /summaries?session_id={id}` - Retrieve session summary
- `POST /close_session/{id}` - Close and cleanup session
**5. Intake** (Python Module) - **Embedded in Cortex**
- **No longer a standalone service** - runs as Python module inside Cortex container
- Short-term memory management with session-based circular buffer
- In-memory SESSIONS dictionary: `session_id → {buffer: deque(maxlen=200), created_at: timestamp}`
- Multi-level summarization (L1/L5/L10/L20/L30) produced by `summarize_context()`
- Deferred summarization - actual summary generation happens during `/reason` call
- Internal Python API:
- `add_exchange_internal(exchange)` - Direct function call from Cortex
- `summarize_context(session_id, exchanges)` - Async LLM-based summarization
- `SESSIONS` - Module-level global state (requires single Uvicorn worker)
### LLM Backends (HTTP-based)
**All LLM communication is done via HTTP APIs:**
- **PRIMARY**: vLLM server (`http://10.0.0.43:8000`) - AMD MI50 GPU backend
- **PRIMARY**: llama.cpp server (`http://10.0.0.44:8080`) - AMD MI50 GPU backend
- **SECONDARY**: Ollama server (`http://10.0.0.3:11434`) - RTX 3090 backend
- Model: qwen2.5:7b-instruct-q4_K_M
- **CLOUD**: OpenAI API (`https://api.openai.com/v1`) - Cloud-based models
- Model: gpt-4o-mini
- **FALLBACK**: Local backup (`http://10.0.0.41:11435`) - Emergency fallback
- Model: llama-3.2-8b-instruct
Each module can be configured to use a different backend via environment variables.
Each module can be configured to use a different backend via environment variables.
---
## Data Flow Architecture (v0.5.0)
## Data Flow Architecture (v0.5.1)
### Normal Message Flow:
@@ -82,43 +94,44 @@ User (UI) → POST /v1/chat/completions
Relay (7078)
↓ POST /reason
Cortex (7081)
GET /summaries?session_id=xxx
Intake (7080) [RETURNS SUMMARY]
(internal Python call)
Intake module → summarize_context()
Cortex processes (4 stages):
1. reflection.py → meta-awareness notes
2. reasoning.py → draft answer (uses LLM)
3. refine.py → refined answer (uses LLM)
4. persona/speak.py → Lyra personality (uses LLM)
1. reflection.py → meta-awareness notes (CLOUD backend)
2. reasoning.py → draft answer (PRIMARY backend)
3. refine.py → refined answer (PRIMARY backend)
4. persona/speak.py → Lyra personality (CLOUD backend)
Returns persona answer to Relay
Relay → Cortex /ingest (async, stub)
Relay → Intake /add_exchange (async)
Relay → POST /ingest (async)
Intake → Background summarize → NeoMem
Cortex → add_exchange_internal() → SESSIONS buffer
Relay → NeoMem /memories (async, planned)
Relay → UI (returns final response)
```
### Cortex 4-Stage Reasoning Pipeline:
1. **Reflection** (`reflection.py`) - Configurable LLM via HTTP
1. **Reflection** (`reflection.py`) - Cloud LLM (OpenAI)
- Analyzes user intent and conversation context
- Generates meta-awareness notes
- "What is the user really asking?"
2. **Reasoning** (`reasoning.py`) - Configurable LLM via HTTP
- Retrieves short-term context from Intake
2. **Reasoning** (`reasoning.py`) - Primary LLM (llama.cpp)
- Retrieves short-term context from Intake module
- Creates initial draft answer
- Integrates context, reflection notes, and user prompt
3. **Refinement** (`refine.py`) - Configurable LLM via HTTP
3. **Refinement** (`refine.py`) - Primary LLM (llama.cpp)
- Polishes the draft answer
- Improves clarity and coherence
- Ensures factual consistency
4. **Persona** (`speak.py`) - Configurable LLM via HTTP
4. **Persona** (`speak.py`) - Cloud LLM (OpenAI)
- Applies Lyra's personality and speaking style
- Natural, conversational output
- Final answer returned to user
@@ -134,7 +147,7 @@ Relay → UI (returns final response)
- OpenAI-compatible endpoint: `POST /v1/chat/completions`
- Internal endpoint: `POST /chat`
- Health check: `GET /_health`
- Async non-blocking calls to Cortex and Intake
- Async non-blocking calls to Cortex
- Shared request handler for code reuse
- Comprehensive error handling
@@ -154,73 +167,70 @@ Relay → UI (returns final response)
### Reasoning Layer
**Cortex** (v0.5):
**Cortex** (v0.5.1):
- Multi-stage reasoning pipeline (reflection → reasoning → refine → persona)
- Flexible LLM backend routing via HTTP
- Per-stage backend selection
- Async processing throughout
- IntakeClient integration for short-term context
- `/reason`, `/ingest` (stub), `/health` endpoints
- Embedded Intake module for short-term context
- `/reason`, `/ingest`, `/health`, `/debug/sessions`, `/debug/summary` endpoints
- Lenient error handling - never fails the chat pipeline
**Intake** (v0.2):
- Simplified single-level summarization
- Session-based circular buffer (200 exchanges max)
- Background async summarization
- Automatic NeoMem push
- No persistent log files (memory-only)
- **Breaking change from v0.1**: Removed cascading summaries (L1, L2, L5, L10, L20, L30)
**Intake** (Embedded Module):
- **Architectural change**: Now runs as Python module inside Cortex container
- In-memory SESSIONS management (session_id → buffer)
- Multi-level summarization: L1 (ultra-short), L5 (short), L10 (medium), L20 (detailed), L30 (full)
- Deferred summarization strategy - summaries generated during `/reason` call
- `bg_summarize()` is a logging stub - actual work deferred
- **Single-worker constraint**: SESSIONS requires single Uvicorn worker or Redis/shared storage
**LLM Router**:
- Dynamic backend selection via HTTP
- Environment-driven configuration
- Support for vLLM, Ollama, OpenAI, custom endpoints
- Per-module backend preferences
- Support for llama.cpp, Ollama, OpenAI, custom endpoints
- Per-module backend preferences:
- `CORTEX_LLM=SECONDARY` (Ollama for reasoning)
- `INTAKE_LLM=PRIMARY` (llama.cpp for summarization)
- `SPEAK_LLM=OPENAI` (Cloud for persona)
- `NEOMEM_LLM=PRIMARY` (llama.cpp for memory operations)
### Beta Lyrae (RAG Memory DB) - Currently Disabled
# Beta Lyrae (RAG Memory DB) - added 11-3-25
- **RAG Knowledge DB - Beta Lyrae (sheliak)**
- This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
- This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
The system uses:
- **ChromaDB** for persistent vector storage
- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
- **FastAPI** (port 7090) for the `/rag/search` REST endpoint
- Directory Layout
rag/
├── rag_chat_import.py # imports JSON chat logs
├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
├── rag_build.py # legacy single-folder builder
├── rag_query.py # command-line query helper
├── rag_api.py # FastAPI service providing /rag/search
├── chromadb/ # persistent vector store
├── chatlogs/ # organized source data
│ ├── poker/
│ ├── work/
│ ├── lyra/
│ ├── personal/
│ └── ...
└── import.log # progress log for batch runs
- **OpenAI chatlog importer.
- Takes JSON formatted chat logs and imports it to the RAG.
- **fetures include:**
- Recursive folder indexing with **category detection** from directory name
- Smart chunking for long messages (5 000 chars per slice)
- Automatic deduplication using SHA-1 hash of file + chunk
- Timestamps for both file modification and import time
- Full progress logging via tqdm
- Safe to run in background with nohup … &
- Metadata per chunk:
```json
{
"chat_id": "<sha1 of filename>",
"chunk_index": 0,
"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
"title": "cortex LLMs 11-1-25",
"role": "assistant",
"category": "lyra",
"type": "chat",
"file_modified": "2025-11-06T23:41:02",
"imported_at": "2025-11-07T03:55:00Z"
}```
- **Status**: Disabled in docker-compose.yml (v0.5.1)
The system uses:
- **ChromaDB** for persistent vector storage
- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
- **FastAPI** (port 7090) for the `/rag/search` REST endpoint
Directory Layout:
```
rag/
├── rag_chat_import.py # imports JSON chat logs
├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
├── rag_build.py # legacy single-folder builder
├── rag_query.py # command-line query helper
├── rag_api.py # FastAPI service providing /rag/search
├── chromadb/ # persistent vector store
├── chatlogs/ # organized source data
│ ├── poker/
│ ├── work/
├── lyra/
│ ├── personal/
└── ...
└── import.log # progress log for batch runs
```
**OpenAI chatlog importer features:**
- Recursive folder indexing with **category detection** from directory name
- Smart chunking for long messages (5,000 chars per slice)
- Automatic deduplication using SHA-1 hash of file + chunk
- Timestamps for both file modification and import time
- Full progress logging via tqdm
- Safe to run in background with `nohup … &`
---
@@ -228,13 +238,16 @@ Relay → UI (returns final response)
All services run in a single docker-compose stack with the following containers:
**Active Services:**
- **neomem-postgres** - PostgreSQL with pgvector extension (port 5432)
- **neomem-neo4j** - Neo4j graph database (ports 7474, 7687)
- **neomem-api** - NeoMem memory service (port 7077)
- **relay** - Main orchestrator (port 7078)
- **cortex** - Reasoning engine (port 7081)
- **intake** - Short-term memory summarization (port 7080) - currently disabled
- **rag** - RAG search service (port 7090) - currently disabled
- **cortex** - Reasoning engine with embedded Intake (port 7081)
**Disabled Services:**
- **intake** - No longer needed (embedded in Cortex as of v0.5.1)
- **rag** - Beta Lyrae RAG service (port 7090) - currently disabled
All containers communicate via the `lyra_net` Docker bridge network.
@@ -242,10 +255,10 @@ All containers communicate via the `lyra_net` Docker bridge network.
The following LLM backends are accessed via HTTP (not part of docker-compose):
- **vLLM Server** (`http://10.0.0.43:8000`)
- **llama.cpp Server** (`http://10.0.0.44:8080`)
- AMD MI50 GPU-accelerated inference
- Custom ROCm-enabled vLLM build
- Primary backend for reasoning and refinement stages
- Model path: `/model`
- **Ollama Server** (`http://10.0.0.3:11434`)
- RTX 3090 GPU-accelerated inference
@@ -265,16 +278,38 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
## Version History
### v0.5.0 (2025-11-28) - Current Release
### v0.5.1 (2025-12-11) - Current Release
**Critical Intake Integration Fixes:**
- ✅ Fixed `bg_summarize()` NameError preventing SESSIONS persistence
- ✅ Fixed `/ingest` endpoint unreachable code
- ✅ Added `cortex/intake/__init__.py` for proper package structure
- ✅ Added diagnostic logging to verify SESSIONS singleton behavior
- ✅ Added `/debug/sessions` and `/debug/summary` endpoints
- ✅ Documented single-worker constraint in Dockerfile
- ✅ Implemented lenient error handling (never fails chat pipeline)
- ✅ Intake now embedded in Cortex - no longer standalone service
**Architecture Changes:**
- Intake module runs inside Cortex container as pure Python import
- No HTTP calls between Cortex and Intake (internal function calls)
- SESSIONS persist correctly in Uvicorn worker
- Deferred summarization strategy (summaries generated during `/reason`)
### v0.5.0 (2025-11-28)
- ✅ Fixed all critical API wiring issues
- ✅ Added OpenAI-compatible endpoint to Relay (`/v1/chat/completions`)
- ✅ Fixed Cortex → Intake integration
- ✅ Added missing Python package `__init__.py` files
- ✅ End-to-end message flow verified and working
### Infrastructure v1.0.0 (2025-11-26)
- Consolidated 9 scattered `.env` files into single source of truth
- Multi-backend LLM strategy implemented
- Docker Compose consolidation
- Created `.env.example` security templates
### v0.4.x (Major Rewire)
- Cortex multi-stage reasoning pipeline
- Intake v0.2 simplification
- LLM router with multi-backend support
- Major architectural restructuring
@@ -285,19 +320,30 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
---
## Known Issues (v0.5.0)
## Known Issues (v0.5.1)
### Critical (Fixed in v0.5.1)
- ~~Intake SESSIONS not persisting~~ ✅ **FIXED**
- ~~`bg_summarize()` NameError~~ ✅ **FIXED**
- ~~`/ingest` endpoint unreachable code~~ ✅ **FIXED**
### Non-Critical
- Session management endpoints not fully implemented in Relay
- Intake service currently disabled in docker-compose.yml
- RAG service currently disabled in docker-compose.yml
- Cortex `/ingest` endpoint is a stub
- NeoMem integration in Relay not yet active (planned for v0.5.2)
### Operational Notes
- **Single-worker constraint**: Cortex must run with single Uvicorn worker to maintain SESSIONS state
- Multi-worker scaling requires migrating SESSIONS to Redis or shared storage
- Diagnostic endpoints (`/debug/sessions`, `/debug/summary`) available for troubleshooting
### Future Enhancements
- Re-enable RAG service integration
- Implement full session persistence
- Migrate SESSIONS to Redis for multi-worker support
- Add request correlation IDs for tracing
- Comprehensive health checks
- Comprehensive health checks across all services
- NeoMem integration in Relay
---
@@ -305,21 +351,39 @@ The following LLM backends are accessed via HTTP (not part of docker-compose):
### Prerequisites
- Docker + Docker Compose
- At least one HTTP-accessible LLM endpoint (vLLM, Ollama, or OpenAI API key)
- At least one HTTP-accessible LLM endpoint (llama.cpp, Ollama, or OpenAI API key)
### Setup
1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys
1. Copy `.env.example` to `.env` and configure your LLM backend URLs and API keys:
```bash
# Required: Configure at least one LLM backend
LLM_PRIMARY_URL=http://10.0.0.44:8080 # llama.cpp
LLM_SECONDARY_URL=http://10.0.0.3:11434 # Ollama
OPENAI_API_KEY=sk-... # OpenAI
```
2. Start all services with docker-compose:
```bash
docker-compose up -d
```
3. Check service health:
```bash
# Relay health
curl http://localhost:7078/_health
# Cortex health
curl http://localhost:7081/health
# NeoMem health
curl http://localhost:7077/health
```
4. Access the UI at `http://localhost:7078`
### Test
**Test Relay → Cortex pipeline:**
```bash
curl -X POST http://localhost:7078/v1/chat/completions \
-H "Content-Type: application/json" \
@@ -329,15 +393,130 @@ curl -X POST http://localhost:7078/v1/chat/completions \
}'
```
**Test Cortex /ingest endpoint:**
```bash
curl -X POST http://localhost:7081/ingest \
-H "Content-Type: application/json" \
-d '{
"session_id": "test",
"user_msg": "Hello",
"assistant_msg": "Hi there!"
}'
```
**Inspect SESSIONS state:**
```bash
curl http://localhost:7081/debug/sessions
```
**Get summary for a session:**
```bash
curl "http://localhost:7081/debug/summary?session_id=test"
```
All backend databases (PostgreSQL and Neo4j) are automatically started as part of the docker-compose stack.
---
## Environment Variables
### LLM Backend Configuration
**Backend URLs (Full API endpoints):**
```bash
LLM_PRIMARY_URL=http://10.0.0.44:8080 # llama.cpp
LLM_PRIMARY_MODEL=/model
LLM_SECONDARY_URL=http://10.0.0.3:11434 # Ollama
LLM_SECONDARY_MODEL=qwen2.5:7b-instruct-q4_K_M
LLM_OPENAI_URL=https://api.openai.com/v1
LLM_OPENAI_MODEL=gpt-4o-mini
OPENAI_API_KEY=sk-...
```
**Module-specific backend selection:**
```bash
CORTEX_LLM=SECONDARY # Use Ollama for reasoning
INTAKE_LLM=PRIMARY # Use llama.cpp for summarization
SPEAK_LLM=OPENAI # Use OpenAI for persona
NEOMEM_LLM=PRIMARY # Use llama.cpp for memory
UI_LLM=OPENAI # Use OpenAI for UI
RELAY_LLM=PRIMARY # Use llama.cpp for relay
```
### Database Configuration
```bash
POSTGRES_USER=neomem
POSTGRES_PASSWORD=neomempass
POSTGRES_DB=neomem
POSTGRES_HOST=neomem-postgres
POSTGRES_PORT=5432
NEO4J_URI=bolt://neomem-neo4j:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=neomemgraph
```
### Service URLs (Internal Docker Network)
```bash
NEOMEM_API=http://neomem-api:7077
CORTEX_API=http://cortex:7081
CORTEX_REASON_URL=http://cortex:7081/reason
CORTEX_INGEST_URL=http://cortex:7081/ingest
RELAY_URL=http://relay:7078
```
### Feature Flags
```bash
CORTEX_ENABLED=true
MEMORY_ENABLED=true
PERSONA_ENABLED=false
DEBUG_PROMPT=true
VERBOSE_DEBUG=true
```
For complete environment variable reference, see [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md).
---
## Documentation
- See [CHANGELOG.md](CHANGELOG.md) for detailed version history
- See `ENVIRONMENT_VARIABLES.md` for environment variable reference
- Additional information available in the Trilium docs
- [CHANGELOG.md](CHANGELOG.md) - Detailed version history
- [PROJECT_SUMMARY.md](PROJECT_SUMMARY.md) - Comprehensive project overview for AI context
- [ENVIRONMENT_VARIABLES.md](ENVIRONMENT_VARIABLES.md) - Environment variable reference
- [DEPRECATED_FILES.md](DEPRECATED_FILES.md) - Deprecated files and migration guide
---
## Troubleshooting
### SESSIONS not persisting
**Symptom:** Intake buffer always shows 0 exchanges, summaries always empty.
**Solution (Fixed in v0.5.1):**
- Ensure `cortex/intake/__init__.py` exists
- Check Cortex logs for `[Intake Module Init]` message showing SESSIONS object ID
- Verify single-worker mode (Dockerfile: `uvicorn main:app --workers 1`)
- Use `/debug/sessions` endpoint to inspect current state
### Cortex connection errors
**Symptom:** Relay can't reach Cortex, 502 errors.
**Solution:**
- Verify Cortex container is running: `docker ps | grep cortex`
- Check Cortex health: `curl http://localhost:7081/health`
- Verify environment variables: `CORTEX_REASON_URL=http://cortex:7081/reason`
- Check docker network: `docker network inspect lyra_net`
### LLM backend timeouts
**Symptom:** Reasoning stage hangs or times out.
**Solution:**
- Verify LLM backend is running and accessible
- Check LLM backend health: `curl http://10.0.0.44:8080/health`
- Increase timeout in llm_router.py if using slow models
- Check logs for specific backend errors
---
@@ -356,6 +535,8 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
- All services communicate via Docker internal networking on the `lyra_net` bridge
- History and entity graphs are managed via PostgreSQL + Neo4j
- LLM backends are accessed via HTTP and configured in `.env`
- Intake module is imported internally by Cortex (no HTTP communication)
- SESSIONS state is maintained in-memory within Cortex container
---
@@ -391,3 +572,38 @@ NeoMem is a derivative work based on Mem0 OSS (Apache 2.0).
}'
```
---
## Development Notes
### Cortex Architecture (v0.5.1)
- Cortex contains embedded Intake module at `cortex/intake/`
- Intake is imported as: `from intake.intake import add_exchange_internal, SESSIONS`
- SESSIONS is a module-level global dictionary (singleton pattern)
- Single-worker constraint required to maintain SESSIONS state
- Diagnostic endpoints available for debugging: `/debug/sessions`, `/debug/summary`
### Adding New LLM Backends
1. Add backend URL to `.env`:
```bash
LLM_CUSTOM_URL=http://your-backend:port
LLM_CUSTOM_MODEL=model-name
```
2. Configure module to use new backend:
```bash
CORTEX_LLM=CUSTOM
```
3. Restart Cortex container:
```bash
docker-compose restart cortex
```
### Debugging Tips
- Enable verbose logging: `VERBOSE_DEBUG=true` in `.env`
- Check Cortex logs: `docker logs cortex -f`
- Inspect SESSIONS: `curl http://localhost:7081/debug/sessions`
- Test summarization: `curl "http://localhost:7081/debug/summary?session_id=test"`
- Check Relay logs: `docker logs relay -f`
- Monitor Docker network: `docker network inspect lyra_net`