From 5ed3fd0982a5f56e3386218f76d5f3cdd7f05605 Mon Sep 17 00:00:00 2001 From: serversdwn Date: Thu, 11 Dec 2025 02:50:23 -0500 Subject: [PATCH] cortex rework continued. --- CHANGELOG.md | 1304 +++++++++++++++++++------------------ cortex/Dockerfile | 2 + cortex/context.py | 39 +- cortex/intake/__init__.py | 18 + cortex/intake/intake.py | 138 ++-- cortex/router.py | 106 ++- vllm-mi50.md | 416 ------------ 7 files changed, 910 insertions(+), 1113 deletions(-) create mode 100644 cortex/intake/__init__.py delete mode 100644 vllm-mi50.md diff --git a/CHANGELOG.md b/CHANGELOG.md index b634cc9..ab30ad6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,18 +1,72 @@ -# Project Lyra β€” Modular Changelog -All notable changes to Project Lyra are organized by component. -The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) -and adheres to [Semantic Versioning](https://semver.org/). -# Last Updated: 11-28-25 +# Project Lyra Changelog + +All notable changes to Project Lyra. +Format based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) and [Semantic Versioning](https://semver.org/). + --- -## 🧠 Lyra-Core ############################################################################## +## [Unreleased] -## [Project Lyra v0.5.0] - 2025-11-28 +--- + +## [0.5.1] - 2025-12-11 + +### Fixed - Intake Integration +- **Critical**: Fixed `bg_summarize()` function not defined error + - Was only a `TYPE_CHECKING` stub, now implemented as logging stub + - Eliminated `NameError` preventing SESSIONS from persisting correctly + - Function now logs exchange additions and defers summarization to `/reason` endpoint +- **Critical**: Fixed `/ingest` endpoint unreachable code in [router.py:201-233](cortex/router.py#L201-L233) + - Removed early return that prevented `update_last_assistant_message()` from executing + - Removed duplicate `add_exchange_internal()` call + - Implemented lenient error handling (each operation wrapped in try/except) +- **Intake**: Added missing `__init__.py` to make intake a proper Python package [cortex/intake/__init__.py](cortex/intake/__init__.py) + - Prevents namespace package issues + - Enables proper module imports + - Exports `SESSIONS`, `add_exchange_internal`, `summarize_context` + +### Added - Diagnostics & Debugging +- Added diagnostic logging to verify SESSIONS singleton behavior + - Module initialization logs SESSIONS object ID [intake.py:14](cortex/intake/intake.py#L14) + - Each `add_exchange_internal()` call logs object ID and buffer state [intake.py:343-358](cortex/intake/intake.py#L343-L358) +- Added `/debug/sessions` HTTP endpoint [router.py:276-305](cortex/router.py#L276-L305) + - Inspect SESSIONS from within running Uvicorn worker + - Shows total sessions, session count, buffer sizes, recent exchanges + - Returns SESSIONS object ID for verification +- Added `/debug/summary` HTTP endpoint [router.py:238-271](cortex/router.py#L238-L271) + - Test `summarize_context()` for any session + - Returns L1/L5/L10/L20/L30 summaries + - Includes buffer size and exchange preview + +### Changed - Intake Architecture +- **Intake no longer standalone service** - runs inside Cortex container as pure Python module + - Imported as `from intake.intake import add_exchange_internal, SESSIONS` + - No HTTP calls between Cortex and Intake + - Eliminates network latency and dependency on Intake service being up +- **Deferred summarization**: `bg_summarize()` is now a no-op stub [intake.py:318-325](cortex/intake/intake.py#L318-L325) + - Actual summarization happens during `/reason` call via `summarize_context()` + - Simplifies async/sync complexity + - Prevents NameError when called from `add_exchange_internal()` +- **Lenient error handling**: `/ingest` endpoint always returns success [router.py:201-233](cortex/router.py#L201-L233) + - Each operation wrapped in try/except + - Logs errors but never fails to avoid breaking chat pipeline + - User requirement: never fail chat pipeline + +### Documentation +- Added single-worker constraint note in [cortex/Dockerfile:7-8](cortex/Dockerfile#L7-L8) + - Documents that SESSIONS requires single Uvicorn worker + - Notes that multi-worker scaling requires Redis or shared storage +- Updated plan documentation with root cause analysis + +--- + +## [0.5.0] - 2025-11-28 + +### Fixed - Critical API Wiring & Integration -### πŸ”§ Fixed - Critical API Wiring & Integration After the major architectural rewire (v0.4.x), this release fixes all critical endpoint mismatches and ensures end-to-end system connectivity. -#### Cortex β†’ Intake Integration βœ… +#### Cortex β†’ Intake Integration - **Fixed** `IntakeClient` to use correct Intake v0.2 API endpoints - Changed `GET /context/{session_id}` β†’ `GET /summaries?session_id={session_id}` - Updated JSON response parsing to extract `summary_text` field @@ -20,7 +74,7 @@ After the major architectural rewire (v0.4.x), this release fixes all critical e - Corrected default port: `7083` β†’ `7080` - Added deprecation warning to `summarize_turn()` method (endpoint removed in Intake v0.2) -#### Relay β†’ UI Compatibility βœ… +#### Relay β†’ UI Compatibility - **Added** OpenAI-compatible endpoint `POST /v1/chat/completions` - Accepts standard OpenAI format with `messages[]` array - Returns OpenAI-compatible response structure with `choices[]` @@ -31,13 +85,13 @@ After the major architectural rewire (v0.4.x), this release fixes all critical e - Eliminates code duplication - Consistent error handling across endpoints -#### Relay β†’ Intake Connection βœ… +#### Relay β†’ Intake Connection - **Fixed** Intake URL fallback in Relay server configuration - Corrected port: `7082` β†’ `7080` - Updated endpoint: `/summary` β†’ `/add_exchange` - Now properly sends exchanges to Intake for summarization -#### Code Quality & Python Package Structure βœ… +#### Code Quality & Python Package Structure - **Added** missing `__init__.py` files to all Cortex subdirectories - `cortex/llm/__init__.py` - `cortex/reasoning/__init__.py` @@ -48,7 +102,8 @@ After the major architectural rewire (v0.4.x), this release fixes all critical e - **Removed** unused import in `cortex/router.py`: `from unittest import result` - **Deleted** empty file `cortex/llm/resolve_llm_url.py` (was 0 bytes, never implemented) -### βœ… Verified Working +### Verified Working + Complete end-to-end message flow now operational: ``` UI β†’ Relay (/v1/chat/completions) @@ -72,26 +127,26 @@ Intake β†’ NeoMem (background memory storage) Relay β†’ UI (final response) ``` -### πŸ“ Documentation -- **Added** this CHANGELOG entry with comprehensive v0.5.0 notes +### Documentation +- **Added** comprehensive v0.5.0 changelog entry - **Updated** README.md to reflect v0.5.0 architecture - Documented new endpoints - Updated data flow diagrams - Clarified Intake v0.2 changes - Corrected service descriptions -### πŸ› Issues Resolved +### Issues Resolved - ❌ Cortex could not retrieve context from Intake (wrong endpoint) - ❌ UI could not send messages to Relay (endpoint mismatch) - ❌ Relay could not send summaries to Intake (wrong port/endpoint) - ❌ Python package imports were implicit (missing __init__.py) -### ⚠️ Known Issues (Non-Critical) +### Known Issues (Non-Critical) - Session management endpoints not implemented in Relay (`GET/POST /sessions/:id`) - RAG service currently disabled in docker-compose.yml - Cortex `/ingest` endpoint is a stub returning `{"status": "ok"}` -### 🎯 Migration Notes +### Migration Notes If upgrading from v0.4.x: 1. Pull latest changes from git 2. Verify environment variables in `.env` files: @@ -104,45 +159,48 @@ If upgrading from v0.4.x: ## [Infrastructure v1.0.0] - 2025-11-26 -### Changed -- **Environment Variable Consolidation** - Major reorganization to eliminate duplication and improve maintainability - - Consolidated 9 scattered `.env` files into single source of truth architecture - - Root `.env` now contains all shared infrastructure (LLM backends, databases, API keys, service URLs) - - Service-specific `.env` files minimized to only essential overrides: - - `cortex/.env`: Reduced from 42 to 22 lines (operational parameters only) - - `neomem/.env`: Reduced from 26 to 14 lines (LLM naming conventions only) - - `intake/.env`: Kept at 8 lines (already minimal) - - **Result**: ~24% reduction in total configuration lines (197 β†’ ~150) +### Changed - Environment Variable Consolidation -- **Docker Compose Consolidation** - - All services now defined in single root `docker-compose.yml` - - Relay service updated with complete configuration (env_file, volumes) - - Removed redundant `core/docker-compose.yml` (marked as DEPRECATED) - - Standardized network communication to use Docker container names +**Major reorganization to eliminate duplication and improve maintainability** -- **Service URL Standardization** - - Internal services use container names: `http://neomem-api:7077`, `http://cortex:7081` - - External services use IP addresses: `http://10.0.0.43:8000` (vLLM), `http://10.0.0.3:11434` (Ollama) - - Removed IP/container name inconsistencies across files +- Consolidated 9 scattered `.env` files into single source of truth architecture +- Root `.env` now contains all shared infrastructure (LLM backends, databases, API keys, service URLs) +- Service-specific `.env` files minimized to only essential overrides: + - `cortex/.env`: Reduced from 42 to 22 lines (operational parameters only) + - `neomem/.env`: Reduced from 26 to 14 lines (LLM naming conventions only) + - `intake/.env`: Kept at 8 lines (already minimal) +- **Result**: ~24% reduction in total configuration lines (197 β†’ ~150) -### Added -- **Security Templates** - Created `.env.example` files for all services - - Root `.env.example` with sanitized credentials - - Service-specific templates: `cortex/.env.example`, `neomem/.env.example`, `intake/.env.example`, `rag/.env.example` - - All `.env.example` files safe to commit to version control +**Docker Compose Consolidation** +- All services now defined in single root `docker-compose.yml` +- Relay service updated with complete configuration (env_file, volumes) +- Removed redundant `core/docker-compose.yml` (marked as DEPRECATED) +- Standardized network communication to use Docker container names -- **Documentation** - - `ENVIRONMENT_VARIABLES.md`: Comprehensive reference for all environment variables - - Variable descriptions, defaults, and usage examples - - Multi-backend LLM strategy documentation - - Troubleshooting guide - - Security best practices - - `DEPRECATED_FILES.md`: Deletion guide for deprecated files with verification steps +**Service URL Standardization** +- Internal services use container names: `http://neomem-api:7077`, `http://cortex:7081` +- External services use IP addresses: `http://10.0.0.43:8000` (vLLM), `http://10.0.0.3:11434` (Ollama) +- Removed IP/container name inconsistencies across files -- **Enhanced .gitignore** - - Ignores all `.env` files (including subdirectories) - - Tracks `.env.example` templates for documentation - - Ignores `.env-backups/` directory +### Added - Security & Documentation + +**Security Templates** - Created `.env.example` files for all services +- Root `.env.example` with sanitized credentials +- Service-specific templates: `cortex/.env.example`, `neomem/.env.example`, `intake/.env.example`, `rag/.env.example` +- All `.env.example` files safe to commit to version control + +**Documentation** +- `ENVIRONMENT_VARIABLES.md`: Comprehensive reference for all environment variables + - Variable descriptions, defaults, and usage examples + - Multi-backend LLM strategy documentation + - Troubleshooting guide + - Security best practices +- `DEPRECATED_FILES.md`: Deletion guide for deprecated files with verification steps + +**Enhanced .gitignore** +- Ignores all `.env` files (including subdirectories) +- Tracks `.env.example` templates for documentation +- Ignores `.env-backups/` directory ### Removed - `core/.env` - Redundant with root `.env`, now deleted @@ -154,13 +212,15 @@ If upgrading from v0.4.x: - Eliminated duplicate database credentials across 3+ files - Resolved Cortex `environment:` section override in docker-compose (now uses env_file) -### Architecture -- **Multi-Backend LLM Strategy**: Root `.env` provides all backend OPTIONS (PRIMARY, SECONDARY, CLOUD, FALLBACK), services choose which to USE - - Cortex β†’ vLLM (PRIMARY) for autonomous reasoning - - NeoMem β†’ Ollama (SECONDARY) + OpenAI embeddings - - Intake β†’ vLLM (PRIMARY) for summarization - - Relay β†’ Fallback chain with user preference -- Preserves per-service flexibility while eliminating URL duplication +### Architecture - Multi-Backend LLM Strategy + +Root `.env` provides all backend OPTIONS (PRIMARY, SECONDARY, CLOUD, FALLBACK), services choose which to USE: +- **Cortex** β†’ vLLM (PRIMARY) for autonomous reasoning +- **NeoMem** β†’ Ollama (SECONDARY) + OpenAI embeddings +- **Intake** β†’ vLLM (PRIMARY) for summarization +- **Relay** β†’ Fallback chain with user preference + +Preserves per-service flexibility while eliminating URL duplication. ### Migration - All original `.env` files backed up to `.env-backups/` with timestamp `20251126_025334` @@ -169,637 +229,607 @@ If upgrading from v0.4.x: --- -## [Lyra_RAG v0.1.0] 2025-11-07 -### Added -- Initial standalone RAG module for Project Lyra. -- Persistent ChromaDB vector store (`./chromadb`). -- Importer `rag_chat_import.py` with: - - Recursive folder scanning and category tagging. - - Smart chunking (~5 k chars). - - SHA-1 deduplication and chat-ID metadata. - - Timestamp fields (`file_modified`, `imported_at`). - - Background-safe operation (`nohup`/`tmux`). -- 68 Lyra-category chats imported: - - **6 556 new chunks added** - - **1 493 duplicates skipped** - - **7 997 total vectors** now stored. +## [0.4.x] - 2025-11-13 -### API -- `/rag/search` FastAPI endpoint implemented (port 7090). -- Supports natural-language queries and returns top related excerpts. -- Added answer synthesis step using `gpt-4o-mini`. +### Added - Multi-Stage Reasoning Pipeline -### Verified -- Successful recall of Lyra-Core development history (v0.3.0 snapshot). -- Correct metadata and category tagging for all new imports. +**Cortex v0.5 - Complete architectural overhaul** -### Next Planned -- Optional `where` filter parameter for category/date queries. -- Graceful β€œno results” handler for empty retrievals. -- `rag_docs_import.py` for PDFs and other document types. +- **New `reasoning.py` module** + - Async reasoning engine + - Accepts user prompt, identity, RAG block, and reflection notes + - Produces draft internal answers + - Uses primary backend (vLLM) -## [Lyra Core v0.3.2 + Web Ui v0.2.0] - 2025-10-28 +- **New `reflection.py` module** + - Fully async meta-awareness layer + - Produces actionable JSON "internal notes" + - Enforces strict JSON schema and fallback parsing + - Forces cloud backend (`backend_override="cloud"`) -### Added -- ** New UI ** - - Cleaned up UI look and feel. - -- ** Added "sessions" ** - - Now sessions persist over time. - - Ability to create new sessions or load sessions from a previous instance. - - When changing the session, it updates what the prompt is sending relay (doesn't prompt with messages from other sessions). - - Relay is correctly wired in. +- **Integrated `refine.py` into pipeline** + - New stage between reflection and persona + - Runs exclusively on primary vLLM backend (MI50) + - Produces final, internally consistent output for downstream persona layer -## [Lyra-Core 0.3.1] - 2025-10-09 +- **Backend override system** + - Each LLM call can now select its own backend + - Enables multi-LLM cognition: Reflection β†’ cloud, Reasoning β†’ primary -### Added -- **NVGRAM Integration (Full Pipeline Reconnected)** - - Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077). - - Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search`. - - Added `.env` variable: - ``` - NVGRAM_API=http://nvgram-api:7077 - ``` - - Verified end-to-end Lyra conversation persistence: - - `relay β†’ nvgram-api β†’ postgres/neo4j β†’ relay β†’ ollama β†’ ui` - - βœ… Memories stored, retrieved, and re-injected successfully. +- **Identity loader** + - Added `identity.py` with `load_identity()` for consistent persona retrieval -### Changed -- Renamed `MEM0_URL` β†’ `NVGRAM_API` across all relay environment configs. -- Updated Docker Compose service dependency order: - - `relay` now depends on `nvgram-api` healthcheck. - - Removed `mem0` references and volumes. -- Minor cleanup to Persona fetch block (null-checks and safer default persona string). +- **Ingest handler** + - Async stub created for future Intake β†’ NeoMem β†’ RAG pipeline + +**Cortex v0.4.1 - RAG Integration** + +- **RAG integration** + - Added `rag.py` with `query_rag()` and `format_rag_block()` + - Cortex now queries local RAG API (`http://10.0.0.41:7090/rag/search`) + - Synthesized answers and top excerpts injected into reasoning prompt + +### Changed - Unified LLM Architecture + +**Cortex v0.5** + +- **Unified LLM backend URL handling across Cortex** + - ENV variables must now contain FULL API endpoints + - Removed all internal path-appending (e.g. `.../v1/completions`) + - `llm_router.py` rewritten to use env-provided URLs as-is + - Ensures consistent behavior between draft, reflection, refine, and persona + +- **Rebuilt `main.py`** + - Removed old annotation/analysis logic + - New structure: load identity β†’ get RAG β†’ reflect β†’ reason β†’ return draft+notes + - Routes now clean and minimal (`/reason`, `/ingest`, `/health`) + - Async path throughout Cortex + +- **Refactored `llm_router.py`** + - Removed old fallback logic during overrides + - OpenAI requests now use `/v1/chat/completions` + - Added proper OpenAI Authorization headers + - Distinct payload format for vLLM vs OpenAI + - Unified, correct parsing across models + +- **Simplified Cortex architecture** + - Removed deprecated "context.py" and old reasoning code + - Relay completely decoupled from smart behavior + +- **Updated environment specification** + - `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions` + - `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama) + - `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions` + +**Cortex v0.4.1** + +- **Revised `/reason` endpoint** + - Now builds unified context blocks: [Intake] β†’ recent summaries, [RAG] β†’ contextual knowledge, [User Message] β†’ current input + - Calls `call_llm()` for first pass, then `reflection_loop()` for meta-evaluation + - Returns `cortex_prompt`, `draft_output`, `final_output`, and normalized reflection + +- **Reflection Pipeline Stability** + - Cleaned parsing to normalize JSON vs. text reflections + - Added fallback handling for malformed or non-JSON outputs + - Log system improved to show raw JSON, extracted fields, and normalized summary + +- **Async Summarization (Intake v0.2.1)** + - Intake summaries now run in background threads to avoid blocking Cortex + - Summaries (L1–L∞) logged asynchronously with [BG] tags + +- **Environment & Networking Fixes** + - Verified `.env` variables propagate correctly inside Cortex container + - Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG + - Adjusted localhost calls to service-IP mapping + +- **Behavioral Updates** + - Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers) + - RAG context successfully grounds reasoning outputs + - Intake and NeoMem confirmed receiving summaries via `/add_exchange` + - Log clarity pass: all reflective and contextual blocks clearly labeled ### Fixed -- Relay startup no longer crashes when NVGRAM is unavailable β€” deferred connection handling. -- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500`. -- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON). -### Goals / Next Steps -- Add salience visualization (e.g., memory weights displayed in injected system message). -- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring. -- Add relay auto-retry for transient 500 responses from NVGRAM. +**Cortex v0.5** + +- Resolved endpoint conflict where router expected base URLs and refine expected full URLs + - Fixed by standardizing full-URL behavior across entire system +- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax) +- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints +- No more double-routing through vLLM during reflection +- Corrected async/sync mismatch in multiple locations +- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic + +### Removed + +**Cortex v0.5** + +- Legacy `annotate`, `reason_check` glue logic from old architecture +- Old backend probing junk code +- Stale imports and unused modules leftover from previous prototype + +### Verified + +**Cortex v0.5** + +- Cortex β†’ vLLM (MI50) β†’ refine β†’ final_output now functioning correctly +- Refine shows `used_primary_backend: true` and no fallback +- Manual curl test confirms endpoint accuracy + +### Known Issues + +**Cortex v0.5** + +- Refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this +- Hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned) + +**Cortex v0.4.1** + +- NeoMem tuning needed - improve retrieval latency and relevance +- Need dedicated `/reflections/recent` endpoint for Cortex +- Migrate to Cortex-first ingestion (Relay β†’ Cortex β†’ NeoMem) +- Add persistent reflection recall (use prior reflections as meta-context) +- Improve reflection JSON structure ("insight", "evaluation", "next_action" β†’ guaranteed fields) +- Tighten temperature and prompt control for factual consistency +- RAG optimization: add source ranking, filtering, multi-vector hybrid search +- Cache RAG responses per session to reduce duplicate calls + +### Notes + +**Cortex v0.5** + +This is the largest structural change to Cortex so far. It establishes: +- Multi-model cognition +- Clean layering +- Identity + reflection separation +- Correct async code +- Deterministic backend routing +- Predictable JSON reflection + +The system is now ready for: +- Refinement loops +- Persona-speaking layer +- Containerized RAG +- Long-term memory integration +- True emergent-behavior experiments --- -## [Lyra-Core] v0.3.1 - 2025-09-27 -### Changed -- Removed salience filter logic; Cortex is now the default annotator. -- All user messages stored in Mem0; no discard tier applied. +## [0.3.x] - 2025-10-28 to 2025-09-26 ### Added -- Cortex annotations (`metadata.cortex`) now attached to memories. -- Debug logging improvements: + +**[Lyra Core v0.3.2 + Web UI v0.2.0] - 2025-10-28** + +- **New UI** + - Cleaned up UI look and feel + +- **Sessions** + - Sessions now persist over time + - Ability to create new sessions or load sessions from previous instance + - When changing session, updates what the prompt sends to relay (doesn't prompt with messages from other sessions) + - Relay correctly wired in + +**[Lyra-Core 0.3.1] - 2025-10-09** + +- **NVGRAM Integration (Full Pipeline Reconnected)** + - Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077) + - Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search` + - Added `.env` variable: `NVGRAM_API=http://nvgram-api:7077` + - Verified end-to-end Lyra conversation persistence: `relay β†’ nvgram-api β†’ postgres/neo4j β†’ relay β†’ ollama β†’ ui` + - βœ… Memories stored, retrieved, and re-injected successfully + +**[Lyra-Core v0.3.0] - 2025-09-26** + +- **Salience filtering** in Relay + - `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL` + - Supports `heuristic` and `llm` classification modes + - LLM-based salience filter integrated with Cortex VM running `llama-server` +- Logging improvements + - Added debug logs for salience mode, raw LLM output, and unexpected outputs + - Fail-closed behavior for unexpected LLM responses +- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers +- Verified end-to-end flow: Relay β†’ salience filter β†’ Mem0 add/search β†’ Persona injection β†’ LLM reply + +**[Cortex v0.3.0] - 2025-10-31** + +- **Cortex Service (FastAPI)** + - New standalone reasoning engine (`cortex/main.py`) with endpoints: + - `GET /health` – reports active backend + NeoMem status + - `POST /reason` – evaluates `{prompt, response}` pairs + - `POST /annotate` – experimental text analysis + - Background NeoMem health monitor (5-minute interval) + +- **Multi-Backend Reasoning Support** + - Environment-driven backend selection via `LLM_FORCE_BACKEND` + - Supports: Primary (vLLM MI50), Secondary (Ollama 3090), Cloud (OpenAI), Fallback (llama.cpp CPU) + - Per-backend model variables: `LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL` + +- **Response Normalization Layer** + - Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON + - Handles Ollama's multi-line streaming and Mythomax's missing punctuation issues + - Prints concise debug previews of merged content + +- **Environment Simplification** + - Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file + - Removed reliance on shared/global env file to prevent cross-contamination + - Verified Docker Compose networking across containers + +**[NeoMem 0.1.2] - 2025-10-27** (formerly NVGRAM) + +- **Renamed NVGRAM to NeoMem** + - All future updates under name NeoMem + - Features unchanged + +**[NVGRAM 0.1.1] - 2025-10-08** + +- **Async Memory Rewrite (Stability + Safety Patch)** + - Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes + - Added input sanitation to prevent embedding errors (`'list' object has no attribute 'replace'`) + - Implemented `flatten_messages()` helper in API layer to clean malformed payloads + - Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware) + - Health endpoint (`/health`) returns structured JSON `{status, version, service}` + - Startup logs include sanitized embedder config with masked API keys + +**[NVGRAM 0.1.0] - 2025-10-07** + +- **Initial fork of Mem0 β†’ NVGRAM** + - Created fully independent local-first memory engine based on Mem0 OSS + - Renamed all internal modules, Docker services, environment variables from `mem0` β†’ `nvgram` + - New service name: `nvgram-api`, default port 7077 + - Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility + - Uses FastAPI, Postgres, and Neo4j as persistent backends + +**[Lyra-Mem0 0.3.2] - 2025-10-05** + +- **Ollama LLM reasoning** alongside OpenAI embeddings + - Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090` + - Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M` + - Split processing: Embeddings β†’ OpenAI `text-embedding-3-small`, LLM β†’ Local Ollama +- Added `.env.3090` template for self-hosted inference nodes +- Integrated runtime diagnostics and seeder progress tracking + - File-level + message-level progress bars + - Retry/back-off logic for timeouts (3 attempts) + - Event logging (`ADD / UPDATE / NONE`) for every memory record +- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers +- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090) + +**[Lyra-Mem0 0.3.1] - 2025-10-03** + +- HuggingFace TEI integration (local 3090 embedder) +- Dual-mode environment switch between OpenAI cloud and local +- CSV export of memories from Postgres (`payload->>'data'`) + +**[Lyra-Mem0 0.3.0]** + +- **Ollama embeddings** in Mem0 OSS container + - Configure `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, `OLLAMA_HOST` via `.env` + - Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG` + - Installed `ollama` Python client into custom API container image +- `.env.3090` file for external embedding mode (3090 machine) +- Workflow for multiple embedding modes: LAN-based 3090/Ollama, Local-only CPU, OpenAI fallback + +**[Lyra-Mem0 v0.2.1]** + +- **Seeding pipeline** + - Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0 + - Implemented incremental seeding option (skip existing memories, only add new ones) + - Verified insert process with Postgres-backed history DB + +**[Intake v0.1.0] - 2025-10-27** + +- Receives messages from relay and summarizes them in cascading format +- Continues to summarize smaller amounts of exchanges while generating large-scale conversational summaries (L20) +- Currently logs summaries to .log file in `/project-lyra/intake-logs/` + +**[Lyra-Cortex v0.2.0] - 2025-09-26** + +- Integrated **llama-server** on dedicated Cortex VM (Proxmox) +- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs +- Benchmarked Phi-3.5-mini performance: ~18 tokens/sec CPU-only on Ryzen 7 7800X +- Salience classification functional but sometimes inconsistent +- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier + - Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval) + - More responsive but over-classifies messages as "salient" +- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models + +### Changed + +**[Lyra-Core 0.3.1] - 2025-10-09** + +- Renamed `MEM0_URL` β†’ `NVGRAM_API` across all relay environment configs +- Updated Docker Compose service dependency order + - `relay` now depends on `nvgram-api` healthcheck + - Removed `mem0` references and volumes +- Minor cleanup to Persona fetch block (null-checks and safer default persona string) + +**[Lyra-Core v0.3.1] - 2025-09-27** + +- Removed salience filter logic; Cortex is now default annotator +- All user messages stored in Mem0; no discard tier applied +- Cortex annotations (`metadata.cortex`) now attached to memories +- Debug logging improvements - Pretty-print Cortex annotations - Injected prompt preview - Memory search hit list with scores -- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed. +- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed + +**[Lyra-Core v0.3.0] - 2025-09-26** + +- Refactored `server.js` to gate `mem.add()` calls behind salience filter +- Updated `.env` to support `SALIENCE_MODEL` + +**[Cortex v0.3.0] - 2025-10-31** + +- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend +- Enhanced startup logs to announce active backend, model, URL, and mode +- Improved error handling with clearer "Reasoning error" messages + +**[NVGRAM 0.1.1] - 2025-10-08** + +- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes +- Normalized indentation and cleaned duplicate `main.py` references +- Removed redundant `FastAPI()` app reinitialization +- Updated internal logging to INFO-level timing format +- Deprecated `@app.on_event("startup")` β†’ will migrate to `lifespan` handler in v0.1.2 + +**[NVGRAM 0.1.0] - 2025-10-07** + +- Removed dependency on external `mem0ai` SDK β€” all logic now local +- Re-pinned requirements: fastapi==0.115.8, uvicorn==0.34.0, pydantic==2.10.4, python-dotenv==1.0.1, psycopg>=3.2.8, ollama +- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming + +**[Lyra-Mem0 0.3.2] - 2025-10-05** + +- Updated `main.py` configuration block to load `LLM_PROVIDER`, `LLM_MODEL`, `OLLAMA_BASE_URL` + - Fallback to OpenAI if Ollama unavailable +- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py` +- Normalized `.env` loading so `mem0-api` and host environment share identical values +- Improved seeder logging and progress telemetry +- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']` + +**[Lyra-Mem0 0.3.0]** + +- `docker-compose.yml` updated to mount local `main.py` and `.env.3090` +- Built custom Dockerfile (`mem0-api-server:latest`) extending base image with `pip install ollama` +- Updated `requirements.txt` to include `ollama` package +- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv` +- Tested new embeddings path with curl `/memories` API call + +**[Lyra-Mem0 v0.2.1]** + +- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends +- Mounted host `main.py` into container so local edits persist across rebuilds +- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles +- Built custom Dockerfile (`mem0-api-server:latest`) including `pip install ollama` +- Updated `requirements.txt` with `ollama` dependency +- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP) +- Added logging to confirm model pulls and embedding requests ### Fixed -- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner. -- Relay no longer β€œhangs” on malformed Cortex outputs. ---- +**[Lyra-Core 0.3.1] - 2025-10-09** -### [Lyra-Core] v0.3.0 β€” 2025-09-26 -#### Added -- Implemented **salience filtering** in Relay: - - `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL`. - - Supports `heuristic` and `llm` classification modes. - - LLM-based salience filter integrated with Cortex VM running `llama-server`. -- Logging improvements: - - Added debug logs for salience mode, raw LLM output, and unexpected outputs. - - Fail-closed behavior for unexpected LLM responses. -- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers. -- Verified end-to-end flow: Relay β†’ salience filter β†’ Mem0 add/search β†’ Persona injection β†’ LLM reply. +- Relay startup no longer crashes when NVGRAM is unavailable β€” deferred connection handling +- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500` +- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON) -#### Changed -- Refactored `server.js` to gate `mem.add()` calls behind salience filter. -- Updated `.env` to support `SALIENCE_MODEL`. +**[Lyra-Core v0.3.1] - 2025-09-27** -#### Known Issues -- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient". -- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi"). -- CPU-only inference is functional but limited; larger models recommended once GPU is available. +- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner +- Relay no longer "hangs" on malformed Cortex outputs ---- +**[Cortex v0.3.0] - 2025-10-31** -### [Lyra-Core] v0.2.0 β€” 2025-09-24 -#### Added -- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls. -- Implemented `sessionId` support (client-supplied, fallback to `default`). -- Added debug logs for memory add/search. -- Cleaned up Relay structure for clarity. +- Corrected broken vLLM endpoint routing (`/v1/completions`) +- Stabilized cross-container health reporting for NeoMem +- Resolved JSON parse failures caused by streaming chunk delimiters ---- +**[NVGRAM 0.1.1] - 2025-10-08** -### [Lyra-Core] v0.1.0 β€” 2025-09-23 -#### Added -- First working MVP of **Lyra Core Relay**. -- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible). -- Memory integration with Mem0: - - `POST /memories` on each user message. - - `POST /search` before LLM call. -- Persona Sidecar integration (`GET /current`). -- OpenAI GPT + Ollama (Mythomax) support in Relay. -- Simple browser-based chat UI (talks to Relay at `http://:7078`). -- `.env` standardization for Relay + Mem0 + Postgres + Neo4j. -- Working Neo4j + Postgres backing stores for Mem0. -- Initial MVP relay service with raw fetch calls to Mem0. -- Dockerized with basic healthcheck. +- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content +- Masked API key leaks from boot logs +- Ensured Neo4j reconnects gracefully on first retry -#### Fixed -- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only). -- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env`. +**[Lyra-Mem0 0.3.2] - 2025-10-05** -#### Known Issues -- No feedback loop (thumbs up/down) yet. -- Forget/delete flow is manual (via memory IDs). -- Memory latency ~1–4s depending on embedding model. +- Resolved crash during startup: `TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'` +- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors +- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests +- "Unknown event" warnings now safely ignored (no longer break seeding loop) +- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`) ---- +**[Lyra-Mem0 0.3.1] - 2025-10-03** -## 🧩 lyra-neomem (used to be NVGRAM / Lyra-Mem0) ############################################################################## +- `.env` CRLF vs LF line ending issues +- Local seeding now possible via HuggingFace server -## [NeoMem 0.1.2] - 2025-10-27 -### Changed -- **Renamed NVGRAM to neomem** - - All future updates will be under the name NeoMem. - - Features have not changed. +**[Lyra-Mem0 0.3.0]** -## [NVGRAM 0.1.1] - 2025-10-08 -### Added -- **Async Memory Rewrite (Stability + Safety Patch)** - - Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes. - - Added **input sanitation** to prevent embedding errors (`'list' object has no attribute 'replace'`). - - Implemented `flatten_messages()` helper in API layer to clean malformed payloads. - - Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware). - - Health endpoint (`/health`) now returns structured JSON `{status, version, service}`. - - Startup logs now include **sanitized embedder config** with API keys masked for safety: - ``` - >>> Embedder config (sanitized): {'provider': 'openai', 'config': {'model': 'text-embedding-3-small', 'api_key': '***'}} - βœ… Connected to Neo4j on attempt 1 - 🧠 NVGRAM v0.1.1 β€” Neural Vectorized Graph Recall and Memory initialized - ``` +- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`) +- Fixed config overwrite issue where rebuilding container restored stock `main.py` +- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes -### Changed -- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes. -- Normalized indentation and cleaned duplicate `main.py` references under `/nvgram/` vs `/nvgram/server/`. -- Removed redundant `FastAPI()` app reinitialization. -- Updated internal logging to INFO-level timing format: - 2025-10-08 21:48:45 [INFO] POST /memories -> 200 (11189.1 ms) -- Deprecated `@app.on_event("startup")` (FastAPI deprecation warning) β†’ will migrate to `lifespan` handler in v0.1.2. +**[Lyra-Mem0 v0.2.1]** -### Fixed -- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content. -- Masked API key leaks from boot logs. -- Ensured Neo4j reconnects gracefully on first retry. - -### Goals / Next Steps -- Integrate **salience scoring** and **embedding confidence weight** fields in Postgres schema. -- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall. -- Migrate from deprecated `on_event` β†’ `lifespan` pattern in 0.1.2. - ---- - -## [NVGRAM 0.1.0] - 2025-10-07 -### Added -- **Initial fork of Mem0 β†’ NVGRAM**: - - Created a fully independent local-first memory engine based on Mem0 OSS. - - Renamed all internal modules, Docker services, and environment variables from `mem0` β†’ `nvgram`. - - New service name: **`nvgram-api`**, default port **7077**. - - Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility with Lyra Core. - - Uses **FastAPI**, **Postgres**, and **Neo4j** as persistent backends. - - Verified clean startup: - ``` - βœ… Connected to Neo4j on attempt 1 - INFO: Uvicorn running on http://0.0.0.0:7077 - ``` - - `/docs` and `/openapi.json` confirmed reachable and functional. - -### Changed -- Removed dependency on the external `mem0ai` SDK β€” all logic now local. -- Re-pinned requirements: - - fastapi==0.115.8 - - uvicorn==0.34.0 - - pydantic==2.10.4 - - python-dotenv==1.0.1 - - psycopg>=3.2.8 - - ollama -- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming and image paths. - -### Goals / Next Steps -- Integrate NVGRAM as the new default backend in Lyra Relay. -- Deprecate remaining Mem0 references and archive old configs. -- Begin versioning as a standalone project (`nvgram-core`, `nvgram-api`, etc.). - ---- - -## [Lyra-Mem0 0.3.2] - 2025-10-05 -### Added -- Support for **Ollama LLM reasoning** alongside OpenAI embeddings: - - Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090`. - - Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M`. - - Split processing pipeline: - - Embeddings β†’ OpenAI `text-embedding-3-small` - - LLM β†’ Local Ollama (`http://10.0.0.3:11434/api/chat`). -- Added `.env.3090` template for self-hosted inference nodes. -- Integrated runtime diagnostics and seeder progress tracking: - - File-level + message-level progress bars. - - Retry/back-off logic for timeouts (3 attempts). - - Event logging (`ADD / UPDATE / NONE`) for every memory record. -- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers. -- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090). - -### Changed -- Updated `main.py` configuration block to load: - - `LLM_PROVIDER`, `LLM_MODEL`, and `OLLAMA_BASE_URL`. - - Fallback to OpenAI if Ollama unavailable. -- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py`. -- Normalized `.env` loading so `mem0-api` and host environment share identical values. -- Improved seeder logging and progress telemetry for clearer diagnostics. -- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']` for tuning future local inference runs. - -### Fixed -- Resolved crash during startup: - `TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'`. -- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors. -- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests. -- β€œUnknown event” warnings now safely ignored (no longer break seeding loop). -- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`). - -### Observations -- Stable GPU utilization: ~8 GB VRAM @ 92 % load, β‰ˆ 67 Β°C under sustained seeding. -- Next revision will re-format seed JSON to preserve `role` context (user vs assistant). - ---- - -## [Lyra-Mem0 0.3.1] - 2025-10-03 -### Added -- HuggingFace TEI integration (local 3090 embedder). -- Dual-mode environment switch between OpenAI cloud and local. -- CSV export of memories from Postgres (`payload->>'data'`). - -### Fixed -- `.env` CRLF vs LF line ending issues. -- Local seeding now possible via huggingface server running - ---- - -## [Lyra-mem0 0.3.0] -### Added -- Support for **Ollama embeddings** in Mem0 OSS container: - - Added ability to configure `EMBEDDER_PROVIDER=ollama` and set `EMBEDDER_MODEL` + `OLLAMA_HOST` via `.env`. - - Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG`. - - Installed `ollama` Python client into custom API container image. -- `.env.3090` file created for external embedding mode (3090 machine): - - EMBEDDER_PROVIDER=ollama - - EMBEDDER_MODEL=mxbai-embed-large - - OLLAMA_HOST=http://10.0.0.3:11434 -- Workflow to support **multiple embedding modes**: - 1. Fast LAN-based 3090/Ollama embeddings - 2. Local-only CPU embeddings (Lyra Cortex VM) - 3. OpenAI fallback embeddings - -### Changed -- `docker-compose.yml` updated to mount local `main.py` and `.env.3090`. -- Built **custom Dockerfile** (`mem0-api-server:latest`) extending base image with `pip install ollama`. -- Updated `requirements.txt` to include `ollama` package. -- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv` (`load_dotenv()`). -- Tested new embeddings path with curl `/memories` API call. - -### Fixed -- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`). -- Fixed config overwrite issue where rebuilding container restored stock `main.py`. -- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes and planning to standardize at 1536-dim. - --- - -## [Lyra-mem0 v0.2.1] - -### Added -- **Seeding pipeline**: - - Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0. - - Implemented incremental seeding option (skip existing memories, only add new ones). - - Verified insert process with Postgres-backed history DB and curl `/memories/search` sanity check. -- **Ollama embedding support** in Mem0 OSS container: - - Added configuration for `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, and `OLLAMA_HOST` via `.env`. - - Created `.env.3090` profile for LAN-connected 3090 machine with Ollama. - - Set up three embedding modes: - 1. Fast LAN-based 3090/Ollama - 2. Local-only CPU model (Lyra Cortex VM) - 3. OpenAI fallback - -### Changed -- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends. -- Mounted host `main.py` into container so local edits persist across rebuilds. -- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles. -- Built **custom Dockerfile** (`mem0-api-server:latest`) including `pip install ollama`. -- Updated `requirements.txt` with `ollama` dependency. -- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP). -- Added logging to confirm model pulls and embedding requests. - -### Fixed -- Seeder process originally failed on old memories β€” now skips duplicates and continues batch. -- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image. -- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild. -- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch by investigating OpenAI (1536-dim) vs Ollama (1024-dim) schemas. - -### Notes -- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAI’s schema and avoid Neo4j errors). -- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors. -- Seeder workflow validated but should be wrapped in a repeatable weekly run for full Cloudβ†’Local sync. - ---- - -## [Lyra-Mem0 v0.2.0] - 2025-09-30 -### Added -- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/` - - Includes **Postgres (pgvector)**, **Qdrant**, **Neo4j**, and **SQLite** for history tracking. - - Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building the Mem0 API server. -- Verified REST API functionality: - - `POST /memories` works for adding memories. - - `POST /search` works for semantic search. -- Successful end-to-end test with persisted memory: - *"Likes coffee in the morning"* β†’ retrievable via search. βœ… - -### Changed -- Split architecture into **modular stacks**: - - `~/lyra-core` (Relay, Persona-Sidecar, etc.) - - `~/lyra-mem0` (Mem0 OSS memory stack) -- Removed old embedded mem0 containers from the Lyra-Core compose file. -- Added Lyra-Mem0 section in README.md. - -### Next Steps -- Wire **Relay β†’ Mem0 API** (integration not yet complete). -- Add integration tests to verify persistence and retrieval from within Lyra-Core. - ---- - -## 🧠 Lyra-Cortex ############################################################################## - -## [ Cortex - v0.5] -2025-11-13 - -### Added -- **New `reasoning.py` module** - - Async reasoning engine. - - Accepts user prompt, identity, RAG block, and reflection notes. - - Produces draft internal answers. - - Uses primary backend (vLLM). -- **New `reflection.py` module** - - Fully async. - - Produces actionable JSON β€œinternal notes.” - - Enforces strict JSON schema and fallback parsing. - - Forces cloud backend (`backend_override="cloud"`). -- Integrated `refine.py` into Cortex reasoning pipeline: - - New stage between reflection and persona. - - Runs exclusively on primary vLLM backend (MI50). - - Produces final, internally consistent output for downstream persona layer. -- **Backend override system** - - Each LLM call can now select its own backend. - - Enables multi-LLM cognition: Reflection β†’ cloud, Reasoning β†’ primary. - -- **identity loader** - - Added `identity.py` with `load_identity()` for consistent persona retrieval. - -- **ingest_handler** - - Async stub created for future Intake β†’ NeoMem β†’ RAG pipeline. - -### Changed -- Unified LLM backend URL handling across Cortex: - - ENV variables must now contain FULL API endpoints. - - Removed all internal path-appending (e.g. `.../v1/completions`). - - `llm_router.py` rewritten to use env-provided URLs as-is. - - Ensures consistent behavior between draft, reflection, refine, and persona. -- **Rebuilt `main.py`** - - Removed old annotation/analysis logic. - - New structure: load identity β†’ get RAG β†’ reflect β†’ reason β†’ return draft+notes. - - Routes now clean and minimal (`/reason`, `/ingest`, `/health`). - - Async path throughout Cortex. - -- **Refactored `llm_router.py`** - - Removed old fallback logic during overrides. - - OpenAI requests now use `/v1/chat/completions`. - - Added proper OpenAI Authorization headers. - - Distinct payload format for vLLM vs OpenAI. - - Unified, correct parsing across models. - -- **Simplified Cortex architecture** - - Removed deprecated β€œcontext.py” and old reasoning code. - - Relay completely decoupled from smart behavior. - -- Updated environment specification: - - `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions`. - - `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama). - - `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions`. - -### Fixed -- Resolved endpoint conflict where: - - Router expected base URLs. - - Refine expected full URLs. - - Refine always fell back due to hitting incorrect endpoint. - - Fixed by standardizing full-URL behavior across entire system. -- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax). -- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints. -- No more double-routing through vLLM during reflection. -- Corrected async/sync mismatch in multiple locations. -- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic. - -### Removed -- Legacy `annotate`, `reason_check` glue logic from old architecture. -- Old backend probing junk code. -- Stale imports and unused modules leftover from previous prototype. - -### Verified -- Cortex β†’ vLLM (MI50) β†’ refine β†’ final_output now functioning correctly. -- refine shows `used_primary_backend: true` and no fallback. -- Manual curl test confirms endpoint accuracy. +- Seeder process originally failed on old memories β€” now skips duplicates and continues batch +- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image +- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild +- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch ### Known Issues -- refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this. -- hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned). -### Pending / Known Issues -- **RAG service does not exist** β€” requires containerized FastAPI service. -- Reasoning layer lacks self-revision loop (deliberate thought cycle). -- No speak/persona generation layer yet (`speak.py` planned). -- Intake summaries not yet routing into RAG or reflection layer. -- No refinement engine between reasoning and speak. +**[Lyra-Core v0.3.0] - 2025-09-26** -### Notes -This is the largest structural change to Cortex so far. -It establishes: -- multi-model cognition -- clean layering -- identity + reflection separation -- correct async code -- deterministic backend routing -- predictable JSON reflection +- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient" +- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi") +- CPU-only inference is functional but limited; larger models recommended once GPU available -The system is now ready for: -- refinement loops -- persona-speaking layer -- containerized RAG -- long-term memory integration -- true emergent-behavior experiments +**[Lyra-Cortex v0.2.0] - 2025-09-26** +- Small models tend to drift or over-classify +- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models +- Need to set up `systemd` service for `llama-server` to auto-start on VM reboot +### Observations + +**[Lyra-Mem0 0.3.2] - 2025-10-05** + +- Stable GPU utilization: ~8 GB VRAM @ 92% load, β‰ˆ 67Β°C under sustained seeding +- Next revision will re-format seed JSON to preserve `role` context (user vs assistant) + +**[Lyra-Mem0 v0.2.1]** + +- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAI's schema) +- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors +- Seeder workflow validated but should be wrapped in repeatable weekly run for full Cloudβ†’Local sync + +### Next Steps + +**[Lyra-Core 0.3.1] - 2025-10-09** + +- Add salience visualization (e.g., memory weights displayed in injected system message) +- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring +- Add relay auto-retry for transient 500 responses from NVGRAM + +**[NVGRAM 0.1.1] - 2025-10-08** + +- Integrate salience scoring and embedding confidence weight fields in Postgres schema +- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall +- Migrate from deprecated `on_event` β†’ `lifespan` pattern in 0.1.2 + +**[NVGRAM 0.1.0] - 2025-10-07** + +- Integrate NVGRAM as new default backend in Lyra Relay +- Deprecate remaining Mem0 references and archive old configs +- Begin versioning as standalone project (`nvgram-core`, `nvgram-api`, etc.) + +**[Intake v0.1.0] - 2025-10-27** + +- Feed intake into NeoMem +- Generate daily/hourly overall summary (IE: Today Brian and Lyra worked on x, y, and z) +- Generate session-aware summaries with own intake hopper + +--- + +## [0.2.x] - 2025-09-30 to 2025-09-24 -## [ Cortex - v0.4.1] - 2025-11-5 ### Added -- **RAG intergration** - - Added rag.py with query_rag() and format_rag_block(). - - Cortex now queries the local RAG API (http://10.0.0.41:7090/rag/search) for contextual augmentation. - - Synthesized answers and top excerpts are injected into the reasoning prompt. -### Changed ### -- **Revised /reason endpoint.** - - Now builds unified context blocks: - - [Intake] β†’ recent summaries - - [RAG] β†’ contextual knowledge - - [User Message] β†’ current input - - Calls call_llm() for the first pass, then reflection_loop() for meta-evaluation. - - Returns cortex_prompt, draft_output, final_output, and normalized reflection. -- **Reflection Pipeline Stability** - - Cleaned parsing to normalize JSON vs. text reflections. - - Added fallback handling for malformed or non-JSON outputs. - - Log system improved to show raw JSON, extracted fields, and normalized summary. -- **Async Summarization (Intake v0.2.1)** - - Intake summaries now run in background threads to avoid blocking Cortex. - - Summaries (L1–L∞) logged asynchronously with [BG] tags. -- **Environment & Networking Fixes** - - Verified .env variables propagate correctly inside the Cortex container. - - Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG (shared serversdown_lyra_net). - - Adjusted localhost calls to service-IP mapping (10.0.0.41 for Cortex host). - -- **Behavioral Updates** - - Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers). - - RAG context successfully grounds reasoning outputs. - - Intake and NeoMem confirmed receiving summaries via /add_exchange. - - Log clarity pass: all reflective and contextual blocks clearly labeled. -- **Known Gaps / Next Steps** - - NeoMem Tuning - - Improve retrieval latency and relevance. - - Implement a dedicated /reflections/recent endpoint for Cortex. - - Migrate to Cortex-first ingestion (Relay β†’ Cortex β†’ NeoMem). -- **Cortex Enhancements** - - Add persistent reflection recall (use prior reflections as meta-context). - - Improve reflection JSON structure ("insight", "evaluation", "next_action" β†’ guaranteed fields). - - Tighten temperature and prompt control for factual consistency. -- **RAG Optimization** - -Add source ranking, filtering, and multi-vector hybrid search. - -Cache RAG responses per session to reduce duplicate calls. -- **Documentation / Monitoring** - -Add health route for RAG and Intake summaries. - -Include internal latency metrics in /health endpoint. +**[Lyra-Mem0 v0.2.0] - 2025-09-30** -Consolidate logs into unified β€œLyra Cortex Console” for tracing all module calls. +- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/` + - Includes Postgres (pgvector), Qdrant, Neo4j, and SQLite for history tracking + - Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building Mem0 API server +- Verified REST API functionality + - `POST /memories` works for adding memories + - `POST /search` works for semantic search +- Successful end-to-end test with persisted memory: *"Likes coffee in the morning"* β†’ retrievable via search βœ… -## [Cortex - v0.3.0] – 2025-10-31 -### Added -- **Cortex Service (FastAPI)** - - New standalone reasoning engine (`cortex/main.py`) with endpoints: - - `GET /health` – reports active backend + NeoMem status. - - `POST /reason` – evaluates `{prompt, response}` pairs. - - `POST /annotate` – experimental text analysis. - - Background NeoMem health monitor (5-minute interval). +**[Lyra-Core v0.2.0] - 2025-09-24** -- **Multi-Backend Reasoning Support** - - Added environment-driven backend selection via `LLM_FORCE_BACKEND`. - - Supports: - - **Primary** β†’ vLLM (MI50 node @ 10.0.0.43) - - **Secondary** β†’ Ollama (3090 node @ 10.0.0.3) - - **Cloud** β†’ OpenAI API - - **Fallback** β†’ llama.cpp (CPU) - - Introduced per-backend model variables: - `LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL`. - -- **Response Normalization Layer** - - Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON. - - Handles Ollama’s multi-line streaming and Mythomax’s missing punctuation issues. - - Prints concise debug previews of merged content. - -- **Environment Simplification** - - Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file. - - Removed reliance on shared/global env file to prevent cross-contamination. - - Verified Docker Compose networking across containers. +- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls +- Implemented `sessionId` support (client-supplied, fallback to `default`) +- Added debug logs for memory add/search +- Cleaned up Relay structure for clarity ### Changed -- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend. -- Enhanced startup logs to announce active backend, model, URL, and mode. -- Improved error handling with clearer β€œReasoning error” messages. + +**[Lyra-Mem0 v0.2.0] - 2025-09-30** + +- Split architecture into modular stacks: + - `~/lyra-core` (Relay, Persona-Sidecar, etc.) + - `~/lyra-mem0` (Mem0 OSS memory stack) +- Removed old embedded mem0 containers from Lyra-Core compose file +- Added Lyra-Mem0 section in README.md + +### Next Steps + +**[Lyra-Mem0 v0.2.0] - 2025-09-30** + +- Wire **Relay β†’ Mem0 API** (integration not yet complete) +- Add integration tests to verify persistence and retrieval from within Lyra-Core + +--- + +## [0.1.x] - 2025-09-25 to 2025-09-23 + +### Added + +**[Lyra_RAG v0.1.0] - 2025-11-07** + +- Initial standalone RAG module for Project Lyra +- Persistent ChromaDB vector store (`./chromadb`) +- Importer `rag_chat_import.py` with: + - Recursive folder scanning and category tagging + - Smart chunking (~5k chars) + - SHA-1 deduplication and chat-ID metadata + - Timestamp fields (`file_modified`, `imported_at`) + - Background-safe operation (`nohup`/`tmux`) +- 68 Lyra-category chats imported: + - 6,556 new chunks added + - 1,493 duplicates skipped + - 7,997 total vectors stored + +**[Lyra_RAG v0.1.0 API] - 2025-11-07** + +- `/rag/search` FastAPI endpoint implemented (port 7090) +- Supports natural-language queries and returns top related excerpts +- Added answer synthesis step using `gpt-4o-mini` + +**[Lyra-Core v0.1.0] - 2025-09-23** + +- First working MVP of **Lyra Core Relay** +- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible) +- Memory integration with Mem0: + - `POST /memories` on each user message + - `POST /search` before LLM call +- Persona Sidecar integration (`GET /current`) +- OpenAI GPT + Ollama (Mythomax) support in Relay +- Simple browser-based chat UI (talks to Relay at `http://:7078`) +- `.env` standardization for Relay + Mem0 + Postgres + Neo4j +- Working Neo4j + Postgres backing stores for Mem0 +- Initial MVP relay service with raw fetch calls to Mem0 +- Dockerized with basic healthcheck + +**[Lyra-Cortex v0.1.0] - 2025-09-25** + +- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD) +- Built **llama.cpp** with `llama-server` target via CMake +- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model +- Verified API compatibility at `/v1/chat/completions` +- Local test successful via `curl` β†’ ~523 token response generated +- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X) +- Confirmed usable for salience scoring, summarization, and lightweight reasoning ### Fixed -- Corrected broken vLLM endpoint routing (`/v1/completions`). -- Stabilized cross-container health reporting for NeoMem. -- Resolved JSON parse failures caused by streaming chunk delimiters. + +**[Lyra-Core v0.1.0] - 2025-09-23** + +- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only) +- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env` + +### Verified + +**[Lyra_RAG v0.1.0] - 2025-11-07** + +- Successful recall of Lyra-Core development history (v0.3.0 snapshot) +- Correct metadata and category tagging for all new imports + +### Known Issues + +**[Lyra-Core v0.1.0] - 2025-09-23** + +- No feedback loop (thumbs up/down) yet +- Forget/delete flow is manual (via memory IDs) +- Memory latency ~1–4s depending on embedding model + +### Next Planned + +**[Lyra_RAG v0.1.0] - 2025-11-07** + +- Optional `where` filter parameter for category/date queries +- Graceful "no results" handler for empty retrievals +- `rag_docs_import.py` for PDFs and other document types --- - -## Next Planned – [v0.4.0] -### Planned Additions -- **Reflection Mode** - - Introduce `REASONING_MODE=factcheck|reflection`. - - Output schema: - ```json - { "insight": "...", "evaluation": "...", "next_action": "..." } - ``` - -- **Cortex-First Pipeline** - - UI β†’ Cortex β†’ [Reflection + Verifier + Memory] β†’ Speech LLM β†’ User. - - Allows Lyra to β€œthink before speaking.” - -- **Verifier Stub** - - New `/verify` endpoint for search-based factual grounding. - - Asynchronous external truth checking. - -- **Memory Integration** - - Feed reflective outputs into NeoMem. - - Enable β€œdream” cycles for autonomous self-review. - ---- - -**Status:** 🟒 Stable Core – Multi-backend reasoning operational. -**Next milestone:** *v0.4.0 β€” Reflection Mode + Thought Pipeline orchestration.* - ---- - -### [Intake] v0.1.0 - 2025-10-27 - - Recieves messages from relay and summarizes them in a cascading format. - - Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20) - - Currently logs summaries to .log file in /project-lyra/intake-logs/ - ** Next Steps ** - - Feed intake into neomem. - - Generate a daily/hourly/etc overall summary, (IE: Today Brian and Lyra worked on x, y, and z) - - Generate session aware summaries, with its own intake hopper. - - -### [Lyra-Cortex] v0.2.0 β€” 2025-09-26 -**Added -- Integrated **llama-server** on dedicated Cortex VM (Proxmox). -- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs. -- Benchmarked Phi-3.5-mini performance: - - ~18 tokens/sec CPU-only on Ryzen 7 7800X. - - Salience classification functional but sometimes inconsistent ("sali", "fi", "jamming"). -- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier: - - Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval). - - More responsive but over-classifies messages as β€œsalient.” -- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models. - -** Known Issues -- Small models tend to drift or over-classify. -- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models. -- Need to set up a `systemd` service for `llama-server` to auto-start on VM reboot. - ---- - -### [Lyra-Cortex] v0.1.0 β€” 2025-09-25 -#### Added -- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD). -- Built **llama.cpp** with `llama-server` target via CMake. -- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model. -- Verified **API compatibility** at `/v1/chat/completions`. -- Local test successful via `curl` β†’ ~523 token response generated. -- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X). -- Confirmed usable for salience scoring, summarization, and lightweight reasoning. diff --git a/cortex/Dockerfile b/cortex/Dockerfile index 784f720..77cd233 100644 --- a/cortex/Dockerfile +++ b/cortex/Dockerfile @@ -4,4 +4,6 @@ COPY requirements.txt . RUN pip install -r requirements.txt COPY . . EXPOSE 7081 +# NOTE: Running with single worker to maintain SESSIONS global state in Intake. +# If scaling to multiple workers, migrate SESSIONS to Redis or shared storage. CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7081"] diff --git a/cortex/context.py b/cortex/context.py index aff3327..341946d 100644 --- a/cortex/context.py +++ b/cortex/context.py @@ -84,6 +84,7 @@ def _init_session(session_id: str) -> Dict[str, Any]: "mood": "neutral", # Future: mood tracking "active_project": None, # Future: project context "message_count": 0, + "message_history": [], } @@ -275,6 +276,13 @@ async def collect_context(session_id: str, user_prompt: str) -> Dict[str, Any]: state["last_user_message"] = user_prompt state["last_timestamp"] = now state["message_count"] += 1 + # Save user turn to history + state["message_history"].append({ + "user": user_prompt, + "assistant": "" # assistant reply filled later by update_last_assistant_message() + }) + + # F. Assemble unified context context_state = { @@ -311,20 +319,27 @@ async def collect_context(session_id: str, user_prompt: str) -> Dict[str, Any]: # ----------------------------- def update_last_assistant_message(session_id: str, message: str) -> None: """ - Update session state with assistant's response. - - Called by router.py after persona layer completes. - - Args: - session_id: Session identifier - message: Assistant's final response text + Update session state with assistant's response and complete + the last turn inside message_history. """ - if session_id in SESSION_STATE: - SESSION_STATE[session_id]["last_assistant_message"] = message - SESSION_STATE[session_id]["last_timestamp"] = datetime.now() - logger.debug(f"Updated assistant message for session {session_id}") - else: + session = SESSION_STATE.get(session_id) + if not session: logger.warning(f"Attempted to update non-existent session: {session_id}") + return + + # Update last assistant message + timestamp + session["last_assistant_message"] = message + session["last_timestamp"] = datetime.now() + + # Fill in assistant reply for the most recent turn + history = session.get("message_history", []) + if history: + # history entry already contains {"user": "...", "assistant": "...?"} + history[-1]["assistant"] = message + + if VERBOSE_DEBUG: + logger.debug(f"Updated assistant message for session {session_id}") + def get_session_state(session_id: str) -> Optional[Dict[str, Any]]: diff --git a/cortex/intake/__init__.py b/cortex/intake/__init__.py new file mode 100644 index 0000000..c967d4a --- /dev/null +++ b/cortex/intake/__init__.py @@ -0,0 +1,18 @@ +""" +Intake module - short-term memory summarization. + +Runs inside the Cortex container as a pure Python module. +No standalone API server - called internally by Cortex. +""" + +from .intake import ( + SESSIONS, + add_exchange_internal, + summarize_context, +) + +__all__ = [ + "SESSIONS", + "add_exchange_internal", + "summarize_context", +] diff --git a/cortex/intake/intake.py b/cortex/intake/intake.py index 897acf8..50b192d 100644 --- a/cortex/intake/intake.py +++ b/cortex/intake/intake.py @@ -1,18 +1,29 @@ import os +import json from datetime import datetime from typing import List, Dict, Any, TYPE_CHECKING from collections import deque +from llm.llm_router import call_llm +# ------------------------------------------------------------------- +# Global Short-Term Memory (new Intake) +# ------------------------------------------------------------------- +SESSIONS: dict[str, dict] = {} # session_id β†’ { buffer: deque, created_at: timestamp } + +# Diagnostic: Verify module loads only once +print(f"[Intake Module Init] SESSIONS object id: {id(SESSIONS)}, module: {__name__}") + +# L10 / L20 history lives here too +L10_HISTORY: Dict[str, list[str]] = {} +L20_HISTORY: Dict[str, list[str]] = {} + +from llm.llm_router import call_llm # Use Cortex's shared LLM router if TYPE_CHECKING: + # Only for type hints β€” do NOT redefine SESSIONS here from collections import deque as _deque - SESSIONS: dict - L10_HISTORY: dict - L20_HISTORY: dict def bg_summarize(session_id: str) -> None: ... -from llm.llm_router import call_llm # use Cortex's shared router - # ───────────────────────────── # Config # ───────────────────────────── @@ -220,20 +231,24 @@ def push_to_neomem(summary: str, session_id: str, level: str) -> None: # ───────────────────────────── # Main entrypoint for Cortex # ───────────────────────────── - -async def summarize_context( - session_id: str, - exchanges: List[Dict[str, Any]], -) -> Dict[str, Any]: +async def summarize_context(session_id: str, exchanges: list[dict]): """ - Main API used by Cortex: + Internal summarizer that uses Cortex's LLM router. + Produces L1 / L5 / L10 / L20 / L30 summaries. - summaries = await summarize_context(session_id, exchanges) - - `exchanges` should be the recent conversation buffer for that session. + Args: + session_id: The conversation/session ID + exchanges: A list of {"user_msg": ..., "assistant_msg": ..., "timestamp": ...} """ - buf = list(exchanges) - if not buf: + + # Build raw conversation text + convo_lines = [] + for ex in exchanges: + convo_lines.append(f"User: {ex.get('user_msg','')}") + convo_lines.append(f"Assistant: {ex.get('assistant_msg','')}") + convo_text = "\n".join(convo_lines) + + if not convo_text.strip(): return { "session_id": session_id, "exchange_count": 0, @@ -242,31 +257,72 @@ async def summarize_context( "L10": "", "L20": "", "L30": "", - "last_updated": None, + "last_updated": datetime.now().isoformat() } - # Base levels - L1 = await summarize_L1(buf) - L5 = await summarize_L5(buf) - L10 = await summarize_L10(session_id, buf) - L20 = await summarize_L20(session_id) - L30 = await summarize_L30(session_id) + # Prompt the LLM (internal β€” no HTTP) + prompt = f""" +Summarize the conversation below into multiple compression levels. - # Push the "interesting" tiers into NeoMem - push_to_neomem(L10, session_id, "L10") - push_to_neomem(L20, session_id, "L20") - push_to_neomem(L30, session_id, "L30") +Conversation: +---------------- +{convo_text} +---------------- - return { - "session_id": session_id, - "exchange_count": len(buf), - "L1": L1, - "L5": L5, - "L10": L10, - "L20": L20, - "L30": L30, - "last_updated": datetime.now().isoformat(), - } +Output strictly in JSON with keys: +L1 β†’ ultra short summary (1–2 sentences max) +L5 β†’ short summary +L10 β†’ medium summary +L20 β†’ detailed overview +L30 β†’ full detailed summary + +JSON only. No text outside JSON. +""" + + try: + llm_response = await call_llm( + prompt, + temperature=0.2 + ) + + + # LLM should return JSON, parse it + summary = json.loads(llm_response) + + return { + "session_id": session_id, + "exchange_count": len(exchanges), + "L1": summary.get("L1", ""), + "L5": summary.get("L5", ""), + "L10": summary.get("L10", ""), + "L20": summary.get("L20", ""), + "L30": summary.get("L30", ""), + "last_updated": datetime.now().isoformat() + } + + except Exception as e: + return { + "session_id": session_id, + "exchange_count": len(exchanges), + "L1": f"[Error summarizing: {str(e)}]", + "L5": "", + "L10": "", + "L20": "", + "L30": "", + "last_updated": datetime.now().isoformat() + } + +# ───────────────────────────────── +# Background summarization stub +# ───────────────────────────────── +def bg_summarize(session_id: str): + """ + Placeholder for background summarization. + Actual summarization happens during /reason via summarize_context(). + + This function exists to prevent NameError when called from add_exchange_internal(). + """ + print(f"[Intake] Exchange added for {session_id}. Will summarize on next /reason call.") # ───────────────────────────── # Internal entrypoint for Cortex @@ -283,15 +339,23 @@ def add_exchange_internal(exchange: dict): exchange["timestamp"] = datetime.now().isoformat() + # DEBUG: Verify we're using the module-level SESSIONS + print(f"[add_exchange_internal] SESSIONS object id: {id(SESSIONS)}, current sessions: {list(SESSIONS.keys())}") + # Ensure session exists if session_id not in SESSIONS: SESSIONS[session_id] = { "buffer": deque(maxlen=200), "created_at": datetime.now() } + print(f"[add_exchange_internal] Created new session: {session_id}") + else: + print(f"[add_exchange_internal] Using existing session: {session_id}") # Append exchange into the rolling buffer SESSIONS[session_id]["buffer"].append(exchange) + buffer_len = len(SESSIONS[session_id]["buffer"]) + print(f"[add_exchange_internal] Added exchange to {session_id}, buffer now has {buffer_len} items") # Trigger summarization immediately try: diff --git a/cortex/router.py b/cortex/router.py index 0beb457..e6ba161 100644 --- a/cortex/router.py +++ b/cortex/router.py @@ -197,26 +197,110 @@ class IngestPayload(BaseModel): user_msg: str assistant_msg: str + @cortex_router.post("/ingest") -async def ingest_stub(): - # Intake is internal now β€” this endpoint is only for compatibility. - return {"status": "ok", "note": "intake is internal now"} +async def ingest(payload: IngestPayload): + """ + Receives (session_id, user_msg, assistant_msg) from Relay + and pushes directly into Intake's in-memory buffer. - - # 1. Update Cortex session state - update_last_assistant_message(payload.session_id, payload.assistant_msg) - - # 2. Feed Intake internally (no HTTP) + Uses lenient error handling - always returns success to avoid + breaking the chat pipeline. + """ try: + # 1. Update Cortex session state + update_last_assistant_message(payload.session_id, payload.assistant_msg) + except Exception as e: + logger.warning(f"[INGEST] Failed to update session state: {e}") + # Continue anyway (lenient mode) + + try: + # 2. Feed Intake internally (no HTTP) add_exchange_internal({ "session_id": payload.session_id, "user_msg": payload.user_msg, "assistant_msg": payload.assistant_msg, }) - logger.debug(f"[INGEST] Added exchange to Intake for {payload.session_id}") except Exception as e: - logger.warning(f"[INGEST] Failed to add exchange to Intake: {e}") + logger.warning(f"[INGEST] Failed to add to Intake: {e}") + # Continue anyway (lenient mode) - return {"ok": True, "session_id": payload.session_id} + # Always return success (user requirement: never fail chat pipeline) + return { + "status": "ok", + "session_id": payload.session_id + } + +# ----------------------------- +# Debug endpoint: summarized context +# ----------------------------- +@cortex_router.get("/debug/summary") +async def debug_summary(session_id: str): + """ + Diagnostic endpoint that runs Intake's summarize_context() for a session. + + Shows exactly what L1/L5/L10/L20/L30 summaries would look like + inside the actual Uvicorn worker, using the real SESSIONS buffer. + """ + from intake.intake import SESSIONS, summarize_context + + # Validate session + session = SESSIONS.get(session_id) + if not session: + return {"error": "session not found", "session_id": session_id} + + # Convert deque into the structure summarize_context expects + buffer = session["buffer"] + exchanges = [ + { + "user_msg": ex.get("user_msg", ""), + "assistant_msg": ex.get("assistant_msg", ""), + } + for ex in buffer + ] + + # πŸ”₯ CRITICAL FIX β€” summarize_context is async + summary = await summarize_context(session_id, exchanges) + + return { + "session_id": session_id, + "buffer_size": len(buffer), + "exchanges_preview": exchanges[-5:], # last 5 items + "summary": summary + } + +# ----------------------------- +# Debug endpoint for SESSIONS +# ----------------------------- +@cortex_router.get("/debug/sessions") +async def debug_sessions(): + """ + Diagnostic endpoint to inspect SESSIONS from within the running Uvicorn worker. + This shows the actual state of the in-memory SESSIONS dict. + """ + from intake.intake import SESSIONS + + sessions_data = {} + for session_id, session_info in SESSIONS.items(): + buffer = session_info["buffer"] + sessions_data[session_id] = { + "created_at": session_info["created_at"].isoformat(), + "buffer_size": len(buffer), + "buffer_maxlen": buffer.maxlen, + "recent_exchanges": [ + { + "user_msg": ex.get("user_msg", "")[:100], + "assistant_msg": ex.get("assistant_msg", "")[:100], + "timestamp": ex.get("timestamp", "") + } + for ex in list(buffer)[-5:] # Last 5 exchanges + ] + } + + return { + "sessions_object_id": id(SESSIONS), + "total_sessions": len(SESSIONS), + "sessions": sessions_data + } diff --git a/vllm-mi50.md b/vllm-mi50.md deleted file mode 100644 index c8f6fd4..0000000 --- a/vllm-mi50.md +++ /dev/null @@ -1,416 +0,0 @@ -Here you go β€” a **clean, polished, ready-to-drop-into-Trilium or GitHub** Markdown file. - -If you want, I can also auto-generate a matching `/docs/vllm-mi50/` folder structure and a mini-ToC. - ---- - -# **MI50 + vLLM + Proxmox LXC Setup Guide** - -### *End-to-End Field Manual for gfx906 LLM Serving* - -**Version:** 1.0 -**Last updated:** 2025-11-17 - ---- - -## **πŸ“Œ Overview** - -This guide documents how to run a **vLLM OpenAI-compatible server** on an -**AMD Instinct MI50 (gfx906)** inside a **Proxmox LXC container**, expose it over LAN, -and wire it into **Project Lyra's Cortex reasoning layer**. - -This file is long, specific, and intentionally leaves *nothing* out so you never have to rediscover ROCm pain rituals again. - ---- - -## **1. What This Stack Looks Like** - -``` -Proxmox Host - β”œβ”€ AMD Instinct MI50 (gfx906) - β”œβ”€ AMDGPU + ROCm stack - └─ LXC Container (CT 201: cortex-gpu) - β”œβ”€ Ubuntu 24.04 - β”œβ”€ Docker + docker compose - β”œβ”€ vLLM inside Docker (nalanzeyu/vllm-gfx906) - β”œβ”€ GPU passthrough via /dev/kfd + /dev/dri + PCI bind - └─ vLLM API exposed on :8000 -Lyra Cortex (VM/Server) - └─ LLM_PRIMARY_URL=http://10.0.0.43:8000 -``` - ---- - -## **2. Proxmox Host β€” GPU Setup** - -### **2.1 Confirm MI50 exists** - -```bash -lspci -nn | grep -i 'vega\|instinct\|radeon' -``` - -You should see something like: - -``` -0a:00.0 Display controller: AMD Instinct MI50 (gfx906) -``` - -### **2.2 Load AMDGPU driver** - -The main pitfall after **any host reboot**. - -```bash -modprobe amdgpu -``` - -If you skip this, the LXC container won't see the GPU. - ---- - -## **3. LXC Container Configuration (CT 201)** - -The container ID is **201**. -Config file is at: - -``` -/etc/pve/lxc/201.conf -``` - -### **3.1 Working 201.conf** - -Paste this *exact* version: - -```ini -arch: amd64 -cores: 4 -hostname: cortex-gpu -memory: 16384 -swap: 512 -ostype: ubuntu -onboot: 1 -startup: order=2,up=10,down=10 -net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:C6:3E:88,ip=dhcp,type=veth -rootfs: local-lvm:vm-201-disk-0,size=200G -unprivileged: 0 - -# Docker in LXC requires this -features: keyctl=1,nesting=1 -lxc.apparmor.profile: unconfined -lxc.cap.drop: - -# --- GPU passthrough for ROCm (MI50) --- -lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file,mode=0666 -lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir -lxc.mount.entry: /sys/class/drm sys/class/drm none bind,ro,optional,create=dir -lxc.mount.entry: /opt/rocm /opt/rocm none bind,ro,optional,create=dir - -# Bind the MI50 PCI device -lxc.mount.entry: /dev/bus/pci/0000:0a:00.0 dev/bus/pci/0000:0a:00.0 none bind,optional,create=file - -# Allow GPU-related character devices -lxc.cgroup2.devices.allow: c 226:* rwm -lxc.cgroup2.devices.allow: c 29:* rwm -lxc.cgroup2.devices.allow: c 189:* rwm -lxc.cgroup2.devices.allow: c 238:* rwm -lxc.cgroup2.devices.allow: c 241:* rwm -lxc.cgroup2.devices.allow: c 242:* rwm -lxc.cgroup2.devices.allow: c 243:* rwm -lxc.cgroup2.devices.allow: c 244:* rwm -lxc.cgroup2.devices.allow: c 245:* rwm -lxc.cgroup2.devices.allow: c 246:* rwm -lxc.cgroup2.devices.allow: c 247:* rwm -lxc.cgroup2.devices.allow: c 248:* rwm -lxc.cgroup2.devices.allow: c 249:* rwm -lxc.cgroup2.devices.allow: c 250:* rwm -lxc.cgroup2.devices.allow: c 510:0 rwm -``` - -### **3.2 Restart sequence** - -```bash -pct stop 201 -modprobe amdgpu -pct start 201 -pct enter 201 -``` - ---- - -## **4. Inside CT 201 β€” Verifying ROCm + GPU Visibility** - -### **4.1 Check device nodes** - -```bash -ls -l /dev/kfd -ls -l /dev/dri -ls -l /opt/rocm -``` - -All must exist. - -### **4.2 Validate GPU via rocminfo** - -```bash -/opt/rocm/bin/rocminfo | grep -i gfx -``` - -You need to see: - -``` -gfx906 -``` - -If you see **nothing**, the GPU isn’t passed through β€” restart and re-check the host steps. - ---- - -## **5. Install Docker in the LXC (Ubuntu 24.04)** - -This container runs Docker inside LXC (nesting enabled). - -```bash -apt update -apt install -y ca-certificates curl gnupg - -install -m 0755 -d /etc/apt/keyrings -curl -fsSL https://download.docker.com/linux/ubuntu/gpg \ - | gpg --dearmor -o /etc/apt/keyrings/docker.gpg -chmod a+r /etc/apt/keyrings/docker.gpg - -echo \ - "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \ - https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \ - > /etc/apt/sources.list.d/docker.list - -apt update -apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -``` - -Check: - -```bash -docker --version -docker compose version -``` - ---- - -## **6. Running vLLM Inside CT 201 via Docker** - -### **6.1 Create directory** - -```bash -mkdir -p /root/vllm -cd /root/vllm -``` - -### **6.2 docker-compose.yml** - -Save this exact file as `/root/vllm/docker-compose.yml`: - -```yaml -version: "3.9" - -services: - vllm-mi50: - image: nalanzeyu/vllm-gfx906:latest - container_name: vllm-mi50 - restart: unless-stopped - ports: - - "8000:8000" - environment: - VLLM_ROLE: "APIServer" - VLLM_MODEL: "/model" - VLLM_LOGGING_LEVEL: "INFO" - command: > - vllm serve /model - --host 0.0.0.0 - --port 8000 - --dtype float16 - --max-model-len 4096 - --api-type openai - devices: - - "/dev/kfd:/dev/kfd" - - "/dev/dri:/dev/dri" - volumes: - - /opt/rocm:/opt/rocm:ro -``` - -### **6.3 Start vLLM** - -```bash -docker compose up -d -docker compose logs -f -``` - -When healthy, you’ll see: - -``` -(APIServer) Application startup complete. -``` - -and periodic throughput logs. - ---- - -## **7. Test vLLM API** - -### **7.1 From Proxmox host** - -```bash -curl -X POST http://10.0.0.43:8000/v1/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"/model","prompt":"ping","max_tokens":5}' -``` - -Should respond like: - -```json -{"choices":[{"text":"-pong"}]} -``` - -### **7.2 From Cortex machine** - -```bash -curl -X POST http://10.0.0.43:8000/v1/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"/model","prompt":"ping from cortex","max_tokens":5}' -``` - ---- - -## **8. Wiring into Lyra Cortex** - -In `cortex` container’s `docker-compose.yml`: - -```yaml -environment: - LLM_PRIMARY_URL: http://10.0.0.43:8000 -``` - -Not `/v1/completions` because the router appends that automatically. - -In `cortex/.env`: - -```env -LLM_FORCE_BACKEND=primary -LLM_MODEL=/model -``` - -Test: - -```bash -curl -X POST http://10.0.0.41:7081/reason \ - -H "Content-Type: application/json" \ - -d '{"prompt":"test vllm","session_id":"dev"}' -``` - -If you get a meaningful response: **Cortex β†’ vLLM is online**. - ---- - -## **9. Common Failure Modes (And Fixes)** - -### **9.1 β€œFailed to infer device type”** - -vLLM cannot see any ROCm devices. - -Fix: - -```bash -# On host -modprobe amdgpu -pct stop 201 -pct start 201 -# In container -/opt/rocm/bin/rocminfo | grep -i gfx -docker compose up -d -``` - -### **9.2 GPU disappears after reboot** - -Same fix: - -```bash -modprobe amdgpu -pct stop 201 -pct start 201 -``` - -### **9.3 Invalid image name** - -If you see pull errors: - -``` -pull access denied for nalanzeuy... -``` - -Use: - -``` -image: nalanzeyu/vllm-gfx906 -``` - -### **9.4 Double `/v1` in URL** - -Ensure: - -``` -LLM_PRIMARY_URL=http://10.0.0.43:8000 -``` - -Router appends `/v1/completions`. - ---- - -## **10. Daily / Reboot Ritual** - -### **On Proxmox host** - -```bash -modprobe amdgpu -pct stop 201 -pct start 201 -``` - -### **Inside CT 201** - -```bash -/opt/rocm/bin/rocminfo | grep -i gfx -cd /root/vllm -docker compose up -d -docker compose logs -f -``` - -### **Test API** - -```bash -curl -X POST http://10.0.0.43:8000/v1/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"/model","prompt":"ping","max_tokens":5}' -``` - ---- - -## **11. Summary** - -You now have: - -* **MI50 (gfx906)** correctly passed into LXC -* **ROCm** inside the container via bind mounts -* **vLLM** running inside Docker in the LXC -* **OpenAI-compatible API** on port 8000 -* **Lyra Cortex** using it automatically as primary backend - -This is a complete, reproducible setup that survives reboots (with the modprobe ritual) and allows you to upgrade/replace models anytime. - ---- - -If you want, I can generate: - -* A `/docs/vllm-mi50/README.md` -* A "vLLM Gotchas" document -* A quick-reference cheat sheet -* A troubleshooting decision tree - -Just say the word.