From e5e32f268340dfe89e7ada57b8112cf7d2158b67 Mon Sep 17 00:00:00 2001 From: serversdwn Date: Mon, 17 Nov 2025 03:34:23 -0500 Subject: [PATCH] Add MI50 + vLLM full setup guide --- core/CHANGELOG.md => CHANGELOG.md | 1286 ++++++++++++++--------------- core/README.md => README.md | 530 ++++++------ vllm-mi50.md | 416 ++++++++++ 3 files changed, 1324 insertions(+), 908 deletions(-) rename core/CHANGELOG.md => CHANGELOG.md (97%) rename core/README.md => README.md (97%) create mode 100644 vllm-mi50.md diff --git a/core/CHANGELOG.md b/CHANGELOG.md similarity index 97% rename from core/CHANGELOG.md rename to CHANGELOG.md index 77aff74..ce887d0 100644 --- a/core/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,643 +1,643 @@ -# Project Lyra β€” Modular Changelog -All notable changes to Project Lyra are organized by component. -The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) -and adheres to [Semantic Versioning](https://semver.org/). -# Last Updated: 11-13-25 ---- - -## 🧠 Lyra-Core ############################################################################## - -## [Lyra_RAG v0.1.0] 2025-11-07 -### Added -- Initial standalone RAG module for Project Lyra. -- Persistent ChromaDB vector store (`./chromadb`). -- Importer `rag_chat_import.py` with: - - Recursive folder scanning and category tagging. - - Smart chunking (~5 k chars). - - SHA-1 deduplication and chat-ID metadata. - - Timestamp fields (`file_modified`, `imported_at`). - - Background-safe operation (`nohup`/`tmux`). -- 68 Lyra-category chats imported: - - **6 556 new chunks added** - - **1 493 duplicates skipped** - - **7 997 total vectors** now stored. - -### API -- `/rag/search` FastAPI endpoint implemented (port 7090). -- Supports natural-language queries and returns top related excerpts. -- Added answer synthesis step using `gpt-4o-mini`. - -### Verified -- Successful recall of Lyra-Core development history (v0.3.0 snapshot). -- Correct metadata and category tagging for all new imports. - -### Next Planned -- Optional `where` filter parameter for category/date queries. -- Graceful β€œno results” handler for empty retrievals. -- `rag_docs_import.py` for PDFs and other document types. - -## [Lyra Core v0.3.2 + Web Ui v0.2.0] - 2025-10-28 - -### Added -- ** New UI ** - - Cleaned up UI look and feel. - -- ** Added "sessions" ** - - Now sessions persist over time. - - Ability to create new sessions or load sessions from a previous instance. - - When changing the session, it updates what the prompt is sending relay (doesn't prompt with messages from other sessions). - - Relay is correctly wired in. - -## [Lyra-Core 0.3.1] - 2025-10-09 - -### Added -- **NVGRAM Integration (Full Pipeline Reconnected)** - - Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077). - - Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search`. - - Added `.env` variable: - ``` - NVGRAM_API=http://nvgram-api:7077 - ``` - - Verified end-to-end Lyra conversation persistence: - - `relay β†’ nvgram-api β†’ postgres/neo4j β†’ relay β†’ ollama β†’ ui` - - βœ… Memories stored, retrieved, and re-injected successfully. - -### Changed -- Renamed `MEM0_URL` β†’ `NVGRAM_API` across all relay environment configs. -- Updated Docker Compose service dependency order: - - `relay` now depends on `nvgram-api` healthcheck. - - Removed `mem0` references and volumes. -- Minor cleanup to Persona fetch block (null-checks and safer default persona string). - -### Fixed -- Relay startup no longer crashes when NVGRAM is unavailable β€” deferred connection handling. -- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500`. -- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON). - -### Goals / Next Steps -- Add salience visualization (e.g., memory weights displayed in injected system message). -- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring. -- Add relay auto-retry for transient 500 responses from NVGRAM. - ---- - -## [Lyra-Core] v0.3.1 - 2025-09-27 -### Changed -- Removed salience filter logic; Cortex is now the default annotator. -- All user messages stored in Mem0; no discard tier applied. - -### Added -- Cortex annotations (`metadata.cortex`) now attached to memories. -- Debug logging improvements: - - Pretty-print Cortex annotations - - Injected prompt preview - - Memory search hit list with scores -- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed. - -### Fixed -- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner. -- Relay no longer β€œhangs” on malformed Cortex outputs. - ---- - -### [Lyra-Core] v0.3.0 β€” 2025-09-26 -#### Added -- Implemented **salience filtering** in Relay: - - `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL`. - - Supports `heuristic` and `llm` classification modes. - - LLM-based salience filter integrated with Cortex VM running `llama-server`. -- Logging improvements: - - Added debug logs for salience mode, raw LLM output, and unexpected outputs. - - Fail-closed behavior for unexpected LLM responses. -- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers. -- Verified end-to-end flow: Relay β†’ salience filter β†’ Mem0 add/search β†’ Persona injection β†’ LLM reply. - -#### Changed -- Refactored `server.js` to gate `mem.add()` calls behind salience filter. -- Updated `.env` to support `SALIENCE_MODEL`. - -#### Known Issues -- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient". -- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi"). -- CPU-only inference is functional but limited; larger models recommended once GPU is available. - ---- - -### [Lyra-Core] v0.2.0 β€” 2025-09-24 -#### Added -- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls. -- Implemented `sessionId` support (client-supplied, fallback to `default`). -- Added debug logs for memory add/search. -- Cleaned up Relay structure for clarity. - ---- - -### [Lyra-Core] v0.1.0 β€” 2025-09-23 -#### Added -- First working MVP of **Lyra Core Relay**. -- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible). -- Memory integration with Mem0: - - `POST /memories` on each user message. - - `POST /search` before LLM call. -- Persona Sidecar integration (`GET /current`). -- OpenAI GPT + Ollama (Mythomax) support in Relay. -- Simple browser-based chat UI (talks to Relay at `http://:7078`). -- `.env` standardization for Relay + Mem0 + Postgres + Neo4j. -- Working Neo4j + Postgres backing stores for Mem0. -- Initial MVP relay service with raw fetch calls to Mem0. -- Dockerized with basic healthcheck. - -#### Fixed -- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only). -- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env`. - -#### Known Issues -- No feedback loop (thumbs up/down) yet. -- Forget/delete flow is manual (via memory IDs). -- Memory latency ~1–4s depending on embedding model. - ---- - -## 🧩 lyra-neomem (used to be NVGRAM / Lyra-Mem0) ############################################################################## - -## [NeoMem 0.1.2] - 2025-10-27 -### Changed -- **Renamed NVGRAM to neomem** - - All future updates will be under the name NeoMem. - - Features have not changed. - -## [NVGRAM 0.1.1] - 2025-10-08 -### Added -- **Async Memory Rewrite (Stability + Safety Patch)** - - Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes. - - Added **input sanitation** to prevent embedding errors (`'list' object has no attribute 'replace'`). - - Implemented `flatten_messages()` helper in API layer to clean malformed payloads. - - Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware). - - Health endpoint (`/health`) now returns structured JSON `{status, version, service}`. - - Startup logs now include **sanitized embedder config** with API keys masked for safety: - ``` - >>> Embedder config (sanitized): {'provider': 'openai', 'config': {'model': 'text-embedding-3-small', 'api_key': '***'}} - βœ… Connected to Neo4j on attempt 1 - 🧠 NVGRAM v0.1.1 β€” Neural Vectorized Graph Recall and Memory initialized - ``` - -### Changed -- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes. -- Normalized indentation and cleaned duplicate `main.py` references under `/nvgram/` vs `/nvgram/server/`. -- Removed redundant `FastAPI()` app reinitialization. -- Updated internal logging to INFO-level timing format: - 2025-10-08 21:48:45 [INFO] POST /memories -> 200 (11189.1 ms) -- Deprecated `@app.on_event("startup")` (FastAPI deprecation warning) β†’ will migrate to `lifespan` handler in v0.1.2. - -### Fixed -- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content. -- Masked API key leaks from boot logs. -- Ensured Neo4j reconnects gracefully on first retry. - -### Goals / Next Steps -- Integrate **salience scoring** and **embedding confidence weight** fields in Postgres schema. -- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall. -- Migrate from deprecated `on_event` β†’ `lifespan` pattern in 0.1.2. - ---- - -## [NVGRAM 0.1.0] - 2025-10-07 -### Added -- **Initial fork of Mem0 β†’ NVGRAM**: - - Created a fully independent local-first memory engine based on Mem0 OSS. - - Renamed all internal modules, Docker services, and environment variables from `mem0` β†’ `nvgram`. - - New service name: **`nvgram-api`**, default port **7077**. - - Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility with Lyra Core. - - Uses **FastAPI**, **Postgres**, and **Neo4j** as persistent backends. - - Verified clean startup: - ``` - βœ… Connected to Neo4j on attempt 1 - INFO: Uvicorn running on http://0.0.0.0:7077 - ``` - - `/docs` and `/openapi.json` confirmed reachable and functional. - -### Changed -- Removed dependency on the external `mem0ai` SDK β€” all logic now local. -- Re-pinned requirements: - - fastapi==0.115.8 - - uvicorn==0.34.0 - - pydantic==2.10.4 - - python-dotenv==1.0.1 - - psycopg>=3.2.8 - - ollama -- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming and image paths. - -### Goals / Next Steps -- Integrate NVGRAM as the new default backend in Lyra Relay. -- Deprecate remaining Mem0 references and archive old configs. -- Begin versioning as a standalone project (`nvgram-core`, `nvgram-api`, etc.). - ---- - -## [Lyra-Mem0 0.3.2] - 2025-10-05 -### Added -- Support for **Ollama LLM reasoning** alongside OpenAI embeddings: - - Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090`. - - Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M`. - - Split processing pipeline: - - Embeddings β†’ OpenAI `text-embedding-3-small` - - LLM β†’ Local Ollama (`http://10.0.0.3:11434/api/chat`). -- Added `.env.3090` template for self-hosted inference nodes. -- Integrated runtime diagnostics and seeder progress tracking: - - File-level + message-level progress bars. - - Retry/back-off logic for timeouts (3 attempts). - - Event logging (`ADD / UPDATE / NONE`) for every memory record. -- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers. -- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090). - -### Changed -- Updated `main.py` configuration block to load: - - `LLM_PROVIDER`, `LLM_MODEL`, and `OLLAMA_BASE_URL`. - - Fallback to OpenAI if Ollama unavailable. -- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py`. -- Normalized `.env` loading so `mem0-api` and host environment share identical values. -- Improved seeder logging and progress telemetry for clearer diagnostics. -- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']` for tuning future local inference runs. - -### Fixed -- Resolved crash during startup: - `TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'`. -- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors. -- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests. -- β€œUnknown event” warnings now safely ignored (no longer break seeding loop). -- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`). - -### Observations -- Stable GPU utilization: ~8 GB VRAM @ 92 % load, β‰ˆ 67 Β°C under sustained seeding. -- Next revision will re-format seed JSON to preserve `role` context (user vs assistant). - ---- - -## [Lyra-Mem0 0.3.1] - 2025-10-03 -### Added -- HuggingFace TEI integration (local 3090 embedder). -- Dual-mode environment switch between OpenAI cloud and local. -- CSV export of memories from Postgres (`payload->>'data'`). - -### Fixed -- `.env` CRLF vs LF line ending issues. -- Local seeding now possible via huggingface server running - ---- - -## [Lyra-mem0 0.3.0] -### Added -- Support for **Ollama embeddings** in Mem0 OSS container: - - Added ability to configure `EMBEDDER_PROVIDER=ollama` and set `EMBEDDER_MODEL` + `OLLAMA_HOST` via `.env`. - - Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG`. - - Installed `ollama` Python client into custom API container image. -- `.env.3090` file created for external embedding mode (3090 machine): - - EMBEDDER_PROVIDER=ollama - - EMBEDDER_MODEL=mxbai-embed-large - - OLLAMA_HOST=http://10.0.0.3:11434 -- Workflow to support **multiple embedding modes**: - 1. Fast LAN-based 3090/Ollama embeddings - 2. Local-only CPU embeddings (Lyra Cortex VM) - 3. OpenAI fallback embeddings - -### Changed -- `docker-compose.yml` updated to mount local `main.py` and `.env.3090`. -- Built **custom Dockerfile** (`mem0-api-server:latest`) extending base image with `pip install ollama`. -- Updated `requirements.txt` to include `ollama` package. -- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv` (`load_dotenv()`). -- Tested new embeddings path with curl `/memories` API call. - -### Fixed -- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`). -- Fixed config overwrite issue where rebuilding container restored stock `main.py`. -- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes and planning to standardize at 1536-dim. - --- - -## [Lyra-mem0 v0.2.1] - -### Added -- **Seeding pipeline**: - - Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0. - - Implemented incremental seeding option (skip existing memories, only add new ones). - - Verified insert process with Postgres-backed history DB and curl `/memories/search` sanity check. -- **Ollama embedding support** in Mem0 OSS container: - - Added configuration for `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, and `OLLAMA_HOST` via `.env`. - - Created `.env.3090` profile for LAN-connected 3090 machine with Ollama. - - Set up three embedding modes: - 1. Fast LAN-based 3090/Ollama - 2. Local-only CPU model (Lyra Cortex VM) - 3. OpenAI fallback - -### Changed -- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends. -- Mounted host `main.py` into container so local edits persist across rebuilds. -- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles. -- Built **custom Dockerfile** (`mem0-api-server:latest`) including `pip install ollama`. -- Updated `requirements.txt` with `ollama` dependency. -- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP). -- Added logging to confirm model pulls and embedding requests. - -### Fixed -- Seeder process originally failed on old memories β€” now skips duplicates and continues batch. -- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image. -- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild. -- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch by investigating OpenAI (1536-dim) vs Ollama (1024-dim) schemas. - -### Notes -- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAI’s schema and avoid Neo4j errors). -- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors. -- Seeder workflow validated but should be wrapped in a repeatable weekly run for full Cloudβ†’Local sync. - ---- - -## [Lyra-Mem0 v0.2.0] - 2025-09-30 -### Added -- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/` - - Includes **Postgres (pgvector)**, **Qdrant**, **Neo4j**, and **SQLite** for history tracking. - - Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building the Mem0 API server. -- Verified REST API functionality: - - `POST /memories` works for adding memories. - - `POST /search` works for semantic search. -- Successful end-to-end test with persisted memory: - *"Likes coffee in the morning"* β†’ retrievable via search. βœ… - -### Changed -- Split architecture into **modular stacks**: - - `~/lyra-core` (Relay, Persona-Sidecar, etc.) - - `~/lyra-mem0` (Mem0 OSS memory stack) -- Removed old embedded mem0 containers from the Lyra-Core compose file. -- Added Lyra-Mem0 section in README.md. - -### Next Steps -- Wire **Relay β†’ Mem0 API** (integration not yet complete). -- Add integration tests to verify persistence and retrieval from within Lyra-Core. - ---- - -## 🧠 Lyra-Cortex ############################################################################## - -## [ Cortex - v0.5] -2025-11-13 - -### Added -- **New `reasoning.py` module** - - Async reasoning engine. - - Accepts user prompt, identity, RAG block, and reflection notes. - - Produces draft internal answers. - - Uses primary backend (vLLM). -- **New `reflection.py` module** - - Fully async. - - Produces actionable JSON β€œinternal notes.” - - Enforces strict JSON schema and fallback parsing. - - Forces cloud backend (`backend_override="cloud"`). -- Integrated `refine.py` into Cortex reasoning pipeline: - - New stage between reflection and persona. - - Runs exclusively on primary vLLM backend (MI50). - - Produces final, internally consistent output for downstream persona layer. -- **Backend override system** - - Each LLM call can now select its own backend. - - Enables multi-LLM cognition: Reflection β†’ cloud, Reasoning β†’ primary. - -- **identity loader** - - Added `identity.py` with `load_identity()` for consistent persona retrieval. - -- **ingest_handler** - - Async stub created for future Intake β†’ NeoMem β†’ RAG pipeline. - -### Changed -- Unified LLM backend URL handling across Cortex: - - ENV variables must now contain FULL API endpoints. - - Removed all internal path-appending (e.g. `.../v1/completions`). - - `llm_router.py` rewritten to use env-provided URLs as-is. - - Ensures consistent behavior between draft, reflection, refine, and persona. -- **Rebuilt `main.py`** - - Removed old annotation/analysis logic. - - New structure: load identity β†’ get RAG β†’ reflect β†’ reason β†’ return draft+notes. - - Routes now clean and minimal (`/reason`, `/ingest`, `/health`). - - Async path throughout Cortex. - -- **Refactored `llm_router.py`** - - Removed old fallback logic during overrides. - - OpenAI requests now use `/v1/chat/completions`. - - Added proper OpenAI Authorization headers. - - Distinct payload format for vLLM vs OpenAI. - - Unified, correct parsing across models. - -- **Simplified Cortex architecture** - - Removed deprecated β€œcontext.py” and old reasoning code. - - Relay completely decoupled from smart behavior. - -- Updated environment specification: - - `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions`. - - `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama). - - `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions`. - -### Fixed -- Resolved endpoint conflict where: - - Router expected base URLs. - - Refine expected full URLs. - - Refine always fell back due to hitting incorrect endpoint. - - Fixed by standardizing full-URL behavior across entire system. -- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax). -- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints. -- No more double-routing through vLLM during reflection. -- Corrected async/sync mismatch in multiple locations. -- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic. - -### Removed -- Legacy `annotate`, `reason_check` glue logic from old architecture. -- Old backend probing junk code. -- Stale imports and unused modules leftover from previous prototype. - -### Verified -- Cortex β†’ vLLM (MI50) β†’ refine β†’ final_output now functioning correctly. -- refine shows `used_primary_backend: true` and no fallback. -- Manual curl test confirms endpoint accuracy. - -### Known Issues -- refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this. -- hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned). - -### Pending / Known Issues -- **RAG service does not exist** β€” requires containerized FastAPI service. -- Reasoning layer lacks self-revision loop (deliberate thought cycle). -- No speak/persona generation layer yet (`speak.py` planned). -- Intake summaries not yet routing into RAG or reflection layer. -- No refinement engine between reasoning and speak. - -### Notes -This is the largest structural change to Cortex so far. -It establishes: -- multi-model cognition -- clean layering -- identity + reflection separation -- correct async code -- deterministic backend routing -- predictable JSON reflection - -The system is now ready for: -- refinement loops -- persona-speaking layer -- containerized RAG -- long-term memory integration -- true emergent-behavior experiments - - - -## [ Cortex - v0.4.1] - 2025-11-5 -### Added -- **RAG intergration** - - Added rag.py with query_rag() and format_rag_block(). - - Cortex now queries the local RAG API (http://10.0.0.41:7090/rag/search) for contextual augmentation. - - Synthesized answers and top excerpts are injected into the reasoning prompt. - -### Changed ### -- **Revised /reason endpoint.** - - Now builds unified context blocks: - - [Intake] β†’ recent summaries - - [RAG] β†’ contextual knowledge - - [User Message] β†’ current input - - Calls call_llm() for the first pass, then reflection_loop() for meta-evaluation. - - Returns cortex_prompt, draft_output, final_output, and normalized reflection. -- **Reflection Pipeline Stability** - - Cleaned parsing to normalize JSON vs. text reflections. - - Added fallback handling for malformed or non-JSON outputs. - - Log system improved to show raw JSON, extracted fields, and normalized summary. -- **Async Summarization (Intake v0.2.1)** - - Intake summaries now run in background threads to avoid blocking Cortex. - - Summaries (L1–L∞) logged asynchronously with [BG] tags. -- **Environment & Networking Fixes** - - Verified .env variables propagate correctly inside the Cortex container. - - Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG (shared serversdown_lyra_net). - - Adjusted localhost calls to service-IP mapping (10.0.0.41 for Cortex host). - -- **Behavioral Updates** - - Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers). - - RAG context successfully grounds reasoning outputs. - - Intake and NeoMem confirmed receiving summaries via /add_exchange. - - Log clarity pass: all reflective and contextual blocks clearly labeled. -- **Known Gaps / Next Steps** - - NeoMem Tuning - - Improve retrieval latency and relevance. - - Implement a dedicated /reflections/recent endpoint for Cortex. - - Migrate to Cortex-first ingestion (Relay β†’ Cortex β†’ NeoMem). -- **Cortex Enhancements** - - Add persistent reflection recall (use prior reflections as meta-context). - - Improve reflection JSON structure ("insight", "evaluation", "next_action" β†’ guaranteed fields). - - Tighten temperature and prompt control for factual consistency. -- **RAG Optimization** - -Add source ranking, filtering, and multi-vector hybrid search. - -Cache RAG responses per session to reduce duplicate calls. -- **Documentation / Monitoring** - -Add health route for RAG and Intake summaries. - -Include internal latency metrics in /health endpoint. - -Consolidate logs into unified β€œLyra Cortex Console” for tracing all module calls. - -## [Cortex - v0.3.0] – 2025-10-31 -### Added -- **Cortex Service (FastAPI)** - - New standalone reasoning engine (`cortex/main.py`) with endpoints: - - `GET /health` – reports active backend + NeoMem status. - - `POST /reason` – evaluates `{prompt, response}` pairs. - - `POST /annotate` – experimental text analysis. - - Background NeoMem health monitor (5-minute interval). - -- **Multi-Backend Reasoning Support** - - Added environment-driven backend selection via `LLM_FORCE_BACKEND`. - - Supports: - - **Primary** β†’ vLLM (MI50 node @ 10.0.0.43) - - **Secondary** β†’ Ollama (3090 node @ 10.0.0.3) - - **Cloud** β†’ OpenAI API - - **Fallback** β†’ llama.cpp (CPU) - - Introduced per-backend model variables: - `LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL`. - -- **Response Normalization Layer** - - Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON. - - Handles Ollama’s multi-line streaming and Mythomax’s missing punctuation issues. - - Prints concise debug previews of merged content. - -- **Environment Simplification** - - Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file. - - Removed reliance on shared/global env file to prevent cross-contamination. - - Verified Docker Compose networking across containers. - -### Changed -- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend. -- Enhanced startup logs to announce active backend, model, URL, and mode. -- Improved error handling with clearer β€œReasoning error” messages. - -### Fixed -- Corrected broken vLLM endpoint routing (`/v1/completions`). -- Stabilized cross-container health reporting for NeoMem. -- Resolved JSON parse failures caused by streaming chunk delimiters. - ---- - -## Next Planned – [v0.4.0] -### Planned Additions -- **Reflection Mode** - - Introduce `REASONING_MODE=factcheck|reflection`. - - Output schema: - ```json - { "insight": "...", "evaluation": "...", "next_action": "..." } - ``` - -- **Cortex-First Pipeline** - - UI β†’ Cortex β†’ [Reflection + Verifier + Memory] β†’ Speech LLM β†’ User. - - Allows Lyra to β€œthink before speaking.” - -- **Verifier Stub** - - New `/verify` endpoint for search-based factual grounding. - - Asynchronous external truth checking. - -- **Memory Integration** - - Feed reflective outputs into NeoMem. - - Enable β€œdream” cycles for autonomous self-review. - ---- - -**Status:** 🟒 Stable Core – Multi-backend reasoning operational. -**Next milestone:** *v0.4.0 β€” Reflection Mode + Thought Pipeline orchestration.* - ---- - -### [Intake] v0.1.0 - 2025-10-27 - - Recieves messages from relay and summarizes them in a cascading format. - - Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20) - - Currently logs summaries to .log file in /project-lyra/intake-logs/ - ** Next Steps ** - - Feed intake into neomem. - - Generate a daily/hourly/etc overall summary, (IE: Today Brian and Lyra worked on x, y, and z) - - Generate session aware summaries, with its own intake hopper. - - -### [Lyra-Cortex] v0.2.0 β€” 2025-09-26 -**Added -- Integrated **llama-server** on dedicated Cortex VM (Proxmox). -- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs. -- Benchmarked Phi-3.5-mini performance: - - ~18 tokens/sec CPU-only on Ryzen 7 7800X. - - Salience classification functional but sometimes inconsistent ("sali", "fi", "jamming"). -- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier: - - Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval). - - More responsive but over-classifies messages as β€œsalient.” -- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models. - -** Known Issues -- Small models tend to drift or over-classify. -- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models. -- Need to set up a `systemd` service for `llama-server` to auto-start on VM reboot. - ---- - -### [Lyra-Cortex] v0.1.0 β€” 2025-09-25 -#### Added -- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD). -- Built **llama.cpp** with `llama-server` target via CMake. -- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model. -- Verified **API compatibility** at `/v1/chat/completions`. -- Local test successful via `curl` β†’ ~523 token response generated. -- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X). -- Confirmed usable for salience scoring, summarization, and lightweight reasoning. +# Project Lyra β€” Modular Changelog +All notable changes to Project Lyra are organized by component. +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/) +and adheres to [Semantic Versioning](https://semver.org/). +# Last Updated: 11-13-25 +--- + +## 🧠 Lyra-Core ############################################################################## + +## [Lyra_RAG v0.1.0] 2025-11-07 +### Added +- Initial standalone RAG module for Project Lyra. +- Persistent ChromaDB vector store (`./chromadb`). +- Importer `rag_chat_import.py` with: + - Recursive folder scanning and category tagging. + - Smart chunking (~5 k chars). + - SHA-1 deduplication and chat-ID metadata. + - Timestamp fields (`file_modified`, `imported_at`). + - Background-safe operation (`nohup`/`tmux`). +- 68 Lyra-category chats imported: + - **6 556 new chunks added** + - **1 493 duplicates skipped** + - **7 997 total vectors** now stored. + +### API +- `/rag/search` FastAPI endpoint implemented (port 7090). +- Supports natural-language queries and returns top related excerpts. +- Added answer synthesis step using `gpt-4o-mini`. + +### Verified +- Successful recall of Lyra-Core development history (v0.3.0 snapshot). +- Correct metadata and category tagging for all new imports. + +### Next Planned +- Optional `where` filter parameter for category/date queries. +- Graceful β€œno results” handler for empty retrievals. +- `rag_docs_import.py` for PDFs and other document types. + +## [Lyra Core v0.3.2 + Web Ui v0.2.0] - 2025-10-28 + +### Added +- ** New UI ** + - Cleaned up UI look and feel. + +- ** Added "sessions" ** + - Now sessions persist over time. + - Ability to create new sessions or load sessions from a previous instance. + - When changing the session, it updates what the prompt is sending relay (doesn't prompt with messages from other sessions). + - Relay is correctly wired in. + +## [Lyra-Core 0.3.1] - 2025-10-09 + +### Added +- **NVGRAM Integration (Full Pipeline Reconnected)** + - Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077). + - Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search`. + - Added `.env` variable: + ``` + NVGRAM_API=http://nvgram-api:7077 + ``` + - Verified end-to-end Lyra conversation persistence: + - `relay β†’ nvgram-api β†’ postgres/neo4j β†’ relay β†’ ollama β†’ ui` + - βœ… Memories stored, retrieved, and re-injected successfully. + +### Changed +- Renamed `MEM0_URL` β†’ `NVGRAM_API` across all relay environment configs. +- Updated Docker Compose service dependency order: + - `relay` now depends on `nvgram-api` healthcheck. + - Removed `mem0` references and volumes. +- Minor cleanup to Persona fetch block (null-checks and safer default persona string). + +### Fixed +- Relay startup no longer crashes when NVGRAM is unavailable β€” deferred connection handling. +- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500`. +- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON). + +### Goals / Next Steps +- Add salience visualization (e.g., memory weights displayed in injected system message). +- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring. +- Add relay auto-retry for transient 500 responses from NVGRAM. + +--- + +## [Lyra-Core] v0.3.1 - 2025-09-27 +### Changed +- Removed salience filter logic; Cortex is now the default annotator. +- All user messages stored in Mem0; no discard tier applied. + +### Added +- Cortex annotations (`metadata.cortex`) now attached to memories. +- Debug logging improvements: + - Pretty-print Cortex annotations + - Injected prompt preview + - Memory search hit list with scores +- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed. + +### Fixed +- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner. +- Relay no longer β€œhangs” on malformed Cortex outputs. + +--- + +### [Lyra-Core] v0.3.0 β€” 2025-09-26 +#### Added +- Implemented **salience filtering** in Relay: + - `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL`. + - Supports `heuristic` and `llm` classification modes. + - LLM-based salience filter integrated with Cortex VM running `llama-server`. +- Logging improvements: + - Added debug logs for salience mode, raw LLM output, and unexpected outputs. + - Fail-closed behavior for unexpected LLM responses. +- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers. +- Verified end-to-end flow: Relay β†’ salience filter β†’ Mem0 add/search β†’ Persona injection β†’ LLM reply. + +#### Changed +- Refactored `server.js` to gate `mem.add()` calls behind salience filter. +- Updated `.env` to support `SALIENCE_MODEL`. + +#### Known Issues +- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient". +- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi"). +- CPU-only inference is functional but limited; larger models recommended once GPU is available. + +--- + +### [Lyra-Core] v0.2.0 β€” 2025-09-24 +#### Added +- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls. +- Implemented `sessionId` support (client-supplied, fallback to `default`). +- Added debug logs for memory add/search. +- Cleaned up Relay structure for clarity. + +--- + +### [Lyra-Core] v0.1.0 β€” 2025-09-23 +#### Added +- First working MVP of **Lyra Core Relay**. +- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible). +- Memory integration with Mem0: + - `POST /memories` on each user message. + - `POST /search` before LLM call. +- Persona Sidecar integration (`GET /current`). +- OpenAI GPT + Ollama (Mythomax) support in Relay. +- Simple browser-based chat UI (talks to Relay at `http://:7078`). +- `.env` standardization for Relay + Mem0 + Postgres + Neo4j. +- Working Neo4j + Postgres backing stores for Mem0. +- Initial MVP relay service with raw fetch calls to Mem0. +- Dockerized with basic healthcheck. + +#### Fixed +- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only). +- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env`. + +#### Known Issues +- No feedback loop (thumbs up/down) yet. +- Forget/delete flow is manual (via memory IDs). +- Memory latency ~1–4s depending on embedding model. + +--- + +## 🧩 lyra-neomem (used to be NVGRAM / Lyra-Mem0) ############################################################################## + +## [NeoMem 0.1.2] - 2025-10-27 +### Changed +- **Renamed NVGRAM to neomem** + - All future updates will be under the name NeoMem. + - Features have not changed. + +## [NVGRAM 0.1.1] - 2025-10-08 +### Added +- **Async Memory Rewrite (Stability + Safety Patch)** + - Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes. + - Added **input sanitation** to prevent embedding errors (`'list' object has no attribute 'replace'`). + - Implemented `flatten_messages()` helper in API layer to clean malformed payloads. + - Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware). + - Health endpoint (`/health`) now returns structured JSON `{status, version, service}`. + - Startup logs now include **sanitized embedder config** with API keys masked for safety: + ``` + >>> Embedder config (sanitized): {'provider': 'openai', 'config': {'model': 'text-embedding-3-small', 'api_key': '***'}} + βœ… Connected to Neo4j on attempt 1 + 🧠 NVGRAM v0.1.1 β€” Neural Vectorized Graph Recall and Memory initialized + ``` + +### Changed +- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes. +- Normalized indentation and cleaned duplicate `main.py` references under `/nvgram/` vs `/nvgram/server/`. +- Removed redundant `FastAPI()` app reinitialization. +- Updated internal logging to INFO-level timing format: + 2025-10-08 21:48:45 [INFO] POST /memories -> 200 (11189.1 ms) +- Deprecated `@app.on_event("startup")` (FastAPI deprecation warning) β†’ will migrate to `lifespan` handler in v0.1.2. + +### Fixed +- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content. +- Masked API key leaks from boot logs. +- Ensured Neo4j reconnects gracefully on first retry. + +### Goals / Next Steps +- Integrate **salience scoring** and **embedding confidence weight** fields in Postgres schema. +- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall. +- Migrate from deprecated `on_event` β†’ `lifespan` pattern in 0.1.2. + +--- + +## [NVGRAM 0.1.0] - 2025-10-07 +### Added +- **Initial fork of Mem0 β†’ NVGRAM**: + - Created a fully independent local-first memory engine based on Mem0 OSS. + - Renamed all internal modules, Docker services, and environment variables from `mem0` β†’ `nvgram`. + - New service name: **`nvgram-api`**, default port **7077**. + - Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility with Lyra Core. + - Uses **FastAPI**, **Postgres**, and **Neo4j** as persistent backends. + - Verified clean startup: + ``` + βœ… Connected to Neo4j on attempt 1 + INFO: Uvicorn running on http://0.0.0.0:7077 + ``` + - `/docs` and `/openapi.json` confirmed reachable and functional. + +### Changed +- Removed dependency on the external `mem0ai` SDK β€” all logic now local. +- Re-pinned requirements: + - fastapi==0.115.8 + - uvicorn==0.34.0 + - pydantic==2.10.4 + - python-dotenv==1.0.1 + - psycopg>=3.2.8 + - ollama +- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming and image paths. + +### Goals / Next Steps +- Integrate NVGRAM as the new default backend in Lyra Relay. +- Deprecate remaining Mem0 references and archive old configs. +- Begin versioning as a standalone project (`nvgram-core`, `nvgram-api`, etc.). + +--- + +## [Lyra-Mem0 0.3.2] - 2025-10-05 +### Added +- Support for **Ollama LLM reasoning** alongside OpenAI embeddings: + - Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090`. + - Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M`. + - Split processing pipeline: + - Embeddings β†’ OpenAI `text-embedding-3-small` + - LLM β†’ Local Ollama (`http://10.0.0.3:11434/api/chat`). +- Added `.env.3090` template for self-hosted inference nodes. +- Integrated runtime diagnostics and seeder progress tracking: + - File-level + message-level progress bars. + - Retry/back-off logic for timeouts (3 attempts). + - Event logging (`ADD / UPDATE / NONE`) for every memory record. +- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers. +- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090). + +### Changed +- Updated `main.py` configuration block to load: + - `LLM_PROVIDER`, `LLM_MODEL`, and `OLLAMA_BASE_URL`. + - Fallback to OpenAI if Ollama unavailable. +- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py`. +- Normalized `.env` loading so `mem0-api` and host environment share identical values. +- Improved seeder logging and progress telemetry for clearer diagnostics. +- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']` for tuning future local inference runs. + +### Fixed +- Resolved crash during startup: + `TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'`. +- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors. +- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests. +- β€œUnknown event” warnings now safely ignored (no longer break seeding loop). +- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`). + +### Observations +- Stable GPU utilization: ~8 GB VRAM @ 92 % load, β‰ˆ 67 Β°C under sustained seeding. +- Next revision will re-format seed JSON to preserve `role` context (user vs assistant). + +--- + +## [Lyra-Mem0 0.3.1] - 2025-10-03 +### Added +- HuggingFace TEI integration (local 3090 embedder). +- Dual-mode environment switch between OpenAI cloud and local. +- CSV export of memories from Postgres (`payload->>'data'`). + +### Fixed +- `.env` CRLF vs LF line ending issues. +- Local seeding now possible via huggingface server running + +--- + +## [Lyra-mem0 0.3.0] +### Added +- Support for **Ollama embeddings** in Mem0 OSS container: + - Added ability to configure `EMBEDDER_PROVIDER=ollama` and set `EMBEDDER_MODEL` + `OLLAMA_HOST` via `.env`. + - Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG`. + - Installed `ollama` Python client into custom API container image. +- `.env.3090` file created for external embedding mode (3090 machine): + - EMBEDDER_PROVIDER=ollama + - EMBEDDER_MODEL=mxbai-embed-large + - OLLAMA_HOST=http://10.0.0.3:11434 +- Workflow to support **multiple embedding modes**: + 1. Fast LAN-based 3090/Ollama embeddings + 2. Local-only CPU embeddings (Lyra Cortex VM) + 3. OpenAI fallback embeddings + +### Changed +- `docker-compose.yml` updated to mount local `main.py` and `.env.3090`. +- Built **custom Dockerfile** (`mem0-api-server:latest`) extending base image with `pip install ollama`. +- Updated `requirements.txt` to include `ollama` package. +- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv` (`load_dotenv()`). +- Tested new embeddings path with curl `/memories` API call. + +### Fixed +- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`). +- Fixed config overwrite issue where rebuilding container restored stock `main.py`. +- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes and planning to standardize at 1536-dim. + +-- + +## [Lyra-mem0 v0.2.1] + +### Added +- **Seeding pipeline**: + - Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0. + - Implemented incremental seeding option (skip existing memories, only add new ones). + - Verified insert process with Postgres-backed history DB and curl `/memories/search` sanity check. +- **Ollama embedding support** in Mem0 OSS container: + - Added configuration for `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, and `OLLAMA_HOST` via `.env`. + - Created `.env.3090` profile for LAN-connected 3090 machine with Ollama. + - Set up three embedding modes: + 1. Fast LAN-based 3090/Ollama + 2. Local-only CPU model (Lyra Cortex VM) + 3. OpenAI fallback + +### Changed +- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends. +- Mounted host `main.py` into container so local edits persist across rebuilds. +- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles. +- Built **custom Dockerfile** (`mem0-api-server:latest`) including `pip install ollama`. +- Updated `requirements.txt` with `ollama` dependency. +- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP). +- Added logging to confirm model pulls and embedding requests. + +### Fixed +- Seeder process originally failed on old memories β€” now skips duplicates and continues batch. +- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image. +- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild. +- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch by investigating OpenAI (1536-dim) vs Ollama (1024-dim) schemas. + +### Notes +- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAI’s schema and avoid Neo4j errors). +- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors. +- Seeder workflow validated but should be wrapped in a repeatable weekly run for full Cloudβ†’Local sync. + +--- + +## [Lyra-Mem0 v0.2.0] - 2025-09-30 +### Added +- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/` + - Includes **Postgres (pgvector)**, **Qdrant**, **Neo4j**, and **SQLite** for history tracking. + - Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building the Mem0 API server. +- Verified REST API functionality: + - `POST /memories` works for adding memories. + - `POST /search` works for semantic search. +- Successful end-to-end test with persisted memory: + *"Likes coffee in the morning"* β†’ retrievable via search. βœ… + +### Changed +- Split architecture into **modular stacks**: + - `~/lyra-core` (Relay, Persona-Sidecar, etc.) + - `~/lyra-mem0` (Mem0 OSS memory stack) +- Removed old embedded mem0 containers from the Lyra-Core compose file. +- Added Lyra-Mem0 section in README.md. + +### Next Steps +- Wire **Relay β†’ Mem0 API** (integration not yet complete). +- Add integration tests to verify persistence and retrieval from within Lyra-Core. + +--- + +## 🧠 Lyra-Cortex ############################################################################## + +## [ Cortex - v0.5] -2025-11-13 + +### Added +- **New `reasoning.py` module** + - Async reasoning engine. + - Accepts user prompt, identity, RAG block, and reflection notes. + - Produces draft internal answers. + - Uses primary backend (vLLM). +- **New `reflection.py` module** + - Fully async. + - Produces actionable JSON β€œinternal notes.” + - Enforces strict JSON schema and fallback parsing. + - Forces cloud backend (`backend_override="cloud"`). +- Integrated `refine.py` into Cortex reasoning pipeline: + - New stage between reflection and persona. + - Runs exclusively on primary vLLM backend (MI50). + - Produces final, internally consistent output for downstream persona layer. +- **Backend override system** + - Each LLM call can now select its own backend. + - Enables multi-LLM cognition: Reflection β†’ cloud, Reasoning β†’ primary. + +- **identity loader** + - Added `identity.py` with `load_identity()` for consistent persona retrieval. + +- **ingest_handler** + - Async stub created for future Intake β†’ NeoMem β†’ RAG pipeline. + +### Changed +- Unified LLM backend URL handling across Cortex: + - ENV variables must now contain FULL API endpoints. + - Removed all internal path-appending (e.g. `.../v1/completions`). + - `llm_router.py` rewritten to use env-provided URLs as-is. + - Ensures consistent behavior between draft, reflection, refine, and persona. +- **Rebuilt `main.py`** + - Removed old annotation/analysis logic. + - New structure: load identity β†’ get RAG β†’ reflect β†’ reason β†’ return draft+notes. + - Routes now clean and minimal (`/reason`, `/ingest`, `/health`). + - Async path throughout Cortex. + +- **Refactored `llm_router.py`** + - Removed old fallback logic during overrides. + - OpenAI requests now use `/v1/chat/completions`. + - Added proper OpenAI Authorization headers. + - Distinct payload format for vLLM vs OpenAI. + - Unified, correct parsing across models. + +- **Simplified Cortex architecture** + - Removed deprecated β€œcontext.py” and old reasoning code. + - Relay completely decoupled from smart behavior. + +- Updated environment specification: + - `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions`. + - `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama). + - `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions`. + +### Fixed +- Resolved endpoint conflict where: + - Router expected base URLs. + - Refine expected full URLs. + - Refine always fell back due to hitting incorrect endpoint. + - Fixed by standardizing full-URL behavior across entire system. +- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax). +- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints. +- No more double-routing through vLLM during reflection. +- Corrected async/sync mismatch in multiple locations. +- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic. + +### Removed +- Legacy `annotate`, `reason_check` glue logic from old architecture. +- Old backend probing junk code. +- Stale imports and unused modules leftover from previous prototype. + +### Verified +- Cortex β†’ vLLM (MI50) β†’ refine β†’ final_output now functioning correctly. +- refine shows `used_primary_backend: true` and no fallback. +- Manual curl test confirms endpoint accuracy. + +### Known Issues +- refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this. +- hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned). + +### Pending / Known Issues +- **RAG service does not exist** β€” requires containerized FastAPI service. +- Reasoning layer lacks self-revision loop (deliberate thought cycle). +- No speak/persona generation layer yet (`speak.py` planned). +- Intake summaries not yet routing into RAG or reflection layer. +- No refinement engine between reasoning and speak. + +### Notes +This is the largest structural change to Cortex so far. +It establishes: +- multi-model cognition +- clean layering +- identity + reflection separation +- correct async code +- deterministic backend routing +- predictable JSON reflection + +The system is now ready for: +- refinement loops +- persona-speaking layer +- containerized RAG +- long-term memory integration +- true emergent-behavior experiments + + + +## [ Cortex - v0.4.1] - 2025-11-5 +### Added +- **RAG intergration** + - Added rag.py with query_rag() and format_rag_block(). + - Cortex now queries the local RAG API (http://10.0.0.41:7090/rag/search) for contextual augmentation. + - Synthesized answers and top excerpts are injected into the reasoning prompt. + +### Changed ### +- **Revised /reason endpoint.** + - Now builds unified context blocks: + - [Intake] β†’ recent summaries + - [RAG] β†’ contextual knowledge + - [User Message] β†’ current input + - Calls call_llm() for the first pass, then reflection_loop() for meta-evaluation. + - Returns cortex_prompt, draft_output, final_output, and normalized reflection. +- **Reflection Pipeline Stability** + - Cleaned parsing to normalize JSON vs. text reflections. + - Added fallback handling for malformed or non-JSON outputs. + - Log system improved to show raw JSON, extracted fields, and normalized summary. +- **Async Summarization (Intake v0.2.1)** + - Intake summaries now run in background threads to avoid blocking Cortex. + - Summaries (L1–L∞) logged asynchronously with [BG] tags. +- **Environment & Networking Fixes** + - Verified .env variables propagate correctly inside the Cortex container. + - Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG (shared serversdown_lyra_net). + - Adjusted localhost calls to service-IP mapping (10.0.0.41 for Cortex host). + +- **Behavioral Updates** + - Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers). + - RAG context successfully grounds reasoning outputs. + - Intake and NeoMem confirmed receiving summaries via /add_exchange. + - Log clarity pass: all reflective and contextual blocks clearly labeled. +- **Known Gaps / Next Steps** + - NeoMem Tuning + - Improve retrieval latency and relevance. + - Implement a dedicated /reflections/recent endpoint for Cortex. + - Migrate to Cortex-first ingestion (Relay β†’ Cortex β†’ NeoMem). +- **Cortex Enhancements** + - Add persistent reflection recall (use prior reflections as meta-context). + - Improve reflection JSON structure ("insight", "evaluation", "next_action" β†’ guaranteed fields). + - Tighten temperature and prompt control for factual consistency. +- **RAG Optimization** + -Add source ranking, filtering, and multi-vector hybrid search. + -Cache RAG responses per session to reduce duplicate calls. +- **Documentation / Monitoring** + -Add health route for RAG and Intake summaries. + -Include internal latency metrics in /health endpoint. + +Consolidate logs into unified β€œLyra Cortex Console” for tracing all module calls. + +## [Cortex - v0.3.0] – 2025-10-31 +### Added +- **Cortex Service (FastAPI)** + - New standalone reasoning engine (`cortex/main.py`) with endpoints: + - `GET /health` – reports active backend + NeoMem status. + - `POST /reason` – evaluates `{prompt, response}` pairs. + - `POST /annotate` – experimental text analysis. + - Background NeoMem health monitor (5-minute interval). + +- **Multi-Backend Reasoning Support** + - Added environment-driven backend selection via `LLM_FORCE_BACKEND`. + - Supports: + - **Primary** β†’ vLLM (MI50 node @ 10.0.0.43) + - **Secondary** β†’ Ollama (3090 node @ 10.0.0.3) + - **Cloud** β†’ OpenAI API + - **Fallback** β†’ llama.cpp (CPU) + - Introduced per-backend model variables: + `LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL`. + +- **Response Normalization Layer** + - Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON. + - Handles Ollama’s multi-line streaming and Mythomax’s missing punctuation issues. + - Prints concise debug previews of merged content. + +- **Environment Simplification** + - Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file. + - Removed reliance on shared/global env file to prevent cross-contamination. + - Verified Docker Compose networking across containers. + +### Changed +- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend. +- Enhanced startup logs to announce active backend, model, URL, and mode. +- Improved error handling with clearer β€œReasoning error” messages. + +### Fixed +- Corrected broken vLLM endpoint routing (`/v1/completions`). +- Stabilized cross-container health reporting for NeoMem. +- Resolved JSON parse failures caused by streaming chunk delimiters. + +--- + +## Next Planned – [v0.4.0] +### Planned Additions +- **Reflection Mode** + - Introduce `REASONING_MODE=factcheck|reflection`. + - Output schema: + ```json + { "insight": "...", "evaluation": "...", "next_action": "..." } + ``` + +- **Cortex-First Pipeline** + - UI β†’ Cortex β†’ [Reflection + Verifier + Memory] β†’ Speech LLM β†’ User. + - Allows Lyra to β€œthink before speaking.” + +- **Verifier Stub** + - New `/verify` endpoint for search-based factual grounding. + - Asynchronous external truth checking. + +- **Memory Integration** + - Feed reflective outputs into NeoMem. + - Enable β€œdream” cycles for autonomous self-review. + +--- + +**Status:** 🟒 Stable Core – Multi-backend reasoning operational. +**Next milestone:** *v0.4.0 β€” Reflection Mode + Thought Pipeline orchestration.* + +--- + +### [Intake] v0.1.0 - 2025-10-27 + - Recieves messages from relay and summarizes them in a cascading format. + - Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20) + - Currently logs summaries to .log file in /project-lyra/intake-logs/ + ** Next Steps ** + - Feed intake into neomem. + - Generate a daily/hourly/etc overall summary, (IE: Today Brian and Lyra worked on x, y, and z) + - Generate session aware summaries, with its own intake hopper. + + +### [Lyra-Cortex] v0.2.0 β€” 2025-09-26 +**Added +- Integrated **llama-server** on dedicated Cortex VM (Proxmox). +- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs. +- Benchmarked Phi-3.5-mini performance: + - ~18 tokens/sec CPU-only on Ryzen 7 7800X. + - Salience classification functional but sometimes inconsistent ("sali", "fi", "jamming"). +- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier: + - Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval). + - More responsive but over-classifies messages as β€œsalient.” +- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models. + +** Known Issues +- Small models tend to drift or over-classify. +- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models. +- Need to set up a `systemd` service for `llama-server` to auto-start on VM reboot. + +--- + +### [Lyra-Cortex] v0.1.0 β€” 2025-09-25 +#### Added +- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD). +- Built **llama.cpp** with `llama-server` target via CMake. +- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model. +- Verified **API compatibility** at `/v1/chat/completions`. +- Local test successful via `curl` β†’ ~523 token response generated. +- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X). +- Confirmed usable for salience scoring, summarization, and lightweight reasoning. diff --git a/core/README.md b/README.md similarity index 97% rename from core/README.md rename to README.md index cf265e4..f8a1eed 100644 --- a/core/README.md +++ b/README.md @@ -1,265 +1,265 @@ -##### Project Lyra - README v0.3.0 - needs fixing ##### - -Lyra is a modular persistent AI companion system. -It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**, -with optional subconscious annotation powered by **Cortex VM** running local LLMs. - -## Mission Statement ## - The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later. - ---- - -## Structure ## - Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules: - ## A. VM 100 - lyra-core: - 1. ** Core v0.3.1 - Docker Stack - - Relay - (docker container) - The main harness that connects the modules together and accepts input from the user. - - UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that. - - Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection. - - All of this is built and controlled by a single .env and docker-compose.lyra.yml. - 2. **NeoMem v0.1.0 - (docker stack) - - NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph. - - NeoMem launches with a single separate docker-compose.neomem.yml. - - ## B. VM 101 - lyra - cortex - 3. ** Cortex - VM containing docker stack - - This is the working reasoning layer of Lyra. - - Built to be flexible in deployment. Run it locally or remotely (via wan/lan) - - Intake v0.1.0 - (docker Container) gives conversations context and purpose - - Intake takes the last N exchanges and summarizes them into coherrent short term memories. - - Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc. - - Keeps the bot aware of what is going on with out having to send it the whole chat every time. - - Cortex - Docker container containing: - - Reasoning Layer - - TBD - - Reflect - (docker continer) - Not yet implemented, road map. - - Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise. - - Can be done actively and asynchronously, or on a time basis (think human sleep and dreams). - - This stage is not yet built, this is just an idea. - - ## C. Remote LLM APIs: - 3. **AI Backends - - Lyra doesnt run models her self, she calls up APIs. - - Endlessly customizable as long as it outputs to the same schema. - ---- - - -## πŸš€ Features ## - -# Lyra-Core VM (VM100) -- **Relay **: - - The main harness and orchestrator of Lyra. - - OpenAI-compatible endpoint: `POST /v1/chat/completions` - - Injects persona + relevant memories into every LLM call - - Routes all memory storage/retrieval through **NeoMem** - - Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`) - -- **NeoMem (Memory Engine)**: - - Forked from Mem0 OSS and fully independent. - - Drop-in compatible API (`/memories`, `/search`). - - Local-first: runs on FastAPI with Postgres + Neo4j. - - No external SDK dependencies. - - Default service: `neomem-api` (port 7077). - - Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match. - -- **UI**: - - Lightweight static HTML chat page. - - Connects to Relay at `http://:7078`. - - Nice cyberpunk theme! - - Saves and loads sessions, which then in turn send to relay. - -# Beta Lyrae (RAG Memory DB) - added 11-3-25 -- **RAG Knowledge DB - Beta Lyrae (sheliak)** - - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra. - - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. - The system uses: - - **ChromaDB** for persistent vector storage - - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity - - **FastAPI** (port 7090) for the `/rag/search` REST endpoint - - Directory Layout - rag/ - β”œβ”€β”€ rag_chat_import.py # imports JSON chat logs - β”œβ”€β”€ rag_docs_import.py # (planned) PDF/EPUB/manual importer - β”œβ”€β”€ rag_build.py # legacy single-folder builder - β”œβ”€β”€ rag_query.py # command-line query helper - β”œβ”€β”€ rag_api.py # FastAPI service providing /rag/search - β”œβ”€β”€ chromadb/ # persistent vector store - β”œβ”€β”€ chatlogs/ # organized source data - β”‚ β”œβ”€β”€ poker/ - β”‚ β”œβ”€β”€ work/ - β”‚ β”œβ”€β”€ lyra/ - β”‚ β”œβ”€β”€ personal/ - β”‚ └── ... - └── import.log # progress log for batch runs - - **OpenAI chatlog importer. - - Takes JSON formatted chat logs and imports it to the RAG. - - **fetures include:** - - Recursive folder indexing with **category detection** from directory name - - Smart chunking for long messages (5 000 chars per slice) - - Automatic deduplication using SHA-1 hash of file + chunk - - Timestamps for both file modification and import time - - Full progress logging via tqdm - - Safe to run in background with nohup … & - - Metadata per chunk: - ```json - { - "chat_id": "", - "chunk_index": 0, - "source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json", - "title": "cortex LLMs 11-1-25", - "role": "assistant", - "category": "lyra", - "type": "chat", - "file_modified": "2025-11-06T23:41:02", - "imported_at": "2025-11-07T03:55:00Z" - }``` - -# Cortex VM (VM101, CT201) - - **CT201 main reasoning orchestrator.** - - This is the internal brain of Lyra. - - Running in a privellaged LXC. - - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. - - Accessible via 10.0.0.43:8000/v1/completions. - - - **Intake v0.1.1 ** - - Recieves messages from relay and summarizes them in a cascading format. - - Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20) - - Intake then sends to cortex for self reflection, neomem for memory consolidation. - - - **Reflect ** - -TBD - -# Self hosted vLLM server # - - **CT201 main reasoning orchestrator.** - - This is the internal brain of Lyra. - - Running in a privellaged LXC. - - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. - - Accessible via 10.0.0.43:8000/v1/completions. - - **Stack Flow** - - [Proxmox Host] - └── loads AMDGPU driver - └── boots CT201 (order=2) - - [CT201 GPU Container] - β”œβ”€β”€ lyra-start-vllm.sh β†’ starts vLLM ROCm model server - β”œβ”€β”€ lyra-vllm.service β†’ runs the above automatically - β”œβ”€β”€ lyra-core.service β†’ launches Cortex + Intake Docker stack - └── Docker Compose β†’ runs Cortex + Intake containers - - [Cortex Container] - β”œβ”€β”€ Listens on port 7081 - β”œβ”€β”€ Talks to NVGRAM (mem API) + Intake - └── Main relay between Lyra UI ↔ memory ↔ model - - [Intake Container] - β”œβ”€β”€ Listens on port 7080 - β”œβ”€β”€ Summarizes every few exchanges - β”œβ”€β”€ Writes summaries to /app/logs/summaries.log - └── Future: sends summaries β†’ Cortex for reflection - - -# Additional information available in the trilium docs. # ---- - -## πŸ“¦ Requirements - -- Docker + Docker Compose -- Postgres + Neo4j (for NeoMem) -- Access to an open AI or ollama style API. -- OpenAI API key (for Relay fallback LLMs) - -**Dependencies:** - - fastapi==0.115.8 - - uvicorn==0.34.0 - - pydantic==2.10.4 - - python-dotenv==1.0.1 - - psycopg>=3.2.8 - - ollama - ---- - -πŸ”Œ Integration Notes - -Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally. - -API endpoints remain identical to Mem0 (/memories, /search). - -History and entity graphs managed internally via Postgres + Neo4j. - ---- - -🧱 Architecture Snapshot - - User β†’ Relay β†’ Cortex - ↓ - [RAG Search] - ↓ - [Reflection Loop] - ↓ - Intake (async summaries) - ↓ - NeoMem (persistent memory) - -**Cortex v0.4.1 introduces the first fully integrated reasoning loop.** -- Data Flow: - - User message enters Cortex via /reason. - - Cortex assembles context: - - Intake summaries (short-term memory) - - RAG contextual data (knowledge base) - - LLM generates initial draft (call_llm). - - Reflection loop critiques and refines the answer. - - Intake asynchronously summarizes and sends snapshots to NeoMem. - -RAG API Configuration: -Set RAG_API_URL in .env (default: http://localhost:7090). - ---- - -## Setup and Operation ## - -## Beta Lyrae - RAG memory system ## -**Requirements** - -Env= python 3.10+ - -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq - -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db) - -**Import Chats** - - Chats need to be formatted into the correct format of - ``` - "messages": [ - { - "role:" "user", - "content": "Message here" - }, - "messages": [ - { - "role:" "assistant", - "content": "Message here" - },``` - - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight. - - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB). - -**Build API Server** - - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.) - - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090``` - -**Query** - - Run: python3 rag_query.py "Question here?" - - For testing a curl command can reach it too - ``` - curl -X POST http://127.0.0.1:7090/rag/search \ - -H "Content-Type: application/json" \ - -d '{ - "query": "What is the current state of Cortex and Project Lyra?", - "where": {"category": "lyra"} - }' - ``` - -# Beta Lyrae - RAG System - -## πŸ“– License -NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0). -This fork retains the original Apache 2.0 license and adds local modifications. -Β© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0. - +##### Project Lyra - README v0.3.0 - needs fixing ##### + +Lyra is a modular persistent AI companion system. +It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**, +with optional subconscious annotation powered by **Cortex VM** running local LLMs. + +## Mission Statement ## + The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later. + +--- + +## Structure ## + Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules: + ## A. VM 100 - lyra-core: + 1. ** Core v0.3.1 - Docker Stack + - Relay - (docker container) - The main harness that connects the modules together and accepts input from the user. + - UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that. + - Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection. + - All of this is built and controlled by a single .env and docker-compose.lyra.yml. + 2. **NeoMem v0.1.0 - (docker stack) + - NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph. + - NeoMem launches with a single separate docker-compose.neomem.yml. + + ## B. VM 101 - lyra - cortex + 3. ** Cortex - VM containing docker stack + - This is the working reasoning layer of Lyra. + - Built to be flexible in deployment. Run it locally or remotely (via wan/lan) + - Intake v0.1.0 - (docker Container) gives conversations context and purpose + - Intake takes the last N exchanges and summarizes them into coherrent short term memories. + - Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc. + - Keeps the bot aware of what is going on with out having to send it the whole chat every time. + - Cortex - Docker container containing: + - Reasoning Layer + - TBD + - Reflect - (docker continer) - Not yet implemented, road map. + - Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise. + - Can be done actively and asynchronously, or on a time basis (think human sleep and dreams). + - This stage is not yet built, this is just an idea. + + ## C. Remote LLM APIs: + 3. **AI Backends + - Lyra doesnt run models her self, she calls up APIs. + - Endlessly customizable as long as it outputs to the same schema. + +--- + + +## πŸš€ Features ## + +# Lyra-Core VM (VM100) +- **Relay **: + - The main harness and orchestrator of Lyra. + - OpenAI-compatible endpoint: `POST /v1/chat/completions` + - Injects persona + relevant memories into every LLM call + - Routes all memory storage/retrieval through **NeoMem** + - Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`) + +- **NeoMem (Memory Engine)**: + - Forked from Mem0 OSS and fully independent. + - Drop-in compatible API (`/memories`, `/search`). + - Local-first: runs on FastAPI with Postgres + Neo4j. + - No external SDK dependencies. + - Default service: `neomem-api` (port 7077). + - Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match. + +- **UI**: + - Lightweight static HTML chat page. + - Connects to Relay at `http://:7078`. + - Nice cyberpunk theme! + - Saves and loads sessions, which then in turn send to relay. + +# Beta Lyrae (RAG Memory DB) - added 11-3-25 +- **RAG Knowledge DB - Beta Lyrae (sheliak)** + - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra. + - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. + The system uses: + - **ChromaDB** for persistent vector storage + - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity + - **FastAPI** (port 7090) for the `/rag/search` REST endpoint + - Directory Layout + rag/ + β”œβ”€β”€ rag_chat_import.py # imports JSON chat logs + β”œβ”€β”€ rag_docs_import.py # (planned) PDF/EPUB/manual importer + β”œβ”€β”€ rag_build.py # legacy single-folder builder + β”œβ”€β”€ rag_query.py # command-line query helper + β”œβ”€β”€ rag_api.py # FastAPI service providing /rag/search + β”œβ”€β”€ chromadb/ # persistent vector store + β”œβ”€β”€ chatlogs/ # organized source data + β”‚ β”œβ”€β”€ poker/ + β”‚ β”œβ”€β”€ work/ + β”‚ β”œβ”€β”€ lyra/ + β”‚ β”œβ”€β”€ personal/ + β”‚ └── ... + └── import.log # progress log for batch runs + - **OpenAI chatlog importer. + - Takes JSON formatted chat logs and imports it to the RAG. + - **fetures include:** + - Recursive folder indexing with **category detection** from directory name + - Smart chunking for long messages (5 000 chars per slice) + - Automatic deduplication using SHA-1 hash of file + chunk + - Timestamps for both file modification and import time + - Full progress logging via tqdm + - Safe to run in background with nohup … & + - Metadata per chunk: + ```json + { + "chat_id": "", + "chunk_index": 0, + "source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json", + "title": "cortex LLMs 11-1-25", + "role": "assistant", + "category": "lyra", + "type": "chat", + "file_modified": "2025-11-06T23:41:02", + "imported_at": "2025-11-07T03:55:00Z" + }``` + +# Cortex VM (VM101, CT201) + - **CT201 main reasoning orchestrator.** + - This is the internal brain of Lyra. + - Running in a privellaged LXC. + - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. + - Accessible via 10.0.0.43:8000/v1/completions. + + - **Intake v0.1.1 ** + - Recieves messages from relay and summarizes them in a cascading format. + - Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20) + - Intake then sends to cortex for self reflection, neomem for memory consolidation. + + - **Reflect ** + -TBD + +# Self hosted vLLM server # + - **CT201 main reasoning orchestrator.** + - This is the internal brain of Lyra. + - Running in a privellaged LXC. + - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. + - Accessible via 10.0.0.43:8000/v1/completions. + - **Stack Flow** + - [Proxmox Host] + └── loads AMDGPU driver + └── boots CT201 (order=2) + + [CT201 GPU Container] + β”œβ”€β”€ lyra-start-vllm.sh β†’ starts vLLM ROCm model server + β”œβ”€β”€ lyra-vllm.service β†’ runs the above automatically + β”œβ”€β”€ lyra-core.service β†’ launches Cortex + Intake Docker stack + └── Docker Compose β†’ runs Cortex + Intake containers + + [Cortex Container] + β”œβ”€β”€ Listens on port 7081 + β”œβ”€β”€ Talks to NVGRAM (mem API) + Intake + └── Main relay between Lyra UI ↔ memory ↔ model + + [Intake Container] + β”œβ”€β”€ Listens on port 7080 + β”œβ”€β”€ Summarizes every few exchanges + β”œβ”€β”€ Writes summaries to /app/logs/summaries.log + └── Future: sends summaries β†’ Cortex for reflection + + +# Additional information available in the trilium docs. # +--- + +## πŸ“¦ Requirements + +- Docker + Docker Compose +- Postgres + Neo4j (for NeoMem) +- Access to an open AI or ollama style API. +- OpenAI API key (for Relay fallback LLMs) + +**Dependencies:** + - fastapi==0.115.8 + - uvicorn==0.34.0 + - pydantic==2.10.4 + - python-dotenv==1.0.1 + - psycopg>=3.2.8 + - ollama + +--- + +πŸ”Œ Integration Notes + +Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally. + +API endpoints remain identical to Mem0 (/memories, /search). + +History and entity graphs managed internally via Postgres + Neo4j. + +--- + +🧱 Architecture Snapshot + + User β†’ Relay β†’ Cortex + ↓ + [RAG Search] + ↓ + [Reflection Loop] + ↓ + Intake (async summaries) + ↓ + NeoMem (persistent memory) + +**Cortex v0.4.1 introduces the first fully integrated reasoning loop.** +- Data Flow: + - User message enters Cortex via /reason. + - Cortex assembles context: + - Intake summaries (short-term memory) + - RAG contextual data (knowledge base) + - LLM generates initial draft (call_llm). + - Reflection loop critiques and refines the answer. + - Intake asynchronously summarizes and sends snapshots to NeoMem. + +RAG API Configuration: +Set RAG_API_URL in .env (default: http://localhost:7090). + +--- + +## Setup and Operation ## + +## Beta Lyrae - RAG memory system ## +**Requirements** + -Env= python 3.10+ + -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq + -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db) + +**Import Chats** + - Chats need to be formatted into the correct format of + ``` + "messages": [ + { + "role:" "user", + "content": "Message here" + }, + "messages": [ + { + "role:" "assistant", + "content": "Message here" + },``` + - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight. + - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB). + +**Build API Server** + - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.) + - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090``` + +**Query** + - Run: python3 rag_query.py "Question here?" + - For testing a curl command can reach it too + ``` + curl -X POST http://127.0.0.1:7090/rag/search \ + -H "Content-Type: application/json" \ + -d '{ + "query": "What is the current state of Cortex and Project Lyra?", + "where": {"category": "lyra"} + }' + ``` + +# Beta Lyrae - RAG System + +## πŸ“– License +NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0). +This fork retains the original Apache 2.0 license and adds local modifications. +Β© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0. + diff --git a/vllm-mi50.md b/vllm-mi50.md new file mode 100644 index 0000000..c8f6fd4 --- /dev/null +++ b/vllm-mi50.md @@ -0,0 +1,416 @@ +Here you go β€” a **clean, polished, ready-to-drop-into-Trilium or GitHub** Markdown file. + +If you want, I can also auto-generate a matching `/docs/vllm-mi50/` folder structure and a mini-ToC. + +--- + +# **MI50 + vLLM + Proxmox LXC Setup Guide** + +### *End-to-End Field Manual for gfx906 LLM Serving* + +**Version:** 1.0 +**Last updated:** 2025-11-17 + +--- + +## **πŸ“Œ Overview** + +This guide documents how to run a **vLLM OpenAI-compatible server** on an +**AMD Instinct MI50 (gfx906)** inside a **Proxmox LXC container**, expose it over LAN, +and wire it into **Project Lyra's Cortex reasoning layer**. + +This file is long, specific, and intentionally leaves *nothing* out so you never have to rediscover ROCm pain rituals again. + +--- + +## **1. What This Stack Looks Like** + +``` +Proxmox Host + β”œβ”€ AMD Instinct MI50 (gfx906) + β”œβ”€ AMDGPU + ROCm stack + └─ LXC Container (CT 201: cortex-gpu) + β”œβ”€ Ubuntu 24.04 + β”œβ”€ Docker + docker compose + β”œβ”€ vLLM inside Docker (nalanzeyu/vllm-gfx906) + β”œβ”€ GPU passthrough via /dev/kfd + /dev/dri + PCI bind + └─ vLLM API exposed on :8000 +Lyra Cortex (VM/Server) + └─ LLM_PRIMARY_URL=http://10.0.0.43:8000 +``` + +--- + +## **2. Proxmox Host β€” GPU Setup** + +### **2.1 Confirm MI50 exists** + +```bash +lspci -nn | grep -i 'vega\|instinct\|radeon' +``` + +You should see something like: + +``` +0a:00.0 Display controller: AMD Instinct MI50 (gfx906) +``` + +### **2.2 Load AMDGPU driver** + +The main pitfall after **any host reboot**. + +```bash +modprobe amdgpu +``` + +If you skip this, the LXC container won't see the GPU. + +--- + +## **3. LXC Container Configuration (CT 201)** + +The container ID is **201**. +Config file is at: + +``` +/etc/pve/lxc/201.conf +``` + +### **3.1 Working 201.conf** + +Paste this *exact* version: + +```ini +arch: amd64 +cores: 4 +hostname: cortex-gpu +memory: 16384 +swap: 512 +ostype: ubuntu +onboot: 1 +startup: order=2,up=10,down=10 +net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:C6:3E:88,ip=dhcp,type=veth +rootfs: local-lvm:vm-201-disk-0,size=200G +unprivileged: 0 + +# Docker in LXC requires this +features: keyctl=1,nesting=1 +lxc.apparmor.profile: unconfined +lxc.cap.drop: + +# --- GPU passthrough for ROCm (MI50) --- +lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file,mode=0666 +lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir +lxc.mount.entry: /sys/class/drm sys/class/drm none bind,ro,optional,create=dir +lxc.mount.entry: /opt/rocm /opt/rocm none bind,ro,optional,create=dir + +# Bind the MI50 PCI device +lxc.mount.entry: /dev/bus/pci/0000:0a:00.0 dev/bus/pci/0000:0a:00.0 none bind,optional,create=file + +# Allow GPU-related character devices +lxc.cgroup2.devices.allow: c 226:* rwm +lxc.cgroup2.devices.allow: c 29:* rwm +lxc.cgroup2.devices.allow: c 189:* rwm +lxc.cgroup2.devices.allow: c 238:* rwm +lxc.cgroup2.devices.allow: c 241:* rwm +lxc.cgroup2.devices.allow: c 242:* rwm +lxc.cgroup2.devices.allow: c 243:* rwm +lxc.cgroup2.devices.allow: c 244:* rwm +lxc.cgroup2.devices.allow: c 245:* rwm +lxc.cgroup2.devices.allow: c 246:* rwm +lxc.cgroup2.devices.allow: c 247:* rwm +lxc.cgroup2.devices.allow: c 248:* rwm +lxc.cgroup2.devices.allow: c 249:* rwm +lxc.cgroup2.devices.allow: c 250:* rwm +lxc.cgroup2.devices.allow: c 510:0 rwm +``` + +### **3.2 Restart sequence** + +```bash +pct stop 201 +modprobe amdgpu +pct start 201 +pct enter 201 +``` + +--- + +## **4. Inside CT 201 β€” Verifying ROCm + GPU Visibility** + +### **4.1 Check device nodes** + +```bash +ls -l /dev/kfd +ls -l /dev/dri +ls -l /opt/rocm +``` + +All must exist. + +### **4.2 Validate GPU via rocminfo** + +```bash +/opt/rocm/bin/rocminfo | grep -i gfx +``` + +You need to see: + +``` +gfx906 +``` + +If you see **nothing**, the GPU isn’t passed through β€” restart and re-check the host steps. + +--- + +## **5. Install Docker in the LXC (Ubuntu 24.04)** + +This container runs Docker inside LXC (nesting enabled). + +```bash +apt update +apt install -y ca-certificates curl gnupg + +install -m 0755 -d /etc/apt/keyrings +curl -fsSL https://download.docker.com/linux/ubuntu/gpg \ + | gpg --dearmor -o /etc/apt/keyrings/docker.gpg +chmod a+r /etc/apt/keyrings/docker.gpg + +echo \ + "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \ + https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \ + > /etc/apt/sources.list.d/docker.list + +apt update +apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin +``` + +Check: + +```bash +docker --version +docker compose version +``` + +--- + +## **6. Running vLLM Inside CT 201 via Docker** + +### **6.1 Create directory** + +```bash +mkdir -p /root/vllm +cd /root/vllm +``` + +### **6.2 docker-compose.yml** + +Save this exact file as `/root/vllm/docker-compose.yml`: + +```yaml +version: "3.9" + +services: + vllm-mi50: + image: nalanzeyu/vllm-gfx906:latest + container_name: vllm-mi50 + restart: unless-stopped + ports: + - "8000:8000" + environment: + VLLM_ROLE: "APIServer" + VLLM_MODEL: "/model" + VLLM_LOGGING_LEVEL: "INFO" + command: > + vllm serve /model + --host 0.0.0.0 + --port 8000 + --dtype float16 + --max-model-len 4096 + --api-type openai + devices: + - "/dev/kfd:/dev/kfd" + - "/dev/dri:/dev/dri" + volumes: + - /opt/rocm:/opt/rocm:ro +``` + +### **6.3 Start vLLM** + +```bash +docker compose up -d +docker compose logs -f +``` + +When healthy, you’ll see: + +``` +(APIServer) Application startup complete. +``` + +and periodic throughput logs. + +--- + +## **7. Test vLLM API** + +### **7.1 From Proxmox host** + +```bash +curl -X POST http://10.0.0.43:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"/model","prompt":"ping","max_tokens":5}' +``` + +Should respond like: + +```json +{"choices":[{"text":"-pong"}]} +``` + +### **7.2 From Cortex machine** + +```bash +curl -X POST http://10.0.0.43:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"/model","prompt":"ping from cortex","max_tokens":5}' +``` + +--- + +## **8. Wiring into Lyra Cortex** + +In `cortex` container’s `docker-compose.yml`: + +```yaml +environment: + LLM_PRIMARY_URL: http://10.0.0.43:8000 +``` + +Not `/v1/completions` because the router appends that automatically. + +In `cortex/.env`: + +```env +LLM_FORCE_BACKEND=primary +LLM_MODEL=/model +``` + +Test: + +```bash +curl -X POST http://10.0.0.41:7081/reason \ + -H "Content-Type: application/json" \ + -d '{"prompt":"test vllm","session_id":"dev"}' +``` + +If you get a meaningful response: **Cortex β†’ vLLM is online**. + +--- + +## **9. Common Failure Modes (And Fixes)** + +### **9.1 β€œFailed to infer device type”** + +vLLM cannot see any ROCm devices. + +Fix: + +```bash +# On host +modprobe amdgpu +pct stop 201 +pct start 201 +# In container +/opt/rocm/bin/rocminfo | grep -i gfx +docker compose up -d +``` + +### **9.2 GPU disappears after reboot** + +Same fix: + +```bash +modprobe amdgpu +pct stop 201 +pct start 201 +``` + +### **9.3 Invalid image name** + +If you see pull errors: + +``` +pull access denied for nalanzeuy... +``` + +Use: + +``` +image: nalanzeyu/vllm-gfx906 +``` + +### **9.4 Double `/v1` in URL** + +Ensure: + +``` +LLM_PRIMARY_URL=http://10.0.0.43:8000 +``` + +Router appends `/v1/completions`. + +--- + +## **10. Daily / Reboot Ritual** + +### **On Proxmox host** + +```bash +modprobe amdgpu +pct stop 201 +pct start 201 +``` + +### **Inside CT 201** + +```bash +/opt/rocm/bin/rocminfo | grep -i gfx +cd /root/vllm +docker compose up -d +docker compose logs -f +``` + +### **Test API** + +```bash +curl -X POST http://10.0.0.43:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"/model","prompt":"ping","max_tokens":5}' +``` + +--- + +## **11. Summary** + +You now have: + +* **MI50 (gfx906)** correctly passed into LXC +* **ROCm** inside the container via bind mounts +* **vLLM** running inside Docker in the LXC +* **OpenAI-compatible API** on port 8000 +* **Lyra Cortex** using it automatically as primary backend + +This is a complete, reproducible setup that survives reboots (with the modprobe ritual) and allows you to upgrade/replace models anytime. + +--- + +If you want, I can generate: + +* A `/docs/vllm-mi50/README.md` +* A "vLLM Gotchas" document +* A quick-reference cheat sheet +* A troubleshooting decision tree + +Just say the word.