From e5e32f268340dfe89e7ada57b8112cf7d2158b67 Mon Sep 17 00:00:00 2001
From: serversdwn <brianharrison02@gmail.com>
Date: Mon, 17 Nov 2025 03:34:23 -0500
Subject: [PATCH] Add MI50 + vLLM full setup guide

---
 core/CHANGELOG.md => CHANGELOG.md | 1286 ++++++++++++++---------------
 core/README.md => README.md       |  530 ++++++------
 vllm-mi50.md                      |  416 ++++++++++
 3 files changed, 1324 insertions(+), 908 deletions(-)
 rename core/CHANGELOG.md => CHANGELOG.md (97%)
 rename core/README.md => README.md (97%)
 create mode 100644 vllm-mi50.md

diff --git a/core/CHANGELOG.md b/CHANGELOG.md
similarity index 97%
rename from core/CHANGELOG.md
rename to CHANGELOG.md
index 77aff74..ce887d0 100644
--- a/core/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,643 +1,643 @@
-# Project Lyra — Modular Changelog
-All notable changes to Project Lyra are organized by component.  
-The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)  
-and adheres to [Semantic Versioning](https://semver.org/).
-# Last Updated: 11-13-25
----
-
-## 🧠 Lyra-Core ##############################################################################
-
-## [Lyra_RAG v0.1.0] 2025-11-07
-### Added
-- Initial standalone RAG module for Project Lyra.
-- Persistent ChromaDB vector store (`./chromadb`).
-- Importer `rag_chat_import.py` with:
-  - Recursive folder scanning and category tagging.
-  - Smart chunking (~5 k chars).
-  - SHA-1 deduplication and chat-ID metadata.
-  - Timestamp fields (`file_modified`, `imported_at`).
-  - Background-safe operation (`nohup`/`tmux`).
-- 68 Lyra-category chats imported:
-  - **6 556 new chunks added**
-  - **1 493 duplicates skipped**
-  - **7 997 total vectors** now stored.
-
-### API
-- `/rag/search` FastAPI endpoint implemented (port 7090).
-- Supports natural-language queries and returns top related excerpts.
-- Added answer synthesis step using `gpt-4o-mini`.
-
-### Verified
-- Successful recall of Lyra-Core development history (v0.3.0 snapshot).
-- Correct metadata and category tagging for all new imports.
-
-### Next Planned
-- Optional `where` filter parameter for category/date queries.
-- Graceful “no results” handler for empty retrievals.
-- `rag_docs_import.py` for PDFs and other document types.
-
-## [Lyra Core v0.3.2 + Web Ui v0.2.0] - 2025-10-28
-
-### Added
-- ** New UI **
-  - Cleaned up UI look and feel.
-  
-- ** Added "sessions" **
-  - Now sessions persist over time.
-  - Ability to create new sessions or load sessions from a previous instance.
-  - When changing the session, it updates what the prompt is sending relay (doesn't prompt with messages from other sessions).
-  - Relay is correctly wired in.
-
-## [Lyra-Core 0.3.1] - 2025-10-09
-
-### Added
-- **NVGRAM Integration (Full Pipeline Reconnected)**
-  - Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077).
-  - Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search`.
-  - Added `.env` variable:
-    ```
-    NVGRAM_API=http://nvgram-api:7077
-    ```
-  - Verified end-to-end Lyra conversation persistence:
-    - `relay → nvgram-api → postgres/neo4j → relay → ollama → ui`
-    - ✅ Memories stored, retrieved, and re-injected successfully.
-
-### Changed
-- Renamed `MEM0_URL` → `NVGRAM_API` across all relay environment configs.
-- Updated Docker Compose service dependency order:
-  - `relay` now depends on `nvgram-api` healthcheck.
-  - Removed `mem0` references and volumes.
-- Minor cleanup to Persona fetch block (null-checks and safer default persona string).
-
-### Fixed
-- Relay startup no longer crashes when NVGRAM is unavailable — deferred connection handling.
-- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500`.
-- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON).
-
-### Goals / Next Steps
-- Add salience visualization (e.g., memory weights displayed in injected system message).
-- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring.
-- Add relay auto-retry for transient 500 responses from NVGRAM.
-
----
-
-## [Lyra-Core] v0.3.1 - 2025-09-27
-### Changed
-- Removed salience filter logic; Cortex is now the default annotator.
-- All user messages stored in Mem0; no discard tier applied.
-
-### Added
-- Cortex annotations (`metadata.cortex`) now attached to memories.
-- Debug logging improvements:
-  - Pretty-print Cortex annotations
-  - Injected prompt preview
-  - Memory search hit list with scores
-- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed.
-
-### Fixed
-- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner.
-- Relay no longer “hangs” on malformed Cortex outputs.
-
----
-
-### [Lyra-Core] v0.3.0 — 2025-09-26
-#### Added
-- Implemented **salience filtering** in Relay:
-  - `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL`.
-  - Supports `heuristic` and `llm` classification modes.
-  - LLM-based salience filter integrated with Cortex VM running `llama-server`.
-- Logging improvements:
-  - Added debug logs for salience mode, raw LLM output, and unexpected outputs.
-  - Fail-closed behavior for unexpected LLM responses.
-- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers.
-- Verified end-to-end flow: Relay → salience filter → Mem0 add/search → Persona injection → LLM reply.
-
-#### Changed
-- Refactored `server.js` to gate `mem.add()` calls behind salience filter.
-- Updated `.env` to support `SALIENCE_MODEL`.
-
-#### Known Issues
-- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient".
-- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi").
-- CPU-only inference is functional but limited; larger models recommended once GPU is available.
-
----
-
-### [Lyra-Core] v0.2.0 — 2025-09-24
-#### Added
-- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls.
-- Implemented `sessionId` support (client-supplied, fallback to `default`).
-- Added debug logs for memory add/search.
-- Cleaned up Relay structure for clarity.
-
----
-
-### [Lyra-Core] v0.1.0 — 2025-09-23
-#### Added
-- First working MVP of **Lyra Core Relay**.
-- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible).
-- Memory integration with Mem0:
-  - `POST /memories` on each user message.
-  - `POST /search` before LLM call.
-- Persona Sidecar integration (`GET /current`).
-- OpenAI GPT + Ollama (Mythomax) support in Relay.
-- Simple browser-based chat UI (talks to Relay at `http://<host>:7078`).
-- `.env` standardization for Relay + Mem0 + Postgres + Neo4j.
-- Working Neo4j + Postgres backing stores for Mem0.
-- Initial MVP relay service with raw fetch calls to Mem0.
-- Dockerized with basic healthcheck.
-
-#### Fixed
-- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only).
-- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env`.
-
-#### Known Issues
-- No feedback loop (thumbs up/down) yet.
-- Forget/delete flow is manual (via memory IDs).
-- Memory latency ~1–4s depending on embedding model.
-
----
-
-## 🧩 lyra-neomem (used to be NVGRAM / Lyra-Mem0) ##############################################################################
-
-## [NeoMem 0.1.2] - 2025-10-27
-### Changed
-- **Renamed NVGRAM to neomem**
-  - All future updates will be under the name NeoMem.
-  - Features have not changed.
-
-## [NVGRAM 0.1.1] - 2025-10-08
-### Added
-- **Async Memory Rewrite (Stability + Safety Patch)**
-  - Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes.
-  - Added **input sanitation** to prevent embedding errors (`'list' object has no attribute 'replace'`).
-  - Implemented `flatten_messages()` helper in API layer to clean malformed payloads.
-  - Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware).
-  - Health endpoint (`/health`) now returns structured JSON `{status, version, service}`.
-  - Startup logs now include **sanitized embedder config** with API keys masked for safety:
-    ```
-    >>> Embedder config (sanitized): {'provider': 'openai', 'config': {'model': 'text-embedding-3-small', 'api_key': '***'}}
-    ✅ Connected to Neo4j on attempt 1
-    🧠 NVGRAM v0.1.1 — Neural Vectorized Graph Recall and Memory initialized
-    ```
-
-### Changed
-- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes.
-- Normalized indentation and cleaned duplicate `main.py` references under `/nvgram/` vs `/nvgram/server/`.
-- Removed redundant `FastAPI()` app reinitialization.
-- Updated internal logging to INFO-level timing format:
-		2025-10-08 21:48:45 [INFO] POST /memories -> 200 (11189.1 ms)
-- Deprecated `@app.on_event("startup")` (FastAPI deprecation warning) → will migrate to `lifespan` handler in v0.1.2.
-
-### Fixed
-- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content.
-- Masked API key leaks from boot logs.
-- Ensured Neo4j reconnects gracefully on first retry.
-
-### Goals / Next Steps
-- Integrate **salience scoring** and **embedding confidence weight** fields in Postgres schema.
-- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall.
-- Migrate from deprecated `on_event` → `lifespan` pattern in 0.1.2.
-
----
-
-## [NVGRAM 0.1.0] - 2025-10-07
-### Added
-- **Initial fork of Mem0 → NVGRAM**:
-  - Created a fully independent local-first memory engine based on Mem0 OSS.
-  - Renamed all internal modules, Docker services, and environment variables from `mem0` → `nvgram`.
-  - New service name: **`nvgram-api`**, default port **7077**.
-  - Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility with Lyra Core.
-  - Uses **FastAPI**, **Postgres**, and **Neo4j** as persistent backends.
-  - Verified clean startup:
-    ```
-    ✅ Connected to Neo4j on attempt 1
-    INFO: Uvicorn running on http://0.0.0.0:7077
-    ```
-  - `/docs` and `/openapi.json` confirmed reachable and functional.
-
-### Changed
-- Removed dependency on the external `mem0ai` SDK — all logic now local.
-- Re-pinned requirements:
-	- fastapi==0.115.8
-	- uvicorn==0.34.0
-	- pydantic==2.10.4
-	- python-dotenv==1.0.1
-	- psycopg>=3.2.8
-	- ollama
-- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming and image paths.
-
-### Goals / Next Steps
-- Integrate NVGRAM as the new default backend in Lyra Relay.
-- Deprecate remaining Mem0 references and archive old configs.
-- Begin versioning as a standalone project (`nvgram-core`, `nvgram-api`, etc.).
-
----
-
-## [Lyra-Mem0 0.3.2] - 2025-10-05
-### Added
-- Support for **Ollama LLM reasoning** alongside OpenAI embeddings:
-  - Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090`.
-  - Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M`.
-  - Split processing pipeline:
-    - Embeddings → OpenAI `text-embedding-3-small`
-    - LLM → Local Ollama (`http://10.0.0.3:11434/api/chat`).
-- Added `.env.3090` template for self-hosted inference nodes.
-- Integrated runtime diagnostics and seeder progress tracking:
-  - File-level + message-level progress bars.
-  - Retry/back-off logic for timeouts (3 attempts).
-  - Event logging (`ADD / UPDATE / NONE`) for every memory record.
-- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers.
-- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090).
-
-### Changed
-- Updated `main.py` configuration block to load:
-  - `LLM_PROVIDER`, `LLM_MODEL`, and `OLLAMA_BASE_URL`.
-  - Fallback to OpenAI if Ollama unavailable.
-- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py`.
-- Normalized `.env` loading so `mem0-api` and host environment share identical values.
-- Improved seeder logging and progress telemetry for clearer diagnostics.
-- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']` for tuning future local inference runs.
-
-### Fixed
-- Resolved crash during startup:
-  `TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'`.
-- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors.
-- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests.
-- “Unknown event” warnings now safely ignored (no longer break seeding loop).
-- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`).
-
-### Observations
-- Stable GPU utilization: ~8 GB VRAM @ 92 % load, ≈ 67 °C under sustained seeding.
-- Next revision will re-format seed JSON to preserve `role` context (user vs assistant).
-
----
-
-## [Lyra-Mem0 0.3.1] - 2025-10-03
-### Added
-- HuggingFace TEI integration (local 3090 embedder).
-- Dual-mode environment switch between OpenAI cloud and local.
-- CSV export of memories from Postgres (`payload->>'data'`).
-
-### Fixed
-- `.env` CRLF vs LF line ending issues.
-- Local seeding now possible via huggingface server running 
-
----
-
-## [Lyra-mem0 0.3.0]
-### Added
-- Support for **Ollama embeddings** in Mem0 OSS container:
-  - Added ability to configure `EMBEDDER_PROVIDER=ollama` and set `EMBEDDER_MODEL` + `OLLAMA_HOST` via `.env`.
-  - Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG`.
-  - Installed `ollama` Python client into custom API container image.
-- `.env.3090` file created for external embedding mode (3090 machine):
-  - EMBEDDER_PROVIDER=ollama
-  - EMBEDDER_MODEL=mxbai-embed-large
-  - OLLAMA_HOST=http://10.0.0.3:11434
-- Workflow to support **multiple embedding modes**:
-  1. Fast LAN-based 3090/Ollama embeddings
-  2. Local-only CPU embeddings (Lyra Cortex VM)
-  3. OpenAI fallback embeddings
-
-### Changed
-- `docker-compose.yml` updated to mount local `main.py` and `.env.3090`.
-- Built **custom Dockerfile** (`mem0-api-server:latest`) extending base image with `pip install ollama`.
-- Updated `requirements.txt` to include `ollama` package.
-- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv` (`load_dotenv()`).
-- Tested new embeddings path with curl `/memories` API call.
-
-### Fixed
-- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`).
-- Fixed config overwrite issue where rebuilding container restored stock `main.py`.
-- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes and planning to standardize at 1536-dim.
-
---
-
-## [Lyra-mem0 v0.2.1]
-
-### Added
-- **Seeding pipeline**:
-  - Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0.
-  - Implemented incremental seeding option (skip existing memories, only add new ones).
-  - Verified insert process with Postgres-backed history DB and curl `/memories/search` sanity check.
-- **Ollama embedding support** in Mem0 OSS container:
-  - Added configuration for `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, and `OLLAMA_HOST` via `.env`.
-  - Created `.env.3090` profile for LAN-connected 3090 machine with Ollama.
-  - Set up three embedding modes:
-    1. Fast LAN-based 3090/Ollama
-    2. Local-only CPU model (Lyra Cortex VM)
-    3. OpenAI fallback
-
-### Changed
-- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends.
-- Mounted host `main.py` into container so local edits persist across rebuilds.
-- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles.
-- Built **custom Dockerfile** (`mem0-api-server:latest`) including `pip install ollama`.
-- Updated `requirements.txt` with `ollama` dependency.
-- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP).
-- Added logging to confirm model pulls and embedding requests.
-
-### Fixed
-- Seeder process originally failed on old memories — now skips duplicates and continues batch.
-- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image.
-- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild.
-- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch by investigating OpenAI (1536-dim) vs Ollama (1024-dim) schemas.
-
-### Notes
-- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAI’s schema and avoid Neo4j errors).
-- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors.
-- Seeder workflow validated but should be wrapped in a repeatable weekly run for full Cloud→Local sync.
-
----
-
-## [Lyra-Mem0 v0.2.0] - 2025-09-30
-### Added
-- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/`
-  - Includes **Postgres (pgvector)**, **Qdrant**, **Neo4j**, and **SQLite** for history tracking.
-  - Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building the Mem0 API server.
-- Verified REST API functionality:
-  - `POST /memories` works for adding memories.
-  - `POST /search` works for semantic search.
-- Successful end-to-end test with persisted memory:  
-  *"Likes coffee in the morning"* → retrievable via search. ✅
-
-### Changed
-- Split architecture into **modular stacks**:
-  - `~/lyra-core` (Relay, Persona-Sidecar, etc.)
-  - `~/lyra-mem0` (Mem0 OSS memory stack)
-- Removed old embedded mem0 containers from the Lyra-Core compose file.
-- Added Lyra-Mem0 section in README.md.
-
-### Next Steps
-- Wire **Relay → Mem0 API** (integration not yet complete).
-- Add integration tests to verify persistence and retrieval from within Lyra-Core.
-
----
-
-## 🧠 Lyra-Cortex ##############################################################################
-
-## [ Cortex - v0.5] -2025-11-13
-
-### Added
-- **New `reasoning.py` module**
-  - Async reasoning engine.
-  - Accepts user prompt, identity, RAG block, and reflection notes.
-  - Produces draft internal answers.
-  - Uses primary backend (vLLM).
-- **New `reflection.py` module**
-  - Fully async.
-  - Produces actionable JSON “internal notes.”
-  - Enforces strict JSON schema and fallback parsing.
-  - Forces cloud backend (`backend_override="cloud"`).
-- Integrated `refine.py` into Cortex reasoning pipeline:
-  - New stage between reflection and persona.
-  - Runs exclusively on primary vLLM backend (MI50).
-  - Produces final, internally consistent output for downstream persona layer.
-- **Backend override system**
-  - Each LLM call can now select its own backend.
-  - Enables multi-LLM cognition: Reflection → cloud, Reasoning → primary.
-
-- **identity loader**
-  - Added `identity.py` with `load_identity()` for consistent persona retrieval.
-
-- **ingest_handler**
-  - Async stub created for future Intake → NeoMem → RAG pipeline.  
-
-### Changed
-- Unified LLM backend URL handling across Cortex:
-  - ENV variables must now contain FULL API endpoints.
-  - Removed all internal path-appending (e.g. `.../v1/completions`).
-  - `llm_router.py` rewritten to use env-provided URLs as-is.
-  - Ensures consistent behavior between draft, reflection, refine, and persona.
-- **Rebuilt `main.py`**
-  - Removed old annotation/analysis logic.
-  - New structure: load identity → get RAG → reflect → reason → return draft+notes.
-  - Routes now clean and minimal (`/reason`, `/ingest`, `/health`).
-  - Async path throughout Cortex.
-
-- **Refactored `llm_router.py`**
-  - Removed old fallback logic during overrides.
-  - OpenAI requests now use `/v1/chat/completions`.
-  - Added proper OpenAI Authorization headers.
-  - Distinct payload format for vLLM vs OpenAI.
-  - Unified, correct parsing across models.
-
-- **Simplified Cortex architecture**
-  - Removed deprecated “context.py” and old reasoning code.
-  - Relay completely decoupled from smart behavior.
-
-- Updated environment specification:
-  - `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions`.
-  - `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama).
-  - `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions`.
-
-### Fixed
-- Resolved endpoint conflict where:
-  - Router expected base URLs.
-  - Refine expected full URLs.
-  - Refine always fell back due to hitting incorrect endpoint.
-  - Fixed by standardizing full-URL behavior across entire system.
-- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax).
-- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints.
-- No more double-routing through vLLM during reflection.
-- Corrected async/sync mismatch in multiple locations.  
-- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic.
-
-### Removed
-- Legacy `annotate`, `reason_check` glue logic from old architecture.
-- Old backend probing junk code.
-- Stale imports and unused modules leftover from previous prototype.
-
-### Verified
-- Cortex → vLLM (MI50) → refine → final_output now functioning correctly.
-- refine shows `used_primary_backend: true` and no fallback.
-- Manual curl test confirms endpoint accuracy.
-
-### Known Issues
-- refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this.
-- hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned).
-
-### Pending / Known Issues
-- **RAG service does not exist** — requires containerized FastAPI service.
-- Reasoning layer lacks self-revision loop (deliberate thought cycle).
-- No speak/persona generation layer yet (`speak.py` planned).
-- Intake summaries not yet routing into RAG or reflection layer.
-- No refinement engine between reasoning and speak.
-
-### Notes
-This is the largest structural change to Cortex so far.  
-It establishes:
-- multi-model cognition  
-- clean layering  
-- identity + reflection separation  
-- correct async code  
-- deterministic backend routing  
-- predictable JSON reflection  
-
-The system is now ready for:
-- refinement loops  
-- persona-speaking layer  
-- containerized RAG  
-- long-term memory integration  
-- true emergent-behavior experiments  
-
-
-
-## [ Cortex - v0.4.1] - 2025-11-5
-### Added
-- **RAG intergration**
-	- Added rag.py with query_rag() and format_rag_block().
-	- Cortex now queries the local RAG API (http://10.0.0.41:7090/rag/search) for contextual augmentation.
-	- Synthesized answers and top excerpts are injected into the reasoning prompt.
-
-### Changed ###
-- **Revised /reason endpoint.**
-	- Now builds unified context blocks:
-	  - [Intake] → recent summaries
-	  - [RAG] → contextual knowledge
-	  - [User Message] → current input 
-	- Calls call_llm() for the first pass, then reflection_loop() for meta-evaluation.
-	- Returns cortex_prompt, draft_output, final_output, and normalized reflection.
-- **Reflection Pipeline Stability**
-	- Cleaned parsing to normalize JSON vs. text reflections.
-	- Added fallback handling for malformed or non-JSON outputs.
-	- Log system improved to show raw JSON, extracted fields, and normalized summary.
-- **Async Summarization (Intake v0.2.1)**
-	- Intake summaries now run in background threads to avoid blocking Cortex.
-	- Summaries (L1–L∞) logged asynchronously with [BG] tags.
-- **Environment & Networking Fixes**
-	- Verified .env variables propagate correctly inside the Cortex container.
-	- Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG (shared serversdown_lyra_net).
-	- Adjusted localhost calls to service-IP mapping (10.0.0.41 for Cortex host).
-	
-- **Behavioral Updates**
-	- Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers).
-	- RAG context successfully grounds reasoning outputs.
-	- Intake and NeoMem confirmed receiving summaries via /add_exchange.
-	- Log clarity pass: all reflective and contextual blocks clearly labeled.
-- **Known Gaps / Next Steps**
-	- NeoMem Tuning
-	- Improve retrieval latency and relevance.
-	- Implement a dedicated /reflections/recent endpoint for Cortex.
-	- Migrate to Cortex-first ingestion (Relay → Cortex → NeoMem).
-- **Cortex Enhancements**
-	- Add persistent reflection recall (use prior reflections as meta-context).
-	- Improve reflection JSON structure ("insight", "evaluation", "next_action" → guaranteed fields).
-	- Tighten temperature and prompt control for factual consistency.
-- **RAG Optimization**
-	-Add source ranking, filtering, and multi-vector hybrid search.
-	-Cache RAG responses per session to reduce duplicate calls.
-- **Documentation / Monitoring**
-	-Add health route for RAG and Intake summaries.
-	-Include internal latency metrics in /health endpoint.
-
-Consolidate logs into unified “Lyra Cortex Console” for tracing all module calls.
-
-## [Cortex - v0.3.0] – 2025-10-31
-### Added
-- **Cortex Service (FastAPI)**  
-  - New standalone reasoning engine (`cortex/main.py`) with endpoints:
-    - `GET /health` – reports active backend + NeoMem status.  
-    - `POST /reason` – evaluates `{prompt, response}` pairs.  
-    - `POST /annotate` – experimental text analysis.  
-  - Background NeoMem health monitor (5-minute interval).
-
-- **Multi-Backend Reasoning Support**  
-  - Added environment-driven backend selection via `LLM_FORCE_BACKEND`.  
-  - Supports:
-    - **Primary** → vLLM (MI50 node @ 10.0.0.43)  
-    - **Secondary** → Ollama (3090 node @ 10.0.0.3)  
-    - **Cloud** → OpenAI API  
-    - **Fallback** → llama.cpp (CPU)
-  - Introduced per-backend model variables:  
-    `LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL`.
-
-- **Response Normalization Layer**  
-  - Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON.  
-  - Handles Ollama’s multi-line streaming and Mythomax’s missing punctuation issues.  
-  - Prints concise debug previews of merged content.
-
-- **Environment Simplification**  
-  - Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file.  
-  - Removed reliance on shared/global env file to prevent cross-contamination.  
-  - Verified Docker Compose networking across containers.
-
-### Changed
-- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend.
-- Enhanced startup logs to announce active backend, model, URL, and mode.
-- Improved error handling with clearer “Reasoning error” messages.
-
-### Fixed
-- Corrected broken vLLM endpoint routing (`/v1/completions`).
-- Stabilized cross-container health reporting for NeoMem.
-- Resolved JSON parse failures caused by streaming chunk delimiters.
-
----
-
-## Next Planned – [v0.4.0]
-### Planned Additions
-- **Reflection Mode**
-  - Introduce `REASONING_MODE=factcheck|reflection`.  
-  - Output schema:
-    ```json
-    { "insight": "...", "evaluation": "...", "next_action": "..." }
-    ```
-
-- **Cortex-First Pipeline**
-  - UI → Cortex → [Reflection + Verifier + Memory] → Speech LLM → User.  
-  - Allows Lyra to “think before speaking.”
-
-- **Verifier Stub**
-  - New `/verify` endpoint for search-based factual grounding.  
-  - Asynchronous external truth checking.
-
-- **Memory Integration**
-  - Feed reflective outputs into NeoMem.  
-  - Enable “dream” cycles for autonomous self-review.
-
----
-
-**Status:** 🟢 Stable Core – Multi-backend reasoning operational.  
-**Next milestone:** *v0.4.0 — Reflection Mode + Thought Pipeline orchestration.*
-
----
-
-### [Intake] v0.1.0 - 2025-10-27
-    - Recieves messages from relay and summarizes them in a cascading format.
-	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
-	- Currently logs summaries to .log file in /project-lyra/intake-logs/
-  ** Next Steps **
-    - Feed intake into neomem.
-	- Generate a daily/hourly/etc overall summary, (IE: Today Brian and Lyra worked on x, y, and z)
-	- Generate session aware summaries, with its own intake hopper.
-  
-
-### [Lyra-Cortex] v0.2.0 — 2025-09-26
-**Added
-- Integrated **llama-server** on dedicated Cortex VM (Proxmox).
-- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs.
-- Benchmarked Phi-3.5-mini performance:
-  - ~18 tokens/sec CPU-only on Ryzen 7 7800X.
-  - Salience classification functional but sometimes inconsistent ("sali", "fi", "jamming").
-- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier:
-  - Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval).
-  - More responsive but over-classifies messages as “salient.”
-- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models.
-
-** Known Issues
-- Small models tend to drift or over-classify.
-- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models.
-- Need to set up a `systemd` service for `llama-server` to auto-start on VM reboot.
-
----
-
-### [Lyra-Cortex] v0.1.0 — 2025-09-25
-#### Added
-- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD).
-- Built **llama.cpp** with `llama-server` target via CMake.
-- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model.
-- Verified **API compatibility** at `/v1/chat/completions`.
-- Local test successful via `curl` → ~523 token response generated.
-- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X).
-- Confirmed usable for salience scoring, summarization, and lightweight reasoning.
+# Project Lyra — Modular Changelog
+All notable changes to Project Lyra are organized by component.  
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)  
+and adheres to [Semantic Versioning](https://semver.org/).
+# Last Updated: 11-13-25
+---
+
+## 🧠 Lyra-Core ##############################################################################
+
+## [Lyra_RAG v0.1.0] 2025-11-07
+### Added
+- Initial standalone RAG module for Project Lyra.
+- Persistent ChromaDB vector store (`./chromadb`).
+- Importer `rag_chat_import.py` with:
+  - Recursive folder scanning and category tagging.
+  - Smart chunking (~5 k chars).
+  - SHA-1 deduplication and chat-ID metadata.
+  - Timestamp fields (`file_modified`, `imported_at`).
+  - Background-safe operation (`nohup`/`tmux`).
+- 68 Lyra-category chats imported:
+  - **6 556 new chunks added**
+  - **1 493 duplicates skipped**
+  - **7 997 total vectors** now stored.
+
+### API
+- `/rag/search` FastAPI endpoint implemented (port 7090).
+- Supports natural-language queries and returns top related excerpts.
+- Added answer synthesis step using `gpt-4o-mini`.
+
+### Verified
+- Successful recall of Lyra-Core development history (v0.3.0 snapshot).
+- Correct metadata and category tagging for all new imports.
+
+### Next Planned
+- Optional `where` filter parameter for category/date queries.
+- Graceful “no results” handler for empty retrievals.
+- `rag_docs_import.py` for PDFs and other document types.
+
+## [Lyra Core v0.3.2 + Web Ui v0.2.0] - 2025-10-28
+
+### Added
+- ** New UI **
+  - Cleaned up UI look and feel.
+  
+- ** Added "sessions" **
+  - Now sessions persist over time.
+  - Ability to create new sessions or load sessions from a previous instance.
+  - When changing the session, it updates what the prompt is sending relay (doesn't prompt with messages from other sessions).
+  - Relay is correctly wired in.
+
+## [Lyra-Core 0.3.1] - 2025-10-09
+
+### Added
+- **NVGRAM Integration (Full Pipeline Reconnected)**
+  - Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077).
+  - Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search`.
+  - Added `.env` variable:
+    ```
+    NVGRAM_API=http://nvgram-api:7077
+    ```
+  - Verified end-to-end Lyra conversation persistence:
+    - `relay → nvgram-api → postgres/neo4j → relay → ollama → ui`
+    - ✅ Memories stored, retrieved, and re-injected successfully.
+
+### Changed
+- Renamed `MEM0_URL` → `NVGRAM_API` across all relay environment configs.
+- Updated Docker Compose service dependency order:
+  - `relay` now depends on `nvgram-api` healthcheck.
+  - Removed `mem0` references and volumes.
+- Minor cleanup to Persona fetch block (null-checks and safer default persona string).
+
+### Fixed
+- Relay startup no longer crashes when NVGRAM is unavailable — deferred connection handling.
+- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500`.
+- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON).
+
+### Goals / Next Steps
+- Add salience visualization (e.g., memory weights displayed in injected system message).
+- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring.
+- Add relay auto-retry for transient 500 responses from NVGRAM.
+
+---
+
+## [Lyra-Core] v0.3.1 - 2025-09-27
+### Changed
+- Removed salience filter logic; Cortex is now the default annotator.
+- All user messages stored in Mem0; no discard tier applied.
+
+### Added
+- Cortex annotations (`metadata.cortex`) now attached to memories.
+- Debug logging improvements:
+  - Pretty-print Cortex annotations
+  - Injected prompt preview
+  - Memory search hit list with scores
+- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed.
+
+### Fixed
+- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner.
+- Relay no longer “hangs” on malformed Cortex outputs.
+
+---
+
+### [Lyra-Core] v0.3.0 — 2025-09-26
+#### Added
+- Implemented **salience filtering** in Relay:
+  - `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL`.
+  - Supports `heuristic` and `llm` classification modes.
+  - LLM-based salience filter integrated with Cortex VM running `llama-server`.
+- Logging improvements:
+  - Added debug logs for salience mode, raw LLM output, and unexpected outputs.
+  - Fail-closed behavior for unexpected LLM responses.
+- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers.
+- Verified end-to-end flow: Relay → salience filter → Mem0 add/search → Persona injection → LLM reply.
+
+#### Changed
+- Refactored `server.js` to gate `mem.add()` calls behind salience filter.
+- Updated `.env` to support `SALIENCE_MODEL`.
+
+#### Known Issues
+- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient".
+- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi").
+- CPU-only inference is functional but limited; larger models recommended once GPU is available.
+
+---
+
+### [Lyra-Core] v0.2.0 — 2025-09-24
+#### Added
+- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls.
+- Implemented `sessionId` support (client-supplied, fallback to `default`).
+- Added debug logs for memory add/search.
+- Cleaned up Relay structure for clarity.
+
+---
+
+### [Lyra-Core] v0.1.0 — 2025-09-23
+#### Added
+- First working MVP of **Lyra Core Relay**.
+- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible).
+- Memory integration with Mem0:
+  - `POST /memories` on each user message.
+  - `POST /search` before LLM call.
+- Persona Sidecar integration (`GET /current`).
+- OpenAI GPT + Ollama (Mythomax) support in Relay.
+- Simple browser-based chat UI (talks to Relay at `http://<host>:7078`).
+- `.env` standardization for Relay + Mem0 + Postgres + Neo4j.
+- Working Neo4j + Postgres backing stores for Mem0.
+- Initial MVP relay service with raw fetch calls to Mem0.
+- Dockerized with basic healthcheck.
+
+#### Fixed
+- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only).
+- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env`.
+
+#### Known Issues
+- No feedback loop (thumbs up/down) yet.
+- Forget/delete flow is manual (via memory IDs).
+- Memory latency ~1–4s depending on embedding model.
+
+---
+
+## 🧩 lyra-neomem (used to be NVGRAM / Lyra-Mem0) ##############################################################################
+
+## [NeoMem 0.1.2] - 2025-10-27
+### Changed
+- **Renamed NVGRAM to neomem**
+  - All future updates will be under the name NeoMem.
+  - Features have not changed.
+
+## [NVGRAM 0.1.1] - 2025-10-08
+### Added
+- **Async Memory Rewrite (Stability + Safety Patch)**
+  - Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes.
+  - Added **input sanitation** to prevent embedding errors (`'list' object has no attribute 'replace'`).
+  - Implemented `flatten_messages()` helper in API layer to clean malformed payloads.
+  - Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware).
+  - Health endpoint (`/health`) now returns structured JSON `{status, version, service}`.
+  - Startup logs now include **sanitized embedder config** with API keys masked for safety:
+    ```
+    >>> Embedder config (sanitized): {'provider': 'openai', 'config': {'model': 'text-embedding-3-small', 'api_key': '***'}}
+    ✅ Connected to Neo4j on attempt 1
+    🧠 NVGRAM v0.1.1 — Neural Vectorized Graph Recall and Memory initialized
+    ```
+
+### Changed
+- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes.
+- Normalized indentation and cleaned duplicate `main.py` references under `/nvgram/` vs `/nvgram/server/`.
+- Removed redundant `FastAPI()` app reinitialization.
+- Updated internal logging to INFO-level timing format:
+		2025-10-08 21:48:45 [INFO] POST /memories -> 200 (11189.1 ms)
+- Deprecated `@app.on_event("startup")` (FastAPI deprecation warning) → will migrate to `lifespan` handler in v0.1.2.
+
+### Fixed
+- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content.
+- Masked API key leaks from boot logs.
+- Ensured Neo4j reconnects gracefully on first retry.
+
+### Goals / Next Steps
+- Integrate **salience scoring** and **embedding confidence weight** fields in Postgres schema.
+- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall.
+- Migrate from deprecated `on_event` → `lifespan` pattern in 0.1.2.
+
+---
+
+## [NVGRAM 0.1.0] - 2025-10-07
+### Added
+- **Initial fork of Mem0 → NVGRAM**:
+  - Created a fully independent local-first memory engine based on Mem0 OSS.
+  - Renamed all internal modules, Docker services, and environment variables from `mem0` → `nvgram`.
+  - New service name: **`nvgram-api`**, default port **7077**.
+  - Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility with Lyra Core.
+  - Uses **FastAPI**, **Postgres**, and **Neo4j** as persistent backends.
+  - Verified clean startup:
+    ```
+    ✅ Connected to Neo4j on attempt 1
+    INFO: Uvicorn running on http://0.0.0.0:7077
+    ```
+  - `/docs` and `/openapi.json` confirmed reachable and functional.
+
+### Changed
+- Removed dependency on the external `mem0ai` SDK — all logic now local.
+- Re-pinned requirements:
+	- fastapi==0.115.8
+	- uvicorn==0.34.0
+	- pydantic==2.10.4
+	- python-dotenv==1.0.1
+	- psycopg>=3.2.8
+	- ollama
+- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming and image paths.
+
+### Goals / Next Steps
+- Integrate NVGRAM as the new default backend in Lyra Relay.
+- Deprecate remaining Mem0 references and archive old configs.
+- Begin versioning as a standalone project (`nvgram-core`, `nvgram-api`, etc.).
+
+---
+
+## [Lyra-Mem0 0.3.2] - 2025-10-05
+### Added
+- Support for **Ollama LLM reasoning** alongside OpenAI embeddings:
+  - Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090`.
+  - Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M`.
+  - Split processing pipeline:
+    - Embeddings → OpenAI `text-embedding-3-small`
+    - LLM → Local Ollama (`http://10.0.0.3:11434/api/chat`).
+- Added `.env.3090` template for self-hosted inference nodes.
+- Integrated runtime diagnostics and seeder progress tracking:
+  - File-level + message-level progress bars.
+  - Retry/back-off logic for timeouts (3 attempts).
+  - Event logging (`ADD / UPDATE / NONE`) for every memory record.
+- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers.
+- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090).
+
+### Changed
+- Updated `main.py` configuration block to load:
+  - `LLM_PROVIDER`, `LLM_MODEL`, and `OLLAMA_BASE_URL`.
+  - Fallback to OpenAI if Ollama unavailable.
+- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py`.
+- Normalized `.env` loading so `mem0-api` and host environment share identical values.
+- Improved seeder logging and progress telemetry for clearer diagnostics.
+- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']` for tuning future local inference runs.
+
+### Fixed
+- Resolved crash during startup:
+  `TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'`.
+- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors.
+- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests.
+- “Unknown event” warnings now safely ignored (no longer break seeding loop).
+- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`).
+
+### Observations
+- Stable GPU utilization: ~8 GB VRAM @ 92 % load, ≈ 67 °C under sustained seeding.
+- Next revision will re-format seed JSON to preserve `role` context (user vs assistant).
+
+---
+
+## [Lyra-Mem0 0.3.1] - 2025-10-03
+### Added
+- HuggingFace TEI integration (local 3090 embedder).
+- Dual-mode environment switch between OpenAI cloud and local.
+- CSV export of memories from Postgres (`payload->>'data'`).
+
+### Fixed
+- `.env` CRLF vs LF line ending issues.
+- Local seeding now possible via huggingface server running 
+
+---
+
+## [Lyra-mem0 0.3.0]
+### Added
+- Support for **Ollama embeddings** in Mem0 OSS container:
+  - Added ability to configure `EMBEDDER_PROVIDER=ollama` and set `EMBEDDER_MODEL` + `OLLAMA_HOST` via `.env`.
+  - Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG`.
+  - Installed `ollama` Python client into custom API container image.
+- `.env.3090` file created for external embedding mode (3090 machine):
+  - EMBEDDER_PROVIDER=ollama
+  - EMBEDDER_MODEL=mxbai-embed-large
+  - OLLAMA_HOST=http://10.0.0.3:11434
+- Workflow to support **multiple embedding modes**:
+  1. Fast LAN-based 3090/Ollama embeddings
+  2. Local-only CPU embeddings (Lyra Cortex VM)
+  3. OpenAI fallback embeddings
+
+### Changed
+- `docker-compose.yml` updated to mount local `main.py` and `.env.3090`.
+- Built **custom Dockerfile** (`mem0-api-server:latest`) extending base image with `pip install ollama`.
+- Updated `requirements.txt` to include `ollama` package.
+- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv` (`load_dotenv()`).
+- Tested new embeddings path with curl `/memories` API call.
+
+### Fixed
+- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`).
+- Fixed config overwrite issue where rebuilding container restored stock `main.py`.
+- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes and planning to standardize at 1536-dim.
+
+--
+
+## [Lyra-mem0 v0.2.1]
+
+### Added
+- **Seeding pipeline**:
+  - Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0.
+  - Implemented incremental seeding option (skip existing memories, only add new ones).
+  - Verified insert process with Postgres-backed history DB and curl `/memories/search` sanity check.
+- **Ollama embedding support** in Mem0 OSS container:
+  - Added configuration for `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, and `OLLAMA_HOST` via `.env`.
+  - Created `.env.3090` profile for LAN-connected 3090 machine with Ollama.
+  - Set up three embedding modes:
+    1. Fast LAN-based 3090/Ollama
+    2. Local-only CPU model (Lyra Cortex VM)
+    3. OpenAI fallback
+
+### Changed
+- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends.
+- Mounted host `main.py` into container so local edits persist across rebuilds.
+- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles.
+- Built **custom Dockerfile** (`mem0-api-server:latest`) including `pip install ollama`.
+- Updated `requirements.txt` with `ollama` dependency.
+- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP).
+- Added logging to confirm model pulls and embedding requests.
+
+### Fixed
+- Seeder process originally failed on old memories — now skips duplicates and continues batch.
+- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image.
+- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild.
+- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch by investigating OpenAI (1536-dim) vs Ollama (1024-dim) schemas.
+
+### Notes
+- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAI’s schema and avoid Neo4j errors).
+- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors.
+- Seeder workflow validated but should be wrapped in a repeatable weekly run for full Cloud→Local sync.
+
+---
+
+## [Lyra-Mem0 v0.2.0] - 2025-09-30
+### Added
+- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/`
+  - Includes **Postgres (pgvector)**, **Qdrant**, **Neo4j**, and **SQLite** for history tracking.
+  - Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building the Mem0 API server.
+- Verified REST API functionality:
+  - `POST /memories` works for adding memories.
+  - `POST /search` works for semantic search.
+- Successful end-to-end test with persisted memory:  
+  *"Likes coffee in the morning"* → retrievable via search. ✅
+
+### Changed
+- Split architecture into **modular stacks**:
+  - `~/lyra-core` (Relay, Persona-Sidecar, etc.)
+  - `~/lyra-mem0` (Mem0 OSS memory stack)
+- Removed old embedded mem0 containers from the Lyra-Core compose file.
+- Added Lyra-Mem0 section in README.md.
+
+### Next Steps
+- Wire **Relay → Mem0 API** (integration not yet complete).
+- Add integration tests to verify persistence and retrieval from within Lyra-Core.
+
+---
+
+## 🧠 Lyra-Cortex ##############################################################################
+
+## [ Cortex - v0.5] -2025-11-13
+
+### Added
+- **New `reasoning.py` module**
+  - Async reasoning engine.
+  - Accepts user prompt, identity, RAG block, and reflection notes.
+  - Produces draft internal answers.
+  - Uses primary backend (vLLM).
+- **New `reflection.py` module**
+  - Fully async.
+  - Produces actionable JSON “internal notes.”
+  - Enforces strict JSON schema and fallback parsing.
+  - Forces cloud backend (`backend_override="cloud"`).
+- Integrated `refine.py` into Cortex reasoning pipeline:
+  - New stage between reflection and persona.
+  - Runs exclusively on primary vLLM backend (MI50).
+  - Produces final, internally consistent output for downstream persona layer.
+- **Backend override system**
+  - Each LLM call can now select its own backend.
+  - Enables multi-LLM cognition: Reflection → cloud, Reasoning → primary.
+
+- **identity loader**
+  - Added `identity.py` with `load_identity()` for consistent persona retrieval.
+
+- **ingest_handler**
+  - Async stub created for future Intake → NeoMem → RAG pipeline.  
+
+### Changed
+- Unified LLM backend URL handling across Cortex:
+  - ENV variables must now contain FULL API endpoints.
+  - Removed all internal path-appending (e.g. `.../v1/completions`).
+  - `llm_router.py` rewritten to use env-provided URLs as-is.
+  - Ensures consistent behavior between draft, reflection, refine, and persona.
+- **Rebuilt `main.py`**
+  - Removed old annotation/analysis logic.
+  - New structure: load identity → get RAG → reflect → reason → return draft+notes.
+  - Routes now clean and minimal (`/reason`, `/ingest`, `/health`).
+  - Async path throughout Cortex.
+
+- **Refactored `llm_router.py`**
+  - Removed old fallback logic during overrides.
+  - OpenAI requests now use `/v1/chat/completions`.
+  - Added proper OpenAI Authorization headers.
+  - Distinct payload format for vLLM vs OpenAI.
+  - Unified, correct parsing across models.
+
+- **Simplified Cortex architecture**
+  - Removed deprecated “context.py” and old reasoning code.
+  - Relay completely decoupled from smart behavior.
+
+- Updated environment specification:
+  - `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions`.
+  - `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama).
+  - `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions`.
+
+### Fixed
+- Resolved endpoint conflict where:
+  - Router expected base URLs.
+  - Refine expected full URLs.
+  - Refine always fell back due to hitting incorrect endpoint.
+  - Fixed by standardizing full-URL behavior across entire system.
+- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax).
+- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints.
+- No more double-routing through vLLM during reflection.
+- Corrected async/sync mismatch in multiple locations.  
+- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic.
+
+### Removed
+- Legacy `annotate`, `reason_check` glue logic from old architecture.
+- Old backend probing junk code.
+- Stale imports and unused modules leftover from previous prototype.
+
+### Verified
+- Cortex → vLLM (MI50) → refine → final_output now functioning correctly.
+- refine shows `used_primary_backend: true` and no fallback.
+- Manual curl test confirms endpoint accuracy.
+
+### Known Issues
+- refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this.
+- hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned).
+
+### Pending / Known Issues
+- **RAG service does not exist** — requires containerized FastAPI service.
+- Reasoning layer lacks self-revision loop (deliberate thought cycle).
+- No speak/persona generation layer yet (`speak.py` planned).
+- Intake summaries not yet routing into RAG or reflection layer.
+- No refinement engine between reasoning and speak.
+
+### Notes
+This is the largest structural change to Cortex so far.  
+It establishes:
+- multi-model cognition  
+- clean layering  
+- identity + reflection separation  
+- correct async code  
+- deterministic backend routing  
+- predictable JSON reflection  
+
+The system is now ready for:
+- refinement loops  
+- persona-speaking layer  
+- containerized RAG  
+- long-term memory integration  
+- true emergent-behavior experiments  
+
+
+
+## [ Cortex - v0.4.1] - 2025-11-5
+### Added
+- **RAG intergration**
+	- Added rag.py with query_rag() and format_rag_block().
+	- Cortex now queries the local RAG API (http://10.0.0.41:7090/rag/search) for contextual augmentation.
+	- Synthesized answers and top excerpts are injected into the reasoning prompt.
+
+### Changed ###
+- **Revised /reason endpoint.**
+	- Now builds unified context blocks:
+	  - [Intake] → recent summaries
+	  - [RAG] → contextual knowledge
+	  - [User Message] → current input 
+	- Calls call_llm() for the first pass, then reflection_loop() for meta-evaluation.
+	- Returns cortex_prompt, draft_output, final_output, and normalized reflection.
+- **Reflection Pipeline Stability**
+	- Cleaned parsing to normalize JSON vs. text reflections.
+	- Added fallback handling for malformed or non-JSON outputs.
+	- Log system improved to show raw JSON, extracted fields, and normalized summary.
+- **Async Summarization (Intake v0.2.1)**
+	- Intake summaries now run in background threads to avoid blocking Cortex.
+	- Summaries (L1–L∞) logged asynchronously with [BG] tags.
+- **Environment & Networking Fixes**
+	- Verified .env variables propagate correctly inside the Cortex container.
+	- Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG (shared serversdown_lyra_net).
+	- Adjusted localhost calls to service-IP mapping (10.0.0.41 for Cortex host).
+	
+- **Behavioral Updates**
+	- Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers).
+	- RAG context successfully grounds reasoning outputs.
+	- Intake and NeoMem confirmed receiving summaries via /add_exchange.
+	- Log clarity pass: all reflective and contextual blocks clearly labeled.
+- **Known Gaps / Next Steps**
+	- NeoMem Tuning
+	- Improve retrieval latency and relevance.
+	- Implement a dedicated /reflections/recent endpoint for Cortex.
+	- Migrate to Cortex-first ingestion (Relay → Cortex → NeoMem).
+- **Cortex Enhancements**
+	- Add persistent reflection recall (use prior reflections as meta-context).
+	- Improve reflection JSON structure ("insight", "evaluation", "next_action" → guaranteed fields).
+	- Tighten temperature and prompt control for factual consistency.
+- **RAG Optimization**
+	-Add source ranking, filtering, and multi-vector hybrid search.
+	-Cache RAG responses per session to reduce duplicate calls.
+- **Documentation / Monitoring**
+	-Add health route for RAG and Intake summaries.
+	-Include internal latency metrics in /health endpoint.
+
+Consolidate logs into unified “Lyra Cortex Console” for tracing all module calls.
+
+## [Cortex - v0.3.0] – 2025-10-31
+### Added
+- **Cortex Service (FastAPI)**  
+  - New standalone reasoning engine (`cortex/main.py`) with endpoints:
+    - `GET /health` – reports active backend + NeoMem status.  
+    - `POST /reason` – evaluates `{prompt, response}` pairs.  
+    - `POST /annotate` – experimental text analysis.  
+  - Background NeoMem health monitor (5-minute interval).
+
+- **Multi-Backend Reasoning Support**  
+  - Added environment-driven backend selection via `LLM_FORCE_BACKEND`.  
+  - Supports:
+    - **Primary** → vLLM (MI50 node @ 10.0.0.43)  
+    - **Secondary** → Ollama (3090 node @ 10.0.0.3)  
+    - **Cloud** → OpenAI API  
+    - **Fallback** → llama.cpp (CPU)
+  - Introduced per-backend model variables:  
+    `LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL`.
+
+- **Response Normalization Layer**  
+  - Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON.  
+  - Handles Ollama’s multi-line streaming and Mythomax’s missing punctuation issues.  
+  - Prints concise debug previews of merged content.
+
+- **Environment Simplification**  
+  - Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file.  
+  - Removed reliance on shared/global env file to prevent cross-contamination.  
+  - Verified Docker Compose networking across containers.
+
+### Changed
+- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend.
+- Enhanced startup logs to announce active backend, model, URL, and mode.
+- Improved error handling with clearer “Reasoning error” messages.
+
+### Fixed
+- Corrected broken vLLM endpoint routing (`/v1/completions`).
+- Stabilized cross-container health reporting for NeoMem.
+- Resolved JSON parse failures caused by streaming chunk delimiters.
+
+---
+
+## Next Planned – [v0.4.0]
+### Planned Additions
+- **Reflection Mode**
+  - Introduce `REASONING_MODE=factcheck|reflection`.  
+  - Output schema:
+    ```json
+    { "insight": "...", "evaluation": "...", "next_action": "..." }
+    ```
+
+- **Cortex-First Pipeline**
+  - UI → Cortex → [Reflection + Verifier + Memory] → Speech LLM → User.  
+  - Allows Lyra to “think before speaking.”
+
+- **Verifier Stub**
+  - New `/verify` endpoint for search-based factual grounding.  
+  - Asynchronous external truth checking.
+
+- **Memory Integration**
+  - Feed reflective outputs into NeoMem.  
+  - Enable “dream” cycles for autonomous self-review.
+
+---
+
+**Status:** 🟢 Stable Core – Multi-backend reasoning operational.  
+**Next milestone:** *v0.4.0 — Reflection Mode + Thought Pipeline orchestration.*
+
+---
+
+### [Intake] v0.1.0 - 2025-10-27
+    - Recieves messages from relay and summarizes them in a cascading format.
+	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
+	- Currently logs summaries to .log file in /project-lyra/intake-logs/
+  ** Next Steps **
+    - Feed intake into neomem.
+	- Generate a daily/hourly/etc overall summary, (IE: Today Brian and Lyra worked on x, y, and z)
+	- Generate session aware summaries, with its own intake hopper.
+  
+
+### [Lyra-Cortex] v0.2.0 — 2025-09-26
+**Added
+- Integrated **llama-server** on dedicated Cortex VM (Proxmox).
+- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs.
+- Benchmarked Phi-3.5-mini performance:
+  - ~18 tokens/sec CPU-only on Ryzen 7 7800X.
+  - Salience classification functional but sometimes inconsistent ("sali", "fi", "jamming").
+- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier:
+  - Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval).
+  - More responsive but over-classifies messages as “salient.”
+- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models.
+
+** Known Issues
+- Small models tend to drift or over-classify.
+- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models.
+- Need to set up a `systemd` service for `llama-server` to auto-start on VM reboot.
+
+---
+
+### [Lyra-Cortex] v0.1.0 — 2025-09-25
+#### Added
+- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD).
+- Built **llama.cpp** with `llama-server` target via CMake.
+- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model.
+- Verified **API compatibility** at `/v1/chat/completions`.
+- Local test successful via `curl` → ~523 token response generated.
+- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X).
+- Confirmed usable for salience scoring, summarization, and lightweight reasoning.
diff --git a/core/README.md b/README.md
similarity index 97%
rename from core/README.md
rename to README.md
index cf265e4..f8a1eed 100644
--- a/core/README.md
+++ b/README.md
@@ -1,265 +1,265 @@
-##### Project Lyra - README v0.3.0 - needs fixing #####
-
-Lyra is a modular persistent AI companion system.  
-It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**,  
-with optional subconscious annotation powered by **Cortex VM** running local LLMs.
-
-## Mission Statement ##
-	The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
-	
----
-	
-## Structure ##
-	Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules:
-	## A. VM 100 - lyra-core:
-		1. ** Core v0.3.1 - Docker Stack
-			- Relay - (docker container) - The main harness that connects the modules together and accepts input from the user.
-			- UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that.
-			- Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection.
-			- All of this is built and controlled by a single .env and docker-compose.lyra.yml.
-		2. **NeoMem v0.1.0 - (docker stack)
-			- NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph.
-			- NeoMem launches with a single separate docker-compose.neomem.yml.
-			
-	## B. VM 101 - lyra - cortex
-		3. ** Cortex - VM containing docker stack
-		- This is the working reasoning layer of Lyra.
-		- Built to be flexible in deployment. Run it locally or remotely (via wan/lan) 
-		- Intake v0.1.0 - (docker Container) gives conversations context and purpose
-			- Intake takes the last N exchanges and summarizes them into coherrent short term memories.
-			- Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc.
-			- Keeps the bot aware of what is going on with out having to send it the whole chat every time. 
-		- Cortex - Docker container containing: 
-			- Reasoning Layer
-				- TBD
-			- Reflect - (docker continer) - Not yet implemented, road map. 
-				- Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise.
-				- Can be done actively and asynchronously, or on a time basis (think human sleep and dreams). 
-				- This stage is not yet built, this is just an idea. 
-		
-	## C. Remote LLM APIs:
-		3. **AI Backends
-			- Lyra doesnt run models her self, she calls up APIs.
-			- Endlessly customizable as long as it outputs to the same schema. 
-			
----
-
-
-## 🚀 Features ##
-
-# Lyra-Core VM (VM100)
-- **Relay **:
-  - The main harness and orchestrator of Lyra.
-  - OpenAI-compatible endpoint: `POST /v1/chat/completions`
-  - Injects persona + relevant memories into every LLM call
-  - Routes all memory storage/retrieval through **NeoMem**
-  - Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`)
-
-- **NeoMem (Memory Engine)**:
-  - Forked from Mem0 OSS and fully independent.
-  - Drop-in compatible API (`/memories`, `/search`).
-  - Local-first: runs on FastAPI with Postgres + Neo4j.
-  - No external SDK dependencies.
-  - Default service: `neomem-api` (port 7077).
-  - Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match.
-
-- **UI**:
-  - Lightweight static HTML chat page.
-  - Connects to Relay at `http://<host>:7078`.
-  - Nice cyberpunk theme!
-  - Saves and loads sessions, which then in turn send to relay.
-
-# Beta Lyrae (RAG Memory DB) - added 11-3-25
-- **RAG Knowledge DB - Beta Lyrae (sheliak)**
-  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.  
-  - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
-		The system uses:
-  - **ChromaDB** for persistent vector storage  
-  - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity  
-  - **FastAPI** (port 7090) for the `/rag/search` REST endpoint  
-  - Directory Layout
-		rag/
-		├── rag_chat_import.py # imports JSON chat logs
-		├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
-		├── rag_build.py # legacy single-folder builder
-		├── rag_query.py # command-line query helper
-		├── rag_api.py # FastAPI service providing /rag/search
-		├── chromadb/ # persistent vector store
-		├── chatlogs/ # organized source data
-		│ ├── poker/
-		│ ├── work/
-		│ ├── lyra/
-		│ ├── personal/
-		│ └── ...
-		└── import.log # progress log for batch runs
-  - **OpenAI chatlog importer.
-	  - Takes JSON formatted chat logs and imports it to the RAG.
-	  - **fetures include:**
-	    - Recursive folder indexing with **category detection** from directory name  
-		- Smart chunking for long messages (5 000 chars per slice)  
-		- Automatic deduplication using SHA-1 hash of file + chunk
-		- Timestamps for both file modification and import time
-		- Full progress logging via tqdm
-		- Safe to run in background with nohup … &
-		- Metadata per chunk:
-		  ```json
-		  {
-			"chat_id": "<sha1 of filename>",
-			"chunk_index": 0,
-			"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
-			"title": "cortex LLMs 11-1-25",
-			"role": "assistant",
-			"category": "lyra",
-			"type": "chat",
-			"file_modified": "2025-11-06T23:41:02",
-			"imported_at": "2025-11-07T03:55:00Z"
-		  }```
-
-# Cortex VM (VM101, CT201)
-  - **CT201 main reasoning orchestrator.**
-    - This is the internal brain of Lyra.
-	- Running in a privellaged LXC.	
-	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
-	- Accessible via 10.0.0.43:8000/v1/completions.
-
-  - **Intake v0.1.1 **
-    - Recieves messages from relay and summarizes them in a cascading format.
-	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
-	- Intake then sends to cortex for self reflection, neomem for memory consolidation.
-	
-  - **Reflect **
-    -TBD
-
-# Self hosted vLLM server #
-  - **CT201 main reasoning orchestrator.**
-    - This is the internal brain of Lyra.
-	- Running in a privellaged LXC.	
-	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
-	- Accessible via 10.0.0.43:8000/v1/completions.
-  - **Stack Flow**
-    -	[Proxmox Host]
-			 └── loads AMDGPU driver
-			 └── boots CT201 (order=2)
-
-		[CT201 GPU Container]
-			 ├── lyra-start-vllm.sh → starts vLLM ROCm model server
-			 ├── lyra-vllm.service   → runs the above automatically
-			 ├── lyra-core.service   → launches Cortex + Intake Docker stack
-			 └── Docker Compose      → runs Cortex + Intake containers
-
-		[Cortex Container]
-			 ├── Listens on port 7081
-			 ├── Talks to NVGRAM (mem API) + Intake
-			 └── Main relay between Lyra UI ↔ memory ↔ model
-
-		[Intake Container]
-			├── Listens on port 7080
-			├── Summarizes every few exchanges
-			├── Writes summaries to /app/logs/summaries.log
-			└── Future: sends summaries → Cortex for reflection
-
-
-# Additional information available in the trilium docs. #
----
-
-## 📦 Requirements
-
-- Docker + Docker Compose  
-- Postgres + Neo4j (for NeoMem)
-- Access to an open AI or ollama style API.
-- OpenAI API key (for Relay fallback LLMs)
-
-**Dependencies:**
-	- fastapi==0.115.8
-	- uvicorn==0.34.0
-	- pydantic==2.10.4
-	- python-dotenv==1.0.1
-	- psycopg>=3.2.8
-	- ollama
-
----
-
-🔌 Integration Notes
-
-Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.
-
-API endpoints remain identical to Mem0 (/memories, /search).
-
-History and entity graphs managed internally via Postgres + Neo4j.
-
----
-
-🧱 Architecture Snapshot
-
-	User → Relay → Cortex
-			 ↓
-		 [RAG Search]
-			 ↓
-		 [Reflection Loop]
-			 ↓
-		 Intake (async summaries)
-			 ↓
-		 NeoMem (persistent memory)
-
-**Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
-- Data Flow:
-  - User message enters Cortex via /reason.
-  - Cortex assembles context:
-	- Intake summaries (short-term memory)
-	- RAG contextual data (knowledge base)
-  - LLM generates initial draft (call_llm).
-  - Reflection loop critiques and refines the answer.
-  - Intake asynchronously summarizes and sends snapshots to NeoMem.
-
-RAG API Configuration:
-Set RAG_API_URL in .env (default: http://localhost:7090).
-
----
-
-## Setup and Operation ##
-
-## Beta Lyrae - RAG memory system ##
-**Requirements**
-  -Env= python 3.10+
-  -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
-  -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)
-
-**Import Chats**
-  - Chats need to be formatted into the correct format of
-	```
-	  "messages": [
-	    {
-		  "role:" "user",
-		  "content": "Message here"
-		},
-		"messages": [
-	    {
-		  "role:" "assistant",
-		  "content": "Message here"
-		},```
-  - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
-  - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).
-
-**Build API Server**
-  - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
-  - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```
-
-**Query**
-  - Run: python3 rag_query.py "Question here?"
-  - For testing a curl command can reach it too
-    ```
-	curl -X POST http://127.0.0.1:7090/rag/search \
-	  -H "Content-Type: application/json" \
-	  -d '{
-			"query": "What is the current state of Cortex and Project Lyra?",
-			"where": {"category": "lyra"}
-		  }'
-	```
-	
-# Beta Lyrae - RAG System
-
-## 📖 License
-NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).  
-This fork retains the original Apache 2.0 license and adds local modifications.  
-© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
-
+##### Project Lyra - README v0.3.0 - needs fixing #####
+
+Lyra is a modular persistent AI companion system.  
+It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**,  
+with optional subconscious annotation powered by **Cortex VM** running local LLMs.
+
+## Mission Statement ##
+	The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
+	
+---
+	
+## Structure ##
+	Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules:
+	## A. VM 100 - lyra-core:
+		1. ** Core v0.3.1 - Docker Stack
+			- Relay - (docker container) - The main harness that connects the modules together and accepts input from the user.
+			- UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that.
+			- Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection.
+			- All of this is built and controlled by a single .env and docker-compose.lyra.yml.
+		2. **NeoMem v0.1.0 - (docker stack)
+			- NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph.
+			- NeoMem launches with a single separate docker-compose.neomem.yml.
+			
+	## B. VM 101 - lyra - cortex
+		3. ** Cortex - VM containing docker stack
+		- This is the working reasoning layer of Lyra.
+		- Built to be flexible in deployment. Run it locally or remotely (via wan/lan) 
+		- Intake v0.1.0 - (docker Container) gives conversations context and purpose
+			- Intake takes the last N exchanges and summarizes them into coherrent short term memories.
+			- Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc.
+			- Keeps the bot aware of what is going on with out having to send it the whole chat every time. 
+		- Cortex - Docker container containing: 
+			- Reasoning Layer
+				- TBD
+			- Reflect - (docker continer) - Not yet implemented, road map. 
+				- Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise.
+				- Can be done actively and asynchronously, or on a time basis (think human sleep and dreams). 
+				- This stage is not yet built, this is just an idea. 
+		
+	## C. Remote LLM APIs:
+		3. **AI Backends
+			- Lyra doesnt run models her self, she calls up APIs.
+			- Endlessly customizable as long as it outputs to the same schema. 
+			
+---
+
+
+## 🚀 Features ##
+
+# Lyra-Core VM (VM100)
+- **Relay **:
+  - The main harness and orchestrator of Lyra.
+  - OpenAI-compatible endpoint: `POST /v1/chat/completions`
+  - Injects persona + relevant memories into every LLM call
+  - Routes all memory storage/retrieval through **NeoMem**
+  - Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`)
+
+- **NeoMem (Memory Engine)**:
+  - Forked from Mem0 OSS and fully independent.
+  - Drop-in compatible API (`/memories`, `/search`).
+  - Local-first: runs on FastAPI with Postgres + Neo4j.
+  - No external SDK dependencies.
+  - Default service: `neomem-api` (port 7077).
+  - Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match.
+
+- **UI**:
+  - Lightweight static HTML chat page.
+  - Connects to Relay at `http://<host>:7078`.
+  - Nice cyberpunk theme!
+  - Saves and loads sessions, which then in turn send to relay.
+
+# Beta Lyrae (RAG Memory DB) - added 11-3-25
+- **RAG Knowledge DB - Beta Lyrae (sheliak)**
+  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.  
+  - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
+		The system uses:
+  - **ChromaDB** for persistent vector storage  
+  - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity  
+  - **FastAPI** (port 7090) for the `/rag/search` REST endpoint  
+  - Directory Layout
+		rag/
+		├── rag_chat_import.py # imports JSON chat logs
+		├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
+		├── rag_build.py # legacy single-folder builder
+		├── rag_query.py # command-line query helper
+		├── rag_api.py # FastAPI service providing /rag/search
+		├── chromadb/ # persistent vector store
+		├── chatlogs/ # organized source data
+		│ ├── poker/
+		│ ├── work/
+		│ ├── lyra/
+		│ ├── personal/
+		│ └── ...
+		└── import.log # progress log for batch runs
+  - **OpenAI chatlog importer.
+	  - Takes JSON formatted chat logs and imports it to the RAG.
+	  - **fetures include:**
+	    - Recursive folder indexing with **category detection** from directory name  
+		- Smart chunking for long messages (5 000 chars per slice)  
+		- Automatic deduplication using SHA-1 hash of file + chunk
+		- Timestamps for both file modification and import time
+		- Full progress logging via tqdm
+		- Safe to run in background with nohup … &
+		- Metadata per chunk:
+		  ```json
+		  {
+			"chat_id": "<sha1 of filename>",
+			"chunk_index": 0,
+			"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
+			"title": "cortex LLMs 11-1-25",
+			"role": "assistant",
+			"category": "lyra",
+			"type": "chat",
+			"file_modified": "2025-11-06T23:41:02",
+			"imported_at": "2025-11-07T03:55:00Z"
+		  }```
+
+# Cortex VM (VM101, CT201)
+  - **CT201 main reasoning orchestrator.**
+    - This is the internal brain of Lyra.
+	- Running in a privellaged LXC.	
+	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
+	- Accessible via 10.0.0.43:8000/v1/completions.
+
+  - **Intake v0.1.1 **
+    - Recieves messages from relay and summarizes them in a cascading format.
+	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
+	- Intake then sends to cortex for self reflection, neomem for memory consolidation.
+	
+  - **Reflect **
+    -TBD
+
+# Self hosted vLLM server #
+  - **CT201 main reasoning orchestrator.**
+    - This is the internal brain of Lyra.
+	- Running in a privellaged LXC.	
+	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
+	- Accessible via 10.0.0.43:8000/v1/completions.
+  - **Stack Flow**
+    -	[Proxmox Host]
+			 └── loads AMDGPU driver
+			 └── boots CT201 (order=2)
+
+		[CT201 GPU Container]
+			 ├── lyra-start-vllm.sh → starts vLLM ROCm model server
+			 ├── lyra-vllm.service   → runs the above automatically
+			 ├── lyra-core.service   → launches Cortex + Intake Docker stack
+			 └── Docker Compose      → runs Cortex + Intake containers
+
+		[Cortex Container]
+			 ├── Listens on port 7081
+			 ├── Talks to NVGRAM (mem API) + Intake
+			 └── Main relay between Lyra UI ↔ memory ↔ model
+
+		[Intake Container]
+			├── Listens on port 7080
+			├── Summarizes every few exchanges
+			├── Writes summaries to /app/logs/summaries.log
+			└── Future: sends summaries → Cortex for reflection
+
+
+# Additional information available in the trilium docs. #
+---
+
+## 📦 Requirements
+
+- Docker + Docker Compose  
+- Postgres + Neo4j (for NeoMem)
+- Access to an open AI or ollama style API.
+- OpenAI API key (for Relay fallback LLMs)
+
+**Dependencies:**
+	- fastapi==0.115.8
+	- uvicorn==0.34.0
+	- pydantic==2.10.4
+	- python-dotenv==1.0.1
+	- psycopg>=3.2.8
+	- ollama
+
+---
+
+🔌 Integration Notes
+
+Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.
+
+API endpoints remain identical to Mem0 (/memories, /search).
+
+History and entity graphs managed internally via Postgres + Neo4j.
+
+---
+
+🧱 Architecture Snapshot
+
+	User → Relay → Cortex
+			 ↓
+		 [RAG Search]
+			 ↓
+		 [Reflection Loop]
+			 ↓
+		 Intake (async summaries)
+			 ↓
+		 NeoMem (persistent memory)
+
+**Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
+- Data Flow:
+  - User message enters Cortex via /reason.
+  - Cortex assembles context:
+	- Intake summaries (short-term memory)
+	- RAG contextual data (knowledge base)
+  - LLM generates initial draft (call_llm).
+  - Reflection loop critiques and refines the answer.
+  - Intake asynchronously summarizes and sends snapshots to NeoMem.
+
+RAG API Configuration:
+Set RAG_API_URL in .env (default: http://localhost:7090).
+
+---
+
+## Setup and Operation ##
+
+## Beta Lyrae - RAG memory system ##
+**Requirements**
+  -Env= python 3.10+
+  -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
+  -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)
+
+**Import Chats**
+  - Chats need to be formatted into the correct format of
+	```
+	  "messages": [
+	    {
+		  "role:" "user",
+		  "content": "Message here"
+		},
+		"messages": [
+	    {
+		  "role:" "assistant",
+		  "content": "Message here"
+		},```
+  - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
+  - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).
+
+**Build API Server**
+  - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
+  - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```
+
+**Query**
+  - Run: python3 rag_query.py "Question here?"
+  - For testing a curl command can reach it too
+    ```
+	curl -X POST http://127.0.0.1:7090/rag/search \
+	  -H "Content-Type: application/json" \
+	  -d '{
+			"query": "What is the current state of Cortex and Project Lyra?",
+			"where": {"category": "lyra"}
+		  }'
+	```
+	
+# Beta Lyrae - RAG System
+
+## 📖 License
+NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).  
+This fork retains the original Apache 2.0 license and adds local modifications.  
+© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
+
diff --git a/vllm-mi50.md b/vllm-mi50.md
new file mode 100644
index 0000000..c8f6fd4
--- /dev/null
+++ b/vllm-mi50.md
@@ -0,0 +1,416 @@
+Here you go — a **clean, polished, ready-to-drop-into-Trilium or GitHub** Markdown file.
+
+If you want, I can also auto-generate a matching `/docs/vllm-mi50/` folder structure and a mini-ToC.
+
+---
+
+# **MI50 + vLLM + Proxmox LXC Setup Guide**
+
+### *End-to-End Field Manual for gfx906 LLM Serving*
+
+**Version:** 1.0
+**Last updated:** 2025-11-17
+
+---
+
+## **📌 Overview**
+
+This guide documents how to run a **vLLM OpenAI-compatible server** on an
+**AMD Instinct MI50 (gfx906)** inside a **Proxmox LXC container**, expose it over LAN,
+and wire it into **Project Lyra's Cortex reasoning layer**.
+
+This file is long, specific, and intentionally leaves *nothing* out so you never have to rediscover ROCm pain rituals again.
+
+---
+
+## **1. What This Stack Looks Like**
+
+```
+Proxmox Host
+ ├─ AMD Instinct MI50 (gfx906)
+ ├─ AMDGPU + ROCm stack
+ └─ LXC Container (CT 201: cortex-gpu)
+      ├─ Ubuntu 24.04
+      ├─ Docker + docker compose
+      ├─ vLLM inside Docker (nalanzeyu/vllm-gfx906)
+      ├─ GPU passthrough via /dev/kfd + /dev/dri + PCI bind
+      └─ vLLM API exposed on :8000
+Lyra Cortex (VM/Server)
+ └─ LLM_PRIMARY_URL=http://10.0.0.43:8000
+```
+
+---
+
+## **2. Proxmox Host — GPU Setup**
+
+### **2.1 Confirm MI50 exists**
+
+```bash
+lspci -nn | grep -i 'vega\|instinct\|radeon'
+```
+
+You should see something like:
+
+```
+0a:00.0 Display controller: AMD Instinct MI50 (gfx906)
+```
+
+### **2.2 Load AMDGPU driver**
+
+The main pitfall after **any host reboot**.
+
+```bash
+modprobe amdgpu
+```
+
+If you skip this, the LXC container won't see the GPU.
+
+---
+
+## **3. LXC Container Configuration (CT 201)**
+
+The container ID is **201**.
+Config file is at:
+
+```
+/etc/pve/lxc/201.conf
+```
+
+### **3.1 Working 201.conf**
+
+Paste this *exact* version:
+
+```ini
+arch: amd64
+cores: 4
+hostname: cortex-gpu
+memory: 16384
+swap: 512
+ostype: ubuntu
+onboot: 1
+startup: order=2,up=10,down=10
+net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:C6:3E:88,ip=dhcp,type=veth
+rootfs: local-lvm:vm-201-disk-0,size=200G
+unprivileged: 0
+
+# Docker in LXC requires this
+features: keyctl=1,nesting=1
+lxc.apparmor.profile: unconfined
+lxc.cap.drop:
+
+# --- GPU passthrough for ROCm (MI50) ---
+lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file,mode=0666
+lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
+lxc.mount.entry: /sys/class/drm sys/class/drm none bind,ro,optional,create=dir
+lxc.mount.entry: /opt/rocm /opt/rocm none bind,ro,optional,create=dir
+
+# Bind the MI50 PCI device
+lxc.mount.entry: /dev/bus/pci/0000:0a:00.0 dev/bus/pci/0000:0a:00.0 none bind,optional,create=file
+
+# Allow GPU-related character devices
+lxc.cgroup2.devices.allow: c 226:* rwm
+lxc.cgroup2.devices.allow: c 29:* rwm
+lxc.cgroup2.devices.allow: c 189:* rwm
+lxc.cgroup2.devices.allow: c 238:* rwm
+lxc.cgroup2.devices.allow: c 241:* rwm
+lxc.cgroup2.devices.allow: c 242:* rwm
+lxc.cgroup2.devices.allow: c 243:* rwm
+lxc.cgroup2.devices.allow: c 244:* rwm
+lxc.cgroup2.devices.allow: c 245:* rwm
+lxc.cgroup2.devices.allow: c 246:* rwm
+lxc.cgroup2.devices.allow: c 247:* rwm
+lxc.cgroup2.devices.allow: c 248:* rwm
+lxc.cgroup2.devices.allow: c 249:* rwm
+lxc.cgroup2.devices.allow: c 250:* rwm
+lxc.cgroup2.devices.allow: c 510:0 rwm
+```
+
+### **3.2 Restart sequence**
+
+```bash
+pct stop 201
+modprobe amdgpu
+pct start 201
+pct enter 201
+```
+
+---
+
+## **4. Inside CT 201 — Verifying ROCm + GPU Visibility**
+
+### **4.1 Check device nodes**
+
+```bash
+ls -l /dev/kfd
+ls -l /dev/dri
+ls -l /opt/rocm
+```
+
+All must exist.
+
+### **4.2 Validate GPU via rocminfo**
+
+```bash
+/opt/rocm/bin/rocminfo | grep -i gfx
+```
+
+You need to see:
+
+```
+gfx906
+```
+
+If you see **nothing**, the GPU isn’t passed through — restart and re-check the host steps.
+
+---
+
+## **5. Install Docker in the LXC (Ubuntu 24.04)**
+
+This container runs Docker inside LXC (nesting enabled).
+
+```bash
+apt update
+apt install -y ca-certificates curl gnupg
+
+install -m 0755 -d /etc/apt/keyrings
+curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
+  | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
+chmod a+r /etc/apt/keyrings/docker.gpg
+
+echo \
+  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
+  https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
+  > /etc/apt/sources.list.d/docker.list
+
+apt update
+apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+```
+
+Check:
+
+```bash
+docker --version
+docker compose version
+```
+
+---
+
+## **6. Running vLLM Inside CT 201 via Docker**
+
+### **6.1 Create directory**
+
+```bash
+mkdir -p /root/vllm
+cd /root/vllm
+```
+
+### **6.2 docker-compose.yml**
+
+Save this exact file as `/root/vllm/docker-compose.yml`:
+
+```yaml
+version: "3.9"
+
+services:
+  vllm-mi50:
+    image: nalanzeyu/vllm-gfx906:latest
+    container_name: vllm-mi50
+    restart: unless-stopped
+    ports:
+      - "8000:8000"
+    environment:
+      VLLM_ROLE: "APIServer"
+      VLLM_MODEL: "/model"
+      VLLM_LOGGING_LEVEL: "INFO"
+    command: >
+      vllm serve /model
+      --host 0.0.0.0
+      --port 8000
+      --dtype float16
+      --max-model-len 4096
+      --api-type openai
+    devices:
+      - "/dev/kfd:/dev/kfd"
+      - "/dev/dri:/dev/dri"
+    volumes:
+      - /opt/rocm:/opt/rocm:ro
+```
+
+### **6.3 Start vLLM**
+
+```bash
+docker compose up -d
+docker compose logs -f
+```
+
+When healthy, you’ll see:
+
+```
+(APIServer) Application startup complete.
+```
+
+and periodic throughput logs.
+
+---
+
+## **7. Test vLLM API**
+
+### **7.1 From Proxmox host**
+
+```bash
+curl -X POST http://10.0.0.43:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"/model","prompt":"ping","max_tokens":5}'
+```
+
+Should respond like:
+
+```json
+{"choices":[{"text":"-pong"}]}
+```
+
+### **7.2 From Cortex machine**
+
+```bash
+curl -X POST http://10.0.0.43:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"/model","prompt":"ping from cortex","max_tokens":5}'
+```
+
+---
+
+## **8. Wiring into Lyra Cortex**
+
+In `cortex` container’s `docker-compose.yml`:
+
+```yaml
+environment:
+  LLM_PRIMARY_URL: http://10.0.0.43:8000
+```
+
+Not `/v1/completions` because the router appends that automatically.
+
+In `cortex/.env`:
+
+```env
+LLM_FORCE_BACKEND=primary
+LLM_MODEL=/model
+```
+
+Test:
+
+```bash
+curl -X POST http://10.0.0.41:7081/reason \
+  -H "Content-Type: application/json" \
+  -d '{"prompt":"test vllm","session_id":"dev"}'
+```
+
+If you get a meaningful response: **Cortex → vLLM is online**.
+
+---
+
+## **9. Common Failure Modes (And Fixes)**
+
+### **9.1 “Failed to infer device type”**
+
+vLLM cannot see any ROCm devices.
+
+Fix:
+
+```bash
+# On host
+modprobe amdgpu
+pct stop 201
+pct start 201
+# In container
+/opt/rocm/bin/rocminfo | grep -i gfx
+docker compose up -d
+```
+
+### **9.2 GPU disappears after reboot**
+
+Same fix:
+
+```bash
+modprobe amdgpu
+pct stop 201
+pct start 201
+```
+
+### **9.3 Invalid image name**
+
+If you see pull errors:
+
+```
+pull access denied for nalanzeuy...
+```
+
+Use:
+
+```
+image: nalanzeyu/vllm-gfx906
+```
+
+### **9.4 Double `/v1` in URL**
+
+Ensure:
+
+```
+LLM_PRIMARY_URL=http://10.0.0.43:8000
+```
+
+Router appends `/v1/completions`.
+
+---
+
+## **10. Daily / Reboot Ritual**
+
+### **On Proxmox host**
+
+```bash
+modprobe amdgpu
+pct stop 201
+pct start 201
+```
+
+### **Inside CT 201**
+
+```bash
+/opt/rocm/bin/rocminfo | grep -i gfx
+cd /root/vllm
+docker compose up -d
+docker compose logs -f
+```
+
+### **Test API**
+
+```bash
+curl -X POST http://10.0.0.43:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"/model","prompt":"ping","max_tokens":5}'
+```
+
+---
+
+## **11. Summary**
+
+You now have:
+
+* **MI50 (gfx906)** correctly passed into LXC
+* **ROCm** inside the container via bind mounts
+* **vLLM** running inside Docker in the LXC
+* **OpenAI-compatible API** on port 8000
+* **Lyra Cortex** using it automatically as primary backend
+
+This is a complete, reproducible setup that survives reboots (with the modprobe ritual) and allows you to upgrade/replace models anytime.
+
+---
+
+If you want, I can generate:
+
+* A `/docs/vllm-mi50/README.md`
+* A "vLLM Gotchas" document
+* A quick-reference cheat sheet
+* A troubleshooting decision tree
+
+Just say the word.