Files
project-lyra/CHANGELOG.md
2025-11-26 03:18:15 -05:00

711 lines
31 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Project Lyra — Modular Changelog
All notable changes to Project Lyra are organized by component.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
and adheres to [Semantic Versioning](https://semver.org/).
# Last Updated: 11-26-25
---
## 🧠 Lyra-Core ##############################################################################
## [Infrastructure v1.0.0] - 2025-11-26
### Changed
- **Environment Variable Consolidation** - Major reorganization to eliminate duplication and improve maintainability
- Consolidated 9 scattered `.env` files into single source of truth architecture
- Root `.env` now contains all shared infrastructure (LLM backends, databases, API keys, service URLs)
- Service-specific `.env` files minimized to only essential overrides:
- `cortex/.env`: Reduced from 42 to 22 lines (operational parameters only)
- `neomem/.env`: Reduced from 26 to 14 lines (LLM naming conventions only)
- `intake/.env`: Kept at 8 lines (already minimal)
- **Result**: ~24% reduction in total configuration lines (197 → ~150)
- **Docker Compose Consolidation**
- All services now defined in single root `docker-compose.yml`
- Relay service updated with complete configuration (env_file, volumes)
- Removed redundant `core/docker-compose.yml` (marked as DEPRECATED)
- Standardized network communication to use Docker container names
- **Service URL Standardization**
- Internal services use container names: `http://neomem-api:7077`, `http://cortex:7081`
- External services use IP addresses: `http://10.0.0.43:8000` (vLLM), `http://10.0.0.3:11434` (Ollama)
- Removed IP/container name inconsistencies across files
### Added
- **Security Templates** - Created `.env.example` files for all services
- Root `.env.example` with sanitized credentials
- Service-specific templates: `cortex/.env.example`, `neomem/.env.example`, `intake/.env.example`, `rag/.env.example`
- All `.env.example` files safe to commit to version control
- **Documentation**
- `ENVIRONMENT_VARIABLES.md`: Comprehensive reference for all environment variables
- Variable descriptions, defaults, and usage examples
- Multi-backend LLM strategy documentation
- Troubleshooting guide
- Security best practices
- `DEPRECATED_FILES.md`: Deletion guide for deprecated files with verification steps
- **Enhanced .gitignore**
- Ignores all `.env` files (including subdirectories)
- Tracks `.env.example` templates for documentation
- Ignores `.env-backups/` directory
### Removed
- `core/.env` - Redundant with root `.env`, now deleted
- `core/docker-compose.yml` - Consolidated into main compose file (marked DEPRECATED)
### Fixed
- Eliminated duplicate `OPENAI_API_KEY` across 5+ files
- Eliminated duplicate LLM backend URLs across 4+ files
- Eliminated duplicate database credentials across 3+ files
- Resolved Cortex `environment:` section override in docker-compose (now uses env_file)
### Architecture
- **Multi-Backend LLM Strategy**: Root `.env` provides all backend OPTIONS (PRIMARY, SECONDARY, CLOUD, FALLBACK), services choose which to USE
- Cortex → vLLM (PRIMARY) for autonomous reasoning
- NeoMem → Ollama (SECONDARY) + OpenAI embeddings
- Intake → vLLM (PRIMARY) for summarization
- Relay → Fallback chain with user preference
- Preserves per-service flexibility while eliminating URL duplication
### Migration
- All original `.env` files backed up to `.env-backups/` with timestamp `20251126_025334`
- Rollback plan documented in `ENVIRONMENT_VARIABLES.md`
- Verification steps provided in `DEPRECATED_FILES.md`
---
## [Lyra_RAG v0.1.0] 2025-11-07
### Added
- Initial standalone RAG module for Project Lyra.
- Persistent ChromaDB vector store (`./chromadb`).
- Importer `rag_chat_import.py` with:
- Recursive folder scanning and category tagging.
- Smart chunking (~5 k chars).
- SHA-1 deduplication and chat-ID metadata.
- Timestamp fields (`file_modified`, `imported_at`).
- Background-safe operation (`nohup`/`tmux`).
- 68 Lyra-category chats imported:
- **6 556 new chunks added**
- **1 493 duplicates skipped**
- **7 997 total vectors** now stored.
### API
- `/rag/search` FastAPI endpoint implemented (port 7090).
- Supports natural-language queries and returns top related excerpts.
- Added answer synthesis step using `gpt-4o-mini`.
### Verified
- Successful recall of Lyra-Core development history (v0.3.0 snapshot).
- Correct metadata and category tagging for all new imports.
### Next Planned
- Optional `where` filter parameter for category/date queries.
- Graceful “no results” handler for empty retrievals.
- `rag_docs_import.py` for PDFs and other document types.
## [Lyra Core v0.3.2 + Web Ui v0.2.0] - 2025-10-28
### Added
- ** New UI **
- Cleaned up UI look and feel.
- ** Added "sessions" **
- Now sessions persist over time.
- Ability to create new sessions or load sessions from a previous instance.
- When changing the session, it updates what the prompt is sending relay (doesn't prompt with messages from other sessions).
- Relay is correctly wired in.
## [Lyra-Core 0.3.1] - 2025-10-09
### Added
- **NVGRAM Integration (Full Pipeline Reconnected)**
- Replaced legacy Mem0 service with NVGRAM microservice (`nvgram-api` @ port 7077).
- Updated `server.js` in Relay to route all memory ops via `${NVGRAM_API}/memories` and `/search`.
- Added `.env` variable:
```
NVGRAM_API=http://nvgram-api:7077
```
- Verified end-to-end Lyra conversation persistence:
- `relay → nvgram-api → postgres/neo4j → relay → ollama → ui`
- ✅ Memories stored, retrieved, and re-injected successfully.
### Changed
- Renamed `MEM0_URL` → `NVGRAM_API` across all relay environment configs.
- Updated Docker Compose service dependency order:
- `relay` now depends on `nvgram-api` healthcheck.
- Removed `mem0` references and volumes.
- Minor cleanup to Persona fetch block (null-checks and safer default persona string).
### Fixed
- Relay startup no longer crashes when NVGRAM is unavailable — deferred connection handling.
- `/memories` POST failures no longer crash Relay; now logged gracefully as `relay error Error: memAdd failed: 500`.
- Improved injected prompt debugging (`DEBUG_PROMPT=true` now prints clean JSON).
### Goals / Next Steps
- Add salience visualization (e.g., memory weights displayed in injected system message).
- Begin schema alignment with NVGRAM v0.1.2 for confidence scoring.
- Add relay auto-retry for transient 500 responses from NVGRAM.
---
## [Lyra-Core] v0.3.1 - 2025-09-27
### Changed
- Removed salience filter logic; Cortex is now the default annotator.
- All user messages stored in Mem0; no discard tier applied.
### Added
- Cortex annotations (`metadata.cortex`) now attached to memories.
- Debug logging improvements:
- Pretty-print Cortex annotations
- Injected prompt preview
- Memory search hit list with scores
- `.env` toggle (`CORTEX_ENABLED`) to bypass Cortex when needed.
### Fixed
- Parsing failures from Markdown-wrapped Cortex JSON via fence cleaner.
- Relay no longer “hangs” on malformed Cortex outputs.
---
### [Lyra-Core] v0.3.0 — 2025-09-26
#### Added
- Implemented **salience filtering** in Relay:
- `.env` configurable: `SALIENCE_ENABLED`, `SALIENCE_MODE`, `SALIENCE_MODEL`, `SALIENCE_API_URL`.
- Supports `heuristic` and `llm` classification modes.
- LLM-based salience filter integrated with Cortex VM running `llama-server`.
- Logging improvements:
- Added debug logs for salience mode, raw LLM output, and unexpected outputs.
- Fail-closed behavior for unexpected LLM responses.
- Successfully tested with **Phi-3.5-mini** and **Qwen2-0.5B-Instruct** as salience classifiers.
- Verified end-to-end flow: Relay → salience filter → Mem0 add/search → Persona injection → LLM reply.
#### Changed
- Refactored `server.js` to gate `mem.add()` calls behind salience filter.
- Updated `.env` to support `SALIENCE_MODEL`.
#### Known Issues
- Small models (e.g. Qwen2-0.5B) tend to over-classify as "salient".
- Phi-3.5-mini sometimes returns truncated tokens ("sali", "fi").
- CPU-only inference is functional but limited; larger models recommended once GPU is available.
---
### [Lyra-Core] v0.2.0 — 2025-09-24
#### Added
- Migrated Relay to use `mem0ai` SDK instead of raw fetch calls.
- Implemented `sessionId` support (client-supplied, fallback to `default`).
- Added debug logs for memory add/search.
- Cleaned up Relay structure for clarity.
---
### [Lyra-Core] v0.1.0 — 2025-09-23
#### Added
- First working MVP of **Lyra Core Relay**.
- Relay service accepts `POST /v1/chat/completions` (OpenAI-compatible).
- Memory integration with Mem0:
- `POST /memories` on each user message.
- `POST /search` before LLM call.
- Persona Sidecar integration (`GET /current`).
- OpenAI GPT + Ollama (Mythomax) support in Relay.
- Simple browser-based chat UI (talks to Relay at `http://<host>:7078`).
- `.env` standardization for Relay + Mem0 + Postgres + Neo4j.
- Working Neo4j + Postgres backing stores for Mem0.
- Initial MVP relay service with raw fetch calls to Mem0.
- Dockerized with basic healthcheck.
#### Fixed
- Resolved crash loop in Neo4j by restricting env vars (`NEO4J_AUTH` only).
- Relay now correctly reads `MEM0_URL` and `MEM0_API_KEY` from `.env`.
#### Known Issues
- No feedback loop (thumbs up/down) yet.
- Forget/delete flow is manual (via memory IDs).
- Memory latency ~14s depending on embedding model.
---
## 🧩 lyra-neomem (used to be NVGRAM / Lyra-Mem0) ##############################################################################
## [NeoMem 0.1.2] - 2025-10-27
### Changed
- **Renamed NVGRAM to neomem**
- All future updates will be under the name NeoMem.
- Features have not changed.
## [NVGRAM 0.1.1] - 2025-10-08
### Added
- **Async Memory Rewrite (Stability + Safety Patch)**
- Introduced `AsyncMemory` class with fully asynchronous vector and graph store writes.
- Added **input sanitation** to prevent embedding errors (`'list' object has no attribute 'replace'`).
- Implemented `flatten_messages()` helper in API layer to clean malformed payloads.
- Added structured request logging via `RequestLoggingMiddleware` (FastAPI middleware).
- Health endpoint (`/health`) now returns structured JSON `{status, version, service}`.
- Startup logs now include **sanitized embedder config** with API keys masked for safety:
```
>>> Embedder config (sanitized): {'provider': 'openai', 'config': {'model': 'text-embedding-3-small', 'api_key': '***'}}
✅ Connected to Neo4j on attempt 1
🧠 NVGRAM v0.1.1 — Neural Vectorized Graph Recall and Memory initialized
```
### Changed
- Replaced synchronous `Memory.add()` with async-safe version supporting concurrent vector + graph writes.
- Normalized indentation and cleaned duplicate `main.py` references under `/nvgram/` vs `/nvgram/server/`.
- Removed redundant `FastAPI()` app reinitialization.
- Updated internal logging to INFO-level timing format:
2025-10-08 21:48:45 [INFO] POST /memories -> 200 (11189.1 ms)
- Deprecated `@app.on_event("startup")` (FastAPI deprecation warning) → will migrate to `lifespan` handler in v0.1.2.
### Fixed
- Eliminated repeating 500 error from OpenAI embedder caused by non-string message content.
- Masked API key leaks from boot logs.
- Ensured Neo4j reconnects gracefully on first retry.
### Goals / Next Steps
- Integrate **salience scoring** and **embedding confidence weight** fields in Postgres schema.
- Begin testing with full Lyra Relay + Persona Sidecar pipeline for live session memory recall.
- Migrate from deprecated `on_event` → `lifespan` pattern in 0.1.2.
---
## [NVGRAM 0.1.0] - 2025-10-07
### Added
- **Initial fork of Mem0 → NVGRAM**:
- Created a fully independent local-first memory engine based on Mem0 OSS.
- Renamed all internal modules, Docker services, and environment variables from `mem0` → `nvgram`.
- New service name: **`nvgram-api`**, default port **7077**.
- Maintains same API endpoints (`/memories`, `/search`) for drop-in compatibility with Lyra Core.
- Uses **FastAPI**, **Postgres**, and **Neo4j** as persistent backends.
- Verified clean startup:
```
✅ Connected to Neo4j on attempt 1
INFO: Uvicorn running on http://0.0.0.0:7077
```
- `/docs` and `/openapi.json` confirmed reachable and functional.
### Changed
- Removed dependency on the external `mem0ai` SDK — all logic now local.
- Re-pinned requirements:
- fastapi==0.115.8
- uvicorn==0.34.0
- pydantic==2.10.4
- python-dotenv==1.0.1
- psycopg>=3.2.8
- ollama
- Adjusted `docker-compose` and `.env` templates to use new NVGRAM naming and image paths.
### Goals / Next Steps
- Integrate NVGRAM as the new default backend in Lyra Relay.
- Deprecate remaining Mem0 references and archive old configs.
- Begin versioning as a standalone project (`nvgram-core`, `nvgram-api`, etc.).
---
## [Lyra-Mem0 0.3.2] - 2025-10-05
### Added
- Support for **Ollama LLM reasoning** alongside OpenAI embeddings:
- Introduced `LLM_PROVIDER=ollama`, `LLM_MODEL`, and `OLLAMA_HOST` in `.env.3090`.
- Verified local 3090 setup using `qwen2.5:7b-instruct-q4_K_M`.
- Split processing pipeline:
- Embeddings → OpenAI `text-embedding-3-small`
- LLM → Local Ollama (`http://10.0.0.3:11434/api/chat`).
- Added `.env.3090` template for self-hosted inference nodes.
- Integrated runtime diagnostics and seeder progress tracking:
- File-level + message-level progress bars.
- Retry/back-off logic for timeouts (3 attempts).
- Event logging (`ADD / UPDATE / NONE`) for every memory record.
- Expanded Docker health checks for Postgres, Qdrant, and Neo4j containers.
- Added GPU-friendly long-run configuration for continuous seeding (validated on RTX 3090).
### Changed
- Updated `main.py` configuration block to load:
- `LLM_PROVIDER`, `LLM_MODEL`, and `OLLAMA_BASE_URL`.
- Fallback to OpenAI if Ollama unavailable.
- Adjusted `docker-compose.yml` mount paths to correctly map `/app/main.py`.
- Normalized `.env` loading so `mem0-api` and host environment share identical values.
- Improved seeder logging and progress telemetry for clearer diagnostics.
- Added explicit `temperature` field to `DEFAULT_CONFIG['llm']['config']` for tuning future local inference runs.
### Fixed
- Resolved crash during startup:
`TypeError: OpenAIConfig.__init__() got an unexpected keyword argument 'ollama_base_url'`.
- Corrected mount type mismatch (file vs directory) causing `OCI runtime create failed` errors.
- Prevented duplicate or partial postings when retry logic triggered multiple concurrent requests.
- “Unknown event” warnings now safely ignored (no longer break seeding loop).
- Confirmed full dual-provider operation in logs (`api.openai.com` + `10.0.0.3:11434/api/chat`).
### Observations
- Stable GPU utilization: ~8 GB VRAM @ 92 % load, ≈ 67 °C under sustained seeding.
- Next revision will re-format seed JSON to preserve `role` context (user vs assistant).
---
## [Lyra-Mem0 0.3.1] - 2025-10-03
### Added
- HuggingFace TEI integration (local 3090 embedder).
- Dual-mode environment switch between OpenAI cloud and local.
- CSV export of memories from Postgres (`payload->>'data'`).
### Fixed
- `.env` CRLF vs LF line ending issues.
- Local seeding now possible via huggingface server running
---
## [Lyra-mem0 0.3.0]
### Added
- Support for **Ollama embeddings** in Mem0 OSS container:
- Added ability to configure `EMBEDDER_PROVIDER=ollama` and set `EMBEDDER_MODEL` + `OLLAMA_HOST` via `.env`.
- Mounted `main.py` override from host into container to load custom `DEFAULT_CONFIG`.
- Installed `ollama` Python client into custom API container image.
- `.env.3090` file created for external embedding mode (3090 machine):
- EMBEDDER_PROVIDER=ollama
- EMBEDDER_MODEL=mxbai-embed-large
- OLLAMA_HOST=http://10.0.0.3:11434
- Workflow to support **multiple embedding modes**:
1. Fast LAN-based 3090/Ollama embeddings
2. Local-only CPU embeddings (Lyra Cortex VM)
3. OpenAI fallback embeddings
### Changed
- `docker-compose.yml` updated to mount local `main.py` and `.env.3090`.
- Built **custom Dockerfile** (`mem0-api-server:latest`) extending base image with `pip install ollama`.
- Updated `requirements.txt` to include `ollama` package.
- Adjusted Mem0 container config so `main.py` pulls environment variables with `dotenv` (`load_dotenv()`).
- Tested new embeddings path with curl `/memories` API call.
### Fixed
- Resolved container boot failure caused by missing `ollama` dependency (`ModuleNotFoundError`).
- Fixed config overwrite issue where rebuilding container restored stock `main.py`.
- Worked around Neo4j error (`vector.similarity.cosine(): mismatched vector dimensions`) by confirming OpenAI vs. Ollama embedding vector sizes and planning to standardize at 1536-dim.
--
## [Lyra-mem0 v0.2.1]
### Added
- **Seeding pipeline**:
- Built Python seeder script to bulk-insert raw Cloud Lyra exports into Mem0.
- Implemented incremental seeding option (skip existing memories, only add new ones).
- Verified insert process with Postgres-backed history DB and curl `/memories/search` sanity check.
- **Ollama embedding support** in Mem0 OSS container:
- Added configuration for `EMBEDDER_PROVIDER=ollama`, `EMBEDDER_MODEL`, and `OLLAMA_HOST` via `.env`.
- Created `.env.3090` profile for LAN-connected 3090 machine with Ollama.
- Set up three embedding modes:
1. Fast LAN-based 3090/Ollama
2. Local-only CPU model (Lyra Cortex VM)
3. OpenAI fallback
### Changed
- Updated `main.py` to load configuration from `.env` using `dotenv` and support multiple embedder backends.
- Mounted host `main.py` into container so local edits persist across rebuilds.
- Updated `docker-compose.yml` to mount `.env.3090` and support swap between profiles.
- Built **custom Dockerfile** (`mem0-api-server:latest`) including `pip install ollama`.
- Updated `requirements.txt` with `ollama` dependency.
- Adjusted startup flow so container automatically connects to external Ollama host (LAN IP).
- Added logging to confirm model pulls and embedding requests.
### Fixed
- Seeder process originally failed on old memories — now skips duplicates and continues batch.
- Resolved container boot error (`ModuleNotFoundError: ollama`) by extending image.
- Fixed overwrite issue where stock `main.py` replaced custom config during rebuild.
- Worked around Neo4j `vector.similarity.cosine()` dimension mismatch by investigating OpenAI (1536-dim) vs Ollama (1024-dim) schemas.
### Notes
- To fully unify embedding modes, a Hugging Face / local model with **1536-dim embeddings** will be needed (to match OpenAIs schema and avoid Neo4j errors).
- Current Ollama model (`mxbai-embed-large`) works, but returns 1024-dim vectors.
- Seeder workflow validated but should be wrapped in a repeatable weekly run for full Cloud→Local sync.
---
## [Lyra-Mem0 v0.2.0] - 2025-09-30
### Added
- Standalone **Lyra-Mem0** stack created at `~/lyra-mem0/`
- Includes **Postgres (pgvector)**, **Qdrant**, **Neo4j**, and **SQLite** for history tracking.
- Added working `docker-compose.mem0.yml` and custom `Dockerfile` for building the Mem0 API server.
- Verified REST API functionality:
- `POST /memories` works for adding memories.
- `POST /search` works for semantic search.
- Successful end-to-end test with persisted memory:
*"Likes coffee in the morning"* → retrievable via search. ✅
### Changed
- Split architecture into **modular stacks**:
- `~/lyra-core` (Relay, Persona-Sidecar, etc.)
- `~/lyra-mem0` (Mem0 OSS memory stack)
- Removed old embedded mem0 containers from the Lyra-Core compose file.
- Added Lyra-Mem0 section in README.md.
### Next Steps
- Wire **Relay → Mem0 API** (integration not yet complete).
- Add integration tests to verify persistence and retrieval from within Lyra-Core.
---
## 🧠 Lyra-Cortex ##############################################################################
## [ Cortex - v0.5] -2025-11-13
### Added
- **New `reasoning.py` module**
- Async reasoning engine.
- Accepts user prompt, identity, RAG block, and reflection notes.
- Produces draft internal answers.
- Uses primary backend (vLLM).
- **New `reflection.py` module**
- Fully async.
- Produces actionable JSON “internal notes.”
- Enforces strict JSON schema and fallback parsing.
- Forces cloud backend (`backend_override="cloud"`).
- Integrated `refine.py` into Cortex reasoning pipeline:
- New stage between reflection and persona.
- Runs exclusively on primary vLLM backend (MI50).
- Produces final, internally consistent output for downstream persona layer.
- **Backend override system**
- Each LLM call can now select its own backend.
- Enables multi-LLM cognition: Reflection → cloud, Reasoning → primary.
- **identity loader**
- Added `identity.py` with `load_identity()` for consistent persona retrieval.
- **ingest_handler**
- Async stub created for future Intake → NeoMem → RAG pipeline.
### Changed
- Unified LLM backend URL handling across Cortex:
- ENV variables must now contain FULL API endpoints.
- Removed all internal path-appending (e.g. `.../v1/completions`).
- `llm_router.py` rewritten to use env-provided URLs as-is.
- Ensures consistent behavior between draft, reflection, refine, and persona.
- **Rebuilt `main.py`**
- Removed old annotation/analysis logic.
- New structure: load identity → get RAG → reflect → reason → return draft+notes.
- Routes now clean and minimal (`/reason`, `/ingest`, `/health`).
- Async path throughout Cortex.
- **Refactored `llm_router.py`**
- Removed old fallback logic during overrides.
- OpenAI requests now use `/v1/chat/completions`.
- Added proper OpenAI Authorization headers.
- Distinct payload format for vLLM vs OpenAI.
- Unified, correct parsing across models.
- **Simplified Cortex architecture**
- Removed deprecated “context.py” and old reasoning code.
- Relay completely decoupled from smart behavior.
- Updated environment specification:
- `LLM_PRIMARY_URL` now set to `http://10.0.0.43:8000/v1/completions`.
- `LLM_SECONDARY_URL` remains `http://10.0.0.3:11434/api/generate` (Ollama).
- `LLM_CLOUD_URL` set to `https://api.openai.com/v1/chat/completions`.
### Fixed
- Resolved endpoint conflict where:
- Router expected base URLs.
- Refine expected full URLs.
- Refine always fell back due to hitting incorrect endpoint.
- Fixed by standardizing full-URL behavior across entire system.
- Reflection layer no longer fails silently (previously returned `[""]` due to MythoMax).
- Resolved 404/401 errors caused by incorrect OpenAI URL endpoints.
- No more double-routing through vLLM during reflection.
- Corrected async/sync mismatch in multiple locations.
- Eliminated double-path bug (`/v1/completions/v1/completions`) caused by previous router logic.
### Removed
- Legacy `annotate`, `reason_check` glue logic from old architecture.
- Old backend probing junk code.
- Stale imports and unused modules leftover from previous prototype.
### Verified
- Cortex → vLLM (MI50) → refine → final_output now functioning correctly.
- refine shows `used_primary_backend: true` and no fallback.
- Manual curl test confirms endpoint accuracy.
### Known Issues
- refine sometimes prefixes output with `"Final Answer:"`; next version will sanitize this.
- hallucinations in draft_output persist due to weak grounding (fix in reasoning + RAG planned).
### Pending / Known Issues
- **RAG service does not exist** — requires containerized FastAPI service.
- Reasoning layer lacks self-revision loop (deliberate thought cycle).
- No speak/persona generation layer yet (`speak.py` planned).
- Intake summaries not yet routing into RAG or reflection layer.
- No refinement engine between reasoning and speak.
### Notes
This is the largest structural change to Cortex so far.
It establishes:
- multi-model cognition
- clean layering
- identity + reflection separation
- correct async code
- deterministic backend routing
- predictable JSON reflection
The system is now ready for:
- refinement loops
- persona-speaking layer
- containerized RAG
- long-term memory integration
- true emergent-behavior experiments
## [ Cortex - v0.4.1] - 2025-11-5
### Added
- **RAG intergration**
- Added rag.py with query_rag() and format_rag_block().
- Cortex now queries the local RAG API (http://10.0.0.41:7090/rag/search) for contextual augmentation.
- Synthesized answers and top excerpts are injected into the reasoning prompt.
### Changed ###
- **Revised /reason endpoint.**
- Now builds unified context blocks:
- [Intake] → recent summaries
- [RAG] → contextual knowledge
- [User Message] → current input
- Calls call_llm() for the first pass, then reflection_loop() for meta-evaluation.
- Returns cortex_prompt, draft_output, final_output, and normalized reflection.
- **Reflection Pipeline Stability**
- Cleaned parsing to normalize JSON vs. text reflections.
- Added fallback handling for malformed or non-JSON outputs.
- Log system improved to show raw JSON, extracted fields, and normalized summary.
- **Async Summarization (Intake v0.2.1)**
- Intake summaries now run in background threads to avoid blocking Cortex.
- Summaries (L1L∞) logged asynchronously with [BG] tags.
- **Environment & Networking Fixes**
- Verified .env variables propagate correctly inside the Cortex container.
- Confirmed Docker network connectivity between Cortex, Intake, NeoMem, and RAG (shared serversdown_lyra_net).
- Adjusted localhost calls to service-IP mapping (10.0.0.41 for Cortex host).
- **Behavioral Updates**
- Cortex now performs conversation reflection (on user intent) and self-reflection (on its own answers).
- RAG context successfully grounds reasoning outputs.
- Intake and NeoMem confirmed receiving summaries via /add_exchange.
- Log clarity pass: all reflective and contextual blocks clearly labeled.
- **Known Gaps / Next Steps**
- NeoMem Tuning
- Improve retrieval latency and relevance.
- Implement a dedicated /reflections/recent endpoint for Cortex.
- Migrate to Cortex-first ingestion (Relay → Cortex → NeoMem).
- **Cortex Enhancements**
- Add persistent reflection recall (use prior reflections as meta-context).
- Improve reflection JSON structure ("insight", "evaluation", "next_action" → guaranteed fields).
- Tighten temperature and prompt control for factual consistency.
- **RAG Optimization**
-Add source ranking, filtering, and multi-vector hybrid search.
-Cache RAG responses per session to reduce duplicate calls.
- **Documentation / Monitoring**
-Add health route for RAG and Intake summaries.
-Include internal latency metrics in /health endpoint.
Consolidate logs into unified “Lyra Cortex Console” for tracing all module calls.
## [Cortex - v0.3.0] 2025-10-31
### Added
- **Cortex Service (FastAPI)**
- New standalone reasoning engine (`cortex/main.py`) with endpoints:
- `GET /health` reports active backend + NeoMem status.
- `POST /reason` evaluates `{prompt, response}` pairs.
- `POST /annotate` experimental text analysis.
- Background NeoMem health monitor (5-minute interval).
- **Multi-Backend Reasoning Support**
- Added environment-driven backend selection via `LLM_FORCE_BACKEND`.
- Supports:
- **Primary** → vLLM (MI50 node @ 10.0.0.43)
- **Secondary** → Ollama (3090 node @ 10.0.0.3)
- **Cloud** → OpenAI API
- **Fallback** → llama.cpp (CPU)
- Introduced per-backend model variables:
`LLM_PRIMARY_MODEL`, `LLM_SECONDARY_MODEL`, `LLM_CLOUD_MODEL`, `LLM_FALLBACK_MODEL`.
- **Response Normalization Layer**
- Implemented `normalize_llm_response()` to merge streamed outputs and repair malformed JSON.
- Handles Ollamas multi-line streaming and Mythomaxs missing punctuation issues.
- Prints concise debug previews of merged content.
- **Environment Simplification**
- Each service (`intake`, `cortex`, `neomem`) now maintains its own `.env` file.
- Removed reliance on shared/global env file to prevent cross-contamination.
- Verified Docker Compose networking across containers.
### Changed
- Refactored `reason_check()` to dynamically switch between **prompt** and **chat** mode depending on backend.
- Enhanced startup logs to announce active backend, model, URL, and mode.
- Improved error handling with clearer “Reasoning error” messages.
### Fixed
- Corrected broken vLLM endpoint routing (`/v1/completions`).
- Stabilized cross-container health reporting for NeoMem.
- Resolved JSON parse failures caused by streaming chunk delimiters.
---
## Next Planned [v0.4.0]
### Planned Additions
- **Reflection Mode**
- Introduce `REASONING_MODE=factcheck|reflection`.
- Output schema:
```json
{ "insight": "...", "evaluation": "...", "next_action": "..." }
```
- **Cortex-First Pipeline**
- UI → Cortex → [Reflection + Verifier + Memory] → Speech LLM → User.
- Allows Lyra to “think before speaking.”
- **Verifier Stub**
- New `/verify` endpoint for search-based factual grounding.
- Asynchronous external truth checking.
- **Memory Integration**
- Feed reflective outputs into NeoMem.
- Enable “dream” cycles for autonomous self-review.
---
**Status:** 🟢 Stable Core Multi-backend reasoning operational.
**Next milestone:** *v0.4.0 — Reflection Mode + Thought Pipeline orchestration.*
---
### [Intake] v0.1.0 - 2025-10-27
- Recieves messages from relay and summarizes them in a cascading format.
- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
- Currently logs summaries to .log file in /project-lyra/intake-logs/
** Next Steps **
- Feed intake into neomem.
- Generate a daily/hourly/etc overall summary, (IE: Today Brian and Lyra worked on x, y, and z)
- Generate session aware summaries, with its own intake hopper.
### [Lyra-Cortex] v0.2.0 — 2025-09-26
**Added
- Integrated **llama-server** on dedicated Cortex VM (Proxmox).
- Verified Phi-3.5-mini-instruct_Uncensored-Q4_K_M running with 8 vCPUs.
- Benchmarked Phi-3.5-mini performance:
- ~18 tokens/sec CPU-only on Ryzen 7 7800X.
- Salience classification functional but sometimes inconsistent ("sali", "fi", "jamming").
- Tested **Qwen2-0.5B-Instruct GGUF** as alternative salience classifier:
- Much faster throughput (~350 tokens/sec prompt, ~100 tokens/sec eval).
- More responsive but over-classifies messages as “salient.”
- Established `.env` integration for model ID (`SALIENCE_MODEL`), enabling hot-swap between models.
** Known Issues
- Small models tend to drift or over-classify.
- CPU-only 7B+ models expected to be slow; GPU passthrough recommended for larger models.
- Need to set up a `systemd` service for `llama-server` to auto-start on VM reboot.
---
### [Lyra-Cortex] v0.1.0 — 2025-09-25
#### Added
- First deployment as dedicated Proxmox VM (5 vCPU / 18 GB RAM / 100 GB SSD).
- Built **llama.cpp** with `llama-server` target via CMake.
- Integrated **Phi-3.5 Mini Instruct (Uncensored, Q4_K_M GGUF)** model.
- Verified **API compatibility** at `/v1/chat/completions`.
- Local test successful via `curl` → ~523 token response generated.
- Performance benchmark: ~11.5 tokens/sec (CPU-only on Ryzen 7800X).
- Confirmed usable for salience scoring, summarization, and lightweight reasoning.