Merge branch 'main' of https://github.com/serversdwn/project-lyra

2025-11-17 03:41:51 -05:00
parent a19231abd0 e5e32f2683
commit b5fe47074a
3 changed files with 1324 additions and 908 deletions
--- a/core/CHANGELOG.md
+++ b/core/CHANGELOG.md
--- a/core/README.md
+++ b/core/README.md
@@ -1,265 +1,265 @@
-##### Project Lyra - README v0.3.0 - needs fixing #####
-
-Lyra is a modular persistent AI companion system.  
-It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**,  
-with optional subconscious annotation powered by **Cortex VM** running local LLMs.
-
-## Mission Statement ##
-	The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
-	
---
-	
-## Structure ##
-	Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules:
-	## A. VM 100 - lyra-core:
-		1. ** Core v0.3.1 - Docker Stack
-			- Relay - (docker container) - The main harness that connects the modules together and accepts input from the user.
-			- UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that.
-			- Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection.
-			- All of this is built and controlled by a single .env and docker-compose.lyra.yml.
-		2. **NeoMem v0.1.0 - (docker stack)
-			- NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph.
-			- NeoMem launches with a single separate docker-compose.neomem.yml.
-			
-	## B. VM 101 - lyra - cortex
-		3. ** Cortex - VM containing docker stack
-		- This is the working reasoning layer of Lyra.
-		- Built to be flexible in deployment. Run it locally or remotely (via wan/lan) 
-		- Intake v0.1.0 - (docker Container) gives conversations context and purpose
-			- Intake takes the last N exchanges and summarizes them into coherrent short term memories.
-			- Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc.
-			- Keeps the bot aware of what is going on with out having to send it the whole chat every time. 
-		- Cortex - Docker container containing: 
-			- Reasoning Layer
-				- TBD
-			- Reflect - (docker continer) - Not yet implemented, road map. 
-				- Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise.
-				- Can be done actively and asynchronously, or on a time basis (think human sleep and dreams). 
-				- This stage is not yet built, this is just an idea. 
-		
-	## C. Remote LLM APIs:
-		3. **AI Backends
-			- Lyra doesnt run models her self, she calls up APIs.
-			- Endlessly customizable as long as it outputs to the same schema. 
-			
---
-
-
-## 🚀 Features ##
-
-# Lyra-Core VM (VM100)
- **Relay **:
-  - The main harness and orchestrator of Lyra.
-  - OpenAI-compatible endpoint: `POST /v1/chat/completions`
-  - Injects persona + relevant memories into every LLM call
-  - Routes all memory storage/retrieval through **NeoMem**
-  - Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`)
-
- **NeoMem (Memory Engine)**:
-  - Forked from Mem0 OSS and fully independent.
-  - Drop-in compatible API (`/memories`, `/search`).
-  - Local-first: runs on FastAPI with Postgres + Neo4j.
-  - No external SDK dependencies.
-  - Default service: `neomem-api` (port 7077).
-  - Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match.
-
- **UI**:
-  - Lightweight static HTML chat page.
-  - Connects to Relay at `http://<host>:7078`.
-  - Nice cyberpunk theme!
-  - Saves and loads sessions, which then in turn send to relay.
-
-# Beta Lyrae (RAG Memory DB) - added 11-3-25
- **RAG Knowledge DB - Beta Lyrae (sheliak)**
-  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.  
-  - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
-		The system uses:
-  - **ChromaDB** for persistent vector storage  
-  - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity  
-  - **FastAPI** (port 7090) for the `/rag/search` REST endpoint  
-  - Directory Layout
-		rag/
-		├── rag_chat_import.py # imports JSON chat logs
-		├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
-		├── rag_build.py # legacy single-folder builder
-		├── rag_query.py # command-line query helper
-		├── rag_api.py # FastAPI service providing /rag/search
-		├── chromadb/ # persistent vector store
-		├── chatlogs/ # organized source data
-		│ ├── poker/
-		│ ├── work/
-		│ ├── lyra/
-		│ ├── personal/
-		│ └── ...
-		└── import.log # progress log for batch runs
-  - **OpenAI chatlog importer.
-	  - Takes JSON formatted chat logs and imports it to the RAG.
-	  - **fetures include:**
-	    - Recursive folder indexing with **category detection** from directory name  
-		- Smart chunking for long messages (5 000 chars per slice)  
-		- Automatic deduplication using SHA-1 hash of file + chunk
-		- Timestamps for both file modification and import time
-		- Full progress logging via tqdm
-		- Safe to run in background with nohup … &
-		- Metadata per chunk:
-		  ```json
-		  {
-			"chat_id": "<sha1 of filename>",
-			"chunk_index": 0,
-			"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
-			"title": "cortex LLMs 11-1-25",
-			"role": "assistant",
-			"category": "lyra",
-			"type": "chat",
-			"file_modified": "2025-11-06T23:41:02",
-			"imported_at": "2025-11-07T03:55:00Z"
-		  }```
-
-# Cortex VM (VM101, CT201)
-  - **CT201 main reasoning orchestrator.**
-    - This is the internal brain of Lyra.
-	- Running in a privellaged LXC.	
-	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
-	- Accessible via 10.0.0.43:8000/v1/completions.
-
-  - **Intake v0.1.1 **
-    - Recieves messages from relay and summarizes them in a cascading format.
-	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
-	- Intake then sends to cortex for self reflection, neomem for memory consolidation.
-	
-  - **Reflect **
-    -TBD
-
-# Self hosted vLLM server #
-  - **CT201 main reasoning orchestrator.**
-    - This is the internal brain of Lyra.
-	- Running in a privellaged LXC.	
-	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
-	- Accessible via 10.0.0.43:8000/v1/completions.
-  - **Stack Flow**
-    -	[Proxmox Host]
-			 └── loads AMDGPU driver
-			 └── boots CT201 (order=2)
-
-		[CT201 GPU Container]
-			 ├── lyra-start-vllm.sh → starts vLLM ROCm model server
-			 ├── lyra-vllm.service   → runs the above automatically
-			 ├── lyra-core.service   → launches Cortex + Intake Docker stack
-			 └── Docker Compose      → runs Cortex + Intake containers
-
-		[Cortex Container]
-			 ├── Listens on port 7081
-			 ├── Talks to NVGRAM (mem API) + Intake
-			 └── Main relay between Lyra UI ↔ memory ↔ model
-
-		[Intake Container]
-			├── Listens on port 7080
-			├── Summarizes every few exchanges
-			├── Writes summaries to /app/logs/summaries.log
-			└── Future: sends summaries → Cortex for reflection
-
-
-# Additional information available in the trilium docs. #
---
-
-## 📦 Requirements
-
- Docker + Docker Compose  
- Postgres + Neo4j (for NeoMem)
- Access to an open AI or ollama style API.
- OpenAI API key (for Relay fallback LLMs)
-
-**Dependencies:**
-	- fastapi==0.115.8
-	- uvicorn==0.34.0
-	- pydantic==2.10.4
-	- python-dotenv==1.0.1
-	- psycopg>=3.2.8
-	- ollama
-
---
-
-🔌 Integration Notes
-
-Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.
-
-API endpoints remain identical to Mem0 (/memories, /search).
-
-History and entity graphs managed internally via Postgres + Neo4j.
-
---
-
-🧱 Architecture Snapshot
-
-	User → Relay → Cortex
-			 ↓
-		 [RAG Search]
-			 ↓
-		 [Reflection Loop]
-			 ↓
-		 Intake (async summaries)
-			 ↓
-		 NeoMem (persistent memory)
-
-**Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
- Data Flow:
-  - User message enters Cortex via /reason.
-  - Cortex assembles context:
-	- Intake summaries (short-term memory)
-	- RAG contextual data (knowledge base)
-  - LLM generates initial draft (call_llm).
-  - Reflection loop critiques and refines the answer.
-  - Intake asynchronously summarizes and sends snapshots to NeoMem.
-
-RAG API Configuration:
-Set RAG_API_URL in .env (default: http://localhost:7090).
-
---
-
-## Setup and Operation ##
-
-## Beta Lyrae - RAG memory system ##
-**Requirements**
-  -Env= python 3.10+
-  -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
-  -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)
-
-**Import Chats**
-  - Chats need to be formatted into the correct format of
-	```
-	  "messages": [
-	    {
-		  "role:" "user",
-		  "content": "Message here"
-		},
-		"messages": [
-	    {
-		  "role:" "assistant",
-		  "content": "Message here"
-		},```
-  - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
-  - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).
-
-**Build API Server**
-  - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
-  - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```
-
-**Query**
-  - Run: python3 rag_query.py "Question here?"
-  - For testing a curl command can reach it too
-    ```
-	curl -X POST http://127.0.0.1:7090/rag/search \
-	  -H "Content-Type: application/json" \
-	  -d '{
-			"query": "What is the current state of Cortex and Project Lyra?",
-			"where": {"category": "lyra"}
-		  }'
-	```
-	
-# Beta Lyrae - RAG System
-
-## 📖 License
-NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).  
-This fork retains the original Apache 2.0 license and adds local modifications.  
-© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
-
+##### Project Lyra - README v0.3.0 - needs fixing #####
+
+Lyra is a modular persistent AI companion system.  
+It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**,  
+with optional subconscious annotation powered by **Cortex VM** running local LLMs.
+
+## Mission Statement ##
+	The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
+	
+---
+	
+## Structure ##
+	Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules:
+	## A. VM 100 - lyra-core:
+		1. ** Core v0.3.1 - Docker Stack
+			- Relay - (docker container) - The main harness that connects the modules together and accepts input from the user.
+			- UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that.
+			- Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection.
+			- All of this is built and controlled by a single .env and docker-compose.lyra.yml.
+		2. **NeoMem v0.1.0 - (docker stack)
+			- NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph.
+			- NeoMem launches with a single separate docker-compose.neomem.yml.
+			
+	## B. VM 101 - lyra - cortex
+		3. ** Cortex - VM containing docker stack
+		- This is the working reasoning layer of Lyra.
+		- Built to be flexible in deployment. Run it locally or remotely (via wan/lan) 
+		- Intake v0.1.0 - (docker Container) gives conversations context and purpose
+			- Intake takes the last N exchanges and summarizes them into coherrent short term memories.
+			- Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc.
+			- Keeps the bot aware of what is going on with out having to send it the whole chat every time. 
+		- Cortex - Docker container containing: 
+			- Reasoning Layer
+				- TBD
+			- Reflect - (docker continer) - Not yet implemented, road map. 
+				- Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise.
+				- Can be done actively and asynchronously, or on a time basis (think human sleep and dreams). 
+				- This stage is not yet built, this is just an idea. 
+		
+	## C. Remote LLM APIs:
+		3. **AI Backends
+			- Lyra doesnt run models her self, she calls up APIs.
+			- Endlessly customizable as long as it outputs to the same schema. 
+			
+---
+
+
+## 🚀 Features ##
+
+# Lyra-Core VM (VM100)
+- **Relay **:
+  - The main harness and orchestrator of Lyra.
+  - OpenAI-compatible endpoint: `POST /v1/chat/completions`
+  - Injects persona + relevant memories into every LLM call
+  - Routes all memory storage/retrieval through **NeoMem**
+  - Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`)
+
+- **NeoMem (Memory Engine)**:
+  - Forked from Mem0 OSS and fully independent.
+  - Drop-in compatible API (`/memories`, `/search`).
+  - Local-first: runs on FastAPI with Postgres + Neo4j.
+  - No external SDK dependencies.
+  - Default service: `neomem-api` (port 7077).
+  - Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match.
+
+- **UI**:
+  - Lightweight static HTML chat page.
+  - Connects to Relay at `http://<host>:7078`.
+  - Nice cyberpunk theme!
+  - Saves and loads sessions, which then in turn send to relay.
+
+# Beta Lyrae (RAG Memory DB) - added 11-3-25
+- **RAG Knowledge DB - Beta Lyrae (sheliak)**
+  - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.  
+  - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
+		The system uses:
+  - **ChromaDB** for persistent vector storage  
+  - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity  
+  - **FastAPI** (port 7090) for the `/rag/search` REST endpoint  
+  - Directory Layout
+		rag/
+		├── rag_chat_import.py # imports JSON chat logs
+		├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
+		├── rag_build.py # legacy single-folder builder
+		├── rag_query.py # command-line query helper
+		├── rag_api.py # FastAPI service providing /rag/search
+		├── chromadb/ # persistent vector store
+		├── chatlogs/ # organized source data
+		│ ├── poker/
+		│ ├── work/
+		│ ├── lyra/
+		│ ├── personal/
+		│ └── ...
+		└── import.log # progress log for batch runs
+  - **OpenAI chatlog importer.
+	  - Takes JSON formatted chat logs and imports it to the RAG.
+	  - **fetures include:**
+	    - Recursive folder indexing with **category detection** from directory name  
+		- Smart chunking for long messages (5 000 chars per slice)  
+		- Automatic deduplication using SHA-1 hash of file + chunk
+		- Timestamps for both file modification and import time
+		- Full progress logging via tqdm
+		- Safe to run in background with nohup … &
+		- Metadata per chunk:
+		  ```json
+		  {
+			"chat_id": "<sha1 of filename>",
+			"chunk_index": 0,
+			"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
+			"title": "cortex LLMs 11-1-25",
+			"role": "assistant",
+			"category": "lyra",
+			"type": "chat",
+			"file_modified": "2025-11-06T23:41:02",
+			"imported_at": "2025-11-07T03:55:00Z"
+		  }```
+
+# Cortex VM (VM101, CT201)
+  - **CT201 main reasoning orchestrator.**
+    - This is the internal brain of Lyra.
+	- Running in a privellaged LXC.	
+	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
+	- Accessible via 10.0.0.43:8000/v1/completions.
+
+  - **Intake v0.1.1 **
+    - Recieves messages from relay and summarizes them in a cascading format.
+	- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
+	- Intake then sends to cortex for self reflection, neomem for memory consolidation.
+	
+  - **Reflect **
+    -TBD
+
+# Self hosted vLLM server #
+  - **CT201 main reasoning orchestrator.**
+    - This is the internal brain of Lyra.
+	- Running in a privellaged LXC.	
+	- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
+	- Accessible via 10.0.0.43:8000/v1/completions.
+  - **Stack Flow**
+    -	[Proxmox Host]
+			 └── loads AMDGPU driver
+			 └── boots CT201 (order=2)
+
+		[CT201 GPU Container]
+			 ├── lyra-start-vllm.sh → starts vLLM ROCm model server
+			 ├── lyra-vllm.service   → runs the above automatically
+			 ├── lyra-core.service   → launches Cortex + Intake Docker stack
+			 └── Docker Compose      → runs Cortex + Intake containers
+
+		[Cortex Container]
+			 ├── Listens on port 7081
+			 ├── Talks to NVGRAM (mem API) + Intake
+			 └── Main relay between Lyra UI ↔ memory ↔ model
+
+		[Intake Container]
+			├── Listens on port 7080
+			├── Summarizes every few exchanges
+			├── Writes summaries to /app/logs/summaries.log
+			└── Future: sends summaries → Cortex for reflection
+
+
+# Additional information available in the trilium docs. #
+---
+
+## 📦 Requirements
+
+- Docker + Docker Compose  
+- Postgres + Neo4j (for NeoMem)
+- Access to an open AI or ollama style API.
+- OpenAI API key (for Relay fallback LLMs)
+
+**Dependencies:**
+	- fastapi==0.115.8
+	- uvicorn==0.34.0
+	- pydantic==2.10.4
+	- python-dotenv==1.0.1
+	- psycopg>=3.2.8
+	- ollama
+
+---
+
+🔌 Integration Notes
+
+Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.
+
+API endpoints remain identical to Mem0 (/memories, /search).
+
+History and entity graphs managed internally via Postgres + Neo4j.
+
+---
+
+🧱 Architecture Snapshot
+
+	User → Relay → Cortex
+			 ↓
+		 [RAG Search]
+			 ↓
+		 [Reflection Loop]
+			 ↓
+		 Intake (async summaries)
+			 ↓
+		 NeoMem (persistent memory)
+
+**Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
+- Data Flow:
+  - User message enters Cortex via /reason.
+  - Cortex assembles context:
+	- Intake summaries (short-term memory)
+	- RAG contextual data (knowledge base)
+  - LLM generates initial draft (call_llm).
+  - Reflection loop critiques and refines the answer.
+  - Intake asynchronously summarizes and sends snapshots to NeoMem.
+
+RAG API Configuration:
+Set RAG_API_URL in .env (default: http://localhost:7090).
+
+---
+
+## Setup and Operation ##
+
+## Beta Lyrae - RAG memory system ##
+**Requirements**
+  -Env= python 3.10+
+  -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
+  -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)
+
+**Import Chats**
+  - Chats need to be formatted into the correct format of
+	```
+	  "messages": [
+	    {
+		  "role:" "user",
+		  "content": "Message here"
+		},
+		"messages": [
+	    {
+		  "role:" "assistant",
+		  "content": "Message here"
+		},```
+  - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
+  - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).
+
+**Build API Server**
+  - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
+  - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```
+
+**Query**
+  - Run: python3 rag_query.py "Question here?"
+  - For testing a curl command can reach it too
+    ```
+	curl -X POST http://127.0.0.1:7090/rag/search \
+	  -H "Content-Type: application/json" \
+	  -d '{
+			"query": "What is the current state of Cortex and Project Lyra?",
+			"where": {"category": "lyra"}
+		  }'
+	```
+	
+# Beta Lyrae - RAG System
+
+## 📖 License
+NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).  
+This fork retains the original Apache 2.0 license and adds local modifications.  
+© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
+
--- a/vllm-mi50.md
+++ b/vllm-mi50.md
@@ -0,0 +1,416 @@
+Here you go — a **clean, polished, ready-to-drop-into-Trilium or GitHub** Markdown file.
+
+If you want, I can also auto-generate a matching `/docs/vllm-mi50/` folder structure and a mini-ToC.
+
+---
+
+# **MI50 + vLLM + Proxmox LXC Setup Guide**
+
+### *End-to-End Field Manual for gfx906 LLM Serving*
+
+**Version:** 1.0
+**Last updated:** 2025-11-17
+
+---
+
+## **📌 Overview**
+
+This guide documents how to run a **vLLM OpenAI-compatible server** on an
+**AMD Instinct MI50 (gfx906)** inside a **Proxmox LXC container**, expose it over LAN,
+and wire it into **Project Lyra's Cortex reasoning layer**.
+
+This file is long, specific, and intentionally leaves *nothing* out so you never have to rediscover ROCm pain rituals again.
+
+---
+
+## **1. What This Stack Looks Like**
+
+```
+Proxmox Host
+ ├─ AMD Instinct MI50 (gfx906)
+ ├─ AMDGPU + ROCm stack
+ └─ LXC Container (CT 201: cortex-gpu)
+      ├─ Ubuntu 24.04
+      ├─ Docker + docker compose
+      ├─ vLLM inside Docker (nalanzeyu/vllm-gfx906)
+      ├─ GPU passthrough via /dev/kfd + /dev/dri + PCI bind
+      └─ vLLM API exposed on :8000
+Lyra Cortex (VM/Server)
+ └─ LLM_PRIMARY_URL=http://10.0.0.43:8000
+```
+
+---
+
+## **2. Proxmox Host — GPU Setup**
+
+### **2.1 Confirm MI50 exists**
+
+```bash
+lspci -nn | grep -i 'vega\|instinct\|radeon'
+```
+
+You should see something like:
+
+```
+0a:00.0 Display controller: AMD Instinct MI50 (gfx906)
+```
+
+### **2.2 Load AMDGPU driver**
+
+The main pitfall after **any host reboot**.
+
+```bash
+modprobe amdgpu
+```
+
+If you skip this, the LXC container won't see the GPU.
+
+---
+
+## **3. LXC Container Configuration (CT 201)**
+
+The container ID is **201**.
+Config file is at:
+
+```
+/etc/pve/lxc/201.conf
+```
+
+### **3.1 Working 201.conf**
+
+Paste this *exact* version:
+
+```ini
+arch: amd64
+cores: 4
+hostname: cortex-gpu
+memory: 16384
+swap: 512
+ostype: ubuntu
+onboot: 1
+startup: order=2,up=10,down=10
+net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:C6:3E:88,ip=dhcp,type=veth
+rootfs: local-lvm:vm-201-disk-0,size=200G
+unprivileged: 0
+
+# Docker in LXC requires this
+features: keyctl=1,nesting=1
+lxc.apparmor.profile: unconfined
+lxc.cap.drop:
+
+# --- GPU passthrough for ROCm (MI50) ---
+lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file,mode=0666
+lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
+lxc.mount.entry: /sys/class/drm sys/class/drm none bind,ro,optional,create=dir
+lxc.mount.entry: /opt/rocm /opt/rocm none bind,ro,optional,create=dir
+
+# Bind the MI50 PCI device
+lxc.mount.entry: /dev/bus/pci/0000:0a:00.0 dev/bus/pci/0000:0a:00.0 none bind,optional,create=file
+
+# Allow GPU-related character devices
+lxc.cgroup2.devices.allow: c 226:* rwm
+lxc.cgroup2.devices.allow: c 29:* rwm
+lxc.cgroup2.devices.allow: c 189:* rwm
+lxc.cgroup2.devices.allow: c 238:* rwm
+lxc.cgroup2.devices.allow: c 241:* rwm
+lxc.cgroup2.devices.allow: c 242:* rwm
+lxc.cgroup2.devices.allow: c 243:* rwm
+lxc.cgroup2.devices.allow: c 244:* rwm
+lxc.cgroup2.devices.allow: c 245:* rwm
+lxc.cgroup2.devices.allow: c 246:* rwm
+lxc.cgroup2.devices.allow: c 247:* rwm
+lxc.cgroup2.devices.allow: c 248:* rwm
+lxc.cgroup2.devices.allow: c 249:* rwm
+lxc.cgroup2.devices.allow: c 250:* rwm
+lxc.cgroup2.devices.allow: c 510:0 rwm
+```
+
+### **3.2 Restart sequence**
+
+```bash
+pct stop 201
+modprobe amdgpu
+pct start 201
+pct enter 201
+```
+
+---
+
+## **4. Inside CT 201 — Verifying ROCm + GPU Visibility**
+
+### **4.1 Check device nodes**
+
+```bash
+ls -l /dev/kfd
+ls -l /dev/dri
+ls -l /opt/rocm
+```
+
+All must exist.
+
+### **4.2 Validate GPU via rocminfo**
+
+```bash
+/opt/rocm/bin/rocminfo | grep -i gfx
+```
+
+You need to see:
+
+```
+gfx906
+```
+
+If you see **nothing**, the GPU isn’t passed through — restart and re-check the host steps.
+
+---
+
+## **5. Install Docker in the LXC (Ubuntu 24.04)**
+
+This container runs Docker inside LXC (nesting enabled).
+
+```bash
+apt update
+apt install -y ca-certificates curl gnupg
+
+install -m 0755 -d /etc/apt/keyrings
+curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
+  | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
+chmod a+r /etc/apt/keyrings/docker.gpg
+
+echo \
+  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
+  https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
+  > /etc/apt/sources.list.d/docker.list
+
+apt update
+apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+```
+
+Check:
+
+```bash
+docker --version
+docker compose version
+```
+
+---
+
+## **6. Running vLLM Inside CT 201 via Docker**
+
+### **6.1 Create directory**
+
+```bash
+mkdir -p /root/vllm
+cd /root/vllm
+```
+
+### **6.2 docker-compose.yml**
+
+Save this exact file as `/root/vllm/docker-compose.yml`:
+
+```yaml
+version: "3.9"
+
+services:
+  vllm-mi50:
+    image: nalanzeyu/vllm-gfx906:latest
+    container_name: vllm-mi50
+    restart: unless-stopped
+    ports:
+      - "8000:8000"
+    environment:
+      VLLM_ROLE: "APIServer"
+      VLLM_MODEL: "/model"
+      VLLM_LOGGING_LEVEL: "INFO"
+    command: >
+      vllm serve /model
+      --host 0.0.0.0
+      --port 8000
+      --dtype float16
+      --max-model-len 4096
+      --api-type openai
+    devices:
+      - "/dev/kfd:/dev/kfd"
+      - "/dev/dri:/dev/dri"
+    volumes:
+      - /opt/rocm:/opt/rocm:ro
+```
+
+### **6.3 Start vLLM**
+
+```bash
+docker compose up -d
+docker compose logs -f
+```
+
+When healthy, you’ll see:
+
+```
+(APIServer) Application startup complete.
+```
+
+and periodic throughput logs.
+
+---
+
+## **7. Test vLLM API**
+
+### **7.1 From Proxmox host**
+
+```bash
+curl -X POST http://10.0.0.43:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"/model","prompt":"ping","max_tokens":5}'
+```
+
+Should respond like:
+
+```json
+{"choices":[{"text":"-pong"}]}
+```
+
+### **7.2 From Cortex machine**
+
+```bash
+curl -X POST http://10.0.0.43:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"/model","prompt":"ping from cortex","max_tokens":5}'
+```
+
+---
+
+## **8. Wiring into Lyra Cortex**
+
+In `cortex` container’s `docker-compose.yml`:
+
+```yaml
+environment:
+  LLM_PRIMARY_URL: http://10.0.0.43:8000
+```
+
+Not `/v1/completions` because the router appends that automatically.
+
+In `cortex/.env`:
+
+```env
+LLM_FORCE_BACKEND=primary
+LLM_MODEL=/model
+```
+
+Test:
+
+```bash
+curl -X POST http://10.0.0.41:7081/reason \
+  -H "Content-Type: application/json" \
+  -d '{"prompt":"test vllm","session_id":"dev"}'
+```
+
+If you get a meaningful response: **Cortex → vLLM is online**.
+
+---
+
+## **9. Common Failure Modes (And Fixes)**
+
+### **9.1 “Failed to infer device type”**
+
+vLLM cannot see any ROCm devices.
+
+Fix:
+
+```bash
+# On host
+modprobe amdgpu
+pct stop 201
+pct start 201
+# In container
+/opt/rocm/bin/rocminfo | grep -i gfx
+docker compose up -d
+```
+
+### **9.2 GPU disappears after reboot**
+
+Same fix:
+
+```bash
+modprobe amdgpu
+pct stop 201
+pct start 201
+```
+
+### **9.3 Invalid image name**
+
+If you see pull errors:
+
+```
+pull access denied for nalanzeuy...
+```
+
+Use:
+
+```
+image: nalanzeyu/vllm-gfx906
+```
+
+### **9.4 Double `/v1` in URL**
+
+Ensure:
+
+```
+LLM_PRIMARY_URL=http://10.0.0.43:8000
+```
+
+Router appends `/v1/completions`.
+
+---
+
+## **10. Daily / Reboot Ritual**
+
+### **On Proxmox host**
+
+```bash
+modprobe amdgpu
+pct stop 201
+pct start 201
+```
+
+### **Inside CT 201**
+
+```bash
+/opt/rocm/bin/rocminfo | grep -i gfx
+cd /root/vllm
+docker compose up -d
+docker compose logs -f
+```
+
+### **Test API**
+
+```bash
+curl -X POST http://10.0.0.43:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"/model","prompt":"ping","max_tokens":5}'
+```
+
+---
+
+## **11. Summary**
+
+You now have:
+
+* **MI50 (gfx906)** correctly passed into LXC
+* **ROCm** inside the container via bind mounts
+* **vLLM** running inside Docker in the LXC
+* **OpenAI-compatible API** on port 8000
+* **Lyra Cortex** using it automatically as primary backend
+
+This is a complete, reproducible setup that survives reboots (with the modprobe ritual) and allows you to upgrade/replace models anytime.
+
+---
+
+If you want, I can generate:
+
+* A `/docs/vllm-mi50/README.md`
+* A "vLLM Gotchas" document
+* A quick-reference cheat sheet
+* A troubleshooting decision tree
+
+Just say the word.