This commit is contained in:
serversdwn
2025-11-17 03:41:51 -05:00
3 changed files with 1324 additions and 908 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -1,265 +1,265 @@
##### Project Lyra - README v0.3.0 - needs fixing ##### ##### Project Lyra - README v0.3.0 - needs fixing #####
Lyra is a modular persistent AI companion system. Lyra is a modular persistent AI companion system.
It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**, It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**,
with optional subconscious annotation powered by **Cortex VM** running local LLMs. with optional subconscious annotation powered by **Cortex VM** running local LLMs.
## Mission Statement ## ## Mission Statement ##
The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later. The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
--- ---
## Structure ## ## Structure ##
Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules: Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules:
## A. VM 100 - lyra-core: ## A. VM 100 - lyra-core:
1. ** Core v0.3.1 - Docker Stack 1. ** Core v0.3.1 - Docker Stack
- Relay - (docker container) - The main harness that connects the modules together and accepts input from the user. - Relay - (docker container) - The main harness that connects the modules together and accepts input from the user.
- UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that. - UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that.
- Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection. - Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection.
- All of this is built and controlled by a single .env and docker-compose.lyra.yml. - All of this is built and controlled by a single .env and docker-compose.lyra.yml.
2. **NeoMem v0.1.0 - (docker stack) 2. **NeoMem v0.1.0 - (docker stack)
- NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph. - NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph.
- NeoMem launches with a single separate docker-compose.neomem.yml. - NeoMem launches with a single separate docker-compose.neomem.yml.
## B. VM 101 - lyra - cortex ## B. VM 101 - lyra - cortex
3. ** Cortex - VM containing docker stack 3. ** Cortex - VM containing docker stack
- This is the working reasoning layer of Lyra. - This is the working reasoning layer of Lyra.
- Built to be flexible in deployment. Run it locally or remotely (via wan/lan) - Built to be flexible in deployment. Run it locally or remotely (via wan/lan)
- Intake v0.1.0 - (docker Container) gives conversations context and purpose - Intake v0.1.0 - (docker Container) gives conversations context and purpose
- Intake takes the last N exchanges and summarizes them into coherrent short term memories. - Intake takes the last N exchanges and summarizes them into coherrent short term memories.
- Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc. - Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc.
- Keeps the bot aware of what is going on with out having to send it the whole chat every time. - Keeps the bot aware of what is going on with out having to send it the whole chat every time.
- Cortex - Docker container containing: - Cortex - Docker container containing:
- Reasoning Layer - Reasoning Layer
- TBD - TBD
- Reflect - (docker continer) - Not yet implemented, road map. - Reflect - (docker continer) - Not yet implemented, road map.
- Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise. - Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise.
- Can be done actively and asynchronously, or on a time basis (think human sleep and dreams). - Can be done actively and asynchronously, or on a time basis (think human sleep and dreams).
- This stage is not yet built, this is just an idea. - This stage is not yet built, this is just an idea.
## C. Remote LLM APIs: ## C. Remote LLM APIs:
3. **AI Backends 3. **AI Backends
- Lyra doesnt run models her self, she calls up APIs. - Lyra doesnt run models her self, she calls up APIs.
- Endlessly customizable as long as it outputs to the same schema. - Endlessly customizable as long as it outputs to the same schema.
--- ---
## 🚀 Features ## ## 🚀 Features ##
# Lyra-Core VM (VM100) # Lyra-Core VM (VM100)
- **Relay **: - **Relay **:
- The main harness and orchestrator of Lyra. - The main harness and orchestrator of Lyra.
- OpenAI-compatible endpoint: `POST /v1/chat/completions` - OpenAI-compatible endpoint: `POST /v1/chat/completions`
- Injects persona + relevant memories into every LLM call - Injects persona + relevant memories into every LLM call
- Routes all memory storage/retrieval through **NeoMem** - Routes all memory storage/retrieval through **NeoMem**
- Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`) - Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`)
- **NeoMem (Memory Engine)**: - **NeoMem (Memory Engine)**:
- Forked from Mem0 OSS and fully independent. - Forked from Mem0 OSS and fully independent.
- Drop-in compatible API (`/memories`, `/search`). - Drop-in compatible API (`/memories`, `/search`).
- Local-first: runs on FastAPI with Postgres + Neo4j. - Local-first: runs on FastAPI with Postgres + Neo4j.
- No external SDK dependencies. - No external SDK dependencies.
- Default service: `neomem-api` (port 7077). - Default service: `neomem-api` (port 7077).
- Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match. - Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match.
- **UI**: - **UI**:
- Lightweight static HTML chat page. - Lightweight static HTML chat page.
- Connects to Relay at `http://<host>:7078`. - Connects to Relay at `http://<host>:7078`.
- Nice cyberpunk theme! - Nice cyberpunk theme!
- Saves and loads sessions, which then in turn send to relay. - Saves and loads sessions, which then in turn send to relay.
# Beta Lyrae (RAG Memory DB) - added 11-3-25 # Beta Lyrae (RAG Memory DB) - added 11-3-25
- **RAG Knowledge DB - Beta Lyrae (sheliak)** - **RAG Knowledge DB - Beta Lyrae (sheliak)**
- This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra. - This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation. - It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
The system uses: The system uses:
- **ChromaDB** for persistent vector storage - **ChromaDB** for persistent vector storage
- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity - **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
- **FastAPI** (port 7090) for the `/rag/search` REST endpoint - **FastAPI** (port 7090) for the `/rag/search` REST endpoint
- Directory Layout - Directory Layout
rag/ rag/
├── rag_chat_import.py # imports JSON chat logs ├── rag_chat_import.py # imports JSON chat logs
├── rag_docs_import.py # (planned) PDF/EPUB/manual importer ├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
├── rag_build.py # legacy single-folder builder ├── rag_build.py # legacy single-folder builder
├── rag_query.py # command-line query helper ├── rag_query.py # command-line query helper
├── rag_api.py # FastAPI service providing /rag/search ├── rag_api.py # FastAPI service providing /rag/search
├── chromadb/ # persistent vector store ├── chromadb/ # persistent vector store
├── chatlogs/ # organized source data ├── chatlogs/ # organized source data
│ ├── poker/ │ ├── poker/
│ ├── work/ │ ├── work/
│ ├── lyra/ │ ├── lyra/
│ ├── personal/ │ ├── personal/
│ └── ... │ └── ...
└── import.log # progress log for batch runs └── import.log # progress log for batch runs
- **OpenAI chatlog importer. - **OpenAI chatlog importer.
- Takes JSON formatted chat logs and imports it to the RAG. - Takes JSON formatted chat logs and imports it to the RAG.
- **fetures include:** - **fetures include:**
- Recursive folder indexing with **category detection** from directory name - Recursive folder indexing with **category detection** from directory name
- Smart chunking for long messages (5 000 chars per slice) - Smart chunking for long messages (5 000 chars per slice)
- Automatic deduplication using SHA-1 hash of file + chunk - Automatic deduplication using SHA-1 hash of file + chunk
- Timestamps for both file modification and import time - Timestamps for both file modification and import time
- Full progress logging via tqdm - Full progress logging via tqdm
- Safe to run in background with nohup … & - Safe to run in background with nohup … &
- Metadata per chunk: - Metadata per chunk:
```json ```json
{ {
"chat_id": "<sha1 of filename>", "chat_id": "<sha1 of filename>",
"chunk_index": 0, "chunk_index": 0,
"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json", "source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
"title": "cortex LLMs 11-1-25", "title": "cortex LLMs 11-1-25",
"role": "assistant", "role": "assistant",
"category": "lyra", "category": "lyra",
"type": "chat", "type": "chat",
"file_modified": "2025-11-06T23:41:02", "file_modified": "2025-11-06T23:41:02",
"imported_at": "2025-11-07T03:55:00Z" "imported_at": "2025-11-07T03:55:00Z"
}``` }```
# Cortex VM (VM101, CT201) # Cortex VM (VM101, CT201)
- **CT201 main reasoning orchestrator.** - **CT201 main reasoning orchestrator.**
- This is the internal brain of Lyra. - This is the internal brain of Lyra.
- Running in a privellaged LXC. - Running in a privellaged LXC.
- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
- Accessible via 10.0.0.43:8000/v1/completions. - Accessible via 10.0.0.43:8000/v1/completions.
- **Intake v0.1.1 ** - **Intake v0.1.1 **
- Recieves messages from relay and summarizes them in a cascading format. - Recieves messages from relay and summarizes them in a cascading format.
- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20) - Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
- Intake then sends to cortex for self reflection, neomem for memory consolidation. - Intake then sends to cortex for self reflection, neomem for memory consolidation.
- **Reflect ** - **Reflect **
-TBD -TBD
# Self hosted vLLM server # # Self hosted vLLM server #
- **CT201 main reasoning orchestrator.** - **CT201 main reasoning orchestrator.**
- This is the internal brain of Lyra. - This is the internal brain of Lyra.
- Running in a privellaged LXC. - Running in a privellaged LXC.
- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm. - Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
- Accessible via 10.0.0.43:8000/v1/completions. - Accessible via 10.0.0.43:8000/v1/completions.
- **Stack Flow** - **Stack Flow**
- [Proxmox Host] - [Proxmox Host]
└── loads AMDGPU driver └── loads AMDGPU driver
└── boots CT201 (order=2) └── boots CT201 (order=2)
[CT201 GPU Container] [CT201 GPU Container]
├── lyra-start-vllm.sh → starts vLLM ROCm model server ├── lyra-start-vllm.sh → starts vLLM ROCm model server
├── lyra-vllm.service → runs the above automatically ├── lyra-vllm.service → runs the above automatically
├── lyra-core.service → launches Cortex + Intake Docker stack ├── lyra-core.service → launches Cortex + Intake Docker stack
└── Docker Compose → runs Cortex + Intake containers └── Docker Compose → runs Cortex + Intake containers
[Cortex Container] [Cortex Container]
├── Listens on port 7081 ├── Listens on port 7081
├── Talks to NVGRAM (mem API) + Intake ├── Talks to NVGRAM (mem API) + Intake
└── Main relay between Lyra UI ↔ memory ↔ model └── Main relay between Lyra UI ↔ memory ↔ model
[Intake Container] [Intake Container]
├── Listens on port 7080 ├── Listens on port 7080
├── Summarizes every few exchanges ├── Summarizes every few exchanges
├── Writes summaries to /app/logs/summaries.log ├── Writes summaries to /app/logs/summaries.log
└── Future: sends summaries → Cortex for reflection └── Future: sends summaries → Cortex for reflection
# Additional information available in the trilium docs. # # Additional information available in the trilium docs. #
--- ---
## 📦 Requirements ## 📦 Requirements
- Docker + Docker Compose - Docker + Docker Compose
- Postgres + Neo4j (for NeoMem) - Postgres + Neo4j (for NeoMem)
- Access to an open AI or ollama style API. - Access to an open AI or ollama style API.
- OpenAI API key (for Relay fallback LLMs) - OpenAI API key (for Relay fallback LLMs)
**Dependencies:** **Dependencies:**
- fastapi==0.115.8 - fastapi==0.115.8
- uvicorn==0.34.0 - uvicorn==0.34.0
- pydantic==2.10.4 - pydantic==2.10.4
- python-dotenv==1.0.1 - python-dotenv==1.0.1
- psycopg>=3.2.8 - psycopg>=3.2.8
- ollama - ollama
--- ---
🔌 Integration Notes 🔌 Integration Notes
Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally. Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.
API endpoints remain identical to Mem0 (/memories, /search). API endpoints remain identical to Mem0 (/memories, /search).
History and entity graphs managed internally via Postgres + Neo4j. History and entity graphs managed internally via Postgres + Neo4j.
--- ---
🧱 Architecture Snapshot 🧱 Architecture Snapshot
User → Relay → Cortex User → Relay → Cortex
[RAG Search] [RAG Search]
[Reflection Loop] [Reflection Loop]
Intake (async summaries) Intake (async summaries)
NeoMem (persistent memory) NeoMem (persistent memory)
**Cortex v0.4.1 introduces the first fully integrated reasoning loop.** **Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
- Data Flow: - Data Flow:
- User message enters Cortex via /reason. - User message enters Cortex via /reason.
- Cortex assembles context: - Cortex assembles context:
- Intake summaries (short-term memory) - Intake summaries (short-term memory)
- RAG contextual data (knowledge base) - RAG contextual data (knowledge base)
- LLM generates initial draft (call_llm). - LLM generates initial draft (call_llm).
- Reflection loop critiques and refines the answer. - Reflection loop critiques and refines the answer.
- Intake asynchronously summarizes and sends snapshots to NeoMem. - Intake asynchronously summarizes and sends snapshots to NeoMem.
RAG API Configuration: RAG API Configuration:
Set RAG_API_URL in .env (default: http://localhost:7090). Set RAG_API_URL in .env (default: http://localhost:7090).
--- ---
## Setup and Operation ## ## Setup and Operation ##
## Beta Lyrae - RAG memory system ## ## Beta Lyrae - RAG memory system ##
**Requirements** **Requirements**
-Env= python 3.10+ -Env= python 3.10+
-Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq -Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
-Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db) -Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)
**Import Chats** **Import Chats**
- Chats need to be formatted into the correct format of - Chats need to be formatted into the correct format of
``` ```
"messages": [ "messages": [
{ {
"role:" "user", "role:" "user",
"content": "Message here" "content": "Message here"
}, },
"messages": [ "messages": [
{ {
"role:" "assistant", "role:" "assistant",
"content": "Message here" "content": "Message here"
},``` },```
- Organize the chats into categorical folders. This step is optional, but it helped me keep it straight. - Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
- run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB). - run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).
**Build API Server** **Build API Server**
- Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.) - Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
- Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090``` - Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```
**Query** **Query**
- Run: python3 rag_query.py "Question here?" - Run: python3 rag_query.py "Question here?"
- For testing a curl command can reach it too - For testing a curl command can reach it too
``` ```
curl -X POST http://127.0.0.1:7090/rag/search \ curl -X POST http://127.0.0.1:7090/rag/search \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d '{ -d '{
"query": "What is the current state of Cortex and Project Lyra?", "query": "What is the current state of Cortex and Project Lyra?",
"where": {"category": "lyra"} "where": {"category": "lyra"}
}' }'
``` ```
# Beta Lyrae - RAG System # Beta Lyrae - RAG System
## 📖 License ## 📖 License
NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0). NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).
This fork retains the original Apache 2.0 license and adds local modifications. This fork retains the original Apache 2.0 license and adds local modifications.
© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0. © 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.

416
vllm-mi50.md Normal file
View File

@@ -0,0 +1,416 @@
Here you go — a **clean, polished, ready-to-drop-into-Trilium or GitHub** Markdown file.
If you want, I can also auto-generate a matching `/docs/vllm-mi50/` folder structure and a mini-ToC.
---
# **MI50 + vLLM + Proxmox LXC Setup Guide**
### *End-to-End Field Manual for gfx906 LLM Serving*
**Version:** 1.0
**Last updated:** 2025-11-17
---
## **📌 Overview**
This guide documents how to run a **vLLM OpenAI-compatible server** on an
**AMD Instinct MI50 (gfx906)** inside a **Proxmox LXC container**, expose it over LAN,
and wire it into **Project Lyra's Cortex reasoning layer**.
This file is long, specific, and intentionally leaves *nothing* out so you never have to rediscover ROCm pain rituals again.
---
## **1. What This Stack Looks Like**
```
Proxmox Host
├─ AMD Instinct MI50 (gfx906)
├─ AMDGPU + ROCm stack
└─ LXC Container (CT 201: cortex-gpu)
├─ Ubuntu 24.04
├─ Docker + docker compose
├─ vLLM inside Docker (nalanzeyu/vllm-gfx906)
├─ GPU passthrough via /dev/kfd + /dev/dri + PCI bind
└─ vLLM API exposed on :8000
Lyra Cortex (VM/Server)
└─ LLM_PRIMARY_URL=http://10.0.0.43:8000
```
---
## **2. Proxmox Host — GPU Setup**
### **2.1 Confirm MI50 exists**
```bash
lspci -nn | grep -i 'vega\|instinct\|radeon'
```
You should see something like:
```
0a:00.0 Display controller: AMD Instinct MI50 (gfx906)
```
### **2.2 Load AMDGPU driver**
The main pitfall after **any host reboot**.
```bash
modprobe amdgpu
```
If you skip this, the LXC container won't see the GPU.
---
## **3. LXC Container Configuration (CT 201)**
The container ID is **201**.
Config file is at:
```
/etc/pve/lxc/201.conf
```
### **3.1 Working 201.conf**
Paste this *exact* version:
```ini
arch: amd64
cores: 4
hostname: cortex-gpu
memory: 16384
swap: 512
ostype: ubuntu
onboot: 1
startup: order=2,up=10,down=10
net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:C6:3E:88,ip=dhcp,type=veth
rootfs: local-lvm:vm-201-disk-0,size=200G
unprivileged: 0
# Docker in LXC requires this
features: keyctl=1,nesting=1
lxc.apparmor.profile: unconfined
lxc.cap.drop:
# --- GPU passthrough for ROCm (MI50) ---
lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file,mode=0666
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
lxc.mount.entry: /sys/class/drm sys/class/drm none bind,ro,optional,create=dir
lxc.mount.entry: /opt/rocm /opt/rocm none bind,ro,optional,create=dir
# Bind the MI50 PCI device
lxc.mount.entry: /dev/bus/pci/0000:0a:00.0 dev/bus/pci/0000:0a:00.0 none bind,optional,create=file
# Allow GPU-related character devices
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.cgroup2.devices.allow: c 29:* rwm
lxc.cgroup2.devices.allow: c 189:* rwm
lxc.cgroup2.devices.allow: c 238:* rwm
lxc.cgroup2.devices.allow: c 241:* rwm
lxc.cgroup2.devices.allow: c 242:* rwm
lxc.cgroup2.devices.allow: c 243:* rwm
lxc.cgroup2.devices.allow: c 244:* rwm
lxc.cgroup2.devices.allow: c 245:* rwm
lxc.cgroup2.devices.allow: c 246:* rwm
lxc.cgroup2.devices.allow: c 247:* rwm
lxc.cgroup2.devices.allow: c 248:* rwm
lxc.cgroup2.devices.allow: c 249:* rwm
lxc.cgroup2.devices.allow: c 250:* rwm
lxc.cgroup2.devices.allow: c 510:0 rwm
```
### **3.2 Restart sequence**
```bash
pct stop 201
modprobe amdgpu
pct start 201
pct enter 201
```
---
## **4. Inside CT 201 — Verifying ROCm + GPU Visibility**
### **4.1 Check device nodes**
```bash
ls -l /dev/kfd
ls -l /dev/dri
ls -l /opt/rocm
```
All must exist.
### **4.2 Validate GPU via rocminfo**
```bash
/opt/rocm/bin/rocminfo | grep -i gfx
```
You need to see:
```
gfx906
```
If you see **nothing**, the GPU isnt passed through — restart and re-check the host steps.
---
## **5. Install Docker in the LXC (Ubuntu 24.04)**
This container runs Docker inside LXC (nesting enabled).
```bash
apt update
apt install -y ca-certificates curl gnupg
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| gpg --dearmor -o /etc/apt/keyrings/docker.gpg
chmod a+r /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
> /etc/apt/sources.list.d/docker.list
apt update
apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
Check:
```bash
docker --version
docker compose version
```
---
## **6. Running vLLM Inside CT 201 via Docker**
### **6.1 Create directory**
```bash
mkdir -p /root/vllm
cd /root/vllm
```
### **6.2 docker-compose.yml**
Save this exact file as `/root/vllm/docker-compose.yml`:
```yaml
version: "3.9"
services:
vllm-mi50:
image: nalanzeyu/vllm-gfx906:latest
container_name: vllm-mi50
restart: unless-stopped
ports:
- "8000:8000"
environment:
VLLM_ROLE: "APIServer"
VLLM_MODEL: "/model"
VLLM_LOGGING_LEVEL: "INFO"
command: >
vllm serve /model
--host 0.0.0.0
--port 8000
--dtype float16
--max-model-len 4096
--api-type openai
devices:
- "/dev/kfd:/dev/kfd"
- "/dev/dri:/dev/dri"
volumes:
- /opt/rocm:/opt/rocm:ro
```
### **6.3 Start vLLM**
```bash
docker compose up -d
docker compose logs -f
```
When healthy, youll see:
```
(APIServer) Application startup complete.
```
and periodic throughput logs.
---
## **7. Test vLLM API**
### **7.1 From Proxmox host**
```bash
curl -X POST http://10.0.0.43:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"/model","prompt":"ping","max_tokens":5}'
```
Should respond like:
```json
{"choices":[{"text":"-pong"}]}
```
### **7.2 From Cortex machine**
```bash
curl -X POST http://10.0.0.43:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"/model","prompt":"ping from cortex","max_tokens":5}'
```
---
## **8. Wiring into Lyra Cortex**
In `cortex` containers `docker-compose.yml`:
```yaml
environment:
LLM_PRIMARY_URL: http://10.0.0.43:8000
```
Not `/v1/completions` because the router appends that automatically.
In `cortex/.env`:
```env
LLM_FORCE_BACKEND=primary
LLM_MODEL=/model
```
Test:
```bash
curl -X POST http://10.0.0.41:7081/reason \
-H "Content-Type: application/json" \
-d '{"prompt":"test vllm","session_id":"dev"}'
```
If you get a meaningful response: **Cortex → vLLM is online**.
---
## **9. Common Failure Modes (And Fixes)**
### **9.1 “Failed to infer device type”**
vLLM cannot see any ROCm devices.
Fix:
```bash
# On host
modprobe amdgpu
pct stop 201
pct start 201
# In container
/opt/rocm/bin/rocminfo | grep -i gfx
docker compose up -d
```
### **9.2 GPU disappears after reboot**
Same fix:
```bash
modprobe amdgpu
pct stop 201
pct start 201
```
### **9.3 Invalid image name**
If you see pull errors:
```
pull access denied for nalanzeuy...
```
Use:
```
image: nalanzeyu/vllm-gfx906
```
### **9.4 Double `/v1` in URL**
Ensure:
```
LLM_PRIMARY_URL=http://10.0.0.43:8000
```
Router appends `/v1/completions`.
---
## **10. Daily / Reboot Ritual**
### **On Proxmox host**
```bash
modprobe amdgpu
pct stop 201
pct start 201
```
### **Inside CT 201**
```bash
/opt/rocm/bin/rocminfo | grep -i gfx
cd /root/vllm
docker compose up -d
docker compose logs -f
```
### **Test API**
```bash
curl -X POST http://10.0.0.43:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"/model","prompt":"ping","max_tokens":5}'
```
---
## **11. Summary**
You now have:
* **MI50 (gfx906)** correctly passed into LXC
* **ROCm** inside the container via bind mounts
* **vLLM** running inside Docker in the LXC
* **OpenAI-compatible API** on port 8000
* **Lyra Cortex** using it automatically as primary backend
This is a complete, reproducible setup that survives reboots (with the modprobe ritual) and allows you to upgrade/replace models anytime.
---
If you want, I can generate:
* A `/docs/vllm-mi50/README.md`
* A "vLLM Gotchas" document
* A quick-reference cheat sheet
* A troubleshooting decision tree
Just say the word.