Merge branch 'main' of https://github.com/serversdwn/project-lyra
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -1,265 +1,265 @@
|
||||
##### Project Lyra - README v0.3.0 - needs fixing #####
|
||||
|
||||
Lyra is a modular persistent AI companion system.
|
||||
It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**,
|
||||
with optional subconscious annotation powered by **Cortex VM** running local LLMs.
|
||||
|
||||
## Mission Statement ##
|
||||
The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
|
||||
|
||||
---
|
||||
|
||||
## Structure ##
|
||||
Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules:
|
||||
## A. VM 100 - lyra-core:
|
||||
1. ** Core v0.3.1 - Docker Stack
|
||||
- Relay - (docker container) - The main harness that connects the modules together and accepts input from the user.
|
||||
- UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that.
|
||||
- Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection.
|
||||
- All of this is built and controlled by a single .env and docker-compose.lyra.yml.
|
||||
2. **NeoMem v0.1.0 - (docker stack)
|
||||
- NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph.
|
||||
- NeoMem launches with a single separate docker-compose.neomem.yml.
|
||||
|
||||
## B. VM 101 - lyra - cortex
|
||||
3. ** Cortex - VM containing docker stack
|
||||
- This is the working reasoning layer of Lyra.
|
||||
- Built to be flexible in deployment. Run it locally or remotely (via wan/lan)
|
||||
- Intake v0.1.0 - (docker Container) gives conversations context and purpose
|
||||
- Intake takes the last N exchanges and summarizes them into coherrent short term memories.
|
||||
- Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc.
|
||||
- Keeps the bot aware of what is going on with out having to send it the whole chat every time.
|
||||
- Cortex - Docker container containing:
|
||||
- Reasoning Layer
|
||||
- TBD
|
||||
- Reflect - (docker continer) - Not yet implemented, road map.
|
||||
- Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise.
|
||||
- Can be done actively and asynchronously, or on a time basis (think human sleep and dreams).
|
||||
- This stage is not yet built, this is just an idea.
|
||||
|
||||
## C. Remote LLM APIs:
|
||||
3. **AI Backends
|
||||
- Lyra doesnt run models her self, she calls up APIs.
|
||||
- Endlessly customizable as long as it outputs to the same schema.
|
||||
|
||||
---
|
||||
|
||||
|
||||
## 🚀 Features ##
|
||||
|
||||
# Lyra-Core VM (VM100)
|
||||
- **Relay **:
|
||||
- The main harness and orchestrator of Lyra.
|
||||
- OpenAI-compatible endpoint: `POST /v1/chat/completions`
|
||||
- Injects persona + relevant memories into every LLM call
|
||||
- Routes all memory storage/retrieval through **NeoMem**
|
||||
- Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`)
|
||||
|
||||
- **NeoMem (Memory Engine)**:
|
||||
- Forked from Mem0 OSS and fully independent.
|
||||
- Drop-in compatible API (`/memories`, `/search`).
|
||||
- Local-first: runs on FastAPI with Postgres + Neo4j.
|
||||
- No external SDK dependencies.
|
||||
- Default service: `neomem-api` (port 7077).
|
||||
- Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match.
|
||||
|
||||
- **UI**:
|
||||
- Lightweight static HTML chat page.
|
||||
- Connects to Relay at `http://<host>:7078`.
|
||||
- Nice cyberpunk theme!
|
||||
- Saves and loads sessions, which then in turn send to relay.
|
||||
|
||||
# Beta Lyrae (RAG Memory DB) - added 11-3-25
|
||||
- **RAG Knowledge DB - Beta Lyrae (sheliak)**
|
||||
- This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
|
||||
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
|
||||
The system uses:
|
||||
- **ChromaDB** for persistent vector storage
|
||||
- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
|
||||
- **FastAPI** (port 7090) for the `/rag/search` REST endpoint
|
||||
- Directory Layout
|
||||
rag/
|
||||
├── rag_chat_import.py # imports JSON chat logs
|
||||
├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
|
||||
├── rag_build.py # legacy single-folder builder
|
||||
├── rag_query.py # command-line query helper
|
||||
├── rag_api.py # FastAPI service providing /rag/search
|
||||
├── chromadb/ # persistent vector store
|
||||
├── chatlogs/ # organized source data
|
||||
│ ├── poker/
|
||||
│ ├── work/
|
||||
│ ├── lyra/
|
||||
│ ├── personal/
|
||||
│ └── ...
|
||||
└── import.log # progress log for batch runs
|
||||
- **OpenAI chatlog importer.
|
||||
- Takes JSON formatted chat logs and imports it to the RAG.
|
||||
- **fetures include:**
|
||||
- Recursive folder indexing with **category detection** from directory name
|
||||
- Smart chunking for long messages (5 000 chars per slice)
|
||||
- Automatic deduplication using SHA-1 hash of file + chunk
|
||||
- Timestamps for both file modification and import time
|
||||
- Full progress logging via tqdm
|
||||
- Safe to run in background with nohup … &
|
||||
- Metadata per chunk:
|
||||
```json
|
||||
{
|
||||
"chat_id": "<sha1 of filename>",
|
||||
"chunk_index": 0,
|
||||
"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
|
||||
"title": "cortex LLMs 11-1-25",
|
||||
"role": "assistant",
|
||||
"category": "lyra",
|
||||
"type": "chat",
|
||||
"file_modified": "2025-11-06T23:41:02",
|
||||
"imported_at": "2025-11-07T03:55:00Z"
|
||||
}```
|
||||
|
||||
# Cortex VM (VM101, CT201)
|
||||
- **CT201 main reasoning orchestrator.**
|
||||
- This is the internal brain of Lyra.
|
||||
- Running in a privellaged LXC.
|
||||
- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
|
||||
- Accessible via 10.0.0.43:8000/v1/completions.
|
||||
|
||||
- **Intake v0.1.1 **
|
||||
- Recieves messages from relay and summarizes them in a cascading format.
|
||||
- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
|
||||
- Intake then sends to cortex for self reflection, neomem for memory consolidation.
|
||||
|
||||
- **Reflect **
|
||||
-TBD
|
||||
|
||||
# Self hosted vLLM server #
|
||||
- **CT201 main reasoning orchestrator.**
|
||||
- This is the internal brain of Lyra.
|
||||
- Running in a privellaged LXC.
|
||||
- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
|
||||
- Accessible via 10.0.0.43:8000/v1/completions.
|
||||
- **Stack Flow**
|
||||
- [Proxmox Host]
|
||||
└── loads AMDGPU driver
|
||||
└── boots CT201 (order=2)
|
||||
|
||||
[CT201 GPU Container]
|
||||
├── lyra-start-vllm.sh → starts vLLM ROCm model server
|
||||
├── lyra-vllm.service → runs the above automatically
|
||||
├── lyra-core.service → launches Cortex + Intake Docker stack
|
||||
└── Docker Compose → runs Cortex + Intake containers
|
||||
|
||||
[Cortex Container]
|
||||
├── Listens on port 7081
|
||||
├── Talks to NVGRAM (mem API) + Intake
|
||||
└── Main relay between Lyra UI ↔ memory ↔ model
|
||||
|
||||
[Intake Container]
|
||||
├── Listens on port 7080
|
||||
├── Summarizes every few exchanges
|
||||
├── Writes summaries to /app/logs/summaries.log
|
||||
└── Future: sends summaries → Cortex for reflection
|
||||
|
||||
|
||||
# Additional information available in the trilium docs. #
|
||||
---
|
||||
|
||||
## 📦 Requirements
|
||||
|
||||
- Docker + Docker Compose
|
||||
- Postgres + Neo4j (for NeoMem)
|
||||
- Access to an open AI or ollama style API.
|
||||
- OpenAI API key (for Relay fallback LLMs)
|
||||
|
||||
**Dependencies:**
|
||||
- fastapi==0.115.8
|
||||
- uvicorn==0.34.0
|
||||
- pydantic==2.10.4
|
||||
- python-dotenv==1.0.1
|
||||
- psycopg>=3.2.8
|
||||
- ollama
|
||||
|
||||
---
|
||||
|
||||
🔌 Integration Notes
|
||||
|
||||
Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.
|
||||
|
||||
API endpoints remain identical to Mem0 (/memories, /search).
|
||||
|
||||
History and entity graphs managed internally via Postgres + Neo4j.
|
||||
|
||||
---
|
||||
|
||||
🧱 Architecture Snapshot
|
||||
|
||||
User → Relay → Cortex
|
||||
↓
|
||||
[RAG Search]
|
||||
↓
|
||||
[Reflection Loop]
|
||||
↓
|
||||
Intake (async summaries)
|
||||
↓
|
||||
NeoMem (persistent memory)
|
||||
|
||||
**Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
|
||||
- Data Flow:
|
||||
- User message enters Cortex via /reason.
|
||||
- Cortex assembles context:
|
||||
- Intake summaries (short-term memory)
|
||||
- RAG contextual data (knowledge base)
|
||||
- LLM generates initial draft (call_llm).
|
||||
- Reflection loop critiques and refines the answer.
|
||||
- Intake asynchronously summarizes and sends snapshots to NeoMem.
|
||||
|
||||
RAG API Configuration:
|
||||
Set RAG_API_URL in .env (default: http://localhost:7090).
|
||||
|
||||
---
|
||||
|
||||
## Setup and Operation ##
|
||||
|
||||
## Beta Lyrae - RAG memory system ##
|
||||
**Requirements**
|
||||
-Env= python 3.10+
|
||||
-Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
|
||||
-Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)
|
||||
|
||||
**Import Chats**
|
||||
- Chats need to be formatted into the correct format of
|
||||
```
|
||||
"messages": [
|
||||
{
|
||||
"role:" "user",
|
||||
"content": "Message here"
|
||||
},
|
||||
"messages": [
|
||||
{
|
||||
"role:" "assistant",
|
||||
"content": "Message here"
|
||||
},```
|
||||
- Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
|
||||
- run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).
|
||||
|
||||
**Build API Server**
|
||||
- Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
|
||||
- Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```
|
||||
|
||||
**Query**
|
||||
- Run: python3 rag_query.py "Question here?"
|
||||
- For testing a curl command can reach it too
|
||||
```
|
||||
curl -X POST http://127.0.0.1:7090/rag/search \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"query": "What is the current state of Cortex and Project Lyra?",
|
||||
"where": {"category": "lyra"}
|
||||
}'
|
||||
```
|
||||
|
||||
# Beta Lyrae - RAG System
|
||||
|
||||
## 📖 License
|
||||
NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).
|
||||
This fork retains the original Apache 2.0 license and adds local modifications.
|
||||
© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
|
||||
|
||||
##### Project Lyra - README v0.3.0 - needs fixing #####
|
||||
|
||||
Lyra is a modular persistent AI companion system.
|
||||
It provides memory-backed chat using **NeoMem** + **Relay** + **Persona Sidecar**,
|
||||
with optional subconscious annotation powered by **Cortex VM** running local LLMs.
|
||||
|
||||
## Mission Statement ##
|
||||
The point of project lyra is to give an AI chatbot more abilities than a typical chatbot. typical chat bots are essentially amnesic and forget everything about your project. Lyra helps keep projects organized and remembers everything you have done. Think of her abilities as a notepad/schedule/data base/ co-creator/collaborattor all with its own executive function. Say something in passing, Lyra remembers it then reminds you of it later.
|
||||
|
||||
---
|
||||
|
||||
## Structure ##
|
||||
Project Lyra exists as a series of docker containers that run independentally of each other but are all networked together. Think of it as how the brain has regions, Lyra has modules:
|
||||
## A. VM 100 - lyra-core:
|
||||
1. ** Core v0.3.1 - Docker Stack
|
||||
- Relay - (docker container) - The main harness that connects the modules together and accepts input from the user.
|
||||
- UI - (HTML) - This is how the user communicates with lyra. ATM its a typical instant message interface, but plans are to make it much more than that.
|
||||
- Persona - (docker container) - This is the personality of lyra, set how you want her to behave. Give specific instructions for output. Basically prompt injection.
|
||||
- All of this is built and controlled by a single .env and docker-compose.lyra.yml.
|
||||
2. **NeoMem v0.1.0 - (docker stack)
|
||||
- NeoMem is Lyra's main long term memory data base. It is a fork of mem0 oss. Uses vector databases and graph.
|
||||
- NeoMem launches with a single separate docker-compose.neomem.yml.
|
||||
|
||||
## B. VM 101 - lyra - cortex
|
||||
3. ** Cortex - VM containing docker stack
|
||||
- This is the working reasoning layer of Lyra.
|
||||
- Built to be flexible in deployment. Run it locally or remotely (via wan/lan)
|
||||
- Intake v0.1.0 - (docker Container) gives conversations context and purpose
|
||||
- Intake takes the last N exchanges and summarizes them into coherrent short term memories.
|
||||
- Uses a cascading summarization setup that quantizes the exchanges. Summaries occur at L2, L5, L10, L15, L20 etc.
|
||||
- Keeps the bot aware of what is going on with out having to send it the whole chat every time.
|
||||
- Cortex - Docker container containing:
|
||||
- Reasoning Layer
|
||||
- TBD
|
||||
- Reflect - (docker continer) - Not yet implemented, road map.
|
||||
- Calls back to NeoMem after N exchanges and N summaries and edits memories created during the initial messaging step. This helps contain memories to coherrent thoughts, reduces the noise.
|
||||
- Can be done actively and asynchronously, or on a time basis (think human sleep and dreams).
|
||||
- This stage is not yet built, this is just an idea.
|
||||
|
||||
## C. Remote LLM APIs:
|
||||
3. **AI Backends
|
||||
- Lyra doesnt run models her self, she calls up APIs.
|
||||
- Endlessly customizable as long as it outputs to the same schema.
|
||||
|
||||
---
|
||||
|
||||
|
||||
## 🚀 Features ##
|
||||
|
||||
# Lyra-Core VM (VM100)
|
||||
- **Relay **:
|
||||
- The main harness and orchestrator of Lyra.
|
||||
- OpenAI-compatible endpoint: `POST /v1/chat/completions`
|
||||
- Injects persona + relevant memories into every LLM call
|
||||
- Routes all memory storage/retrieval through **NeoMem**
|
||||
- Logs spans (`neomem.add`, `neomem.search`, `persona.fetch`, `llm.generate`)
|
||||
|
||||
- **NeoMem (Memory Engine)**:
|
||||
- Forked from Mem0 OSS and fully independent.
|
||||
- Drop-in compatible API (`/memories`, `/search`).
|
||||
- Local-first: runs on FastAPI with Postgres + Neo4j.
|
||||
- No external SDK dependencies.
|
||||
- Default service: `neomem-api` (port 7077).
|
||||
- Capable of adding new memories and updating previous memories. Compares existing embeddings and performs in place updates when a memory is judged to be a semantic match.
|
||||
|
||||
- **UI**:
|
||||
- Lightweight static HTML chat page.
|
||||
- Connects to Relay at `http://<host>:7078`.
|
||||
- Nice cyberpunk theme!
|
||||
- Saves and loads sessions, which then in turn send to relay.
|
||||
|
||||
# Beta Lyrae (RAG Memory DB) - added 11-3-25
|
||||
- **RAG Knowledge DB - Beta Lyrae (sheliak)**
|
||||
- This module implements the **Retrieval-Augmented Generation (RAG)** layer for Project Lyra.
|
||||
- It serves as the long-term searchable memory store that Cortex and Relay can query for relevant context before reasoning or response generation.
|
||||
The system uses:
|
||||
- **ChromaDB** for persistent vector storage
|
||||
- **OpenAI Embeddings (`text-embedding-3-small`)** for semantic similarity
|
||||
- **FastAPI** (port 7090) for the `/rag/search` REST endpoint
|
||||
- Directory Layout
|
||||
rag/
|
||||
├── rag_chat_import.py # imports JSON chat logs
|
||||
├── rag_docs_import.py # (planned) PDF/EPUB/manual importer
|
||||
├── rag_build.py # legacy single-folder builder
|
||||
├── rag_query.py # command-line query helper
|
||||
├── rag_api.py # FastAPI service providing /rag/search
|
||||
├── chromadb/ # persistent vector store
|
||||
├── chatlogs/ # organized source data
|
||||
│ ├── poker/
|
||||
│ ├── work/
|
||||
│ ├── lyra/
|
||||
│ ├── personal/
|
||||
│ └── ...
|
||||
└── import.log # progress log for batch runs
|
||||
- **OpenAI chatlog importer.
|
||||
- Takes JSON formatted chat logs and imports it to the RAG.
|
||||
- **fetures include:**
|
||||
- Recursive folder indexing with **category detection** from directory name
|
||||
- Smart chunking for long messages (5 000 chars per slice)
|
||||
- Automatic deduplication using SHA-1 hash of file + chunk
|
||||
- Timestamps for both file modification and import time
|
||||
- Full progress logging via tqdm
|
||||
- Safe to run in background with nohup … &
|
||||
- Metadata per chunk:
|
||||
```json
|
||||
{
|
||||
"chat_id": "<sha1 of filename>",
|
||||
"chunk_index": 0,
|
||||
"source": "chatlogs/lyra/0002_cortex_LLMs_11-1-25.json",
|
||||
"title": "cortex LLMs 11-1-25",
|
||||
"role": "assistant",
|
||||
"category": "lyra",
|
||||
"type": "chat",
|
||||
"file_modified": "2025-11-06T23:41:02",
|
||||
"imported_at": "2025-11-07T03:55:00Z"
|
||||
}```
|
||||
|
||||
# Cortex VM (VM101, CT201)
|
||||
- **CT201 main reasoning orchestrator.**
|
||||
- This is the internal brain of Lyra.
|
||||
- Running in a privellaged LXC.
|
||||
- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
|
||||
- Accessible via 10.0.0.43:8000/v1/completions.
|
||||
|
||||
- **Intake v0.1.1 **
|
||||
- Recieves messages from relay and summarizes them in a cascading format.
|
||||
- Continues to summarize smaller amounts of exhanges while also generating large scale conversational summaries. (L20)
|
||||
- Intake then sends to cortex for self reflection, neomem for memory consolidation.
|
||||
|
||||
- **Reflect **
|
||||
-TBD
|
||||
|
||||
# Self hosted vLLM server #
|
||||
- **CT201 main reasoning orchestrator.**
|
||||
- This is the internal brain of Lyra.
|
||||
- Running in a privellaged LXC.
|
||||
- Currently a locally served LLM running on a Radeon Instinct HI50, using a customized version of vLLM that lets it use ROCm.
|
||||
- Accessible via 10.0.0.43:8000/v1/completions.
|
||||
- **Stack Flow**
|
||||
- [Proxmox Host]
|
||||
└── loads AMDGPU driver
|
||||
└── boots CT201 (order=2)
|
||||
|
||||
[CT201 GPU Container]
|
||||
├── lyra-start-vllm.sh → starts vLLM ROCm model server
|
||||
├── lyra-vllm.service → runs the above automatically
|
||||
├── lyra-core.service → launches Cortex + Intake Docker stack
|
||||
└── Docker Compose → runs Cortex + Intake containers
|
||||
|
||||
[Cortex Container]
|
||||
├── Listens on port 7081
|
||||
├── Talks to NVGRAM (mem API) + Intake
|
||||
└── Main relay between Lyra UI ↔ memory ↔ model
|
||||
|
||||
[Intake Container]
|
||||
├── Listens on port 7080
|
||||
├── Summarizes every few exchanges
|
||||
├── Writes summaries to /app/logs/summaries.log
|
||||
└── Future: sends summaries → Cortex for reflection
|
||||
|
||||
|
||||
# Additional information available in the trilium docs. #
|
||||
---
|
||||
|
||||
## 📦 Requirements
|
||||
|
||||
- Docker + Docker Compose
|
||||
- Postgres + Neo4j (for NeoMem)
|
||||
- Access to an open AI or ollama style API.
|
||||
- OpenAI API key (for Relay fallback LLMs)
|
||||
|
||||
**Dependencies:**
|
||||
- fastapi==0.115.8
|
||||
- uvicorn==0.34.0
|
||||
- pydantic==2.10.4
|
||||
- python-dotenv==1.0.1
|
||||
- psycopg>=3.2.8
|
||||
- ollama
|
||||
|
||||
---
|
||||
|
||||
🔌 Integration Notes
|
||||
|
||||
Lyra-Core connects to neomem-api:8000 inside Docker or localhost:7077 locally.
|
||||
|
||||
API endpoints remain identical to Mem0 (/memories, /search).
|
||||
|
||||
History and entity graphs managed internally via Postgres + Neo4j.
|
||||
|
||||
---
|
||||
|
||||
🧱 Architecture Snapshot
|
||||
|
||||
User → Relay → Cortex
|
||||
↓
|
||||
[RAG Search]
|
||||
↓
|
||||
[Reflection Loop]
|
||||
↓
|
||||
Intake (async summaries)
|
||||
↓
|
||||
NeoMem (persistent memory)
|
||||
|
||||
**Cortex v0.4.1 introduces the first fully integrated reasoning loop.**
|
||||
- Data Flow:
|
||||
- User message enters Cortex via /reason.
|
||||
- Cortex assembles context:
|
||||
- Intake summaries (short-term memory)
|
||||
- RAG contextual data (knowledge base)
|
||||
- LLM generates initial draft (call_llm).
|
||||
- Reflection loop critiques and refines the answer.
|
||||
- Intake asynchronously summarizes and sends snapshots to NeoMem.
|
||||
|
||||
RAG API Configuration:
|
||||
Set RAG_API_URL in .env (default: http://localhost:7090).
|
||||
|
||||
---
|
||||
|
||||
## Setup and Operation ##
|
||||
|
||||
## Beta Lyrae - RAG memory system ##
|
||||
**Requirements**
|
||||
-Env= python 3.10+
|
||||
-Dependences: pip install chromadb openai tqdm python-dotenv fastapi uvicorn jq
|
||||
-Persistent storage path: ./chromadb (can be moved to /mnt/data/lyra_rag_db)
|
||||
|
||||
**Import Chats**
|
||||
- Chats need to be formatted into the correct format of
|
||||
```
|
||||
"messages": [
|
||||
{
|
||||
"role:" "user",
|
||||
"content": "Message here"
|
||||
},
|
||||
"messages": [
|
||||
{
|
||||
"role:" "assistant",
|
||||
"content": "Message here"
|
||||
},```
|
||||
- Organize the chats into categorical folders. This step is optional, but it helped me keep it straight.
|
||||
- run "python3 rag_chat_import.py", chats will then be imported automatically. For reference, it took 32 Minutes to import 68 Chat logs (aprox 10.3MB).
|
||||
|
||||
**Build API Server**
|
||||
- Run: rag_build.py, this automatically builds the chromaDB using data saved in the /chatlogs/ folder. (docs folder to be added in future.)
|
||||
- Run: rag_api.py or ```uvicorn rag_api:app --host 0.0.0.0 --port 7090```
|
||||
|
||||
**Query**
|
||||
- Run: python3 rag_query.py "Question here?"
|
||||
- For testing a curl command can reach it too
|
||||
```
|
||||
curl -X POST http://127.0.0.1:7090/rag/search \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"query": "What is the current state of Cortex and Project Lyra?",
|
||||
"where": {"category": "lyra"}
|
||||
}'
|
||||
```
|
||||
|
||||
# Beta Lyrae - RAG System
|
||||
|
||||
## 📖 License
|
||||
NeoMem is a derivative work based on the Mem0 OSS project (Apache 2.0).
|
||||
This fork retains the original Apache 2.0 license and adds local modifications.
|
||||
© 2025 Terra-Mechanics / ServersDown Labs. All modifications released under Apache 2.0.
|
||||
|
||||
416
vllm-mi50.md
Normal file
416
vllm-mi50.md
Normal file
@@ -0,0 +1,416 @@
|
||||
Here you go — a **clean, polished, ready-to-drop-into-Trilium or GitHub** Markdown file.
|
||||
|
||||
If you want, I can also auto-generate a matching `/docs/vllm-mi50/` folder structure and a mini-ToC.
|
||||
|
||||
---
|
||||
|
||||
# **MI50 + vLLM + Proxmox LXC Setup Guide**
|
||||
|
||||
### *End-to-End Field Manual for gfx906 LLM Serving*
|
||||
|
||||
**Version:** 1.0
|
||||
**Last updated:** 2025-11-17
|
||||
|
||||
---
|
||||
|
||||
## **📌 Overview**
|
||||
|
||||
This guide documents how to run a **vLLM OpenAI-compatible server** on an
|
||||
**AMD Instinct MI50 (gfx906)** inside a **Proxmox LXC container**, expose it over LAN,
|
||||
and wire it into **Project Lyra's Cortex reasoning layer**.
|
||||
|
||||
This file is long, specific, and intentionally leaves *nothing* out so you never have to rediscover ROCm pain rituals again.
|
||||
|
||||
---
|
||||
|
||||
## **1. What This Stack Looks Like**
|
||||
|
||||
```
|
||||
Proxmox Host
|
||||
├─ AMD Instinct MI50 (gfx906)
|
||||
├─ AMDGPU + ROCm stack
|
||||
└─ LXC Container (CT 201: cortex-gpu)
|
||||
├─ Ubuntu 24.04
|
||||
├─ Docker + docker compose
|
||||
├─ vLLM inside Docker (nalanzeyu/vllm-gfx906)
|
||||
├─ GPU passthrough via /dev/kfd + /dev/dri + PCI bind
|
||||
└─ vLLM API exposed on :8000
|
||||
Lyra Cortex (VM/Server)
|
||||
└─ LLM_PRIMARY_URL=http://10.0.0.43:8000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **2. Proxmox Host — GPU Setup**
|
||||
|
||||
### **2.1 Confirm MI50 exists**
|
||||
|
||||
```bash
|
||||
lspci -nn | grep -i 'vega\|instinct\|radeon'
|
||||
```
|
||||
|
||||
You should see something like:
|
||||
|
||||
```
|
||||
0a:00.0 Display controller: AMD Instinct MI50 (gfx906)
|
||||
```
|
||||
|
||||
### **2.2 Load AMDGPU driver**
|
||||
|
||||
The main pitfall after **any host reboot**.
|
||||
|
||||
```bash
|
||||
modprobe amdgpu
|
||||
```
|
||||
|
||||
If you skip this, the LXC container won't see the GPU.
|
||||
|
||||
---
|
||||
|
||||
## **3. LXC Container Configuration (CT 201)**
|
||||
|
||||
The container ID is **201**.
|
||||
Config file is at:
|
||||
|
||||
```
|
||||
/etc/pve/lxc/201.conf
|
||||
```
|
||||
|
||||
### **3.1 Working 201.conf**
|
||||
|
||||
Paste this *exact* version:
|
||||
|
||||
```ini
|
||||
arch: amd64
|
||||
cores: 4
|
||||
hostname: cortex-gpu
|
||||
memory: 16384
|
||||
swap: 512
|
||||
ostype: ubuntu
|
||||
onboot: 1
|
||||
startup: order=2,up=10,down=10
|
||||
net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:C6:3E:88,ip=dhcp,type=veth
|
||||
rootfs: local-lvm:vm-201-disk-0,size=200G
|
||||
unprivileged: 0
|
||||
|
||||
# Docker in LXC requires this
|
||||
features: keyctl=1,nesting=1
|
||||
lxc.apparmor.profile: unconfined
|
||||
lxc.cap.drop:
|
||||
|
||||
# --- GPU passthrough for ROCm (MI50) ---
|
||||
lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file,mode=0666
|
||||
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
|
||||
lxc.mount.entry: /sys/class/drm sys/class/drm none bind,ro,optional,create=dir
|
||||
lxc.mount.entry: /opt/rocm /opt/rocm none bind,ro,optional,create=dir
|
||||
|
||||
# Bind the MI50 PCI device
|
||||
lxc.mount.entry: /dev/bus/pci/0000:0a:00.0 dev/bus/pci/0000:0a:00.0 none bind,optional,create=file
|
||||
|
||||
# Allow GPU-related character devices
|
||||
lxc.cgroup2.devices.allow: c 226:* rwm
|
||||
lxc.cgroup2.devices.allow: c 29:* rwm
|
||||
lxc.cgroup2.devices.allow: c 189:* rwm
|
||||
lxc.cgroup2.devices.allow: c 238:* rwm
|
||||
lxc.cgroup2.devices.allow: c 241:* rwm
|
||||
lxc.cgroup2.devices.allow: c 242:* rwm
|
||||
lxc.cgroup2.devices.allow: c 243:* rwm
|
||||
lxc.cgroup2.devices.allow: c 244:* rwm
|
||||
lxc.cgroup2.devices.allow: c 245:* rwm
|
||||
lxc.cgroup2.devices.allow: c 246:* rwm
|
||||
lxc.cgroup2.devices.allow: c 247:* rwm
|
||||
lxc.cgroup2.devices.allow: c 248:* rwm
|
||||
lxc.cgroup2.devices.allow: c 249:* rwm
|
||||
lxc.cgroup2.devices.allow: c 250:* rwm
|
||||
lxc.cgroup2.devices.allow: c 510:0 rwm
|
||||
```
|
||||
|
||||
### **3.2 Restart sequence**
|
||||
|
||||
```bash
|
||||
pct stop 201
|
||||
modprobe amdgpu
|
||||
pct start 201
|
||||
pct enter 201
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **4. Inside CT 201 — Verifying ROCm + GPU Visibility**
|
||||
|
||||
### **4.1 Check device nodes**
|
||||
|
||||
```bash
|
||||
ls -l /dev/kfd
|
||||
ls -l /dev/dri
|
||||
ls -l /opt/rocm
|
||||
```
|
||||
|
||||
All must exist.
|
||||
|
||||
### **4.2 Validate GPU via rocminfo**
|
||||
|
||||
```bash
|
||||
/opt/rocm/bin/rocminfo | grep -i gfx
|
||||
```
|
||||
|
||||
You need to see:
|
||||
|
||||
```
|
||||
gfx906
|
||||
```
|
||||
|
||||
If you see **nothing**, the GPU isn’t passed through — restart and re-check the host steps.
|
||||
|
||||
---
|
||||
|
||||
## **5. Install Docker in the LXC (Ubuntu 24.04)**
|
||||
|
||||
This container runs Docker inside LXC (nesting enabled).
|
||||
|
||||
```bash
|
||||
apt update
|
||||
apt install -y ca-certificates curl gnupg
|
||||
|
||||
install -m 0755 -d /etc/apt/keyrings
|
||||
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
|
||||
| gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
||||
chmod a+r /etc/apt/keyrings/docker.gpg
|
||||
|
||||
echo \
|
||||
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
|
||||
https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
|
||||
> /etc/apt/sources.list.d/docker.list
|
||||
|
||||
apt update
|
||||
apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
```bash
|
||||
docker --version
|
||||
docker compose version
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **6. Running vLLM Inside CT 201 via Docker**
|
||||
|
||||
### **6.1 Create directory**
|
||||
|
||||
```bash
|
||||
mkdir -p /root/vllm
|
||||
cd /root/vllm
|
||||
```
|
||||
|
||||
### **6.2 docker-compose.yml**
|
||||
|
||||
Save this exact file as `/root/vllm/docker-compose.yml`:
|
||||
|
||||
```yaml
|
||||
version: "3.9"
|
||||
|
||||
services:
|
||||
vllm-mi50:
|
||||
image: nalanzeyu/vllm-gfx906:latest
|
||||
container_name: vllm-mi50
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
VLLM_ROLE: "APIServer"
|
||||
VLLM_MODEL: "/model"
|
||||
VLLM_LOGGING_LEVEL: "INFO"
|
||||
command: >
|
||||
vllm serve /model
|
||||
--host 0.0.0.0
|
||||
--port 8000
|
||||
--dtype float16
|
||||
--max-model-len 4096
|
||||
--api-type openai
|
||||
devices:
|
||||
- "/dev/kfd:/dev/kfd"
|
||||
- "/dev/dri:/dev/dri"
|
||||
volumes:
|
||||
- /opt/rocm:/opt/rocm:ro
|
||||
```
|
||||
|
||||
### **6.3 Start vLLM**
|
||||
|
||||
```bash
|
||||
docker compose up -d
|
||||
docker compose logs -f
|
||||
```
|
||||
|
||||
When healthy, you’ll see:
|
||||
|
||||
```
|
||||
(APIServer) Application startup complete.
|
||||
```
|
||||
|
||||
and periodic throughput logs.
|
||||
|
||||
---
|
||||
|
||||
## **7. Test vLLM API**
|
||||
|
||||
### **7.1 From Proxmox host**
|
||||
|
||||
```bash
|
||||
curl -X POST http://10.0.0.43:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model":"/model","prompt":"ping","max_tokens":5}'
|
||||
```
|
||||
|
||||
Should respond like:
|
||||
|
||||
```json
|
||||
{"choices":[{"text":"-pong"}]}
|
||||
```
|
||||
|
||||
### **7.2 From Cortex machine**
|
||||
|
||||
```bash
|
||||
curl -X POST http://10.0.0.43:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model":"/model","prompt":"ping from cortex","max_tokens":5}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **8. Wiring into Lyra Cortex**
|
||||
|
||||
In `cortex` container’s `docker-compose.yml`:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
LLM_PRIMARY_URL: http://10.0.0.43:8000
|
||||
```
|
||||
|
||||
Not `/v1/completions` because the router appends that automatically.
|
||||
|
||||
In `cortex/.env`:
|
||||
|
||||
```env
|
||||
LLM_FORCE_BACKEND=primary
|
||||
LLM_MODEL=/model
|
||||
```
|
||||
|
||||
Test:
|
||||
|
||||
```bash
|
||||
curl -X POST http://10.0.0.41:7081/reason \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"prompt":"test vllm","session_id":"dev"}'
|
||||
```
|
||||
|
||||
If you get a meaningful response: **Cortex → vLLM is online**.
|
||||
|
||||
---
|
||||
|
||||
## **9. Common Failure Modes (And Fixes)**
|
||||
|
||||
### **9.1 “Failed to infer device type”**
|
||||
|
||||
vLLM cannot see any ROCm devices.
|
||||
|
||||
Fix:
|
||||
|
||||
```bash
|
||||
# On host
|
||||
modprobe amdgpu
|
||||
pct stop 201
|
||||
pct start 201
|
||||
# In container
|
||||
/opt/rocm/bin/rocminfo | grep -i gfx
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### **9.2 GPU disappears after reboot**
|
||||
|
||||
Same fix:
|
||||
|
||||
```bash
|
||||
modprobe amdgpu
|
||||
pct stop 201
|
||||
pct start 201
|
||||
```
|
||||
|
||||
### **9.3 Invalid image name**
|
||||
|
||||
If you see pull errors:
|
||||
|
||||
```
|
||||
pull access denied for nalanzeuy...
|
||||
```
|
||||
|
||||
Use:
|
||||
|
||||
```
|
||||
image: nalanzeyu/vllm-gfx906
|
||||
```
|
||||
|
||||
### **9.4 Double `/v1` in URL**
|
||||
|
||||
Ensure:
|
||||
|
||||
```
|
||||
LLM_PRIMARY_URL=http://10.0.0.43:8000
|
||||
```
|
||||
|
||||
Router appends `/v1/completions`.
|
||||
|
||||
---
|
||||
|
||||
## **10. Daily / Reboot Ritual**
|
||||
|
||||
### **On Proxmox host**
|
||||
|
||||
```bash
|
||||
modprobe amdgpu
|
||||
pct stop 201
|
||||
pct start 201
|
||||
```
|
||||
|
||||
### **Inside CT 201**
|
||||
|
||||
```bash
|
||||
/opt/rocm/bin/rocminfo | grep -i gfx
|
||||
cd /root/vllm
|
||||
docker compose up -d
|
||||
docker compose logs -f
|
||||
```
|
||||
|
||||
### **Test API**
|
||||
|
||||
```bash
|
||||
curl -X POST http://10.0.0.43:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model":"/model","prompt":"ping","max_tokens":5}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **11. Summary**
|
||||
|
||||
You now have:
|
||||
|
||||
* **MI50 (gfx906)** correctly passed into LXC
|
||||
* **ROCm** inside the container via bind mounts
|
||||
* **vLLM** running inside Docker in the LXC
|
||||
* **OpenAI-compatible API** on port 8000
|
||||
* **Lyra Cortex** using it automatically as primary backend
|
||||
|
||||
This is a complete, reproducible setup that survives reboots (with the modprobe ritual) and allows you to upgrade/replace models anytime.
|
||||
|
||||
---
|
||||
|
||||
If you want, I can generate:
|
||||
|
||||
* A `/docs/vllm-mi50/README.md`
|
||||
* A "vLLM Gotchas" document
|
||||
* A quick-reference cheat sheet
|
||||
* A troubleshooting decision tree
|
||||
|
||||
Just say the word.
|
||||
Reference in New Issue
Block a user