Initial clean commit - unified Lyra stack
This commit is contained in:
57
rag/chatlogs/lyra/0017_Model_inventory_review.json
Normal file
57
rag/chatlogs/lyra/0017_Model_inventory_review.json
Normal file
@@ -0,0 +1,57 @@
|
||||
{
|
||||
"title": "Model inventory review",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "I’d like to take inventory of all the models we have running atm. Like what models are installed on ct201. What is on ollama etc"
|
||||
},
|
||||
{
|
||||
"role": "tool",
|
||||
"content": "All the files uploaded by the user have been fully loaded. Searching won't provide additional information."
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "Perfect — based on your **README.md** and **CHANGELOG.md**, here’s your current model inventory by host and function:\n\n---\n\n### 🧠 **CT201 (Cortex VM)**\n**Purpose:** Local reasoning + salience classification \n**Model Stack (llama.cpp / vLLM / CPU & GPU mix):**\n- **Phi-3.5-Mini-Instruct-Uncensored-Q4_K_M (GGUF)** → running on `llama-server` \n → Used for *salience filtering* and basic reasoning (Cortex v0.2.0). \n- **Qwen2-0.5B-Instruct (GGUF)** → alternative CPU model \n → Faster (~350 tok/s), used for fallback or classification tests. \n- (planned) Larger vLLM models for reflection/reasoning once GPU passthrough is stable.\n\n**Inference Host API:** \n`http://10.0.0.40:7081/v1/chat/completions` \n(`reason` endpoint for proofing/critique loop)\n\n---\n\n### 🧩 **3090 Machine (Ollama Node)**\n**Purpose:** LAN-based local LLM & embedding service \n**Models previously verified in CHANGELOG:**\n- `mythomax:13b` — default local model for general reasoning in Relay \n- `qwen2.5:7b-instruct-q4_K_M` — used for Lyra-Mem0 seeding and LLM reasoning \n- `mxbai-embed-large` — embedding model (1024-dim) for memory vectorization \n *(used in earlier Mem0 builds before NVGRAM fork)*\n\n**Ollama Host:** `http://10.0.0.3:11434` \nUsed by both Relay (LLM calls) and NVGRAM (embedding if configured).\n\n---\n\n### ⚙️ **Lyra-Core (VM100)**\n**Purpose:** Relay + Persona + NVGRAM orchestration \n**Model usage (by API proxy):**\n- **Primary model:** OpenAI GPT (fallback; via `OPENAI_API_KEY`) \n- **Alt local backend:** Ollama (`NVGRAM_API` → `10.0.0.3:11434`) \n- **Embeddings:** `text-embedding-3-small` (OpenAI) unless `EMBEDDER_PROVIDER=ollama` enabled.\n\n---\n\n### 📦 **Summary Table**\n\n| Host / VM | Function | Model | Framework | Status |\n|------------|-----------|--------|------------|---------|\n| **CT201** | Reasoning / Salience | Phi-3.5-Mini-Instruct-Uncensored-Q4_K_M | llama.cpp | ✅ Running |\n| | | Qwen2-0.5B-Instruct | llama.cpp | ✅ Alternate |\n| **3090 Node** | General chat / embeddings | Mythomax-13B | Ollama | ✅ Active |\n| | | Qwen2.5-7B-Instruct-Q4_K_M | Ollama | ✅ Verified |\n| | | Mxbai-Embed-Large | Ollama | ⚙️ Used in Mem0 seeding |\n| **Relay / Lyra-Core** | Fallback LLM | GPT (OpenAI) | API | ✅ Online |\n| **NVGRAM** | Embeddings | text-embedding-3-small | OpenAI | ✅ Default |\n\n---\n\nIf you want, I can give you the **commands to auto-scan** what’s *actually installed right now* on:\n- CT201 (`llama-server` or vLLM instance list)\n- 3090 Ollama node (`ollama list` or API call)\n- Relay `.env` for current active model routing\n\nWould you like me to walk you through that verification next (one host at a time)?"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "I want to check each one, how do i do that? So far i've check ollama. How do i know what llama.cpp and vllm are running?"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "Perfect — let’s go one host at a time so it’s not overwhelming. \nSince you already checked **Ollama**, next we’ll confirm what’s active on **CT201 (Cortex)** where your `llama.cpp` and/or `vLLM` processes live.\n\n---\n\n## 🧩 Step 1 — Check if `llama.cpp` (llama-server) is running\n\nSSH into CT201:\n\n```bash\nssh brian@ct201\n```\n\nThen run:\n\n```bash\nps aux | grep llama\n```\n\nTypical outputs:\n```\nbrian 1432 1.7 6.2 ... ./llama-server --model models/Phi-3.5-mini.Q4_K_M.gguf --port 7081\n```\n\n✅ **If you see `llama-server`** in the list → good, it’s running. \nTake note of the model path after `--model`. That tells you **which GGUF file is loaded**.\n\nIf you want the quick API check (less noisy):\n\n```bash\ncurl http://localhost:7081/v1/models | jq\n```\n\nIf `llama-server` is compatible, it’ll return something like:\n```json\n{\n \"object\": \"list\",\n \"data\": [\n { \"id\": \"phi-3.5-mini-instruct-uncensored-Q4_K_M\", \"object\": \"model\" }\n ]\n}\n```\n\n---\n\n## 🧠 Step 2 — Check if `vLLM` is running\n\nStill on CT201, try:\n\n```bash\nps aux | grep vllm\n```\n\nIf nothing shows, vLLM isn’t active. \nIf it *is* running, you’ll see something like:\n```\npython3 -m vllm.entrypoints.openai.api_server --model /models/qwen2.5-7b-instruct\n```\n\nYou can confirm with the API too:\n\n```bash\ncurl http://localhost:8000/v1/models | jq\n```\n(Replace port if you set something else in your compose file or startup command.)\n\n---\n\n## 🧩 Step 3 — Summary of what to expect\n| Tool | Command | What You’ll See |\n|------|----------|----------------|\n| llama.cpp | `ps aux | grep llama` | running process + model path |\n| llama.cpp API | `curl localhost:7081/v1/models` | JSON list of loaded models |\n| vLLM | `ps aux | grep vllm` | python process with model path |\n| vLLM API | `curl localhost:8000/v1/models` | JSON list of models |\n\n---\n\nWould you like me to help you interpret your `ps aux` or `curl` outputs once you run them on CT201? \n(We’ll move to Relay after that.)"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "root@cortex-gpu:~# curl http://localhost:8000/v1/models | jq\n % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n100 462 100 462 0 0 48837 0 --:--:-- --:--:-- --:--:-- 51333\nCommand 'jq' not found, but can be installed with:\napt install jq\nroot@cortex-gpu:~#\n\n"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "Perfect — that means your **vLLM API is responding** on port `8000`, so it’s active ✅ \nThe only reason you didn’t see the JSON nicely formatted is because `jq` isn’t installed yet.\n\nLet’s fix that first so it’s easier to read responses:\n\n```bash\napt update && apt install -y jq\n```\n\nThen rerun:\n\n```bash\ncurl http://localhost:8000/v1/models | jq\n```\n\nThat will print something like:\n```json\n{\n \"object\": \"list\",\n \"data\": [\n { \"id\": \"qwen2.5-7b-instruct\", \"object\": \"model\" }\n ]\n}\n```\n\nThat `\"id\"` field is the model currently loaded by vLLM.\n\n---\n\nOnce you have that output, copy or screenshot the model list for me — we’ll log exactly which model is live on CT201 under vLLM.\n\nAfter that, I’ll show you how to check **if llama.cpp (llama-server)** is running too, since both can coexist on that machine."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "{\n \"object\": \"list\",\n \"data\": [\n {\n \"id\": \"/model\",\n \"object\": \"model\",\n \"created\": 1761942546,\n \"owned_by\": \"vllm\",\n \"root\": \"/model\",\n \"parent\": null,\n \"max_model_len\": 4096,\n \"permission\": [\n {\n \"id\": \"modelperm-e942c9778305406295491cd90f6b4391\",\n \"object\": \"model_permission\",\n \"created\": 1761942546,\n \"allow_create_engine\": false,\n \"allow_sampling\": true,\n \"allow_logprobs\": true,\n \"allow_search_indices\": false,\n \"allow_view\": true,\n \"allow_fine_tuning\": false,\n \"organization\": \"*\",\n \"group\": null,\n \"is_blocking\": false\n }\n ]\n }\n ]\n}"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "Nice — that tells us **vLLM is up**, but it’s running a **generic mount named `/model`**, not a specifically labeled model like `qwen2.5-7b-instruct` or `phi-3.5-mini`. \n\nThat usually means one of two things:\n\n1. **You launched vLLM with a path argument instead of a model ID**, e.g. \n ```bash\n python3 -m vllm.entrypoints.openai.api_server --model /model\n ```\n where `/model` is a mounted folder (maybe from Docker Compose).\n\n2. The model inside `/model` hasn’t been registered with a name (it’s just the directory). \n\nLet’s confirm exactly what’s inside that path:\n\n```bash\nls -lh /model\n```\n\nTypical results might show something like:\n```\ntotal 30G\n-rw-r--r-- 1 root root 30G Oct 30 qwen2.5-7b-instruct\n```\n\nIf you see a folder (e.g., `/model/qwen2.5-7b-instruct`), that’s your actual model. \n\n---\n\nOnce you check what’s inside `/model`, we’ll know exactly what vLLM has loaded. \nThen we’ll move on to checking **llama.cpp** next."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "we def hgave this set up in an odd way. We had to use a custom fork of vLLM to get it running on the HI50. (too old for anything else) \nit also runs in venv stuff"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "That makes perfect sense — the MI50 (gfx906) setup would’ve forced a lot of odd workarounds. \nMost standard vLLM builds dropped ROCm ≤5.5 long ago, so a custom fork in a venv is the only way to make it play nice.\n\nLet’s adapt your check commands for that setup.\n\n---\n\n### 🧩 Step 1 — Activate your vLLM venv\nFrom your CT201 shell:\n```bash\ncd /path/to/your/vllm-venv/\nsource bin/activate\n```\n\n(If you’re not sure of the exact path, run `ls / | grep vllm` or `find / -type d -name \"vllm-venv\" 2>/dev/null`.)\n\n---\n\n### 🧩 Step 2 — Verify which vLLM fork + model is active\nOnce you’re inside the venv (your prompt should change to show `(venv)`), run:\n```bash\nps -ef | grep vllm\n```\n\nYou’re looking for something like:\n```\npython3 -m vllm.entrypoints.openai.api_server --model /model --port 8000 ...\n```\nThat confirms the fork is actually running from your venv’s Python path.\n\nIf you see a different model folder there, note it.\n\n---\n\n### 🧩 Step 3 — Inspect `/model` mount from that venv\nYou can safely peek:\n```bash\nls -lh /model\n```\nIf it’s a symlink or a mounted directory, we’ll trace it with:\n```bash\nreadlink -f /model\n```\n\nThat’ll tell us where it actually points — often something like:\n```\n/home/brian/models/qwen2.5-7b-instruct/\n```\n\n---\n\n### 🧩 Step 4 — Optional sanity ping\nYou can also query the running fork for its model metadata (since some forks still expose `/version`):\n```bash\ncurl http://localhost:8000/version\n```\nIf supported, it’ll return something like:\n```\n{\"version\": \"vllm-rocm-hi50-custom\", \"backend\": \"rocm\", \"device_count\": 1}\n```\n\n---\n\nLet’s start with just **Step 1 → Step 2** first. \nPaste back what you get from:\n```bash\nps -ef | grep vllm\n``` \nThen we’ll identify exactly which model and fork your ROCm build is using."
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "okay, it seems like you have no idea how to do this so i'll just figure it out myself no problem."
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "Totally fair — you’re deep in custom-hardware territory there, and every MI50 + ROCm fork ends up a little Frankenstein-ed. 😅 \n\nIf you want a simpler starting point later, here’s the minimalist check pattern that works for any of your stacks (no matter how patched):\n\n```bash\nsudo netstat -tulpn | grep -E \"8000|7081|11434\"\n```\n→ shows which ports (vLLM, llama-server, Ollama) are actually listening and which process owns them.\n\nThen for any process you spot:\n```bash\nsudo lsof -p <PID> | grep model\n```\n→ reveals exactly which model file it loaded.\n\nThat’s the clean, vendor-agnostic way to inventory what’s running, even with custom venvs or forks. \nIf you hit a weird output and want help deciphering it, just paste the snippet here and I’ll tell you what’s what — no guessing this time."
|
||||
}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user