Files
project-lyra/rag/chatlogs/lyra/0017_Model_inventory_review.json
2025-11-16 03:17:32 -05:00

57 lines
12 KiB
JSON
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"title": "Model inventory review",
"messages": [
{
"role": "user",
"content": "Id like to take inventory of all the models we have running atm. Like what models are installed on ct201. What is on ollama etc"
},
{
"role": "tool",
"content": "All the files uploaded by the user have been fully loaded. Searching won't provide additional information."
},
{
"role": "assistant",
"content": "Perfect — based on your **README.md** and **CHANGELOG.md**, heres your current model inventory by host and function:\n\n---\n\n### 🧠 **CT201 (Cortex VM)**\n**Purpose:** Local reasoning + salience classification \n**Model Stack (llama.cpp / vLLM / CPU & GPU mix):**\n- **Phi-3.5-Mini-Instruct-Uncensored-Q4_K_M (GGUF)** → running on `llama-server` \n → Used for *salience filtering* and basic reasoning (Cortex v0.2.0). \n- **Qwen2-0.5B-Instruct (GGUF)** → alternative CPU model \n → Faster (~350 tok/s), used for fallback or classification tests. \n- (planned) Larger vLLM models for reflection/reasoning once GPU passthrough is stable.\n\n**Inference Host API:** \n`http://10.0.0.40:7081/v1/chat/completions` \n(`reason` endpoint for proofing/critique loop)\n\n---\n\n### 🧩 **3090 Machine (Ollama Node)**\n**Purpose:** LAN-based local LLM & embedding service \n**Models previously verified in CHANGELOG:**\n- `mythomax:13b` — default local model for general reasoning in Relay \n- `qwen2.5:7b-instruct-q4_K_M` — used for Lyra-Mem0 seeding and LLM reasoning \n- `mxbai-embed-large` — embedding model (1024-dim) for memory vectorization \n *(used in earlier Mem0 builds before NVGRAM fork)*\n\n**Ollama Host:** `http://10.0.0.3:11434` \nUsed by both Relay (LLM calls) and NVGRAM (embedding if configured).\n\n---\n\n### ⚙️ **Lyra-Core (VM100)**\n**Purpose:** Relay + Persona + NVGRAM orchestration \n**Model usage (by API proxy):**\n- **Primary model:** OpenAI GPT (fallback; via `OPENAI_API_KEY`) \n- **Alt local backend:** Ollama (`NVGRAM_API` → `10.0.0.3:11434`) \n- **Embeddings:** `text-embedding-3-small` (OpenAI) unless `EMBEDDER_PROVIDER=ollama` enabled.\n\n---\n\n### 📦 **Summary Table**\n\n| Host / VM | Function | Model | Framework | Status |\n|------------|-----------|--------|------------|---------|\n| **CT201** | Reasoning / Salience | Phi-3.5-Mini-Instruct-Uncensored-Q4_K_M | llama.cpp | ✅ Running |\n| | | Qwen2-0.5B-Instruct | llama.cpp | ✅ Alternate |\n| **3090 Node** | General chat / embeddings | Mythomax-13B | Ollama | ✅ Active |\n| | | Qwen2.5-7B-Instruct-Q4_K_M | Ollama | ✅ Verified |\n| | | Mxbai-Embed-Large | Ollama | ⚙️ Used in Mem0 seeding |\n| **Relay / Lyra-Core** | Fallback LLM | GPT (OpenAI) | API | ✅ Online |\n| **NVGRAM** | Embeddings | text-embedding-3-small | OpenAI | ✅ Default |\n\n---\n\nIf you want, I can give you the **commands to auto-scan** whats *actually installed right now* on:\n- CT201 (`llama-server` or vLLM instance list)\n- 3090 Ollama node (`ollama list` or API call)\n- Relay `.env` for current active model routing\n\nWould you like me to walk you through that verification next (one host at a time)?"
},
{
"role": "user",
"content": "I want to check each one, how do i do that? So far i've check ollama. How do i know what llama.cpp and vllm are running?"
},
{
"role": "assistant",
"content": "Perfect — lets go one host at a time so its not overwhelming. \nSince you already checked **Ollama**, next well confirm whats active on **CT201 (Cortex)** where your `llama.cpp` and/or `vLLM` processes live.\n\n---\n\n## 🧩 Step 1 — Check if `llama.cpp` (llama-server) is running\n\nSSH into CT201:\n\n```bash\nssh brian@ct201\n```\n\nThen run:\n\n```bash\nps aux | grep llama\n```\n\nTypical outputs:\n```\nbrian 1432 1.7 6.2 ... ./llama-server --model models/Phi-3.5-mini.Q4_K_M.gguf --port 7081\n```\n\n✅ **If you see `llama-server`** in the list → good, its running. \nTake note of the model path after `--model`. That tells you **which GGUF file is loaded**.\n\nIf you want the quick API check (less noisy):\n\n```bash\ncurl http://localhost:7081/v1/models | jq\n```\n\nIf `llama-server` is compatible, itll return something like:\n```json\n{\n \"object\": \"list\",\n \"data\": [\n { \"id\": \"phi-3.5-mini-instruct-uncensored-Q4_K_M\", \"object\": \"model\" }\n ]\n}\n```\n\n---\n\n## 🧠 Step 2 — Check if `vLLM` is running\n\nStill on CT201, try:\n\n```bash\nps aux | grep vllm\n```\n\nIf nothing shows, vLLM isnt active. \nIf it *is* running, youll see something like:\n```\npython3 -m vllm.entrypoints.openai.api_server --model /models/qwen2.5-7b-instruct\n```\n\nYou can confirm with the API too:\n\n```bash\ncurl http://localhost:8000/v1/models | jq\n```\n(Replace port if you set something else in your compose file or startup command.)\n\n---\n\n## 🧩 Step 3 — Summary of what to expect\n| Tool | Command | What Youll See |\n|------|----------|----------------|\n| llama.cpp | `ps aux | grep llama` | running process + model path |\n| llama.cpp API | `curl localhost:7081/v1/models` | JSON list of loaded models |\n| vLLM | `ps aux | grep vllm` | python process with model path |\n| vLLM API | `curl localhost:8000/v1/models` | JSON list of models |\n\n---\n\nWould you like me to help you interpret your `ps aux` or `curl` outputs once you run them on CT201? \n(Well move to Relay after that.)"
},
{
"role": "user",
"content": "root@cortex-gpu:~# curl http://localhost:8000/v1/models | jq\n % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n100 462 100 462 0 0 48837 0 --:--:-- --:--:-- --:--:-- 51333\nCommand 'jq' not found, but can be installed with:\napt install jq\nroot@cortex-gpu:~#\n\n"
},
{
"role": "assistant",
"content": "Perfect — that means your **vLLM API is responding** on port `8000`, so its active ✅ \nThe only reason you didnt see the JSON nicely formatted is because `jq` isnt installed yet.\n\nLets fix that first so its easier to read responses:\n\n```bash\napt update && apt install -y jq\n```\n\nThen rerun:\n\n```bash\ncurl http://localhost:8000/v1/models | jq\n```\n\nThat will print something like:\n```json\n{\n \"object\": \"list\",\n \"data\": [\n { \"id\": \"qwen2.5-7b-instruct\", \"object\": \"model\" }\n ]\n}\n```\n\nThat `\"id\"` field is the model currently loaded by vLLM.\n\n---\n\nOnce you have that output, copy or screenshot the model list for me — well log exactly which model is live on CT201 under vLLM.\n\nAfter that, Ill show you how to check **if llama.cpp (llama-server)** is running too, since both can coexist on that machine."
},
{
"role": "user",
"content": "{\n \"object\": \"list\",\n \"data\": [\n {\n \"id\": \"/model\",\n \"object\": \"model\",\n \"created\": 1761942546,\n \"owned_by\": \"vllm\",\n \"root\": \"/model\",\n \"parent\": null,\n \"max_model_len\": 4096,\n \"permission\": [\n {\n \"id\": \"modelperm-e942c9778305406295491cd90f6b4391\",\n \"object\": \"model_permission\",\n \"created\": 1761942546,\n \"allow_create_engine\": false,\n \"allow_sampling\": true,\n \"allow_logprobs\": true,\n \"allow_search_indices\": false,\n \"allow_view\": true,\n \"allow_fine_tuning\": false,\n \"organization\": \"*\",\n \"group\": null,\n \"is_blocking\": false\n }\n ]\n }\n ]\n}"
},
{
"role": "assistant",
"content": "Nice — that tells us **vLLM is up**, but its running a **generic mount named `/model`**, not a specifically labeled model like `qwen2.5-7b-instruct` or `phi-3.5-mini`. \n\nThat usually means one of two things:\n\n1. **You launched vLLM with a path argument instead of a model ID**, e.g. \n ```bash\n python3 -m vllm.entrypoints.openai.api_server --model /model\n ```\n where `/model` is a mounted folder (maybe from Docker Compose).\n\n2. The model inside `/model` hasnt been registered with a name (its just the directory). \n\nLets confirm exactly whats inside that path:\n\n```bash\nls -lh /model\n```\n\nTypical results might show something like:\n```\ntotal 30G\n-rw-r--r-- 1 root root 30G Oct 30 qwen2.5-7b-instruct\n```\n\nIf you see a folder (e.g., `/model/qwen2.5-7b-instruct`), thats your actual model. \n\n---\n\nOnce you check whats inside `/model`, well know exactly what vLLM has loaded. \nThen well move on to checking **llama.cpp** next."
},
{
"role": "user",
"content": "we def hgave this set up in an odd way. We had to use a custom fork of vLLM to get it running on the HI50. (too old for anything else) \nit also runs in venv stuff"
},
{
"role": "assistant",
"content": "That makes perfect sense — the MI50 (gfx906) setup wouldve forced a lot of odd workarounds. \nMost standard vLLM builds dropped ROCm ≤5.5 long ago, so a custom fork in a venv is the only way to make it play nice.\n\nLets adapt your check commands for that setup.\n\n---\n\n### 🧩 Step 1 — Activate your vLLM venv\nFrom your CT201 shell:\n```bash\ncd /path/to/your/vllm-venv/\nsource bin/activate\n```\n\n(If youre not sure of the exact path, run `ls / | grep vllm` or `find / -type d -name \"vllm-venv\" 2>/dev/null`.)\n\n---\n\n### 🧩 Step 2 — Verify which vLLM fork + model is active\nOnce youre inside the venv (your prompt should change to show `(venv)`), run:\n```bash\nps -ef | grep vllm\n```\n\nYoure looking for something like:\n```\npython3 -m vllm.entrypoints.openai.api_server --model /model --port 8000 ...\n```\nThat confirms the fork is actually running from your venvs Python path.\n\nIf you see a different model folder there, note it.\n\n---\n\n### 🧩 Step 3 — Inspect `/model` mount from that venv\nYou can safely peek:\n```bash\nls -lh /model\n```\nIf its a symlink or a mounted directory, well trace it with:\n```bash\nreadlink -f /model\n```\n\nThatll tell us where it actually points — often something like:\n```\n/home/brian/models/qwen2.5-7b-instruct/\n```\n\n---\n\n### 🧩 Step 4 — Optional sanity ping\nYou can also query the running fork for its model metadata (since some forks still expose `/version`):\n```bash\ncurl http://localhost:8000/version\n```\nIf supported, itll return something like:\n```\n{\"version\": \"vllm-rocm-hi50-custom\", \"backend\": \"rocm\", \"device_count\": 1}\n```\n\n---\n\nLets start with just **Step 1 → Step 2** first. \nPaste back what you get from:\n```bash\nps -ef | grep vllm\n``` \nThen well identify exactly which model and fork your ROCm build is using."
},
{
"role": "user",
"content": "okay, it seems like you have no idea how to do this so i'll just figure it out myself no problem."
},
{
"role": "assistant",
"content": "Totally fair — youre deep in custom-hardware territory there, and every MI50 + ROCm fork ends up a little Frankenstein-ed. 😅 \n\nIf you want a simpler starting point later, heres the minimalist check pattern that works for any of your stacks (no matter how patched):\n\n```bash\nsudo netstat -tulpn | grep -E \"8000|7081|11434\"\n```\n→ shows which ports (vLLM, llama-server, Ollama) are actually listening and which process owns them.\n\nThen for any process you spot:\n```bash\nsudo lsof -p <PID> | grep model\n```\n→ reveals exactly which model file it loaded.\n\nThats the clean, vendor-agnostic way to inventory whats running, even with custom venvs or forks. \nIf you hit a weird output and want help deciphering it, just paste the snippet here and Ill tell you whats what — no guessing this time."
}
]
}