project-lyra/rag/chatlogs/lyra/0017_Model_inventory_review.json

{
  "title": "Model inventory review",
  "messages": [
    {
      "role": "user",
      "content": "I’d like to take inventory of all the models we have running atm. Like what models are installed on ct201. What is on ollama etc"
    },
    {
      "role": "tool",
      "content": "All the files uploaded by the user have been fully loaded. Searching won't provide additional information."
    },
    {
      "role": "assistant",
      "content": "Perfect — based on your **README.md** and **CHANGELOG.md**, here’s your current model inventory by host and function:\n\n---\n\n### 🧠 **CT201 (Cortex VM)**\n**Purpose:** Local reasoning + salience classification  \n**Model Stack (llama.cpp / vLLM / CPU & GPU mix):**\n- **Phi-3.5-Mini-Instruct-Uncensored-Q4_K_M (GGUF)** → running on `llama-server`  \n  → Used for *salience filtering* and basic reasoning (Cortex v0.2.0).  \n- **Qwen2-0.5B-Instruct (GGUF)** → alternative CPU model  \n  → Faster (~350 tok/s), used for fallback or classification tests.  \n- (planned) Larger vLLM models for reflection/reasoning once GPU passthrough is stable.\n\n**Inference Host API:**  \n`http://10.0.0.40:7081/v1/chat/completions`  \n(`reason` endpoint for proofing/critique loop)\n\n---\n\n### 🧩 **3090 Machine (Ollama Node)**\n**Purpose:** LAN-based local LLM & embedding service  \n**Models previously verified in CHANGELOG:**\n- `mythomax:13b` — default local model for general reasoning in Relay  \n- `qwen2.5:7b-instruct-q4_K_M` — used for Lyra-Mem0 seeding and LLM reasoning  \n- `mxbai-embed-large` — embedding model (1024-dim) for memory vectorization  \n  *(used in earlier Mem0 builds before NVGRAM fork)*\n\n**Ollama Host:** `http://10.0.0.3:11434`  \nUsed by both Relay (LLM calls) and NVGRAM (embedding if configured).\n\n---\n\n### ⚙️ **Lyra-Core (VM100)**\n**Purpose:** Relay + Persona + NVGRAM orchestration  \n**Model usage (by API proxy):**\n- **Primary model:** OpenAI GPT (fallback; via `OPENAI_API_KEY`)  \n- **Alt local backend:** Ollama (`NVGRAM_API` → `10.0.0.3:11434`)  \n- **Embeddings:** `text-embedding-3-small` (OpenAI) unless `EMBEDDER_PROVIDER=ollama` enabled.\n\n---\n\n### 📦 **Summary Table**\n\n| Host / VM | Function | Model | Framework | Status |\n|------------|-----------|--------|------------|---------|\n| **CT201** | Reasoning / Salience | Phi-3.5-Mini-Instruct-Uncensored-Q4_K_M | llama.cpp | ✅ Running |\n|  |  | Qwen2-0.5B-Instruct | llama.cpp | ✅ Alternate |\n| **3090 Node** | General chat / embeddings | Mythomax-13B | Ollama | ✅ Active |\n|  |  | Qwen2.5-7B-Instruct-Q4_K_M | Ollama | ✅ Verified |\n|  |  | Mxbai-Embed-Large | Ollama | ⚙️ Used in Mem0 seeding |\n| **Relay / Lyra-Core** | Fallback LLM | GPT (OpenAI) | API | ✅ Online |\n| **NVGRAM** | Embeddings | text-embedding-3-small | OpenAI | ✅ Default |\n\n---\n\nIf you want, I can give you the **commands to auto-scan** what’s *actually installed right now* on:\n- CT201 (`llama-server` or vLLM instance list)\n- 3090 Ollama node (`ollama list` or API call)\n- Relay `.env` for current active model routing\n\nWould you like me to walk you through that verification next (one host at a time)?"
    },
    {
      "role": "user",
      "content": "I want to check each one, how do i do that? So far i've check ollama. How do i know what llama.cpp and vllm are running?"
    },
    {
      "role": "assistant",
      "content": "Perfect — let’s go one host at a time so it’s not overwhelming.  \nSince you already checked **Ollama**, next we’ll confirm what’s active on **CT201 (Cortex)** where your `llama.cpp` and/or `vLLM` processes live.\n\n---\n\n## 🧩 Step 1 — Check if `llama.cpp` (llama-server) is running\n\nSSH into CT201:\n\n```bash\nssh brian@ct201\n```\n\nThen run:\n\n```bash\nps aux | grep llama\n```\n\nTypical outputs:\n```\nbrian     1432  1.7  6.2  ...  ./llama-server --model models/Phi-3.5-mini.Q4_K_M.gguf --port 7081\n```\n\n✅ **If you see `llama-server`** in the list → good, it’s running.  \nTake note of the model path after `--model`. That tells you **which GGUF file is loaded**.\n\nIf you want the quick API check (less noisy):\n\n```bash\ncurl http://localhost:7081/v1/models | jq\n```\n\nIf `llama-server` is compatible, it’ll return something like:\n```json\n{\n  \"object\": \"list\",\n  \"data\": [\n    { \"id\": \"phi-3.5-mini-instruct-uncensored-Q4_K_M\", \"object\": \"model\" }\n  ]\n}\n```\n\n---\n\n## 🧠 Step 2 — Check if `vLLM` is running\n\nStill on CT201, try:\n\n```bash\nps aux | grep vllm\n```\n\nIf nothing shows, vLLM isn’t active.  \nIf it *is* running, you’ll see something like:\n```\npython3 -m vllm.entrypoints.openai.api_server --model /models/qwen2.5-7b-instruct\n```\n\nYou can confirm with the API too:\n\n```bash\ncurl http://localhost:8000/v1/models | jq\n```\n(Replace port if you set something else in your compose file or startup command.)\n\n---\n\n## 🧩 Step 3 — Summary of what to expect\n| Tool | Command | What You’ll See |\n|------|----------|----------------|\n| llama.cpp | `ps aux | grep llama` | running process + model path |\n| llama.cpp API | `curl localhost:7081/v1/models` | JSON list of loaded models |\n| vLLM | `ps aux | grep vllm` | python process with model path |\n| vLLM API | `curl localhost:8000/v1/models` | JSON list of models |\n\n---\n\nWould you like me to help you interpret your `ps aux` or `curl` outputs once you run them on CT201?  \n(We’ll move to Relay after that.)"
    },
    {
      "role": "user",
      "content": "root@cortex-gpu:~# curl http://localhost:8000/v1/models | jq\n  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n100   462  100   462    0     0  48837      0 --:--:-- --:--:-- --:--:-- 51333\nCommand 'jq' not found, but can be installed with:\napt install jq\nroot@cortex-gpu:~#\n\n"
    },
    {
      "role": "assistant",
      "content": "Perfect — that means your **vLLM API is responding** on port `8000`, so it’s active ✅  \nThe only reason you didn’t see the JSON nicely formatted is because `jq` isn’t installed yet.\n\nLet’s fix that first so it’s easier to read responses:\n\n```bash\napt update && apt install -y jq\n```\n\nThen rerun:\n\n```bash\ncurl http://localhost:8000/v1/models | jq\n```\n\nThat will print something like:\n```json\n{\n  \"object\": \"list\",\n  \"data\": [\n    { \"id\": \"qwen2.5-7b-instruct\", \"object\": \"model\" }\n  ]\n}\n```\n\nThat `\"id\"` field is the model currently loaded by vLLM.\n\n---\n\nOnce you have that output, copy or screenshot the model list for me — we’ll log exactly which model is live on CT201 under vLLM.\n\nAfter that, I’ll show you how to check **if llama.cpp (llama-server)** is running too, since both can coexist on that machine."
    },
    {
      "role": "user",
      "content": "{\n  \"object\": \"list\",\n  \"data\": [\n    {\n      \"id\": \"/model\",\n      \"object\": \"model\",\n      \"created\": 1761942546,\n      \"owned_by\": \"vllm\",\n      \"root\": \"/model\",\n      \"parent\": null,\n      \"max_model_len\": 4096,\n      \"permission\": [\n        {\n          \"id\": \"modelperm-e942c9778305406295491cd90f6b4391\",\n          \"object\": \"model_permission\",\n          \"created\": 1761942546,\n          \"allow_create_engine\": false,\n          \"allow_sampling\": true,\n          \"allow_logprobs\": true,\n          \"allow_search_indices\": false,\n          \"allow_view\": true,\n          \"allow_fine_tuning\": false,\n          \"organization\": \"*\",\n          \"group\": null,\n          \"is_blocking\": false\n        }\n      ]\n    }\n  ]\n}"
    },
    {
      "role": "assistant",
      "content": "Nice — that tells us **vLLM is up**, but it’s running a **generic mount named `/model`**, not a specifically labeled model like `qwen2.5-7b-instruct` or `phi-3.5-mini`.  \n\nThat usually means one of two things:\n\n1. **You launched vLLM with a path argument instead of a model ID**, e.g.  \n   ```bash\n   python3 -m vllm.entrypoints.openai.api_server --model /model\n   ```\n   where `/model` is a mounted folder (maybe from Docker Compose).\n\n2. The model inside `/model` hasn’t been registered with a name (it’s just the directory).  \n\nLet’s confirm exactly what’s inside that path:\n\n```bash\nls -lh /model\n```\n\nTypical results might show something like:\n```\ntotal 30G\n-rw-r--r-- 1 root root 30G Oct 30 qwen2.5-7b-instruct\n```\n\nIf you see a folder (e.g., `/model/qwen2.5-7b-instruct`), that’s your actual model.  \n\n---\n\nOnce you check what’s inside `/model`, we’ll know exactly what vLLM has loaded.  \nThen we’ll move on to checking **llama.cpp** next."
    },
    {
      "role": "user",
      "content": "we def hgave this set up in an odd way. We had to use a custom fork of vLLM to get it running on the HI50. (too old for anything else) \nit also runs in venv stuff"
    },
    {
      "role": "assistant",
      "content": "That makes perfect sense — the MI50 (gfx906) setup would’ve forced a lot of odd workarounds.  \nMost standard vLLM builds dropped ROCm ≤5.5 long ago, so a custom fork in a venv is the only way to make it play nice.\n\nLet’s adapt your check commands for that setup.\n\n---\n\n### 🧩 Step 1 — Activate your vLLM venv\nFrom your CT201 shell:\n```bash\ncd /path/to/your/vllm-venv/\nsource bin/activate\n```\n\n(If you’re not sure of the exact path, run `ls / | grep vllm` or `find / -type d -name \"vllm-venv\" 2>/dev/null`.)\n\n---\n\n### 🧩 Step 2 — Verify which vLLM fork + model is active\nOnce you’re inside the venv (your prompt should change to show `(venv)`), run:\n```bash\nps -ef | grep vllm\n```\n\nYou’re looking for something like:\n```\npython3 -m vllm.entrypoints.openai.api_server --model /model --port 8000 ...\n```\nThat confirms the fork is actually running from your venv’s Python path.\n\nIf you see a different model folder there, note it.\n\n---\n\n### 🧩 Step 3 — Inspect `/model` mount from that venv\nYou can safely peek:\n```bash\nls -lh /model\n```\nIf it’s a symlink or a mounted directory, we’ll trace it with:\n```bash\nreadlink -f /model\n```\n\nThat’ll tell us where it actually points — often something like:\n```\n/home/brian/models/qwen2.5-7b-instruct/\n```\n\n---\n\n### 🧩 Step 4 — Optional sanity ping\nYou can also query the running fork for its model metadata (since some forks still expose `/version`):\n```bash\ncurl http://localhost:8000/version\n```\nIf supported, it’ll return something like:\n```\n{\"version\": \"vllm-rocm-hi50-custom\", \"backend\": \"rocm\", \"device_count\": 1}\n```\n\n---\n\nLet’s start with just **Step 1 → Step 2** first.  \nPaste back what you get from:\n```bash\nps -ef | grep vllm\n```  \nThen we’ll identify exactly which model and fork your ROCm build is using."
    },
    {
      "role": "user",
      "content": "okay, it seems like you have no idea how to do this so i'll just figure it out myself no problem."
    },
    {
      "role": "assistant",
      "content": "Totally fair — you’re deep in custom-hardware territory there, and every MI50 + ROCm fork ends up a little Frankenstein-ed. 😅  \n\nIf you want a simpler starting point later, here’s the minimalist check pattern that works for any of your stacks (no matter how patched):\n\n```bash\nsudo netstat -tulpn | grep -E \"8000|7081|11434\"\n```\n→ shows which ports (vLLM, llama-server, Ollama) are actually listening and which process owns them.\n\nThen for any process you spot:\n```bash\nsudo lsof -p <PID> | grep model\n```\n→ reveals exactly which model file it loaded.\n\nThat’s the clean, vendor-agnostic way to inventory what’s running, even with custom venvs or forks.  \nIf you hit a weird output and want help deciphering it, just paste the snippet here and I’ll tell you what’s what — no guessing this time."
    }
  ]
}