Initial clean commit - unified Lyra stack

This commit is contained in:
serversdwn
2025-11-16 03:17:32 -05:00
commit 94fb091e59
270 changed files with 74200 additions and 0 deletions

View File

@@ -0,0 +1,57 @@
{
"title": "Model inventory review",
"messages": [
{
"role": "user",
"content": "Id like to take inventory of all the models we have running atm. Like what models are installed on ct201. What is on ollama etc"
},
{
"role": "tool",
"content": "All the files uploaded by the user have been fully loaded. Searching won't provide additional information."
},
{
"role": "assistant",
"content": "Perfect — based on your **README.md** and **CHANGELOG.md**, heres your current model inventory by host and function:\n\n---\n\n### 🧠 **CT201 (Cortex VM)**\n**Purpose:** Local reasoning + salience classification \n**Model Stack (llama.cpp / vLLM / CPU & GPU mix):**\n- **Phi-3.5-Mini-Instruct-Uncensored-Q4_K_M (GGUF)** → running on `llama-server` \n → Used for *salience filtering* and basic reasoning (Cortex v0.2.0). \n- **Qwen2-0.5B-Instruct (GGUF)** → alternative CPU model \n → Faster (~350 tok/s), used for fallback or classification tests. \n- (planned) Larger vLLM models for reflection/reasoning once GPU passthrough is stable.\n\n**Inference Host API:** \n`http://10.0.0.40:7081/v1/chat/completions` \n(`reason` endpoint for proofing/critique loop)\n\n---\n\n### 🧩 **3090 Machine (Ollama Node)**\n**Purpose:** LAN-based local LLM & embedding service \n**Models previously verified in CHANGELOG:**\n- `mythomax:13b` — default local model for general reasoning in Relay \n- `qwen2.5:7b-instruct-q4_K_M` — used for Lyra-Mem0 seeding and LLM reasoning \n- `mxbai-embed-large` — embedding model (1024-dim) for memory vectorization \n *(used in earlier Mem0 builds before NVGRAM fork)*\n\n**Ollama Host:** `http://10.0.0.3:11434` \nUsed by both Relay (LLM calls) and NVGRAM (embedding if configured).\n\n---\n\n### ⚙️ **Lyra-Core (VM100)**\n**Purpose:** Relay + Persona + NVGRAM orchestration \n**Model usage (by API proxy):**\n- **Primary model:** OpenAI GPT (fallback; via `OPENAI_API_KEY`) \n- **Alt local backend:** Ollama (`NVGRAM_API` → `10.0.0.3:11434`) \n- **Embeddings:** `text-embedding-3-small` (OpenAI) unless `EMBEDDER_PROVIDER=ollama` enabled.\n\n---\n\n### 📦 **Summary Table**\n\n| Host / VM | Function | Model | Framework | Status |\n|------------|-----------|--------|------------|---------|\n| **CT201** | Reasoning / Salience | Phi-3.5-Mini-Instruct-Uncensored-Q4_K_M | llama.cpp | ✅ Running |\n| | | Qwen2-0.5B-Instruct | llama.cpp | ✅ Alternate |\n| **3090 Node** | General chat / embeddings | Mythomax-13B | Ollama | ✅ Active |\n| | | Qwen2.5-7B-Instruct-Q4_K_M | Ollama | ✅ Verified |\n| | | Mxbai-Embed-Large | Ollama | ⚙️ Used in Mem0 seeding |\n| **Relay / Lyra-Core** | Fallback LLM | GPT (OpenAI) | API | ✅ Online |\n| **NVGRAM** | Embeddings | text-embedding-3-small | OpenAI | ✅ Default |\n\n---\n\nIf you want, I can give you the **commands to auto-scan** whats *actually installed right now* on:\n- CT201 (`llama-server` or vLLM instance list)\n- 3090 Ollama node (`ollama list` or API call)\n- Relay `.env` for current active model routing\n\nWould you like me to walk you through that verification next (one host at a time)?"
},
{
"role": "user",
"content": "I want to check each one, how do i do that? So far i've check ollama. How do i know what llama.cpp and vllm are running?"
},
{
"role": "assistant",
"content": "Perfect — lets go one host at a time so its not overwhelming. \nSince you already checked **Ollama**, next well confirm whats active on **CT201 (Cortex)** where your `llama.cpp` and/or `vLLM` processes live.\n\n---\n\n## 🧩 Step 1 — Check if `llama.cpp` (llama-server) is running\n\nSSH into CT201:\n\n```bash\nssh brian@ct201\n```\n\nThen run:\n\n```bash\nps aux | grep llama\n```\n\nTypical outputs:\n```\nbrian 1432 1.7 6.2 ... ./llama-server --model models/Phi-3.5-mini.Q4_K_M.gguf --port 7081\n```\n\n✅ **If you see `llama-server`** in the list → good, its running. \nTake note of the model path after `--model`. That tells you **which GGUF file is loaded**.\n\nIf you want the quick API check (less noisy):\n\n```bash\ncurl http://localhost:7081/v1/models | jq\n```\n\nIf `llama-server` is compatible, itll return something like:\n```json\n{\n \"object\": \"list\",\n \"data\": [\n { \"id\": \"phi-3.5-mini-instruct-uncensored-Q4_K_M\", \"object\": \"model\" }\n ]\n}\n```\n\n---\n\n## 🧠 Step 2 — Check if `vLLM` is running\n\nStill on CT201, try:\n\n```bash\nps aux | grep vllm\n```\n\nIf nothing shows, vLLM isnt active. \nIf it *is* running, youll see something like:\n```\npython3 -m vllm.entrypoints.openai.api_server --model /models/qwen2.5-7b-instruct\n```\n\nYou can confirm with the API too:\n\n```bash\ncurl http://localhost:8000/v1/models | jq\n```\n(Replace port if you set something else in your compose file or startup command.)\n\n---\n\n## 🧩 Step 3 — Summary of what to expect\n| Tool | Command | What Youll See |\n|------|----------|----------------|\n| llama.cpp | `ps aux | grep llama` | running process + model path |\n| llama.cpp API | `curl localhost:7081/v1/models` | JSON list of loaded models |\n| vLLM | `ps aux | grep vllm` | python process with model path |\n| vLLM API | `curl localhost:8000/v1/models` | JSON list of models |\n\n---\n\nWould you like me to help you interpret your `ps aux` or `curl` outputs once you run them on CT201? \n(Well move to Relay after that.)"
},
{
"role": "user",
"content": "root@cortex-gpu:~# curl http://localhost:8000/v1/models | jq\n % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload Total Spent Left Speed\n100 462 100 462 0 0 48837 0 --:--:-- --:--:-- --:--:-- 51333\nCommand 'jq' not found, but can be installed with:\napt install jq\nroot@cortex-gpu:~#\n\n"
},
{
"role": "assistant",
"content": "Perfect — that means your **vLLM API is responding** on port `8000`, so its active ✅ \nThe only reason you didnt see the JSON nicely formatted is because `jq` isnt installed yet.\n\nLets fix that first so its easier to read responses:\n\n```bash\napt update && apt install -y jq\n```\n\nThen rerun:\n\n```bash\ncurl http://localhost:8000/v1/models | jq\n```\n\nThat will print something like:\n```json\n{\n \"object\": \"list\",\n \"data\": [\n { \"id\": \"qwen2.5-7b-instruct\", \"object\": \"model\" }\n ]\n}\n```\n\nThat `\"id\"` field is the model currently loaded by vLLM.\n\n---\n\nOnce you have that output, copy or screenshot the model list for me — well log exactly which model is live on CT201 under vLLM.\n\nAfter that, Ill show you how to check **if llama.cpp (llama-server)** is running too, since both can coexist on that machine."
},
{
"role": "user",
"content": "{\n \"object\": \"list\",\n \"data\": [\n {\n \"id\": \"/model\",\n \"object\": \"model\",\n \"created\": 1761942546,\n \"owned_by\": \"vllm\",\n \"root\": \"/model\",\n \"parent\": null,\n \"max_model_len\": 4096,\n \"permission\": [\n {\n \"id\": \"modelperm-e942c9778305406295491cd90f6b4391\",\n \"object\": \"model_permission\",\n \"created\": 1761942546,\n \"allow_create_engine\": false,\n \"allow_sampling\": true,\n \"allow_logprobs\": true,\n \"allow_search_indices\": false,\n \"allow_view\": true,\n \"allow_fine_tuning\": false,\n \"organization\": \"*\",\n \"group\": null,\n \"is_blocking\": false\n }\n ]\n }\n ]\n}"
},
{
"role": "assistant",
"content": "Nice — that tells us **vLLM is up**, but its running a **generic mount named `/model`**, not a specifically labeled model like `qwen2.5-7b-instruct` or `phi-3.5-mini`. \n\nThat usually means one of two things:\n\n1. **You launched vLLM with a path argument instead of a model ID**, e.g. \n ```bash\n python3 -m vllm.entrypoints.openai.api_server --model /model\n ```\n where `/model` is a mounted folder (maybe from Docker Compose).\n\n2. The model inside `/model` hasnt been registered with a name (its just the directory). \n\nLets confirm exactly whats inside that path:\n\n```bash\nls -lh /model\n```\n\nTypical results might show something like:\n```\ntotal 30G\n-rw-r--r-- 1 root root 30G Oct 30 qwen2.5-7b-instruct\n```\n\nIf you see a folder (e.g., `/model/qwen2.5-7b-instruct`), thats your actual model. \n\n---\n\nOnce you check whats inside `/model`, well know exactly what vLLM has loaded. \nThen well move on to checking **llama.cpp** next."
},
{
"role": "user",
"content": "we def hgave this set up in an odd way. We had to use a custom fork of vLLM to get it running on the HI50. (too old for anything else) \nit also runs in venv stuff"
},
{
"role": "assistant",
"content": "That makes perfect sense — the MI50 (gfx906) setup wouldve forced a lot of odd workarounds. \nMost standard vLLM builds dropped ROCm ≤5.5 long ago, so a custom fork in a venv is the only way to make it play nice.\n\nLets adapt your check commands for that setup.\n\n---\n\n### 🧩 Step 1 — Activate your vLLM venv\nFrom your CT201 shell:\n```bash\ncd /path/to/your/vllm-venv/\nsource bin/activate\n```\n\n(If youre not sure of the exact path, run `ls / | grep vllm` or `find / -type d -name \"vllm-venv\" 2>/dev/null`.)\n\n---\n\n### 🧩 Step 2 — Verify which vLLM fork + model is active\nOnce youre inside the venv (your prompt should change to show `(venv)`), run:\n```bash\nps -ef | grep vllm\n```\n\nYoure looking for something like:\n```\npython3 -m vllm.entrypoints.openai.api_server --model /model --port 8000 ...\n```\nThat confirms the fork is actually running from your venvs Python path.\n\nIf you see a different model folder there, note it.\n\n---\n\n### 🧩 Step 3 — Inspect `/model` mount from that venv\nYou can safely peek:\n```bash\nls -lh /model\n```\nIf its a symlink or a mounted directory, well trace it with:\n```bash\nreadlink -f /model\n```\n\nThatll tell us where it actually points — often something like:\n```\n/home/brian/models/qwen2.5-7b-instruct/\n```\n\n---\n\n### 🧩 Step 4 — Optional sanity ping\nYou can also query the running fork for its model metadata (since some forks still expose `/version`):\n```bash\ncurl http://localhost:8000/version\n```\nIf supported, itll return something like:\n```\n{\"version\": \"vllm-rocm-hi50-custom\", \"backend\": \"rocm\", \"device_count\": 1}\n```\n\n---\n\nLets start with just **Step 1 → Step 2** first. \nPaste back what you get from:\n```bash\nps -ef | grep vllm\n``` \nThen well identify exactly which model and fork your ROCm build is using."
},
{
"role": "user",
"content": "okay, it seems like you have no idea how to do this so i'll just figure it out myself no problem."
},
{
"role": "assistant",
"content": "Totally fair — youre deep in custom-hardware territory there, and every MI50 + ROCm fork ends up a little Frankenstein-ed. 😅 \n\nIf you want a simpler starting point later, heres the minimalist check pattern that works for any of your stacks (no matter how patched):\n\n```bash\nsudo netstat -tulpn | grep -E \"8000|7081|11434\"\n```\n→ shows which ports (vLLM, llama-server, Ollama) are actually listening and which process owns them.\n\nThen for any process you spot:\n```bash\nsudo lsof -p <PID> | grep model\n```\n→ reveals exactly which model file it loaded.\n\nThats the clean, vendor-agnostic way to inventory whats running, even with custom venvs or forks. \nIf you hit a weird output and want help deciphering it, just paste the snippet here and Ill tell you whats what — no guessing this time."
}
]
}