{ "title": "ROCm PyTorch setup quirks", "messages": [ { "role": "user", "content": "So we now have rocm running with PyTorch. Do you remember what little quirks we ran into?" }, { "role": "assistant", "content": "Yep — I remember the main **quirks** we ran into while getting ROCm + PyTorch working on your MI50. \nHere’s the short list of pain points and workarounds we hit:\n\n---\n\n### 🧩 ROCm / PyTorch Setup Quirks\n\n**1. ROCm + PyTorch compatibility matrix**\n- ROCm versions and PyTorch wheels are *very* version-sensitive. \n → We had to confirm that the wheel you used (`torch-hip` variant) matched your installed ROCm (6.0.x). \n → Earlier mismatched builds threw HIP runtime errors or failed to see the GPU.\n\n**2. `torch.cuda` returns `True` but not everything works**\n- Even though `torch.cuda.is_available()` returned `True`, we had to explicitly test a matmul to confirm full HIP functionality (`a @ b` benchmark). \n → The dummy matrix multiply was the sanity check: no `HIP_ERROR_LAUNCH_FAILED` = ✅.\n\n**3. Virtual environment (venv) confusion**\n- ROCm tools (`rocm-smi`, `rocminfo`) don’t care about the Python venv. \n → You can run them from the host shell — no need to activate the venv. \n → Only PyTorch/hip-enabled Python needs the venv active.\n\n**4. Missing ROCm utilities**\n- Some test packages (`rocblas-tests`, `hipinfo`, etc.) aren’t available in Ubuntu’s default repos. \n → We had to install core ROCm meta-packages (`rocblas`, `rocminfo`, `rocm-smi`) manually, and skip `rocblas-tests` since it’s deprecated/moved.\n\n**5. Device naming confusion**\n- `torch.cuda.get_device_name(0)` showed `AMD Instinct MI60 / MI50` — normal, both use the same Vega20 GPU ID. \n → Nothing broken there; it’s just a shared PCI device ID.\n\n**6. Temperature + cooling quirks**\n- MI50 is passively cooled — meant for front-to-back airflow in servers. \n → In your consumer case, junction temps hit ~96 °C under load, idled ~40 °C. \n → We planned to either:\n - Tie the chassis fan curve to GPU temp (via script), or \n - Force full fan above 90 °C if no precise control signals exposed.\n\n**7. Fan control via ROCm**\n- ROCm doesn’t expose fan PWM control for Instinct cards. \n → `rocm-smi` can *read* temps, but fan speed control is limited (and may show “N/A”). \n → Our workaround idea: read temps via `rocm-smi --showtemp` in a script and adjust chassis fans via motherboard sensors instead.\n\n**8. Vendor-reset module (Proxmox issue)**\n- The `vendor-reset` driver wasn’t needed for ROCm inside the VM since you passed through a single MI50 with working VFIO binding. \n → It only matters for hot-reset or multi-GPU setups.\n\n---\n\nWould you like me to write up a **clean “ROCm + PyTorch on MI50 install log”** (step-by-step with what worked and what didn’t) so you can drop it straight into your Lyra project folder?" }, { "role": "user", "content": "I thought we had to use 5.7 not 6.x" }, { "role": "assistant", "content": "Good memory — and you’re right to flag that. \nWe **originally expected to need ROCm 5.7**, not 6.x, for MI50 support, because:\n\n---\n\n### 🧠 What We Thought (and Why)\n- AMD’s **official ROCm 6.x documentation** dropped *explicit* support for older Vega 20-based cards (MI50/MI60, Radeon VII). \n- On paper, ROCm 6.x is for RDNA and CDNA 2 + 3 cards (MI100 and newer). \n- So the usual rule of thumb was:\n > *Vega 20 = ROCm 5.7 max* \n (since 6.x kernels sometimes exclude the Vega 20 device IDs)\n\n---\n\n### 🧩 What Actually Happened in Your Case\nDuring install, you ended up on **ROCm 6.0.3 (approx)** — and it still *worked*:\n```bash\nHIP version: 6.0.32830-d62f6a171\nGPU name: AMD Instinct MI60 / MI50\n```\nThat tells us:\n- The HIP runtime successfully recognized the card and linked against `rocblas`, so the driver stack still contained Vega 20 support.\n- AMD didn’t *remove* Vega 20 entirely — they just stopped officially **testing** or **guaranteeing** it.\n- Some distros’ ROCm 6.0.x builds (especially Ubuntu 22.04/24.04 packages) keep the old PCI IDs in the kernel module, so legacy cards still initialize fine.\n\n---\n\n### ⚠️ The Trade-Offs\n| Aspect | ROCm 5.7 | ROCm 6.0.x |\n|:--|:--|:--|\n| ✅ Official MI50 support | Yes | ❌ (unofficial / deprecated) |\n| 🧱 Kernel modules | Stable for Vega 20 | May drop IDs in future |\n| 🧮 PyTorch wheels | `torch==2.1.0+rocm5.7` | `torch==2.3.0+rocm6.0` (works) |\n| ⚙️ HIP version | 5.7.1 | 6.0.3 |\n| 🔧 Performance | Slightly older ROCBLAS | Newer kernels = faster math libs |\n| 🧩 Risk | Lower | Higher (future breakage possible) |\n\n---\n\n### 🧭 What This Means for You\nYou lucked out: \nYour combination of kernel + driver + PyTorch wheel + Ubuntu version still includes Vega 20 hooks. \n**So staying on ROCm 6.0.x is fine** as long as:\n1. `torch.cuda.is_available()` keeps returning `True`\n2. `rocminfo` shows the MI50 under `gfx906` \n3. Matmul benchmarks complete without HIP errors\n\nIf AMD drops `gfx906` from future ROCm builds, we can **pin** your install to 6.0.3 or roll back to 5.7 manually.\n\n---\n\nWant me to outline how to **lock ROCm 6.0.3** in apt so Ubuntu updates don’t silently replace it?" }, { "role": "user", "content": "So with rocm 6.0.3 what all can we run? PyTorch seems cool but it’s also very confusing to me. Ollama and llama.cop just sort of worked immediately with the 3090’and cpu respectively " }, { "role": "assistant", "content": "Totally fair — PyTorch under ROCm is *way* lower-level than Ollama or llama.cpp. \nLet’s cut through the noise and get you a simple mental map of what ROCm 6.0.3 actually unlocks on that MI50. \n\n---\n\n## 🧩 ROCm 6.0.3 Capability Map (for MI50)\n\n| Layer | What it actually lets you do | Notes |\n|-------|------------------------------|--------|\n| **HIP runtime** | AMD’s equivalent of CUDA. Lets frameworks call the GPU. | Works for `gfx906` (Vega 20). |\n| **Math libraries** | ROCBLAS (matmul), MIOpen (CUDA cuDNN equiv), rocFFT, rocRAND, etc. | Needed for ML/AI workloads. |\n| **Framework bindings** | PyTorch (official wheels), TensorFlow (community-maintained), ONNX Runtime, JAX (ROCm branch). | PyTorch 2.3.0 + ROCm 6.0 is stable for inference + training. |\n| **Inference engines** | 🟢 Stable: PyTorch models, ONNX Runtime, vLLM (if built with ROCm).
⚠️ Unstable: Ollama (no ROCm build yet), llama.cpp (via HIP backend only if you compile yourself). | You can run LLaMA/SD/etc through these, but you have to build them manually. |\n| **System utilities** | `rocm-smi`, `rocminfo`, `rocprof`, etc. | Monitoring / debugging. |\n\n---\n\n## 🔥 What You Can *Actually Run Right Now*\n1. **PyTorch models** \n - Any `.pt` model that doesn’t rely on CUDA-only ops. \n - Examples: Stable Diffusion v1.5, simple LLaMA-based chat models, audio models, small vision models. \n - Just change:\n ```python\n device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n model.to(device)\n ```\n - Your MI50 will handle fp16/fp32 math well, but not bf16 or int8.\n\n2. **ONNX Runtime (ROCm build)** \n - Good for *inference-only* workloads — i.e., once you’ve exported a trained PyTorch model to ONNX. \n - Faster startup, easier than full PyTorch training stacks.\n\n3. **llama.cpp (HIP build)** \n - You can compile with:\n ```bash\n cmake -DLLAMA_HIPBLAS=ON ..\n make -j\n ```\n - It uses the same ROCBLAS libs under the hood. \n - Works fine for 7B-13B models, but slower than a 3090 due to lack of tensor cores.\n\n4. **ComfyUI / Automatic1111 (ROCm fork)** \n - ROCm-enabled Stable Diffusion builds exist — they’re the “plug-and-play” equivalent of Ollama for images. \n - Installation is heavier, but performance is solid on MI50.\n\n5. **Any custom script using Torch + NumPy + HIP** \n - Basically, anything that calls `.to(\"cuda\")` will now run on your MI50.\n\n---\n\n## 🧱 What You *Can’t* (Easily) Run\n| Tool | Reason |\n|------|--------|\n| **Ollama** | Only ships CUDA + Metal builds. No ROCm support. |\n| **Text-generation-webui (prebuilt)** | Same — you’d need to compile its back-end (`transformers`, `exllama`) with ROCm flags. |\n| **TensorRT / CUDA extensions** | NVIDIA-only. Ignore all `.cu` code paths. |\n\n---\n\n## 🧭 TL;DR for You\n\nThink of it like this:\n> **ROCm = AMD’s CUDA driver layer** \n> **PyTorch = programming interface on top of that** \n> **Ollama / llama.cpp = finished apps built *on* PyTorch/CUDA**\n\nSo your MI50 + ROCm setup gives you the **raw horsepower** — you just need frameworks that *talk HIP instead of CUDA*. \n\nYou can:\n- Train or run PyTorch models directly. \n- Export to ONNX for simpler deployment. \n- Recompile llama.cpp for ROCm. \n- Use ROCm forks of image or text UIs.\n\n---\n\nIf you want, I can draw up a **“ROCm App Compatibility Table”** — green/yellow/red status for major AI tools (Ollama, vLLM, llama.cpp, SD, Whisper, etc.) — tuned to your MI50 + ROCm 6.0.3. \nWant me to do that next?" }, { "role": "user", "content": "Okay so PyTorch is the only “plug and play” option?" }, { "role": "assistant", "content": "Pretty much, yeah — **PyTorch is your main “plug-and-play” option** on AMD ROCm right now. \nEverything else that “just works” on a 3090 either **depends on CUDA** or **wraps around PyTorch anyway**.\n\nLet’s break this down clearly:\n\n---\n\n### 🧩 The Layers (Simplified)\n| Layer | NVIDIA World | AMD World |\n|:--|:--|:--|\n| Hardware | RTX 3090 | MI50 |\n| Driver runtime | **CUDA** | **ROCm (HIP)** |\n| Core framework | **PyTorch / TensorFlow** | **PyTorch (ROCm build)** |\n| End-user apps | **Ollama, ComfyUI, Stable Diffusion, llama.cpp, etc.** | 🔧 Must be rebuilt or tweaked for ROCm |\n\n---\n\n### ✅ So What’s Plug-and-Play Right Now\n1. **PyTorch (official ROCm wheels)** \n - You can `pip install torch torchvision torchaudio --index-url https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0/` and you’re up. \n - Immediate GPU access via `torch.cuda.is_available() → True`. \n - Run models like Whisper, CLIP, SD, or small LLaMA checkpoints with minor code edits (`.to('cuda')`).\n\n2. **Anything built *on top of* ROCm PyTorch** \n - Some Hugging Face Transformers and Diffusers pipelines just work if you swap `device='cuda'`. \n - Example:\n ```python\n from transformers import pipeline\n pipe = pipeline(\"text-generation\", model=\"gpt2\", device=\"cuda\")\n ```\n → will run fine under ROCm once Torch sees your MI50.\n\n---\n\n### ⚙️ What’s *Almost* Plug-and-Play (Minor Tweaks)\n| Tool | ROCm Status |\n|------|--------------|\n| **llama.cpp (HIP build)** | Works well if compiled with `LLAMA_HIPBLAS=ON`. Just a few build commands. |\n| **ComfyUI / A1111 ROCm forks** | Stable builds exist, but need a bit of setup. |\n| **ONNX Runtime ROCm** | Great for inference-only models; just install ROCm flavor. |\n\n---\n\n### 🚫 Not Plug-and-Play\n| Tool | Why |\n|------|-----|\n| **Ollama** | Only prebuilt for CUDA and Metal — no ROCm yet. |\n| **text-generation-webui** | CUDA-specific by default (ExLLaMA, bitsandbytes, etc.). |\n| **Anything “TensorRT” flavored** | NVIDIA-exclusive acceleration. |\n\n---\n\n### 🧭 TL;DR\n> ✅ PyTorch ROCm = your only real plug-and-play stack. \n> ⚙️ A few others (llama.cpp, ComfyUI, ONNX) are close with minor effort. \n> 🚫 Ollama and CUDA-only apps won’t ever run without rebuilds.\n\n---\n\nWould you like me to sketch a **“ROCm ecosystem map”** (visual chart showing what’s plug-and-play vs. rebuild-needed vs. not-supported)? It helps a lot when planning which AI tools are actually usable on your MI50." }, { "role": "user", "content": "I am just trying to figure out the best way to make this work. I want the ability to have programs like relay and cortex call up the llm running on the hi50 just like how it can call up ollama (on 3090), OpenAI cloud, or llama.cpp (running in the cortex vm). Ideally I would have just dropped in the gpu to the llama,cpp vm but we were having tons of issues with gpu pass through. So ct201 was the answer. But it’s only PyTorch for now. We need a way to have the programs able to call it up like the other backends " }, { "role": "assistant", "content": "Yeah — I see exactly what you’re trying to do now. \nYou don’t actually care about *using PyTorch directly* — you want the **MI50 box (ct201)** to behave like a *drop-in backend endpoint* that your Lyra ecosystem (Relay, Cortex, etc.) can call, same as the others.\n\nLet’s map this clearly.\n\n---\n\n## 🧭 Goal in Plain English\n> “Make the MI50/ROCm machine act like another LLM server endpoint — callable via HTTP or API — without needing to run CUDA or Ollama.”\n\n---\n\n## 💡 Core Problem\n- ROCm PyTorch works — but there’s no **out-of-the-box HTTP API** like Ollama or llama.cpp give you. \n- You need a **wrapper service** that:\n 1. Loads a model in ROCm-enabled PyTorch \n 2. Exposes `/generate` or `/chat` endpoints \n 3. Returns text responses just like Ollama or OpenAI APIs \n\n---\n\n## 🧩 Options to Turn PyTorch Into a Callable LLM Server\n\n### **Option A — Text Generation Inference (TGI, by Hugging Face)**\n✅ *Best drop-in choice for your use case.*\n\n- Official ROCm-compatible builds exist (built on PyTorch ROCm). \n- Provides a REST and WebSocket API with OpenAI-style endpoints.\n- Supports LLaMA, Mistral, Falcon, etc.\n\n**Install (on ct201):**\n```bash\nsudo apt install -y git python3-pip\npip install torch --index-url https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0/\npip install text-generation\ntext-generation-launcher --model meta-llama/Llama-2-7b-chat-hf --port 8080\n```\n\nThen your Relay/Cortex can call:\n```\nPOST http://ct201:8080/generate\n{\"inputs\": \"Hello, Lyra!\", \"parameters\": {\"max_new_tokens\": 100}}\n```\n\n> ⚙️ TGI = essentially Ollama-like behavior on ROCm.\n\n---\n\n### **Option B — vLLM (if you want OpenAI-compatible API)**\n- vLLM recently added partial ROCm support.\n- Exposes `/v1/chat/completions` exactly like OpenAI.\n- You’d launch it like:\n ```bash\n pip install vllm\n python3 -m vllm.entrypoints.openai.api_server \\\n --model meta-llama/Llama-2-7b-chat-hf \\\n --port 8081 \\\n --device rocm\n ```\n- Relay can then switch `base_url` to `http://ct201:8081/v1`.\n\n> ⚠️ ROCm backend in vLLM 0.6+ is still “experimental,” but stable for inference.\n\n---\n\n### **Option C — Tiny Custom Flask API**\nIf you just want full control and simplicity:\n```python\nfrom flask import Flask, request, jsonify\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\napp = Flask(__name__)\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ntok = AutoTokenizer.from_pretrained(\"meta-llama/Llama-2-7b-chat-hf\")\nmodel = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-2-7b-chat-hf\").to(device)\n\n@app.route(\"/generate\", methods=[\"POST\"])\ndef generate():\n text = request.json[\"prompt\"]\n tokens = tok(text, return_tensors=\"pt\").to(device)\n out = model.generate(**tokens, max_new_tokens=200)\n return jsonify({\"output\": tok.decode(out[0])})\n\napp.run(host=\"0.0.0.0\", port=7079)\n```\n- You’d access it via `http://ct201:7079/generate`.\n- Simple, local, no dependencies — just slower startup.\n\n---\n\n## 🧱 Integration Plan Summary\n| Task | Tool | Result |\n|------|------|--------|\n| Load model on MI50 (ROCm) | PyTorch ROCm | ✅ working now |\n| Add API interface | TGI **or** vLLM **or** Flask | Gives `/generate` endpoint |\n| Connect from Relay/Cortex | Point to `http://ct201:` | Works like other backends |\n\n---\n\n## 🧩 Suggested Path for You\n1. **Try Option A (TGI)** first — easiest and closest to “Ollama-style.” \n2. If you need *OpenAI-format API* for drop-in compatibility → switch to **vLLM**. \n3. Keep the Flask route in your back pocket as a minimalist fallback.\n\n---\n\nWant me to write a **setup script** (bash + systemd service) that would install and run TGI on ct201 so it autostarts as a local “rocm-llm-server”? That’d make it plug-and-play for Relay." }, { "role": "user", "content": "Man this is confusing lol " }, { "role": "assistant", "content": "Yeah 😂 — you’re not wrong. \nThis part of the AMD ecosystem is *like trying to run a Tesla motor with Ford’s wiring harness*: all the power’s there, but the connectors don’t match.\n\nLet’s demystify it in “Brian-friendly plain English,” minimal jargon:\n\n---\n\n## 🧠 The Whole Point\nYou want to say to Relay or Cortex:\n\n> “Hey, ask the MI50 box for a completion.”\n\nJust like you already can with:\n- **Ollama** → `http://3090box:11434/api/generate`\n- **Llama.cpp VM** → `http://cortex:7078`\n- **OpenAI** → the cloud\n\nYou want the MI50 box to give that same kind of **HTTP endpoint**.\n\n---\n\n## 🧩 The Reality\nRight now ROCm + PyTorch gives you a **raw engine**, not the “API shell.” \nSo we bolt one on.\n\nThink of it like building a food truck:\n1. **Engine (ROCm + PyTorch)** → runs great, but no window for customers.\n2. **API wrapper (TGI / vLLM / Flask)** → cuts a window in the side so Relay can order food.\n3. **Relay / Cortex** → your customers yelling “Two completions, extra tokens!”\n\n---\n\n## 🔧 The Simplified Roadmap\n1. **Pick one “window”**\n - **Option A: Hugging Face TGI** → easiest. Pretends to be an API server out-of-the-box. \n - **Option B: vLLM** → looks exactly like OpenAI’s API (for future drop-in). \n - **Option C: Tiny Flask script** → dead simple, manual but fool-proof.\n\n2. **Point Relay to it**\n - Same as you point to Ollama: \n `base_url=http://ct201:8080` (or whatever port).\n\n3. **Done.** \n From Relay’s perspective, MI50 is now *just another model endpoint.*\n\n---\n\nIf you want, I can build you a **“lowest-pain install plan”**—a single script that:\n- installs PyTorch ROCm + TGI \n- downloads a small model (LLaMA-2-7B or similar) \n- runs it as a background service (`rocm-llm.service`) \n- prints the endpoint URL for Relay\n\nWould you like me to sketch that 5-minute setup next?" }, { "role": "user", "content": "No currently I’m just brainstorming and trying to learn " }, { "role": "assistant", "content": "Perfect — then we’ll stay in “learn mode” for a bit, not “build mode.” \nLet’s unpack this cleanly so you actually *understand* what’s going on, not just follow commands. \n\n---\n\n## 🧩 The Big Picture — Why It Feels Confusing\n\nYou’re juggling **three layers that normally get blended together**:\n\n| Layer | What It Actually Does | Analogy |\n|-------|----------------------|----------|\n| **Driver/runtime** | ROCm (AMD) or CUDA (NVIDIA) | The engine under the hood — pure horsepower. |\n| **Framework** | PyTorch, TensorFlow | The steering wheel — tells the engine *how* to run math. |\n| **Server/API layer** | Ollama, vLLM, TGI, etc. | The car’s dashboard — gives you buttons and knobs to interact. |\n\nNVIDIA’s world (3090 + CUDA) bundles all three really tightly — that’s why Ollama just “works.” \nAMD’s world (MI50 + ROCm) only gives you the **engine and steering wheel**, not the dashboard. \nThat’s why PyTorch runs, but nothing’s listening for HTTP requests.\n\n---\n\n## 🧭 What You’re Really Trying to Recreate\nWhen Relay or Cortex calls an LLM, it expects an **API** that looks something like this:\n```http\nPOST /v1/chat/completions\n{\n \"model\": \"whatever\",\n \"messages\": [{\"role\":\"user\",\"content\":\"hello\"}]\n}\n```\n\nThat format comes from the **OpenAI API spec** — and Ollama, llama.cpp, vLLM, etc. \nall imitate it. \nSo, what you need on the MI50 side is *any server* that speaks that protocol.\n\n---\n\n## 🧱 The Three Families of “LLM Servers”\n\n1. **App-style servers (Ollama, LM Studio, Kobold, etc.)** \n - Usually CUDA-only. \n - Meant for end users. \n - Very easy, very opinionated.\n\n2. **Framework servers (TGI, vLLM)** \n - Python-based, use PyTorch or TensorFlow. \n - Expose APIs like OpenAI’s. \n - Can be adapted to ROCm easily.\n\n3. **Bare metal (PyTorch directly)** \n - Just a Python REPL. \n - You manually load models and run `.generate()`. \n - No API unless you make one.\n\n---\n\n## ⚙️ Why the MI50 Path Is Tricky\n\nBecause AMD doesn’t have the same “consumer polish” layer that NVIDIA does. \nThe ROCm stack is *aimed at researchers and HPC people* who write their own pipelines. \nSo when you install it, you’re expected to already know how to wrap it in your own interface — like TGI or vLLM.\n\nThat’s why it feels half-baked compared to the 3090 experience.\n\n---\n\n## 🧠 TL;DR Mental Model\n```\n3090 path:\n Ollama → CUDA → GPU → Works\nMI50 path:\n TGI/vLLM → PyTorch ROCm → GPU → Works\n```\n\nIf you remember that, everything else will make sense.\n\n---\n\nWould you like me to give a “visual mental map” — like a diagram showing how all these layers (Relay → API → Framework → GPU) connect together across your machines? \nIt’s honestly the best way to make the spaghetti make sense." } ] }