Got it. I read your forum snippet and tried to open Darren’s page. Quick status + takeaways:
I attempted to load anykeycafe.com/little-ougway/ (and a related page on the same domain), but the site returned an access error through my browser tool. If you paste the page text here, I’ll scan it line-by-line. In the meantime, I reviewed the claims you quoted and cross-checked typical hardware numbers so we can respond accurately.
The 150B-on-12GB claim (reality check)
Running a dense ~150B model fully on a single 12 GB GPU isn’t feasible, even with 4-bit quantization. For reference, 70B models typically need ~64 GB system RAM with Ollama, and ~40–45 GB of VRAM to fully offload a 4-bit build to the GPU; below that you offload to CPU and it gets slow.
If Darren meant an MoE (Mixture-of-Experts) model with ~150B total parameters but far fewer active params per token, it can be much lighter at inference compute, but VRAM can still be high because all experts generally have to be loaded. (In practice, MoE often increases peak memory vs. a dense model with the same active parameter count.)
A more realistic “12 GB VRAM” envelope is 7B–13B models at 4–5-bit, plus careful CPU RAM offload if needed.
Sensible local setups (quick map)
Floor (works today on many PCs): RTX 3060 12 GB, 32–64 GB RAM, NVMe SSD. Models: Mistral-7B, Llama-3.1-8B/8B-Instruct in Q4/Q5. Tools: Ollama + Open WebUI/AnythingLLM; enable function calling/tools for “most technologies.”
Sweet spot: 24 GB VRAM (3090/4090). Smooth 13B–20B inference, and QLoRA fine-tuning for 7B/13B on a single GPU.
High end: 70B inference prefers 64 GB RAM (system) with Ollama and ~40–80 GB total GPU VRAM to keep most of it on-GPU (multi-GPU OK).
“Trainable” on consumer hardware (what that likely means)
Full training is out of scope for 12 GB VRAM, but LoRA/QLoRA fine-tunes are realistic: 7B models can be adapted on ~12–24 GB VRAM if you keep sequence length/batch modest; 13B is doable around 24 GB.
Tools/agents angle (so “most technologies” actually work)
If Darren’s goal is a local agent with tools (browser, RAG, TTS/STT, etc.), these pieces snap together cleanly:
Ollama + Open WebUI — native function/tool calling and pipelines; add your Python tools.
AnythingLLM — simple desktop/Docker app with built-in RAG/agents; can use LocalAI for embeddings or LLMs.
Page Assist / browser front-ends — for page summarization and quick RAG hookups (shows the practical limits vs. cloud tools).
---
Ready-to-post reply (polite, precise)
> Darren, this looks exciting—thanks for sharing. Quick hardware question: you mentioned “150B with 12 GB VRAM.” Do you mean an MoE where only a smaller active subset runs per token, or a dense 150B?
For context, Ollama’s own guidance pegs 70B at ~64 GB RAM (and ~40–45 GB VRAM to fully offload a 4-bit build). Dense 150B wouldn’t fit in 12 GB without very heavy CPU/disk offload (i.e., very slow). If it is MoE, I’m curious how you’re handling VRAM—since most MoE stacks still need all experts resident even if only a few are active per token.
Also interested in your plan for “trainable”: are you targeting LoRA/QLoRA fine-tunes (7B/13B on a single consumer GPU) versus full training? And for “tools,” are you wiring Open WebUI function calling or AnythingLLM-style agents for RAG/browsing/TTS/STT?
Link(s) to your scripts when ready would be awesome—keen to try your exact recipe end-to-end.
(Those 70B/VRAM figures come from Ollama/NVIDIA community references and MoE write-ups.)
---
If you want, I can turn this into your “Step 2 — Hardware Reality Check” page for the Dumb-It-Down guide, with a tiny table of “what you can run on 12 GB / 24 GB / 48 GB+” and the exact Open WebUI + tools toggles.
-Solace