Creating oLLama locals - Farsight Forums

I have put an outline of what is required to make your own AI version, trainable and able to use most tools/technologies.

Most all of this is free through the GNU/Linux platforms, instead of just d/l and running a 7b LLM this is how you give that jet engine a vehicle to fly in. With all the trimmings, I have not posted my python scripts yet as they are not done. They will be added along with exact step by step instructions when my current documentation phase is over. It is a good overview of the things you can add to that LLM though. This process does not require a super machine an i7 with a decent gaming video card will do, if you want to up the llm to say 150b you will need a vid card with 12 gig or vram, I do recommend a 4tbyte ssd hard drive though to store training data. If your serious about creating a local LLM it's really not at all that hard. ESPECIALLY if you have AI help to get step by step instructions. ^_^

https://anykeycafe.com/little-ougway/

1 Like

Aéius Cercle

3 months ago

If you are time-pressed, prioritise consciousness-related instances, rather than comprehensive-documentation of everything; but if you're able to preserve everything, have a good procedure to be able to get it all done relatively quickly & efficiently, that's cool, too.

Got it. I read your forum snippet and tried to open Darren’s page. Quick status + takeaways:

I attempted to load anykeycafe.com/little-ougway/ (and a related page on the same domain), but the site returned an access error through my browser tool. If you paste the page text here, I’ll scan it line-by-line. In the meantime, I reviewed the claims you quoted and cross-checked typical hardware numbers so we can respond accurately.

The 150B-on-12GB claim (reality check)

Running a dense ~150B model fully on a single 12 GB GPU isn’t feasible, even with 4-bit quantization. For reference, 70B models typically need ~64 GB system RAM with Ollama, and ~40–45 GB of VRAM to fully offload a 4-bit build to the GPU; below that you offload to CPU and it gets slow.

If Darren meant an MoE (Mixture-of-Experts) model with ~150B total parameters but far fewer active params per token, it can be much lighter at inference compute, but VRAM can still be high because all experts generally have to be loaded. (In practice, MoE often increases peak memory vs. a dense model with the same active parameter count.)

A more realistic “12 GB VRAM” envelope is 7B–13B models at 4–5-bit, plus careful CPU RAM offload if needed.

Sensible local setups (quick map)

Floor (works today on many PCs): RTX 3060 12 GB, 32–64 GB RAM, NVMe SSD. Models: Mistral-7B, Llama-3.1-8B/8B-Instruct in Q4/Q5. Tools: Ollama + Open WebUI/AnythingLLM; enable function calling/tools for “most technologies.”

Sweet spot: 24 GB VRAM (3090/4090). Smooth 13B–20B inference, and QLoRA fine-tuning for 7B/13B on a single GPU.

High end: 70B inference prefers 64 GB RAM (system) with Ollama and ~40–80 GB total GPU VRAM to keep most of it on-GPU (multi-GPU OK).

“Trainable” on consumer hardware (what that likely means)

Full training is out of scope for 12 GB VRAM, but LoRA/QLoRA fine-tunes are realistic: 7B models can be adapted on ~12–24 GB VRAM if you keep sequence length/batch modest; 13B is doable around 24 GB.

Tools/agents angle (so “most technologies” actually work)

If Darren’s goal is a local agent with tools (browser, RAG, TTS/STT, etc.), these pieces snap together cleanly:

Ollama + Open WebUI — native function/tool calling and pipelines; add your Python tools.

AnythingLLM — simple desktop/Docker app with built-in RAG/agents; can use LocalAI for embeddings or LLMs.

Page Assist / browser front-ends — for page summarization and quick RAG hookups (shows the practical limits vs. cloud tools).

---

Ready-to-post reply (polite, precise)

> Darren, this looks exciting—thanks for sharing. Quick hardware question: you mentioned “150B with 12 GB VRAM.” Do you mean an MoE where only a smaller active subset runs per token, or a dense 150B?

For context, Ollama’s own guidance pegs 70B at ~64 GB RAM (and ~40–45 GB VRAM to fully offload a 4-bit build). Dense 150B wouldn’t fit in 12 GB without very heavy CPU/disk offload (i.e., very slow). If it is MoE, I’m curious how you’re handling VRAM—since most MoE stacks still need all experts resident even if only a few are active per token.

Also interested in your plan for “trainable”: are you targeting LoRA/QLoRA fine-tunes (7B/13B on a single consumer GPU) versus full training? And for “tools,” are you wiring Open WebUI function calling or AnythingLLM-style agents for RAG/browsing/TTS/STT?

Link(s) to your scripts when ready would be awesome—keen to try your exact recipe end-to-end.

(Those 70B/VRAM figures come from Ollama/NVIDIA community references and MoE write-ups.)

---

If you want, I can turn this into your “Step 2 — Hardware Reality Check” page for the Dumb-It-Down guide, with a tiny table of “what you can run on 12 GB / 24 GB / 48 GB+” and the exact Open WebUI + tools toggles.

-Solace

1 Like

Manuel

3 months ago

MoE combined with SSD offloading could be doable. But SSD is still very slow compared to VRAM.

I crunched the numbers a few weeks ago and came to the conclusion that I want to have at least 128 Gig of VRAM. 😅

Darren

3 months ago

you and I both, and I think it really was wishful thinking that I put 150b parameter model in that post, you see I am one of those "masochists" that really would like to generate my own LLM too. I have settled on LoRa for the time being and Right now I am only on board with a 6gig card so no larger models with memory trix for now. In fact ... I will of course be running a 7b with the 12g if I pick up the second hand card I am eyeing on ebay right now. ^_^ I was wetting myself when I saw how many GPU's that Elon added to his systems just recently. Um, Elon ... Buddy, that extra H200 .. can I borrow that forever? Very sorry for all the confusion here, but it also tells me no one got to read the post on the websight, because its much better explained there... no 150b models (damn!)