this post was submitted on 09 Jan 2026
62 points (90.8% liked)
Technology
78482 readers
4059 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related news or articles.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Let me see if I can clarify
The article talks about running models on consumer hardware. I am making the point that this is not a new concept. The GUI is optional but, as I mentioned, llama.cpp and other open source tools provide an OpenAI-compatible api just like the product described in the article.
No. LLMs, as we know them, aren't that old, were a harder to run and required some coding knowledge and environment setup until 3ish years ago, give or take when these more polished tools started coming out.
Ollama matches that description. Llama is a model family from Facebook. Llama.cpp, which is what I was talking about, is an inference and quantization tool suite made for efficient deployment on a variety of hardware including consumer hardware.
Map reduce, in very simplified terms, means spreading out compute work to highly pararelized compute workers. This is, conceptually, how all LLMs are run at scale. You can't map reduce or parallelize LLMs any more than they already are. The article doent imply map reduce other than taking about using multiple computers.
They don't talk about how the models are run in the article. But I know a tiny bit about how they're run. LLMs require very simple and consistent math computations on extremely large matrixes of numbers. The bottleneck is almost always data transfer, not compute. Basically, every LLM deployment tool is already tries to use as much parallelism as possible while reducing data transfer as much as possible.
The article talks about gpt-oss120, so were aren't talking about novel approaches to how the data is laid out or how the models are used. We're talking about tranformer models and how they're huge and require a lot of data transfer. So, the preference is try to keep your model on the fastest-transfer part of your machine. On consumer hardware, which was the key point of the article, you are best off keeping your model in your GPU's memory. If you can't, you'll run into bottlenecks with PCIe, RAM and network transfer speed. But consumers don't have GPUs with 63+ GB of VRAM, which is how big GPT-OSS 120b is, so they MUST contend with these speed bottlenecks. This article doesn't address that. That's what I'm talking about.