I've not found them useful yet for more than basic things. I tried Ollama, it let's you run locally, has simple setup, stays out of the way.
Selfhosted
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
-
No spam.
-
Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.
-
Don't duplicate the full text of your blog or git here. Just post the link for folks to click.
-
Submission headline should match the article title.
-
No trolling.
-
Promotion posts require your active participation in selfhosting or related communities, or the post will be removed. No more than 10% of your posts or comments may be self-promotional, or your post will be removed. F/LOSS Exception: If your post is about a project that is completely open source & can be self-hosted in full without payment, and your account is at least 7 days old, your post is exempt from this rule as long as you continue to engage in comments.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
I have heard good things about LM Studio from several professional coders and tinkers alike. Not tried it myself yet though, but I might have to bite the bullet because I can't seem to get ollama to perform how I want.
TabbyML is another thing to try.
Thanks for the reply!
I had noticed TabbyML but something about their wording made me rethink and then the next day I saw a post on here regarding the same phrasing, I decided to leave it alone after that
Yeah I tried tabby too and they had like a mandatory "we share your code " line and I hoped out. Like if you're going to do that I might as well just use claude
I get good mileage out of the Jan client and Void editor, various models will work but Jan-4B tends to do OK, maybe a Meta-Llama model could do alright too. The Jan client has settings where you can start up a local OpenAI-compatible server, and Void can be configured to point to that localhost URL+port and specific models. If you want to go the extra mile for privacy and you're on a Linux distro, install firejail from your package manager and run both Void and Jan inside the same namespace with outside networking disabled so it only can talk on localhost. E.g.: firejail --noprofile --net=none --name=nameGoesHere Jan and firejail --noprofile --net=none --join=nameGoesHere void, where one of them sets up the namespace (--name=) and the other one joins the namespace (--join=)
Qwen coder model from Huggingface, following the instructions there to run it in llama.cpp. Once that’s up: OpenCode and use the custom OpenAI API to connect it.
You’ll get far better results than trying to use other local options out of the box.
There may be better models potentially but I’ve found Qwen 2.5 etc to be pretty fantastic overall, and definitely a fine option beside Claude/ChatGPT/Gemini. I’ve tested the lot and it’s usually far more down to instruction and AGENTS.md instructions/layout than it is down to just the model.
Do you mind sharing your agents md?
This. Llama.cpp with Vulkan backend running in docker-compose, some Qwen3-Coder quantization from huggingface and pointing Opencode to that local setup with a OpenAI-compatible is working great for me.
The main thing that has stopped me from running models like this so far is VRAM. My server has a RTX 4060 with 8GB, and not sure that can reasonably run a model like this.
Edit:
This calculator seems pretty useful: https://apxml.com/tools/vram-calculator
According to this, I can run Qwen3 14B with 4B quant and 15-20% CPU/NVMe offloading and get 41 tokens / s. It seems 4B quant reduces accuracy by 5-15%.
The calculator even says I can run the flagship model with 100% NVMe offloading and get 4 tokens / s.
I didn’t realize NVMe offloading was even a thing and not sure if it actually is supported or works well in practice. If so, it’s a game changer.
Edit:
The llama.cpp docs do mention that models are memory mapped by default and loaded into memory as needed. Not sure if that means that a MoE model like qwen3 235b can run with 8GB of VRAM and 16GB of RAM, albeit at a speed that is an order of magnitude slower like the calculator suggests is possible.
LM Studio in combination with Kilo Code for IDE integration works pretty nicely locally. Here is a good video covering the basics to get you going: https://www.youtube.com/watch?v=rp5EwOogWEw
I recommend llama.cpp instead of LM Studio.
Why? I use LM studio today, but I'm always interested in futzing with things. Is there a good reason to switch?
Llama.cpp is quite a bit faster than lm studio and ollama. It's easy to find benchmarks showing 2-3x speed ups. I recently switched and am liking it.
I use Ollama with qwen-coder-2.5, integrated with Cline in VSCodium. Works great.
Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I've seen in this thread:
| Fewer Letters | More Letters |
|---|---|
| DNS | Domain Name Service/System |
| Git | Popular version control system, primarily for code |
| IoT | Internet of Things for device controllers |
| NAS | Network-Attached Storage |
| NVMe | Non-Volatile Memory Express interface for mass storage |
| Plex | Brand of media server package |
| SAN | Storage Area Network |
| SBC | Single-Board Computer |
| VPS | Virtual Private Server (opposed to shared hosting) |
[Thread #252 for this comm, first seen 22nd Apr 2026, 12:50] [FAQ] [Full list] [Contact] [Source code]