this post was submitted on 16 Jan 2026
70 points (88.0% liked)

Selfhosted

60093 readers
933 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam.

  3. Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.

  4. Don't duplicate the full text of your blog or git here. Just post the link for folks to click.

  5. Submission headline should match the article title.

  6. No trolling.

  7. Promotion posts require your active participation in selfhosting or related communities, or the post will be removed. No more than 10% of your posts or comments may be self-promotional, or your post will be removed. F/LOSS Exception: If your post is about a project that is completely open source & can be self-hosted in full without payment, and your account is at least 7 days old, your post is exempt from this rule as long as you continue to engage in comments.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 3 years ago
MODERATORS
 

I'd like to set up a local coding assistant so that I can stop using Google to ask complex questions to for search results.

I really don't know what I'm doing or if there's anything that's available that respects privacy. I don't necessarily trust search results for this kind of query either.

I want to run it on my desktop, Ryzen 7 5800xt + Radeon RX 6950xt + 32gb of RAM. I don't need or expect data center performance out of this thing. I'm also a strict Sublime user so I'd like to avoid VS Code suggestions as much as possible.

My coding laptop is an oooooold MacBook Air so I'd like something that can be ran on my desktop and used from my laptop if possible. No remote access needed, just to use from the same home network.

Something like LM Studio and Qwen sounds like it's what I'm looking for, but since I'm unfamiliar with what exists I figured I would ask for Lemmy's opinion.

Is LM Studio + Qwen a good combo for my needs? Are there alternatives?

I'm on Lemmy Connect and can't see comments from other instances when I'm logged in, but to whomever melted down from this question your relief is in my very first sentence:

to ask complex questions to for search results.

you are viewing a single comment's thread
view the rest of the comments
[–] perry@aussie.zone 8 points 5 months ago (3 children)

Qwen coder model from Huggingface, following the instructions there to run it in llama.cpp. Once that’s up: OpenCode and use the custom OpenAI API to connect it.

You’ll get far better results than trying to use other local options out of the box.

There may be better models potentially but I’ve found Qwen 2.5 etc to be pretty fantastic overall, and definitely a fine option beside Claude/ChatGPT/Gemini. I’ve tested the lot and it’s usually far more down to instruction and AGENTS.md instructions/layout than it is down to just the model.

[–] madcaesar@lemmy.world 3 points 5 months ago

Do you mind sharing your agents md?

[–] 70k32@sh.itjust.works 1 points 5 months ago

This. Llama.cpp with Vulkan backend running in docker-compose, some Qwen3-Coder quantization from huggingface and pointing Opencode to that local setup with a OpenAI-compatible is working great for me.

[–] melfie@lemy.lol 1 points 5 months ago* (last edited 5 months ago)

The main thing that has stopped me from running models like this so far is VRAM. My server has a RTX 4060 with 8GB, and not sure that can reasonably run a model like this.

Edit:

This calculator seems pretty useful: https://apxml.com/tools/vram-calculator

According to this, I can run Qwen3 14B with 4B quant and 15-20% CPU/NVMe offloading and get 41 tokens / s. It seems 4B quant reduces accuracy by 5-15%.

The calculator even says I can run the flagship model with 100% NVMe offloading and get 4 tokens / s.

I didn’t realize NVMe offloading was even a thing and not sure if it actually is supported or works well in practice. If so, it’s a game changer.

Edit:

The llama.cpp docs do mention that models are memory mapped by default and loaded into memory as needed. Not sure if that means that a MoE model like qwen3 235b can run with 8GB of VRAM and 16GB of RAM, albeit at a speed that is an order of magnitude slower like the calculator suggests is possible.