overview for brucethemoose

Russia lost over 62,400 soldiers in Kursk operation, Ukraine says in c/world@lemmy.world

[–] brucethemoose@lemmy.world 6 points 9 hours ago* (last edited 9 hours ago)

You're welcome. Check out the ISW report too!

https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-june-23-2023

Russia lost over 62,400 soldiers in Kursk operation, Ukraine says in c/world@lemmy.world

[–] brucethemoose@lemmy.world 48 points 10 hours ago* (last edited 9 hours ago) (4 children)

That literally almost happened:

https://en.m.wikipedia.org/wiki/Wagner_Group_rebellion

Backstory: there was this infamous Russian mercenary company called Wagner. They had a ton of support from Russian nationalists. During the Ukraine war, Putin decided its populist leader’s head was a little too big, so he plopped them on the front lines, undersupplied, and ground them down to keep Wagner in their place, and from growing too powerful.

The leader was not stupid.

So, seeing the existential problem, Wagner, out of the blue, rushed out of Ukraine and made a beeline for Moscow with a big military convoy. There was a lot of political noise, but basically it was a coup.

Ultimately, Wagner realized not enough of the Russian military was turning to their side and stopped (and it leader was literally blown up in a plane some months later), but the crazy thing is some Russian military did defect! Others let them pass without any resistance! And like you said, Moscow didn’t have a lot of defense.

But it was so close to working its nuts! Foreign militaries watched it very closely. The Wikipedia article and ISW reports on it are worth reading, and it’s all an interesting window into internal Russian politics/culture, as Wagner is not the only internal faction that’s a potential threat to Putin.

EDIT: Found the ISW post, so you can see how tense things felt in real time.

https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-june-23-2023

(More generally, those reports are great source for detailed Ukraine war info).

Trump: ‘I’m really not trolling’ with talk of Canada as 51st state in c/world@lemmy.world

[–] brucethemoose@lemmy.world 17 points 1 day ago* (last edited 1 day ago)

Yeah, that's the thing. Even if you buy the idea of Trump's policies (which TBH have a few grains of truth), the implementations of them are so full of nonsense. Like, ok, get Canada into the US, let's just roll with that for the sake of argument... It might make Canada and the US stronger, like the openness between the states does. It would consolidate many federal functions. Canda could retain their culture like individual states do. Sounds plausible.

...And your plan is to get them to join as one state, and only if they grovel to you, by harassing them on Twitter, offering zero details? Like, what world is he living in?

Slate, a customizable EV pickup for $20k in c/technology@lemmy.world

[–] brucethemoose@lemmy.world 4 points 1 day ago* (last edited 1 day ago)

Oh yeah, its more than that. Low weight helps acceleration, braking (so safety), handling, range, wear on every component, and most of all, cost. The same sized tires will need less pressure, wear much less, and grip harder. If the car is lighter, you don't need as stiff a chassis, nor as much braking to lock the wheels, less battery, motor, which means you can take even more weight off the car... You get where I'm going.

Racecars are fast because they are light, not because they have big engines and expensive bodies. Little 1500lb cars can lap a $3 million 1500hp (and quite heavy, because of all the stuff in it) Bugatti around a track.

Heavy cars can handle OK, but the cost is big.

Slate, a customizable EV pickup for $20k in c/technology@lemmy.world

[–] brucethemoose@lemmy.world 8 points 1 day ago* (last edited 1 day ago) (3 children)

+1

Weight is everything. Removing it makes almost literally every aspect of a car better, and it’s usually a terrible negative for EVs.

CIA deputy director’s son killed while fighting for Russia in Ukraine, investigation claims in c/nottheonion@lemmy.world

[–] brucethemoose@lemmy.world 3 points 2 days ago* (last edited 2 days ago)

The source doesn’t matter, it’s more about the example, and idea.

More bluntly, horrible people (like nazis) can go on to do good things in life. That’s okay.

On the other hand, posting the picture without a ton of context seems to reinforce the very thing you are worried about:

someone MIGHT get the impression there was something fishy going on at NATO from the get-go

When that doesn’t seem to be the case. Nazis, tankies, whatever populist group you can name operate on negative quick impressions to sow doubt and anger with institutions.

CIA deputy director’s son killed while fighting for Russia in Ukraine, investigation claims in c/nottheonion@lemmy.world

[–] brucethemoose@lemmy.world 9 points 2 days ago* (last edited 2 days ago) (2 children)

…So?

Poking through some of their history (Ernst and Karl), looks like they were indeed Nazi commanders. They served lower ranks after the war, got more education/experience and rose again to perform well within NATO.

Maybe I'm naive, but I believe horrible people can go on to do good things, and that’s fine. I think my favorite character archetype for this is General Iroh in Avatar, who was involved in unspeakable genocide, changed, and ultimately toppled his own dynasty. He’s one of the most beloved characters in fiction, but a quick bio of his in an image would get him utterly crucified as a terrible human being.

Hence drive by image posts kinda like this without context/history, on the other hand, largely provoke outrage. It’s exactly the kind of thing that would trend on the Twitter algorithm and obliterate any nuance. That’s not necessarily your intent, but it’s kinda the aggregate effect.

CIA deputy director’s son killed while fighting for Russia in Ukraine, investigation claims in c/nottheonion@lemmy.world

[–] brucethemoose@lemmy.world 117 points 2 days ago (21 children)

His social media posts show a distinct change in his views — one identified by Important Stories as belonging to Gloss suggest he believed in conspiracy theories involving Ukraine, and claimed NATO was an evolution of Adolf Hitler's Nazi party.

As always, the root villain is engagement-driven social media.

Trump's "final offer" for peace requires Ukraine to accept Russian occupation in c/world@lemmy.world

[–] brucethemoose@lemmy.world 14 points 4 days ago* (last edited 4 days ago) (3 children)

The implication is walking away all US military support, I believe.

*Permanently Deleted* in c/technology@lemmy.world

[–] brucethemoose@lemmy.world 6 points 4 days ago (1 children)

Good!

*Permanently Deleted* in c/technology@lemmy.world

[–] brucethemoose@lemmy.world 51 points 4 days ago* (last edited 4 days ago) (8 children)

I’m hoping Arc survives all this?

I know they want to focus, but no one’s going to want their future SoCs if the GPU part sucks or is nonexistent. Heck, it’s important for servers, eventually.

Battlemage is good!

1

Trump's "final offer" for peace requires Ukraine to accept Russian occupation (www.axios.com)

submitted 4 days ago* (last edited 4 days ago) by brucethemoose@lemmy.world to c/ukraine@sopuli.xyz

0 comments fedilink

The U.S. expects Ukraine's response Wednesday to a peace framework that includes U.S. recognition of Crimea as part of Russia and unofficial recognition of Russian control of nearly all areas occupied since the 2022 invasion, sources with direct knowledge of the proposal tell Axios.

What Russia gets under Trump's proposal:

"De jure" U.S. recognition of Russian control in Crimea.

"De-facto recognition" of the Russia's occupation of nearly all of Luhansk oblast and the occupied portions of Donetsk, Kherson and Zaporizhzhia.

A promise that Ukraine will not become a member of NATO. The text notes that Ukraine could become part of the European Union.

The lifting of sanctions imposed since 2014.

Enhanced economic cooperation with the U.S., particularly in the energy and industrial sectors.

What Ukraine gets under Trump's proposal:

"A robust security guarantee" involving an ad hoc group of European countries and potentially also like-minded non-European countries. The document is vague in terms of how this peacekeeping operation would function and does not mention any U.S. participation.

The return of the small part of Kharkiv oblast Russia has occupied.

Unimpeded passage of the Dnieper River, which runs along the front line in parts of southern Ukraine.

Compensation and assistance for rebuilding, though the document does not say where the funding will come from.

Whole article is worth a read, as it’s quite short/dense as Axios usually is. For those outside the US, this is an outlet that’s been well sourced in Washington for years.

Angry, disappointed users react to Bluesky's upcoming blue check mark verification system in c/technology@lemmy.world

[–] brucethemoose@lemmy.world 0 points 1 week ago

It was selectively given to institutions and "major" celebrities before that.

Selling them dilutes any meaning of "verified" because any joe can just pay for extra engagement. It's a perverse incentive, as the people most interest in grabbing attention buy it and get amplified.

It really has little to do with Musk.

3

Guide to Self Hosting LLMs Faster/Better than Ollama (lemmy.world)

submitted 6 months ago* (last edited 6 months ago) by brucethemoose@lemmy.world to c/selfhosted@lemmy.world

1 comments fedilink

I see a lot of talk of Ollama here, which I personally don't like because:

The quantizations they use tend to be suboptimal
It abstracts away llama.cpp in a way that, frankly, leaves a lot of performance and quality on the table.
It abstracts away things that you should really know for hosting LLMs.
I don't like some things about the devs. I won't rant, but I especially don't like the hint they're cooking up something commercial.

So, here's a quick guide to get away from Ollama.

First step is to pick your OS. Windows is fine, but if setting up something new, linux is best. I favor CachyOS in particular, for its great python performance. If you use Windows, be sure to enable hardware accelerated scheduling and disable shared memory.
Ensure the latest version of CUDA (or ROCm, if using AMD) is installed. Linux is great for this, as many distros package them for you.
Install Python 3.11.x, 3.12.x, or at least whatever your distro supports, and git. If on linux, also install your distro's "build tools" package.

Now for actually installing the runtime. There are a great number of inference engines supporting different quantizations, forgive the Reddit link but see: https://old.reddit.com/r/LocalLLaMA/comments/1fg3jgr/a_large_table_of_inference_engines_and_supported/

As far as I am concerned, 3 matter to "home" hosters on consumer GPUs:

Exllama (and by extension TabbyAPI), as a very fast, very memory efficient "GPU only" runtime, supports AMD via ROCM and Nvidia via CUDA: https://github.com/theroyallab/tabbyAPI
Aphrodite Engine. While not strictly as vram efficient, its much faster with parallel API calls, reasonably efficient at very short context, and supports just about every quantization under the sun and more exotic models than exllama. AMD/Nvidia only: https://github.com/PygmalionAI/Aphrodite-engine
This fork of kobold.cpp, which supports more fine grained kv cache quantization (we will get to that). It supports CPU offloading and I think Apple Metal: https://github.com/Nexesenex/croco.cpp

Now, there are also reasons I don't like llama.cpp, but one of the big ones is that sometimes its model implementations have... quality degrading issues, or odd bugs. Hence I would generally recommend TabbyAPI if you have enough vram to avoid offloading to CPU, and can figure out how to set it up. So:

Open a terminal, run git clone https://github.com/theroyallab/tabbyAPI.git
cd tabbyAPI
Follow this guide for setting up a python venv and installing pytorch and tabbyAPI: https://github.com/theroyallab/tabbyAPI/wiki/01.-Getting-Started#installing

This can go wrong, if anyone gets stuck I can help with that.

Next, figure out how much VRAM you have.
Figure out how much "context" you want, aka how much text the llm can ingest. If a models has a context length of, say, "8K" that means it can support 8K tokens as input, or less than 8K words. Not all tokenizers are the same, some like Qwen 2.5's can fit nearly a word per token, while others are more in the ballpark of half a work per token or less.
Keep in mind that the actual context length of many models is an outright lie, see: https://github.com/hsiehjackson/RULER
Exllama has a feature called "kv cache quantization" that can dramatically shrink the VRAM the "context" of an LLM takes up. Unlike llama.cpp, it's Q4 cache is basically lossless, and on a model like Command-R, an 80K+ context can take up less than 4GB! Its essential to enable Q4 or Q6 cache to squeeze in as much LLM as you can into your GPU.
With that in mind, you can search huggingface for your desired model. Since we are using tabbyAPI, we want to search for "exl2" quantizations: https://huggingface.co/models?sort=modified&search=exl2
There are all sorts of finetunes... and a lot of straight-up garbage. But I will post some general recommendations based on total vram:
4GB: A very small quantization of Qwen 2.5 7B. Or maybe Llama 3B.
6GB: IMO llama 3.1 8B is best here. There are many finetunes of this depending on what you want (horny chat, tool usage, math, whatever). For coding, I would recommend Qwen 7B coder instead: https://huggingface.co/models?sort=trending&search=qwen+7b+exl2
8GB-12GB Qwen 2.5 14B is king! Unlike it's 7B counterpart, I find the 14B version of the model incredible for its size, and it will squeeze into this vram pool (albeit with very short context/tight quantization for the 8GB cards). I would recommend trying Arcee's new distillation in particular: https://huggingface.co/bartowski/SuperNova-Medius-exl2
16GB: Mistral 22B, Mistral Coder 22B, and very tight quantizations of Qwen 2.5 34B are possible. Honorable mention goes to InternLM 2.5 20B, which is alright even at 128K context.
20GB-24GB: Command-R 2024 35B is excellent for "in context" work, like asking questions about long documents, continuing long stories, anything involving working "with" the text you feed to an LLM rather than pulling from it's internal knowledge pool. It's also quite goot at longer contexts, out to 64K-80K more-or-less, all of which fits in 24GB. Otherwise, stick to Qwen 2.5 34B, which still has a very respectable 32K native context, and a rather mediocre 64K "extended" context via YaRN: https://huggingface.co/DrNicefellow/Qwen2.5-32B-Instruct-4.25bpw-exl2
32GB, same as 24GB, just with a higher bpw quantization. But this is also the threshold were lower bpw quantizations of Qwen 2.5 72B (at short context) start to make sense.
48GB: Llama 3.1 70B (for longer context) or Qwen 2.5 72B (for 32K context or less)

Again, browse huggingface and pick an exl2 quantization that will cleanly fill your vram pool + the amount of context you want to specify in TabbyAPI. Many quantizers such as bartowski will list how much space they take up, but you can also just look at the available filesize.

Now... you have to download the model. Bartowski has instructions here, but I prefer to use this nifty standalone tool instead: https://github.com/bodaay/HuggingFaceModelDownloader
Put it in your TabbyAPI models folder, and follow the documentation on the wiki.
There are a lot of options. Some to keep in mind are chunk_size (higher than 2048 will process long contexts faster but take up lots of vram, less will save a little vram), cache_mode (use Q4 for long context, Q6/Q8 for short context if you have room), max_seq_len (this is your context length), tensor_parallel (for faster inference with 2 identical GPUs), and max_batch_size (parallel processing if you have multiple user hitting the tabbyAPI server, but more vram usage)
Now... pick your frontend. The tabbyAPI wiki has a good compliation of community projects, but Open Web UI is very popular right now: https://github.com/open-webui/open-webui I personally use exui: https://github.com/turboderp/exui
And be careful with your sampling settings when using LLMs. Different models behave differently, but one of the most common mistakes people make is using "old" sampling parameters for new models. In general, keep temperature very low (<0.1, or even zero) and rep penalty low (1.01?) unless you need long, creative responses. If available in your UI, enable DRY sampling to tamp down repition without "dumbing down" the model with too much temperature or repitition penalty. Always use a MinP of 0.05 or higher and disable other samplers. This is especially important for Chinese models like Qwen, as MinP cuts out "wrong language" answers from the response.
Now, once this is all setup and running, I'd recommend throttling your GPU, as it simply doesn't need its full core speed to maximize its inference speed while generating. For my 3090, I use something like sudo nvidia-smi -pl 290, which throttles it down from 420W to 290W.

Sorry for the wall of text! I can keep going, discussing kobold.cpp/llama.cpp, Aphrodite, exotic quantization and other niches like that if anyone is interested.