this post was submitted on 20 Oct 2025
73 points (96.2% liked)

Selfhosted

52543 readers
356 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

A while back I played a round with the HASS Voice Assistant, and pretty easily got to a point where STT and TTS were working really well on my local installation. Also got the hardware to build wyoming satellites with wakeword recognition.

However, what kept me from going through the effort of setting everything up properly (and finally getting fucking Alexa out of my house) was the "all or nothing" approach HASS seemingly has to intent recognition. You either:

  • use the build in Assistant conversation agent, which is a pain in the ass because it matches what your STT recognized 1:1, letter by letter, so it's almost impossible to actually get it to do something unless you spoke perfectly (and forget, for example, about putting something on your ToDo list; Todo, todo, To-Do,... are all not recognized, and have fun getting your STT to reliably generate the ToDo spelling!), or
  • you slap a full-blown LLM behind it, either forcing you to again rely on a shitty company, or host the LLM locally; but even in the latter case and on decent (not H100, of course, but with a GPU at least) hardware, the results were slow and shit, and due to context size limitations, you can just forget about exposing all your entities to the LLM Agent.
  • You also have the option of combining the two approaches; match exactly first, if no intent recognized, forward to LLM; but in practice, that just means that sometimes, you get what you wanted ("all lights off" with a 70% success rate, I'd say), and still a lot of the time you have to wait for ages for a response that may be correct, but often isn't from the LLM.

What I'd like is a third option, doing fuzzy matching on what the STT generated. Indeed, there seems to have been multiple options for that through rhasspy, but that project appears to be dead? The HASS integration has not been updated in over 4 years, and the rhasspy repos are archived as of earlier this month.

Besides, it was not entirely clear to me if you could just use the intent recognition part of the project, forgoing the rest in favor of what HASS already brings to the table.

At this point, I am willing to implement a custom conversation agent, but wanted to make sure first that I haven't simply missed an obvious setting/addon/... for HASS.

My questions are:

  • are you using the HASS Voice Assistant without an LLM?
  • if so, how do you get your intents to be recognized reliably?
  • do you know of any setting/project/addon helping with that?

Cheers! Have a good start into the working week...!

you are viewing a single comment's thread
view the rest of the comments
[–] smiletolerantly@awful.systems 5 points 6 days ago (1 children)

Thanks for your input! The problem with the LLM approach for me is mostly that I have so many entities, HASS exposing them all (or even the subset of those I really, really want) is already big enough to slow everything to a crawl, and to get bad results from all models I've tried. I'll give the model you mentioned another shot though.

However, I really don't want to use an LLM for this. It seems brittle and like overkill at the same time. As you said, intent classification is a wee bit older than LLMs.

Unfortunately, the sentence template matching approach alone isn't sufficient, because quite frequently, the STT is imperfect. With HomeAssistant, currently the intent "turn off all lights" is, for example, not understood if STT produces "turn off all light". And sure, you can extend the template for that. But what about

  • turn of all lights
  • turn off wall lights
  • turnip off all lights
  • off all lights
  • off all fights
  • ...

A human would go "huh? oh, sure, I'll turn off all lights". An LLM might as well. But a fuzzy matching / closest Levensthein distance approach should be more than sufficient for this, too.

Basically, I generally like the sentence template approach used by HASS, but it just needs that little bit of additional robustness against imperfections.

[–] Jayjader@jlai.lu 9 points 6 days ago (1 children)

From my understanding of word embeddings (as used by LLMs), you could skip the LLM and directly compare the similarity of what the STT outputs to each task or phrase in a list you have prepared. You'd need to test it out a few times to see what threshold works, but even testing against dozens of phrases should be much faster than spinning up an LLM - and it should be fully deterministic.

[–] smiletolerantly@awful.systems 10 points 6 days ago

Yep, that's the idea! This post basically boils down to "does this exist for HASS already, or do I need to implement it?" and the answer, unfortunately, seems to be the latter.