Im sure this works but from my recent experience you need your STT on a machine more powerful than a PI atm. Tiny models are just not accurate enough and the bigger ones need more than the PI has to give any sort of decent response time. Compared to where this was two years ago I look forward to where it is in two more.
One of the largest improvements imo has been microwakeword and the ability to run the hotword detection “on device” but I believe it only runs on ESP32 devices so not an option if want everything on a pi.
I spent a little bit of time getting a fully local voice pipeline setup in home assistant last month and I’d say it is near perfect(after adding a few additional community integrations) with the exception of the microphones on current hardware. I look forward to the next HA voice device from Nabu Casa.
Don’t misunderstand. Openwakeword works great. I just think it’s awesome that the hotword detection can run so well directly on low powered devices now.
About STT I’ll agree that they can run well, I just found the experience a lot better on a heavier model running on a beefier machine. They do well in a silent space but struggle when you add background noise.