If you're less concerned about privacy, I use Gemini 2.5 Flash for this and it's exceptionally good and fast as a HA assistant while being much cheaper than the electricity that would be needed to keep a 3090 awake.
The thing that kills this for me (and they even mentioned it) is wake word detection. I have both the HA voice preview and FPH Satellite1 devices, plus have experimented with a few other options like a Raspberry Pi with a conference mic.
Somehow nothing is even 50% good as my Echo devices at picking up the wake word. The assistant itself is far better, but that doesn't matter if it takes 2-3 tries to get it to listen to you. If someone solves this problem with open hardware I'll be immediately buying several.
What's been surprising in my experience regarding the wake word is that it recognizes me (adult male) saying the wake word ~95% of the time. However, it only registers the rest of my family (women and children) ~30% of the time.
I have a feeling beamforming microphone arrays might help here, something like this could improve the audio being processed substantially - https://www.minidsp.com/products/usb-audio-interface/uma-8-m....
What about your wifi APs sensing which room you are in, with your choice of hilarious dance moves as the trigger ?
Funky chicken for Gemini
Penguin dance for OpenAI
Claude?
Why not use an easier to detect wake “word”, like two claps in quick succession? Or a couple of notes of a melody?
How about a button?
I'd prefer to physically press a button on an intercom box than having something churning away constantly processing sound.