This isn't fully "Hidden" but I've always wondered if Ai scraping is the reason why short form videos on Youtube/TikTok/Instagram featuring film/tv clips will sometimes have 2 audio tracks... one with the actual audio from the clip a little louder and one audio track with a computer generated narrator providing running commentary of what is happening and why. As a human I'm able to tune it out but it is very weird/jarring.
In case anyone hasn't had the displeasure of viewing these I'll link some in a comment below once I scroll through my feed and find one.
Isn't this an attack on transcribers? Not on "Voice AI systems". ASR transcribers predate LLMs and all the AI hype.
If you are transcribing audio from unknown sources and feeding the output to agents that can perform authorized actions on your behalf you are kind of screwed anyway. I guess it would be dangerous if you tricked authorized users to play the sounds in the background while transcribing something.
> "Audio modality is really challenging to comprehend because of how limited our hearing is"
Would it help to significantly lower the hearing capabilities of the AI system? At Juvoly, we always encouraged GPs to invest in high quality microphone like Jabra Speak, connected through USB. A good mic results in much better audio transcriptions, but maybe that was all for the wrong reasons?
I believe that will be purely based on how the AI Models stored the voices in their neural networks. If we can debug that, then we would be able to send a secret sounnd a AI model might be able to understand due to it's internat connections, but that doesn't make sense to us. Until then, there's no harm, is what my view is
Related: Benn Jordon shows how to poison pill AI harvesting music for training
The Art Of Poison-Pilling Music Files
Does this transfer to Whisper / CLAP-type audio models or is it ASR-decoder specific? Whisper would be intresting given how widely it's used in prod.
Bene-gesserit have entered the chat!
I'd like to commend Apple on being ahead of the curve with this kind of attack, I don't think Siri is susceptible to this at all. Mostly due to it not being able to hear/understand what I say in the best of times /s
It's insane to me how much of a nose-dive Siri or any Apple-based STT takes when there is _any_ noise in the background. I like to play music at low levels in my house just as background noise and I've noticed that if I'm playing any music my STT just goes to complete shit (often missing the last 2-3+ words and messing up things in the middle). On the other hand, in the exact same environment, Parakeet v3 (via MacWhisper) has zero issues even with background noise.
Isn't it the "adversarial image" attack, well-known in (earlier) visual recognition models [1]? That would be a quite obvious vector.
[1]: https://www.science.org/content/article/turtle-or-rifle-hack...