logoalt Hacker News

daemonologistyesterday at 5:28 PM6 repliesview on HN

It's interesting to me that all AI music sounds slightly sibilant - like someone taped a sheet of paper to the speaker or covered my head in dry leaves. I know no model is perfect but I'd have thought they'd have ironed out this problem by now, given how pervasive it is and how significantly it degrades the end product.


Replies

recursiveyesterday at 10:50 PM

I've noticed this too. I have a few theories about this. Disclosure: I know a little about audio, and very little about audio generative AI.

First, perhaps the models are trained on relatively low-bitrate encodings. Just like image generations sometimes generate JPG artifacts, we could be hearing the known high-frequency loss of low data rate encodings. Another idea is that 'S' and 'T' sounds and similar are relatively broad-spectrum sounds. Not unlike white noise. That kind of sound is known to be difficult to encode for lossy frequency-domain encoding schemes. Perhaps these models work in a similar domain and are subject to similar constraints. Perhaps there's a balance here of low-pass filter vs. "warbly" sounds, and we're hearing a middle ground compromise.

I don't know how it happens, but when I hear the "AI" sound in music, this is usually one of the first tells.

handbanana_today at 6:07 AM

It's because the dataset is all algorithmically lossy compressed music, and not the real source

Basically made with pirated mp3s

AlphaAndOmega0yesterday at 5:32 PM

Agreed. I find that particularly annoying, and I also seem to find that the spatial arrangement or stereo effect is muted for most instruments (or the model simply doesn't use that feature as well as a good human musician).

userbinatortoday at 2:06 AM

Perhaps this is what the human is for - to apply an EQ curve.

conradfrtoday at 7:14 AM

Taping tissue paper over the tweeter of ns10s was popular in studios back in the day ;)

gowldtoday at 1:13 AM

I suspect it's because AI generates music as a waveform incrementally not globally so it favors smoothly varying sounds, not sharp contrast. If it generated MIDI data and then used a MIDI synth to create the audio, you wouldn't get that.