logoalt Hacker News

4TB of voice samples just stolen from 40k AI contractors at Mercor

559 pointsby Oravysyesterday at 9:57 AM211 commentsview on HN

Comments

oefrhayesterday at 2:50 PM

> If you were a Mercor contractor and you believe your voice may already be in circulation, ORAVYS will analyze the first three suspect samples free of charge.

Awesome, if you're a victim of an AI company having your voice, you can help yourself by sending another AI company your voice!

> Audio is never used to train commercial models without explicit consent

I'm sure Mercor has explicit consent as well, legal teams are reasonably good at legally covering their asses with license terms.

show 9 replies
Oravysyesterday at 10:02 AM

Author here. Wrote this after watching Lapsus$ post the Mercor archive on their leak site earlier this month. The thing that struck me is the combination: voice samples paired with ID document scans. Most breaches leak one or the other. This one ships a deepfake-ready kit. Tried to keep the writeup practical: what an attacker can actually do with this combo (banking voiceprint bypass, Arup-style video calls, insurance fraud), and a 5-step checklist for the contractors who were in the dump.

  Happy to discuss the forensic detection side. AudioSeal
  watermarks, AASIST anti-spoofing, and how the detection landscape changes
  once voice biometrics start leaking at scale.
show 5 replies
eqvinoxyesterday at 1:22 PM

The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.

Germans (because of course) have a word for this: "Datensparsamkeit". Being frugal with your data.

show 9 replies
kleiba2today at 9:02 AM

If you had a company, why not just tell all customers that their data is save but don't waste any money on security at all: in case of a breach, just write an apology email to your clients, promise a full investigation, and move on.

Obviously, you don't have to face any legal consequences, so why worry?

Sorry for the rant... but I just find this lack of liability frustrating.

show 1 reply
ethagnawlyesterday at 2:57 PM

So, they should all just rotate their voices ... right?

I jest but the majority of the "normal" people I know are happy to hand over biometrics because _it's easier_. We need to start branding biometrics as "forever passwords" or something to help people understand just what they're handing over when they validate access to their checking account or enter Disney World or whatever else.

show 3 replies
GS_Projectstoday at 11:39 AM

Mercor had a SOC 2, an MSA, all the right clauses. Voices still leaked. The apology email writes itself.

Why is voice and biometric stuff still server-side at all in 2026? Whisper.cpp runs on a phone. WebGPU works. Half these "we keep your voice secure" pipelines could run in the browser today.

The real reason isn't capability. It's cost. Centralised compute is cheaper to run, but that math only holds if you don't price in the periodic breach. Which nobody does until it's their own employees on the leak list.

VladVladikoffyesterday at 1:37 PM

Man that’s pretty shitty that Mercor tricked 40k contractors, and then did a poor job of securing their data. There should be stronger consequences for stuff like this.

show 2 replies
barrenkoyesterday at 2:41 PM

It more looks like the purpose of such company was to steal such data.

show 1 reply
ChrisMarshallNYyesterday at 5:13 PM

> What does an attacker actually do with thirty seconds of someone's clean read voice plus a scan of their driver's license?

I could think of quite a few things. I know that my bank and brokerage use voice ID.

kumarskiyesterday at 11:46 PM

I was floating near some ex agency and GS15 folks yesterday in Houston, they explained to me that the Israeli cybersecurity apparatus has had everyone's voicemails for the last 20 years because they inserted themselves into the supply chain of voicemails somehow or another.

Kind of nuts all the ways audio data can be used now.

show 1 reply
embedding-shapeyesterday at 2:05 PM

I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.

show 4 replies
yesman_xyesterday at 6:42 PM

If this is real, the bigger issue might not even be the leak itself. It could be that we are quietly moving into a world where voice plus ID is enough to fully impersonate someone, and most systems are still not built for that reality.

deferredgrantyesterday at 10:27 PM

There is also an ugly labor story here. The people labeling and training these systems are often the least protected when the data pipeline itself turns into the attack surface.

john_strinlaiyesterday at 2:07 PM

>Set up a verbal codeword with family and finance contacts. Pick a phrase that has never been spoken on a recording and never typed in chat. Brief the people who handle money on your behalf. If a call ever asks for a transfer, the codeword is mandatory.

good luck with this. most finance people deal with hundreds to thousands of clients. they obviously cant remember everyones code word. commonly used finance systems arent setup to securely store these codewords. they dont have processes or policies in place to implement or adhere to any sort of codeword verification.

>Rotate where voiceprints are still in use. [...] Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.

would this even have an effect? i have never heard of "rotating" a voice print. isnt the whole point of a voice print that you cant really change it? if simply switching your environment completely changes your voice print, that would make voice prints utterly useless to begin with.

show 3 replies
eolgunyesterday at 3:08 PM

The biometric pairing is what makes this particularly bad. A leaked password is recoverable. A leaked voiceprint combined with ID scans is permanent, you can not rotate your voice.

The deeper problem is that most of these companies collected this data because they could, not because they needed it for the core service. 'Datensparsamkeit' is the right frame: the voice samples were a liability sitting on a server waiting for exactly this.

show 1 reply
tracker1yesterday at 4:06 PM

I'm pretty sure Google and Apple already have some decent examples of a LOT of people's voices in concert with other data collation. Google Voice IIRC was bought for audio sampling voicemail in the first place. Not sure if Apple has done similar, but would be more surprised if they didn't... Let alone the voice search options for both.

flockonusyesterday at 7:15 PM

> How to check if your voice is being misused

I love that the answer here is basically.. - you don't -

But maybe mitigate at unreasonable personal costs.

How about services simply stop taking public information as proof of identity?

amarcheschiyesterday at 1:49 PM

I've been doing similar things on a different platform because as a uni student the pay is kinda nice, but I limit myself to task without voice/video and just input from mouse/keyboard to do reinforcement learning/data tagging. No way I'm trusting these companies or the companies they contract the work with

meric_yesterday at 4:59 PM

Is this post not just an ad for a vibe coded site / product? It adds no new info on the mercor breach and advertises something which I presume has even worse safety practices

AntiUSAbahyesterday at 4:13 PM

I'm curious: if i create an online sample from my voice, might this make it a lot harder for an AI model to identify me if every trainingdata contains my particular voice sample?

throwaway67743yesterday at 6:15 PM

I saw the red flags immediately when I stumbled across them a year ago maybe. I'm really not surprised.

hedorayesterday at 4:05 PM

Isn’t this going to immediately become daily news?

Half the time I call a company they say “we are recording your voice for security / authentication purposes”.

The companies that do that have all the information on me that they require for me to set up an account, so their data breaches will be just like this one, but 1000x larger.

Can we just fast forward through the part where this works for ID theft, past the firefox age verification plugin that uses these datasets, and even through the part where people in the plugin dataset are digital outcasts (this voice has been used too many times. Want to try another?)

At the end of this dark predictable tunnel, maybe there will be a ban on biometrics for important stuff, a repeal of the age verification laws, and actual privacy legislation with teeth.

AtNightWeCodeyesterday at 4:47 PM

Where I live there was a common scam to manipulate voice recordings from phone calls. I was very careful back then with phone calls when I ran my own business. Like 15 years ago. Kinda crazy that any service would use voice recognition today as stated.

gyanchawdharyyesterday at 7:28 PM

im the founder of a company that runs deepfake phishing simulations for enterprises, so biased on this one .. but the operational thing the piece misses is that this is the first widely circulated dump where voice, govt ID and selfie all came from the same onboarding session i.e. most enterprise call center auth still treats those as 3 independent factors ..

The scarier piece is that an attacker pulls a contractor from the dump, finds their employer on linkedin, then calls that companys IT helpdesk for a password reset with the cloned voice.

Fwiw we put up a free realtime face swap demo a while back at https://www.callstrike.ai/deepfake-security-training .. worth a look if you want to actually feel how trivial this has gotten.

show 1 reply
josefritzishereyesterday at 1:32 PM

This kind of event is the best argument against needless data hoarding. But it would help if the law better provided for some kind of consequences for negligence.

squirrelonyesterday at 4:53 PM

40k people are not under thread, I am getting AI contractor job offers every month on UpWork, I am glad I haven't accepted more than one as it is just not worth to do.

jacquesmyesterday at 1:31 PM

You could have seen this coming a mile away. So far I have gotten away with never uploading my ID and/or interacting with one of those companies (though one idiot working for some VC thought it was ok to sign a document on my behalf by uploading my signature!!, never mind a bit of fraud) but it is getting harder and harder. Banks and in some cases even governments forcing you to send data to these operators is a very bad idea. But hey, who ever got hurt by some security theater?

I've had to open a bank account for a company here a few years ago and that was right on the bubble of this happening and they still had an option to come by in person with the proper documentation, which I did, now it is all outsourced.

These companies are the fattest targets and they're run by incompetents. You should assume that anything you give them will eventually be part of some hack.

show 2 replies
Havocyesterday at 1:43 PM

I love how the check if your affected involves giving a voice sample to whatever the fuck that website is

show 1 reply
interludeadyesterday at 8:14 PM

This is exactly why "voice as authentication" feels like a dead end to me

show 1 reply
kristopherleadsyesterday at 3:20 PM

I'm at the point where I might start professionally using a voice changer. I mean what in the world, my guy?

terobyteyesterday at 11:42 PM

Open Source now?

sharadovyesterday at 3:45 PM

Mercor is the most scummy company out there, run by a bunch of sleazeball 20 somethings who are getting a lot of press as the youngest billionaires in the making.

Can't wait for them to crash and burn.

show 2 replies
miohtamayesterday at 10:57 PM

Now open Chinese models can catch up

throw0101cyesterday at 1:43 PM

"My voice is my passport. Verify Me."

:)

show 2 replies
globalnodeyesterday at 2:51 PM

not to be conspiratorial but stolen? or given away...

immanuwellyesterday at 3:09 PM

they literally handed over their voice, their face, and their government id to train ai models for peanuts - and now lapsus is sitting on 4tb of 'you' that you can never change like a password

KnuthIsGodyesterday at 1:56 PM

[dead]

algoth1yesterday at 1:33 PM

[dead]

dfordp11today at 5:01 AM

[dead]

glicvdfhsdfyesterday at 10:22 PM

[dead]

mugivarra69yesterday at 4:53 PM

[dead]

ghstindayesterday at 3:58 PM

[flagged]