I had to block meta's ASN on my personal cgit server a few weeks ago because they were ignoring robots.txt and torching it. Like hundreds of megabytes of access logs just from them, spread around different network blocks to clearly try and defeat IP based limiting. I couldn't believe it.
IMO ASN-based blocking should be much more common, but unfortunately it is not supported as a first-class configuration option in many common tools.
Hey, how do you identify them? Is there a service to recognize which of these companies scrapped you?
I had to last year too, nonstop crawling, random urls that didn't exist. It looked like they were trying to proxy user queries through to a search endpoint too. The ASN matched so I know it wasn't someone spoofing them.