Are they in this space? [1] One could map the ranges into a web daemon and rate limit them or just 'ip route add blackhole ${cidr}' each cidr block.
I just do this for the IP ranges of Amazon, OpenAI, Huawei and other companies that run these insane crawlers: it's 100% effective and it doesn't annoy real users with a captcha or some PoW thing. There's simply no reason for them to reach my homeserver other than to scrape the hell out of it.
That list is a tad bit too long. Why don't they enforce a rule on these big corps to publicly state which range does what.
I didn't check thoroughly, but the first one I happened to grep out was not on that list:
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
"x-forwarded-for":"44.210.204.255" "x-real-ip":"44.210.204.255"
This is a bit outside my area of expertise, so I don't know how reliable these x-forwarded-for and x-real-ip are.