Ukraine has a sound-based version of this, supposedly using cell phones as the primary hardware element. The idea is to scatter hundreds of sensors along the front in some depth, then use simple on-device models to classify sounds and send an alert when a sound matching a known drone signature is detected.
That's not even complicated.
You can use ESP32 with GPS modules and their PPS signals. The PPS signal from the module often has has a roughly precision around 60ns against the global GPS standard.
With that signal you can PID-control an internal timer of the ESP32 - which then can be used to timestamp audio frames. Send that to a central host over Wifi and you can use your standard localization math.
The trick is to use the internal ESP32 10MHz hardware which automatically kicks timestamps into a register if a GPIO does something. Not using high-level C constructs that must eat their way through x API layers.
This costs like 20€.