Humans aren't using only fixed vision for driving. This is such a tiresome thing to see repeated in every discussion about self driving.
You're listening to the road and car sounds around you. You're feeling vibration on the road. You're feeling feedback on the steering wheel. You're using a combination of monocular and binocular depth perception - plus, your eyes are not a fixed focal length "cameras". You're moving your head to change the perspective you see the road at. Your inner ear is telling you about your acceleration and orientation.
In theory, a computer should be able to do the same. It could do sensor fusion with even more sense modalities than we have. It could have an array of cameras and potentially out-do our stereo vision, or perhaps even use some lightfield magic to (virtually) analyze the same scene with multiple optical paths.
However, there is also a lot of interaction between our perceptual system and cognition. Just for depth perception, we're doing a lot of temporal analysis. We track moving objects and infer distance from assumptions about scale and object permanence. We don't just repeatedly make depth maps from 2D imagery.
The brute-force approach is something like training visual language models (VLMs). E.g. you could train on lots of movies and be able to predict "what happens next" in the imaging world.
But, compared to LLMs, there is a bigger gap between the model and the application domain with VLMs. It may seem like LLMs are being applied to lots of domains, but most are just tiny variations on the same task of "writing what comes next", which is exactly what they were trained on. Unfortunately, driving is not "painting what comes next" in the same way as all these LLM writing hacks. There is still a big gap between that predictive layer, planning, and executing. Our giant corpus of movies does not really provide the ready-made training data to go after those bigger problems.
In India (among others), honking is essential to reducing crashes
We often greatly underestimate / undervalue the role of our ears relative to vision. As my film director friend says, 80% of the impact in a movie is in the sound
Most of what you said has nothing to do with lidar vs camera
20 meters away motion vision is more accurate than stereoscopic vision. What is lidar helping to solve here?
And also, even with the suite of sensors that humans have, their vision perception is frequently inadequate and leads to crashes. If vision was good enough, "SMIDSY" wouldn't be such an infamous acronym in vehicle injury cases.