logoalt Hacker News

ra7yesterday at 5:02 PM11 repliesview on HN

> The insight driving the program, Naga said, is that the limiting factor for AV development is no longer the underlying technology. “The bottleneck is data,” he said. “[Companies like Waymo] need to go around and collect the data, collect different scenarios. You may be able to say: in San Francisco, ‘At this school intersection, I want some data at this time of day so I can train my models.’ The problem for all these companies is access to that data, because they don’t have the capital to deploy the cars and go collect all this information.”

You can’t be the CTO of Uber wanting to do AVs, and get the data collection requirement shockingly wrong.

Waymo’s bottleneck has never been data. When they want data about a school intersection in SF at a certain time of day, they just... synthetically generate it and simulate: https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...

Waymo is able to deploy with less (but targeted and high quality) data collection by having world class simulation capabilities. Not that they haven't collected huge amounts of data as it's no doubt important (I've heard their onboard storage is transferred and emptied every few days), it's just not a bottleneck. They have the most efficient operation in the AV industry.

The best example of why data collection isn’t the bottleneck is Tesla. They boast about billions of miles of data, yet they’re struggling to put out fully autonomous vehicles.


Replies

simmonmtyesterday at 5:55 PM

> When they want data about a school intersection in SF at a certain time of day, they just... synthetically generate it and simulate

I think it's more about detecting changes to the world. You need boots on the ground, so to speak, to see that new speed limit sign or the new lane paint. The Waymo vehicle can no doubt react to changes in the world when it encounters them, relaying them back to the mothership, but it's better to know about them in advance.

show 3 replies
suddenexampleyesterday at 5:56 PM

Yeah I'm not so sure this CTO is on the mark here, but to be fair, I do think some of this IRL long tail/edge case data is important for Waymo. The simulation software is super interesting to me - the real world can be so chaotic, and even if they could generate every possible real life case, there needs to be validation on whether the Waymo driver is responding in the optimal way. They certainly haven't solved this problem, you can see some of their growing pains in all of these articles - floods in Austin, more and more interactions with emergency vehicles that first responders seem to believe are getting worse, etc.

Tesla on the other hand has billions of miles of data, yet because there is a limit to camera-only techniques, that data isn't that useful is it? They have no ground truth data to evaluate their camera system on, which is why sometimes you see those Teslas driving around with lidar rigs mounted on them. Going camera-only is just asking for trouble.

show 1 reply
KaiserProyesterday at 7:19 PM

> The best example of why data collection isn’t the bottleneck is Tesla.

Exactly. plus any delivery company/dashcam company can provide a bunch of data where ever there is any sizeable population.

About 8 years ago, that data would have been really valuable, but at best its nice to have.

the only thing that is valuable is the breadth of different cars, but even then its not that much of a differentiator.

Sardtokyesterday at 7:06 PM

The biggest difference, is Uber has vehicles around the world. So there's more data from countries with different rules from the US. Signage is definitely different between the US and Europe.

iugtmkbdfil834yesterday at 9:04 PM

I.. am amused by the confidence on display, but I can't say that I am not concerned that people are confidently stating that real world data is not useful, because it can be just simulated. One would think that, by now at least, we know that simulation is at best an imperfect copy.

And I don't like the idea of even more data being harvested and used.. I just find the dismissal.. odd.

show 1 reply
cogman10yesterday at 5:59 PM

> The best example of why data collection isn’t the bottleneck is Tesla. They boast about billions of miles of data, yet they’re struggling to put out fully autonomous vehicles.

Well, TBF, the tesla data was complete garbage with earlier vehicles. They had cheap and somewhat bad cameras in the earlier vehicles that was only somewhat recently updated. And even then, I don't think Tesla is at the end of their hardware journey. I think they don't think that either, which is why they've gone to a subscription only model for self driving vehicles.

Waymo, on the other hand, has gathered less data, but more high quality data. They do the expensive mapping of a city which is a big part of why their vehicles have early on been able to do some pretty impressive feats. The drawback is getting that high quality data takes a lot of time and resources.

show 1 reply
gcheongyesterday at 6:15 PM

Didn't they need the data from the 200 million miles or so from actual driving before they could get to the generative model though? Data isn't everything, as you point out with Telsa (mainly because they decided to forego using lidar it would seem), but it is pretty fundamental.

show 2 replies
jsemrautoday at 12:27 AM

"You can’t be the CTO of Uber wanting to do AVs, and get the data collection requirement shockingly wrong."

Problem 1: Cost and privacy constrain limit data collection.

Problem 2: It makes not much sense to collect and store data that you already have. Yet you don't know that when collecting if it is useful or not.

Problem 3: P2P in urban setting fails at edge cases which by definition are rare to collect.

All of these problems limit AV scaling.

whiplash451yesterday at 6:12 PM

Waymo might very well be missing specific kinds of data (e.g more incidents/accidents, near-collisions etc)

Also, Uber’s data might be useful for eval, not training (e.g « here is how Waymo would behave vs human drivers therefore it is safer »)

show 1 reply
bobroyesterday at 6:40 PM

I find the idea of learning from simulated data so unintuitive. How can you radically improve your model with just your model? I take it people do it, so it must work, but i just don’t understand it at all.

show 3 replies
cyanydeezyesterday at 7:41 PM

Yes, the way to make these things safer is to make up data and simulate on that.

Do you hear yourself?

show 1 reply