the observability point hits hard - you're right that the review cost is really about not being able to tell what's happening across runs or track quality over time. the CI gate approach makes sense in theory but honestly most teams I've seen don't have the test coverage to make that work safely, so you end up needing manual review anyway. also the 'does this do what I asked' question is harder than it sounds because sometimes the AI builds the wrong thing correctly, and your existing tests don't catch that since they're testing different assumptions. but yeah, the lack of cost-per-task tracking and quality metrics is brutal - you're flying blind on whether you're actually saving time or just moving the work around