You might like SWE-WebDevBench which tries to do this comprehensive evals for webapp development. https://webdevbench.com/