This seems like a viable eval strategy. Presumably finding a bug requires some degree of understanding of the code, beyond just information retrieval. However it probably does not measure things like prompt adherence or ability to create code that implements a specification?