Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
Does folding a protein count? How about increasing performance at Go?
Does folding a protein count? How about increasing performance at Go?