Two follow-ups:
1) How do you compare accuracy? by checking if the answer is in any of the returned grep/bm25/semble snippets?
2) How do you measure token use without the agent, prompt, and tools?
1) yes! It’s not accuracy, but ndcg 2) we assume that if the agent gets the correct answer in the returned snippets it does not need to read further
1) yes! It’s not accuracy, but ndcg 2) we assume that if the agent gets the correct answer in the returned snippets it does not need to read further