So far we basically just provide a very rule-based approach and try not use LLMs as much as possible. So we extract and parse the citations using various ML and rule-based approaches, and carry out a bunch of predetermined queries and do various fuzzy matching approaches on the metadata components, and have a bunch of rules around risk levels of things we should have found/matched based on what type of source it is, which venue we should have found it in, etc.
So there are absolutely a bunch of tasks that could be evaled/benchmarked, but "hallucination rate" isn't particularly applicable/interesting as a metric of how good the tool is
that said, we do use various LLMs (mostly local, fine-tuned, small, for things like NER/parsing/metadata comparison, etc.). and they can and do hallucinate, but we have very hard constraints on the validation, so any extraction results that don't match 1:1 back to the input text are discarded for example. so again, rather than hallucination risk we prefer hard constraints
So far we basically just provide a very rule-based approach and try not use LLMs as much as possible. So we extract and parse the citations using various ML and rule-based approaches, and carry out a bunch of predetermined queries and do various fuzzy matching approaches on the metadata components, and have a bunch of rules around risk levels of things we should have found/matched based on what type of source it is, which venue we should have found it in, etc.
So there are absolutely a bunch of tasks that could be evaled/benchmarked, but "hallucination rate" isn't particularly applicable/interesting as a metric of how good the tool is
that said, we do use various LLMs (mostly local, fine-tuned, small, for things like NER/parsing/metadata comparison, etc.). and they can and do hallucinate, but we have very hard constraints on the validation, so any extraction results that don't match 1:1 back to the input text are discarded for example. so again, rather than hallucination risk we prefer hard constraints