January 10, 2025 2:05 PM
VentureBeat/Ideogram
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for specific and highly detailed responses.
It’s a challenge data scientists have struggled to overcome, and now, researchers from Google DeepMind say they have come a step closer to achieving true factuality in foundation models. They have introduced FACTS Grounding, a benchmark that evaluates LLMs’ ability to generate factually accurate responses based on long-form documents. Models are also judged on whether their responses are detailed enough to provide useful, relevant answers to prompts.
Along with the new benchmark, the researchers have released a FACTS leaderboard to the Kaggle data science community.
As of this week, Gemini 2.0 Flash topped the leaderboard, with a factuality score of 83.6%. Others in the top 9 include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropic’s Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI’s GPT-4o, 4o-mini, o1-mini and o1-preview. These all ranked above 61.7% in terms of accuracy.