IEEE VIS 2024 Content: LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models



Minsuk Kahng - Google, Atlanta, United States

 Ian Tenney - Google Research, Seattle, United States

 Mahima Pushkarna - Google Research, Cambridge, United States

 Michael Xieyang Liu - Google Research, Pittsburgh, United States

 James Wexler - Google Research, Cambridge, United States

 Emily Reif - Google, Cambridge, United States

 Krystal Kallarackal - Google Research, Mountain View, United States

 Minsuk Chang - Google Research, Seattle, United States

 Michael Terry - Google, Cambridge, United States

 Lucas Dixon - Google, Paris, France

 Download Supplemental Material

 Room: Bayshore V

2024-10-18T12:42:00ZGMT-0600Change your timezone on the schedule page
2024-10-18T12:42:00Z

Exemplar figure, described by caption below — LLM Comparator is a visual analytics tool consisting of multiple views: an interactive table which displays individual prompts and model responses, and a visualization summary which comprises multiple panels, including score distribution, metrics by prompt category, rationale clusters, n-grams, and custom functions.

Fast forward

Full Video

Keywords

Visual analytics, large language models, model evaluation, responsible AI, machine learning interpretability.

Abstract

Evaluating large language models (LLMs) presents unique challenges. While automatic side-by-side evaluation, also known as LLM-as-a-judge, has become a promising solution, model developers and researchers face difficulties with scalability and interpretability when analyzing these evaluation outcomes. To address these challenges, we introduce LLM Comparator, a new visual analytics tool designed for side-by-side evaluations of LLMs. This tool provides analytical workflows that help users understand when and why one LLM outperforms or underperforms another, and how their responses differ. Through close collaboration with practitioners developing LLMs at Google, we have iteratively designed, developed, and refined the tool. Qualitative feedback from these users highlights that the tool facilitates in-depth analysis of individual examples while enabling users to visually overview and flexibly slice data. This empowers users to identify undesirable patterns, formulate hypotheses about model behavior, and gain insights for model improvement. LLM Comparator has been integrated into Google's LLM evaluation platforms and open-sourced.