IEEE VIS 2024 Content: Representing Charts as Text for Language Models: An In-Depth Study of Question Answering for Bar Charts

Representing Charts as Text for Language Models: An In-Depth Study of Question Answering for Bar Charts



 Victor S. Bursztyn - Adobe Research, San Jose, United States

Jane Hoffswell - Adobe Research, Seattle, United States

 Shunan Guo - Adobe Research, San Jose, United States

 Eunyee Koh - Adobe Research, San Jose, United States

 Screen-reader Accessible PDF

 Download Supplemental Material

 Room: Bayshore VI

2024-10-17T14:42:00ZGMT-0600Change your timezone on the schedule page
2024-10-17T14:42:00Z

Exemplar figure, described by caption below — We explore two main tasks related to chart-grounded Q&A: question answering (QA) and visual explanation generation (VEG). QA leverages templated domain facts (DF) from the chart's CSV file, whereas VEG relies on visual context (VC) from its JSON file. In the first fine-tuning step, the charts' underlying text files are injected into the language models (LMs). We then fine-tune the QA and VEG steps on 90% of the charts, with 10% held out for testing during our evaluation in §4. To understand the robustness of our LMs to natural language variation, we also perform a question paraphrasing task to rephrase our template-generated questions more naturally.

Fast forward

Full Video

Keywords

Machine Learning Techniques; Charts, Diagrams, and Plots; Datasets; Computational Benchmark Studies

Abstract

Machine Learning models for chart-grounded Q&A (CQA) often treat charts as images, but performing CQA on pixel values has proven challenging. We thus investigate a resource overlooked by current ML-based approaches: the declarative documents describing how charts should visually encode data (i.e., chart specifications). In this work, we use chart specifications to enhance language models (LMs) for chart-reading tasks, such that the resulting system can robustly understand language for CQA. Through a case study with 359 bar charts, we test novel fine tuning schemes on both GPT-3 and T5 using a new dataset curated for two CQA tasks: question-answering and visual explanation generation. Our text-only approaches strongly outperform vision-based GPT-4 on explanation generation (99% vs. 63% accuracy), and show promising results for question-answering (57-67% accuracy). Through in-depth experiments, we also show that our text-only approaches are mostly robust to natural language variation.