ScienceQA: Science Question Answering

Data and code for NeurIPS 2022 Paper "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering".

For more details, please refer to the project page with dataset exploration and visualization tools: https://scienceqa.github.io.

:bell: If you have any questions or suggestions, please don't hesitate to let us know. You can directly email Pan Lu at UCLA using the email address lupantech@gmail.com, comment on the Twitter, or post an issue on this repository.

💥 News 💥

[2023.12.29] 🚨 We have a major update featuring over 100 recent models! We appreciate your contributions and feedback. 🚀
[2023.05.04] ScienceQA Featured in Leaked Google Document: "We Have No Moat, And Neither Does OpenAI": A recent leak of an internal Google document highlights the advancements and impact of ScienceQA within the AI research community. 🎯
[2023.05.03] In April, our ScienceQA dataset was downloaded 1,421 times from HuggingFace Datasets, showcasing its growing popularity in the community. [Link] 🌟
[2023.04.19] Chameleon: Developed by UCLA and Microsoft, this innovative project achieves a new SOTA in the few-shot setting, reaching an impressive 86.54%. :star:
[2023.04.17] LLaVA: A collaborative effort by UW–Madison and Microsoft, this groundbreaking work sets a new SOTA at 92.53%. :star:
[2023.04.01] Our work is accepted by CVPR 2023 O-DRUM Workshop.
[2023.04.01] Our work is covered by Towards AI.
[2023.04.01] Our ScienceQA dataset was downloaded 377 times in March at HuggingFace Datasets.
[2023.03.30] The ScienceQA dataset is now included at OpenDataLab.
[2023.03.28] The ScienceQA dataset has served as the primary benchmark for LLaMA-Adapter, developed by Shanghai AI Laboratory, UCLA, and CUHK. :star:
[2023.02.13] Our work gives an oral presentation by Pan Lu at AAAI 2023 KnowledgeNLP Workshop.
[2023.02.05] Our work is covered by MarkTechPost.
[2023.02.24] The ScienceQA dataset is now included at HuggingFace Datasets. :star:
[2023.02.02] The ScienceQA dataset has served as the primary benchmark for the new generation of multimodal reasoning systems, Multimodal-CoT, developed by Amazon Science.
[2022.11.29] Our work gives an poster presentation by Pan Lu at NeurIPS 2022.
[2022.11.20] Our work is covered by Geek Culture | Medium.
[2022.11] Our work is now included at Paper with Code.
[2022.09.22] Our work is accepted to NeurIPS 2022. 🌟
[2022.09.20] Our work is featured in Deep AI.

🌟 Star History

:fire: Leaderboard :fire:

Evaluation of different methods on the test split (whole: 4,241, mini: 1,000 examples). The accuracies across various categories and the overall average are reported below.

😀 You are invited to contribute your results to the TabMWP test split! Please send your result scores to this email or open a new issue at the github repository.

⚠️⚠️⚠️ Caveat: The data in the leaderboard is collected manually from existing papers. There might be some errors in the data, ambiguous data due to different interpretations, and missing data due to the lack of information in the papers. Make sure to double-check the data before using it. Please contact us at this email if you find any errors or have any suggestions. We appreciate your contributions and feedback.

The interactive leaderboard is available at https://scienceqa.github.io/leaderboard.html.

#	Model	Method	Learning	#Size	#P	Link	Date	NAT	SOC	LAN	TXT	IMG	NO	G1-6	G7-12	Avg
*	Human Performance	-	-	-	-	Link	22-09-20	90.23	84.97	87.48	89.60	87.50	88.10	91.59	82.42	88.40
*	Random Chance	-	-	-	-	Link	22-09-20	40.28	46.13	29.25	47.45	40.08	33.66	39.35	40.67	39.83
1	Mutimodal-T-SciQ_Large 🥇	LLM	Fine-tune	738M	738M	Link	23-05-05	96.89	95.16	95.55	96.53	94.70	96.79	96.44	95.72	96.18
2	MC-CoT_F-Large 🥈	VLM	Fine-tune	783M	-	Link	23-11-23	97.47	90.44	93.18	96.97	93.75	94.49	95.30	94.13	94.88
3	Honeybee (Vicuna-13B) 🥉	VLM	Fine-tune	13B	-	Link	23-12-11	95.20	96.29	91.18	94.48	93.75	93.17	95.04	93.21	94.39
4	Enigma-COT_Large	LLM	Fine-tune	793M	793M	Link	23-07-24	97.51	84.70	94.73	96.68	91.37	95.89	94.46	93.47	94.11
5	MC-CoT_Large	VLM	Fine-tune	738M	-	Link	23-11-23	95.47	89.99	91.82	95.11	92.66	93.24	94.27	91.76	93.37
6	DPMM-CoT_Large	VLM	Fine-tune	738M	738M	Link	23-12-14	95.52	90.33	91.36	95.50	93.26	92.68	93.28	93.47	93.35
7	LLaVA (GPT-4 judge)	VLM	Fine-tune	13B	13B	Link	23-04-17	91.56	96.74	91.09	90.62	88.99	93.52	92.73	92.16	92.53
8	CoMD (Vicuna-7B)	VLM	Fine-tune	7B	-	Link	23-11-14	91.83	95.95	88.91	90.91	89.94	91.08	92.47	90.97	91.94
9	Mutimodal-T-SciQ_Base	LLM	Fine-tune	223M	223M	Link	23-05-05	91.52	91.45	92.45	91.94	90.33	92.26	92.11	91.10	91.75
10	Multimodal-CoT_Large	VLM	Fine-tune	738M	738M	Link	23-02-02	95.91	82.00	90.82	95.26	88.80	92.89	92.44	90.31	91.68
11	PILL (LLaMA-7B)	VLM	Fine-tune	7B	45M	Link	23-11-03	90.36	95.84	89.27	89.39	88.65	91.71	92.11	89.65	91.23
12	LLaVA (ViT-L/16-224)	VLM	Fine-tune	13B	-	Link	23-12-04	-	-	-	-	-	-	-	-	91.2
13	DPMM-CoT_Base	VLM	Fine-tune	223M	223M	Link	23-12-14	92.72	87.85	89.91	92.72	90.48	91.29	91.45	90.11	90.97
14	LLaVA	VLM	Fine-tune	13B	13B	Link	23-04-17	90.36	95.95	88.00	89.49	88.00	90.66	90.93	90.90	90.92
15	LaVIN-13B	VLM	Fine-tune	13B	5.4M	Link	23-05-24	89.88	94.49	89.82	88.95	87.61	91.85	91.45	89.72	90.83
16	MC-CoT_F-Base	VLM	Fine-tune	248M	-