Files
Download Full Text (280 KB)
Description
Artificial intelligence language models (AI-LMs) show promise in medical education and clinical problem solving, but their performance in medical board exams has been inconsistent (1,2,3).
This study investigated how prompt engineering and providing essential reference material influence AI-LM performance on board exam questions.
We tested ChatGPT and Claude-2 on 360 questions in various formats under three conditions: simple prompt, elaborate CRAFTS prompt, and CRAFTS prompt with references (CRAFTS stands for context, role, action, format, tone, and style).
Results showed significant improvement in accuracy from simple prompts (67-83.5%) to CRAFTS prompts (76.2-92.5%) to CRAFTS prompts with references (97.3- 99.4%)
These findings suggested that AI-LM performance in medical exams is heavily influenced by instruction quality and provided context, highlighting the need for standardized evaluation methods.
Publication Date
7-2024
Disciplines
Pathology
Recommended Citation
Qu Z, Elzieny M, Arora K. Optimize artificial intelligence language model use in medical board exams: insights from instruction quality and domain context analysis. Presented at: Association for Academic Pathology Annual Meeting; 2024 Jul 21-24; Washington, DC.
Comments
Association for Academic Pathology Annual Meeting, July 21-24, 2024, Washington, DC