Optimize Artificial Intelligence Language Model Use in Medical Board Exams: Insights From Instruction Quality and Domain Context Analysis
Document Type
Conference Proceeding
Publication Date
4-2024
Publication Title
Academic Pathology
Abstract
Objectives: The standard board exam provides an excellent test of the usefulness of artificial intelligence language models (AI-LMs) in medical education and clinical problem-solving. However, the performance of AI-LMs in taking standard board exams varies greatly. The same LMs showed impressive performance in some studies, while unsatisfactory in others (ref). We hypothesize that the discrepancy largely results from how the AI-LMs were tested. This study aims to delineate how two key strategies impact the performance of AI-LMs. We reasoned that if such key determinants are identified, they can be implemented to improve the performance of AI-LM. Methods: 360 examination questions with correct answers in four different formats (multiple choice, true-or-false, fill-in-blank, and answer-matching) were used to cross-test two different AI-LMs (chatGPT and Cloud-2 ) in three different experimental settings. In the first, the questions were submitted to the AI-LMs along with a simple instructive prompt “Please follow the test instruction and provide the correct answer to each of the following test questions: [Questions].” In the second, the same questions were submitted along with an elaborate prompt following our CRAFTS formula (i.e. context, role, action, format, tone, and style). The third setting replicated the second, but additionally provided journal review articles or book chapters on the topic of the questions, to be used as reference material for the AI-LM. Results: The accuracy rates range from 67-83.5%, 76.2-92.5%, and 97.3-99.4% for test groups with a simple prompt only, CRAFTS prompt, and CRAFTS prompt plus reference article, respectively. Thus, both prompt engineering and expert reference articles enhance the performance of AI-LMs. The performance of ChatGPT and Claude-2 varies depending on the types of test questions, but the statistical significance of the difference cannot be assessed due to the small sample sizes at the current stage of the study. Nonetheless, the above trend of accuracy rates is held by both ChatGPT and Claude-2. Conclusions: The performance of AI language models in board exams is significantly influenced by the quality of instructions (prompt) and the context of domain knowledge provided. Our findings suggest that variations in previous studies can largely be attributed to differences in these aspects. Therefore, evaluating AI language models should involve well-defined standards. Our methodology and results could be relevant for employing AI language models in clinical problem-solving scenarios.
Volume
11
Issue
2 Suppl
First Page
5
Recommended Citation
Qu Z, Elzieny M, Arora K. Optimize artificial intelligence language model use in medical board exams: insights from instruction quality and domain context analysis. Acad Pathol. 2024 Apr;11(2 Suppl):5. doi:10.1016/j.acpath.2024.100138
DOI
10.1016/j.acpath.2024.100138
Comments
Association for Academic Pathology Annual Meeting, July 21-24, 2024, Washington, DC