Optimize Artificial Intelligence Language Model Use in Medical Board Exams: Insights From Instruction Quality and Domain Context Analysis

Document Type

Conference Proceeding

Publication Date

4-2024

Publication Title

Academic Pathology

Abstract

Objectives: The standard board exam provides an excellent test of the usefulness of artificial intelligence language models (AI-LMs) in medical education and clinical problem-solving. However, the performance of AI-LMs in taking standard board exams varies greatly. The same LMs showed impressive performance in some studies, while unsatisfactory in others (ref). We hypothesize that the discrepancy largely results from how the AI-LMs were tested. This study aims to delineate how two key strategies impact the performance of AI-LMs. We reasoned that if such key determinants are identified, they can be implemented to improve the performance of AI-LM. Methods: 360 examination questions with correct answers in four different formats (multiple choice, true-or-false, fill-in-blank, and answer-matching) were used to cross-test two different AI-LMs (chatGPT and Cloud-2 ) in three different experimental settings. In the first, the questions were submitted to the AI-LMs along with a simple instructive prompt “Please follow the test instruction and provide the correct answer to each of the following test questions: [Questions].” In the second, the same questions were submitted along with an elaborate prompt following our CRAFTS formula (i.e. context, role, action, format, tone, and style). The third setting replicated the second, but additionally provided journal review articles or book chapters on the topic of the questions, to be used as reference material for the AI-LM. Results: The accuracy rates range from 67-83.5%, 76.2-92.5%, and 97.3-99.4% for test groups with a simple prompt only, CRAFTS prompt, and CRAFTS prompt plus reference article, respectively. Thus, both prompt engineering and expert reference articles enhance the performance of AI-LMs. The performance of ChatGPT and Claude-2 varies depending on the types of test questions, but the statistical significance of the difference cannot be assessed due to the small sample sizes at the current stage of the study. Nonetheless, the above trend of accuracy rates is held by both ChatGPT and Claude-2. Conclusions: The performance of AI language models in board exams is significantly influenced by the quality of instructions (prompt) and the context of domain knowledge provided. Our findings suggest that variations in previous studies can largely be attributed to differences in these aspects. Therefore, evaluating AI language models should involve well-defined standards. Our methodology and results could be relevant for employing AI language models in clinical problem-solving scenarios.

Volume

11

Issue

2 Suppl

First Page

5

Comments

Association for Academic Pathology Annual Meeting, July 21-24, 2024, Washington, DC

DOI

10.1016/j.acpath.2024.100138

Share

COinS