Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects

RSCT Certification: κ=0.550 (pending) | RSN: 0.37/0.32/0.31 | Topics: ai-safety

Core Contribution This paper presents an innovative approach to generating high-quality, rubric-aligned feedback on resident scholarly projects using a large language model (LLM) - specifically, the open-weight LLaMA 3.1 model. The key challenge addressed is the labor-intensive nature of providing timely, personalized feedback, especially in large medical residency programs. The authors developed an AI-assisted evaluation system that can ingest diverse resident submissions (PDFs, scans, photos) via optical character recognition (OCR) and produce section-by-section feedback aligned with program-specific rubrics. This represents a significant step towards automating a critical element of resident education and research mentorship.

Technical Approach The core of the system is the LLaMA 3.1 LLM, which the authors fine-tuned and customized with carefully curated prompts to generate the feedback reports. The system ingests heterogeneous resident project submissions, extracts the relevant text and information using OCR, and then leverages the LLM to produce feedback that is aligned with the program's evaluation rubrics. This approach allows the system to provide structured, standardized feedback at scale, with the potential to free up human experts to focus on higher-level mentorship and strategic guidance.

The authors conducted a three-phase study to evaluate the performance of the AI-generated feedback against that of human expert evaluators. In each phase, they generated 40 AI-produced feedback reports and 40 human-produced reports across four project types: Quality Improvement, Survey-Based, Research, and Literature Review. Blinded raters then evaluated the feedback reports using a 25-item survey across five key constructs: understanding and reasoning, trust and confidence, quality of information, expression style and persona, and safety and harm.

Key Results The results of the study are mixed, with human feedback generally outperforming the AI-generated feedback, but with some notable exceptions. In short feedback reports, human experts scored higher on quality and trust metrics. However, in the final feedback reports, the differences were smaller, with humans scoring higher on quality and persona, while the AI system scored higher on safety assessments.

Interestingly, the performance of the AI system varied depending on the project type. In survey-based final reports, the AI scored higher on quality and safety, while in quality improvement short reports, human experts were markedly superior in reasoning. These findings suggest that the AI system may be better suited for certain types of feedback, depending on the specific requirements and characteristics of the project.

Significance and Limitations The significance of this work lies in its potential to revolutionize the way feedback is delivered in medical residency programs, where timely, high-quality mentorship is crucial for the development of future physicians. By leveraging AI to automate the generation of rubric-aligned feedback, the authors have demonstrated a viable approach to scaling this process and freeing up human experts to focus on more strategic aspects of mentorship.

However, the study also highlights the limitations of the current AI system, as human experts generally outperformed the AI in most of the evaluated constructs. The authors acknowledge that the performance of the AI system is likely to improve as newer, more capable open-weight LLMs are developed and incorporated into the system. Additionally, the study was conducted in a relatively controlled environment, and the authors note that real-world deployment may present additional challenges.

Through the RSCT Lens This paper's approach aligns well with key RSCT concepts, particularly in the way it aims to enhance the representation quality (R) and stability (S) of the feedback provided to residents.

By leveraging an LLM and carefully curating the prompts, the authors have sought to improve the relevance (R) of the feedback, ensuring that it directly addresses the core elements of the resident projects and the program's evaluation rubrics. The fact that the AI system was able to outperform human experts in certain contexts, such as safety assessments and quality of survey-based feedback, suggests that the system has managed to capture and represent the essential information and requirements in a more consistent and reliable manner.

However, the paper's κ-gate score of 0.55 indicates that the compatibility of the system's contributions with existing knowledge and practices is not yet optimal. The relatively low Stability (S) score of 0.32 suggests that the findings may not be as consistently robust across different project types and contexts, as evidenced by the performance variations observed. Additionally, the Noise (N) score of 0.31 implies that the system still generates some irrelevant or contradictory elements that dilute the core contribution.

To improve the RSCT certification of this work, the authors could focus on enhancing the stability and consistency of the AI-generated feedback, potentially by further refining the prompts, incorporating more diverse training data, or exploring more advanced LLM architectures and fine-tuning techniques. Additionally, addressing the sources of noise, such as misaligned or irrelevant information, could help strengthen the overall compatibility and reliability of the system's outputs.

Paper Details

Authors: van Allen, Z., Forgues-Martel, S., Venables, M. J., Ghanney, Y., Villeneuve, A.
Source: arXiv
PDF: Download
Published: 2026-03-05

This analysis was generated by the Swarm-It RSCT pipeline using Claude.

Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects

RSCT Score Breakdown

TL;DR

Can AI Match Human Experts? Evaluating LLM-Generated Feedback on Resident Scholarly Projects

Paper Details