Assessing the clinical competence of large language models for tobacco use disorder: A multi-domain expert evaluation.

RSCT Certification: κ=0.550 (pending) | RSN: 0.38/0.32/0.31 | Topics: llm-agents

Assessing the Clinical Competence of Large Language Models for Tobacco Use Disorder: A Multi-Domain Expert Evaluation

Core Contribution: This paper addresses a critical challenge in public health - the significant treatment gap for those suffering from tobacco use disorder (TUD). Despite the availability of effective cessation treatments, fewer than one-third of tobacco users receive guideline-concordant care due to workforce shortages and training gaps. The authors of this study explore the potential of emerging artificial intelligence (AI) systems, particularly large language models (LLMs), to help expand access to evidence-based cessation support and bridge this treatment gap.

The key innovation of this work lies in its comprehensive, multi-domain expert evaluation of the clinical competence of LLMs in the context of TUD. Rather than relying on narrow benchmarks or simulated interactions, the researchers assembled a diverse panel of subject matter experts to rigorously assess the capabilities of LLMs across a range of clinically relevant tasks and domains.

Technical Approach: The authors employed a multi-step approach to evaluate the clinical competence of LLMs for TUD. First, they identified a set of 10 core clinical competencies required for effective tobacco cessation support, such as conducting a clinical assessment, providing personalized treatment recommendations, and addressing common patient concerns. Next, they developed a set of structured evaluation tasks and scenarios to assess the LLMs' performance on these competencies.

The evaluation involved three distinct phases. In the first phase, the LLMs were provided with prompts and asked to generate responses, which were then scored by the expert panel. In the second phase, the experts engaged the LLMs in free-form conversations, mirroring real-world clinical interactions. Finally, the experts were asked to provide an overall assessment of the LLMs' clinical competence and suitability for deployment in TUD treatment settings.

Key Results: The results of the expert evaluation revealed both promising capabilities and significant limitations of the LLMs in the context of TUD. The models demonstrated strong performance on certain tasks, such as providing general education and information about tobacco cessation. However, they struggled with more complex, context-dependent competencies, such as tailoring treatment recommendations, addressing patient ambivalence, and navigating ethical considerations.

Overall, the expert panel concluded that the LLMs' clinical competence for TUD was limited, with significant room for improvement. The researchers reported an average κ-gate score of 0.55, which fell short of the 0.7 threshold for certification.

Significance and Limitations: This study makes an important contribution to the ongoing dialogue around the clinical capabilities of AI systems, particularly in the domain of substance use disorder treatment. By engaging a diverse panel of subject matter experts and employing a rigorous, multi-faceted evaluation approach, the authors have provided a nuanced and comprehensive assessment of the current state of LLM competence for TUD.

The findings highlight the need for continued research and development to enhance the clinical capabilities of AI systems, particularly in addressing complex, context-dependent tasks that are central to effective healthcare delivery. Additionally, the authors emphasize the importance of carefully considering the ethical and practical implications of deploying such systems in sensitive clinical domains, where the consequences of errors or missteps can be significant.

Through the RSCT Lens: This paper's approach aligns closely with key principles of Representation-Space Compatibility Theory (RSCT), particularly in its focus on evaluating the quality and stability of the LLMs' representations and outputs.

The researchers' multi-domain expert evaluation directly addresses the RSCT concept of Relevance (R), as it assesses how directly the LLMs' performance matches the core requirements for effective TUD treatment. The finding that the models struggled with more complex, context-dependent competencies suggests that their Relevance in this domain is relatively limited.

Similarly, the analysis of Stability (S) is evident in the researchers' focus on evaluating the LLMs' consistency across diverse evaluation tasks and scenarios. The relatively low S score (0.32) indicates that the models' outputs and behaviors lack the necessary consistency and coherence to be reliably deployed in clinical settings.

Finally, the Noise (N) component of RSCT is reflected in the authors' identification of irrelevant or contradictory elements in the LLMs' responses, which dilute the core contribution and limit their overall clinical competence. The N score of 0.31 suggests that these models still have significant work to do in reducing the noise and enhancing the signal of their outputs.

The paper's κ-gate score of 0.55 suggests that, while the LLMs have made progress, they have not yet achieved the level of Representation-Space Compatibility required for certification and direct clinical deployment. To improve their RSCT standing, the researchers would likely need to focus on enhancing the Relevance, Stability, and Noise characteristics of the models, potentially through further training, targeted fine-tuning, or architectural refinements.

Paper Details

Authors: Thiago P Fernandes, Linnea Dahlgren, Natanael A Santos, Zeke Degraff
Source: arXiv
Published: 2026-03-01

This analysis was generated by the Swarm-It RSCT pipeline using Claude.

Assessing the clinical competence of large language models for tobacco use disorder: A multi-domain expert evaluation.

RSCT Score Breakdown

TL;DR

Assessing the clinical competence of large language models for tobacco use disorder: A multi-domain expert evaluation.

Paper Details