Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

RSCT Certification: κ=0.550 (pending) | RSN: 0.38/0.32/0.31 | Topics: ai-safety

Title: Stress Testing the Reliability of LLM Judges: The Judge Reliability Harness

Core Contribution: The paper presents the Judge Reliability Harness, an open-source library for constructing validation suites that assess the reliability of large language model (LLM) judges. As LLM-based scoring is widely deployed in AI benchmarks, the authors identify a critical need for tools to efficiently evaluate the robustness of these judging methods. The key innovation of this work is providing a systematic framework to stress test LLM judges across a variety of perturbations, surfacing inconsistencies and weaknesses that can undermine the reliability of these judging systems.
Technical Approach: The Judge Reliability Harness takes as input a benchmark dataset and an LLM judge configuration, and generates a suite of reliability tests. These tests evaluate both binary judgment accuracy and ordinal grading performance for free-response and agentic task formats. The harness applies various perturbations to the input, such as changes in text formatting, paraphrasing, alterations in verbosity, and label flipping. By measuring the judges' performance under these perturbations, the authors are able to assess the stability and robustness of the LLM-based scoring systems.

The paper evaluates four state-of-the-art LLM judges across four benchmarks spanning safety, persuasion, misuse, and agentic behavior. The benchmarks cover a diverse set of tasks, allowing the authors to examine the reliability of the judges in different contexts. The results of these evaluations reveal meaningful variations in performance across models and perturbation types, highlighting opportunities to improve the robustness of LLM judges.

Key Results: The key finding of this paper is that none of the LLM judges evaluated are uniformly reliable across the tested benchmarks. The authors report consistency issues as measured by accuracy in judging another LLM's ability to complete a task, with judges exhibiting sensitivity to simple text formatting changes, paraphrasing, changes in verbosity, and flipping of ground truth labels in LLM-produced responses.

These results underscore the importance of comprehensive reliability testing for LLM-based scoring systems, as they reveal that even state-of-the-art judges can be susceptible to performance degradation under various perturbations. The Judge Reliability Harness provides a systematic approach to uncover these vulnerabilities, which is crucial for developing more robust and trustworthy AI systems.

Significance and Limitations: This work is significant because it addresses a critical gap in the AI research community's ability to assess the reliability of LLM-based judging systems. As these scoring methods become increasingly prevalent in benchmarking and evaluation, it is essential to have rigorous tools to stress test their performance and identify potential weaknesses. The Judge Reliability Harness provides a valuable framework for this purpose, empowering researchers and developers to build more robust and trustworthy AI systems.

A limitation of this work is that it focuses on the reliability of LLM judges and does not directly address the broader issue of the reliability and robustness of LLM-based systems in general. While the harness can be used to test the judges, the findings may not generalize to the overall performance and reliability of the LLMs themselves. Additionally, the paper does not provide a comprehensive analysis of the causes behind the observed inconsistencies in the judges' performance, which could be a fruitful area for future research.

Through the RSCT Lens: This paper's approach is highly relevant to the Representation-Space Compatibility Theory (RSCT), as it directly addresses the stability (S) and noise (N) components of the theory.

By applying various perturbations to the input data and measuring the resulting changes in the LLM judges' performance, the Judge Reliability Harness effectively probes the stability of these judging systems. The finding that none of the evaluated judges are uniformly reliable across benchmarks indicates that the representations learned by the LLMs exhibit inconsistencies, or instability, in their ability to accurately judge the quality of other LLM-generated outputs.

Furthermore, the paper's identification of "irrelevant or contradictory elements that dilute the core contribution" of the LLM judges, such as sensitivity to text formatting changes and label flipping, directly relates to the RSCT concept of noise (N). These noise factors undermine the reliability and trustworthiness of the LLM-based scoring systems, reducing their overall compatibility (κ) with the existing knowledge and use cases.

The paper's κ-gate score of 0.55, which falls short of the 0.7 threshold for certification, suggests that while the Judge Reliability Harness is a valuable contribution, it would benefit from additional context and stabilizing elements to improve its overall compatibility. Specifically, the relatively low Relevance (R=0.38) and Stability (S=0.32) scores, along with the relatively high Noise (N=0.31), indicate that the paper's core contribution could be strengthened by further reducing noise factors, enhancing the consistency of the LLM judges' performance, and more directly addressing the key research questions in the field of AI reliability and robustness.

Paper Details

Authors: Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, Morgan Sandler
Source: arXiv
PDF: Download
Published: 2026-03-05

This analysis was generated by the Swarm-It RSCT pipeline using Claude.

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

RSCT Score Breakdown

TL;DR

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Paper Details