Back to reviews
min readarXiv:2603.12165v1

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Authors: Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang

🥉 Certified (κ=0.78)Intermediatecs-clvision-language-action-models-vlavision-language-models

RSCT Score Breakdown

Relevance (R)
0.42
Superfluous (S)
0.46
Noise (N)
0.12

TL;DR

Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selecti...

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

RSCT Certification: κ=0.778 (certified) | RSN: 0.70/0.75/0.20 | Topics: Vision-Language Models, Vision Language Action Models (VLA), Graph-Based LLM Reasoning

Overview

One-Sentence Summary

This paper proposes a novel data selection framework called QAQ that evaluates the quality of synthetic code instructions by measuring the bidirectional semantic coherence between the instruction (query) and the generated code (answer), outperforming existing approaches on the WarriorCoder dataset.

Key Innovation

The key innovation in this paper is the introduction of Reverse Mutual Information (RMI) as a novel metric to quantify the information gain about the query (instruction) conditioned on the answer (generated code). This contrasts with existing approaches like Instruction-Following Difficulty (IFD) that only assess the difficulty of generating the answer given the query. The authors show that both low and excessively high RMI can signal quality issues in synthetic data, and they leverage the disagreement between strong and weak models to identify samples that are valid yet challenging.

Should You Read This?

If you work on vision-language models or vision-language-action models: Yes, this paper is highly relevant as it tackles the important problem of synthetic data curation for these types of models. The proposed QAQ framework and the insights around bidirectional semantic coherence could inform your own work on improving data quality.

If you work on graph-based LLM reasoning: Maybe, as the paper touches on the potential of the proposed approach to reduce computational costs without sacrificing model capability, which could be of interest. However, the core contribution is more focused on data selection for code generation, so the direct applicability may be limited.

The Good

  • The paper presents a well-motivated and thoughtful approach to addressing the challenge of noisy synthetic data in code generation models.
  • The authors provide a clear and intuitive explanation of the RMI metric and how it can be used to identify both semantic misalignment and overly simplistic samples.
  • The experiments on the WarriorCoder dataset demonstrate the effectiveness of the QAQ framework, showing that it can achieve comparable performance to full-data training while using only 25% of the data.
  • The paper is well-written and accessible, with a good balance of technical details and high-level insights.

The Gaps

  • The paper focuses on the WarriorCoder dataset, which may limit the generalizability of the findings. It would be helpful to see the QAQ framework evaluated on other synthetic code generation datasets.
  • The authors do not provide a detailed analysis of the types of quality issues (e.g., hallucinations, semantic inconsistencies) that the QAQ framework is able to detect. More insight into the specific problems the approach can address would be valuable.
  • The paper does not discuss potential failure modes or limitations of the QAQ framework, such as cases where the disagreement between strong and weak models may not be a reliable signal of data quality.

How to Read This Paper

If you're from the vision-language or vision-language-action community: Focus on the sections that introduce the QAQ framework and the RMI metric, as well as the experiments and results on the WarriorCoder dataset. You can skim the background material on code generation and synthetic data curation.

If you're from the graph-based LLM reasoning community: Pay attention to the sections that discuss the potential of the QAQ framework to reduce computational costs without sacrificing model capability. The background and related work sections on synthetic data curation may also be relevant.

Must read (everyone): The sections that describe the QAQ framework, the RMI metric, and the core experimental results and insights.

Verify: Claims about the generalizability of the QAQ framework and its ability to detect specific types of quality issues in synthetic data.

Bottom Line

This paper presents a compelling approach to addressing the challenge of noisy synthetic data in code generation models. The proposed QAQ framework and the RMI metric offer a novel way to assess the bidirectional semantic coherence of synthetic instructions and generated code, outperforming existing data selection methods. The insights around the relationship between RMI and data quality issues are particularly valuable and could inform future work on synthetic data curation. While the focus on the WarriorCoder dataset limits the generalizability of the findings, the core contribution is still highly relevant for researchers working on vision-language and vision-language-action models.

Quality Assessment

Trust Level: MODERATE - Verify key results first

What the scores mean:

  • 70% signal - This much of the paper directly supports its claims
  • 75% context - Background material for readers from other fields (this is a bridge paper - high context is a feature!)
  • 20% noise - Content that may mislead if taken at face value

Reliability score: 78% (certified)

Practical interpretation: Good foundation but some gaps. Read critically and verify key claims before building on this work.

Paper Details

  • Authors: Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang
  • Published: 2026-03-12
  • Source: arxiv
  • PDF: Download
  • Primary Topic: Vision-Language Models
  • Difficulty: Intermediate

Abstract

Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.


This analysis was automatically generated and certified by the Swarm-It RSCT pipeline. κ-gate score: 0.778 | Quality tier: certified

About This Review

This review was auto-generated by the Swarm-It research discovery platform. Quality is certified using RSCT (RSN Certificate Technology) with a κ-gate score of 0.78. RSN scores: Relevance=0.42, Superfluous=0.46, Noise=0.12.