Back to reviews
min readarXiv:2603.06459v1

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

Authors: Yakov Pyotr Shkolnikov

Pending (κ=0.49)Beginnerrepresentation-learningcs-cv

RSCT Score Breakdown

Relevance (R)
0.37
Superfluous (S)
0.32
Noise (N)
0.31

TL;DR

Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the...

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

RSCT Certification: κ=0.494 (pending) | RSN: 0.37/0.32/0.31 | Topics: representation-learning

Analyzing "Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement" through the RSCT Lens

Core Contribution This paper tackles a fundamental question in vision-language representation learning: do large pre-trained models like CLIP truly understand continuous physical properties like hand joint angles, or do they merely learn superficial associations between images and text? The key innovation is a novel "linear probe" technique that extracts high-fidelity geometric information from frozen vision encoder features, far surpassing the accuracy of the models' own text outputs.

The paper provides compelling evidence that foundation models do in fact encode rich geometric information, but that this signal is not well-expressed through standard text generation. By probing the frozen vision features directly, the authors are able to extract continuous joint angle predictions at less than 7 degrees mean error - a 3x improvement over the models' own text outputs. This suggests a "pathway-training deficit" where the text decoding process fails to fully leverage the geometric representations learned by the vision backbone.

Technical Approach The core of the technical approach is a linear regression probe trained on frozen vision encoder features to predict continuous hand joint angles. This simple linear layer is able to achieve much higher accuracy than the models' own text outputs, indicating that the geometric information is present in the visual representations but not well-expressed through language generation.

The authors experiment with a diverse set of foundation models spanning self-supervised, contrastive, and hybrid training objectives. Notably, they find that despite having very different internal representations (as low as 0.41 CKA similarity), these models all converge to statistically equivalent geometric prediction accuracy. This "functional convergence without representational convergence" suggests that the training objective, rather than model architecture, is the primary driver of geometric encoding.

The authors also analyze the inner workings of these models, finding a universal "accuracy peak" in the mid-layers and identifying specific attention heads that carry disproportionate geometric signal. This enables them to construct a single frozen "geometric sensor" backbone that can be probed with lightweight linear layers to extract continuous physical measurements.

Key Results The key finding is that foundation models like CLIP do in fact encode rich continuous geometric information, as evidenced by the linear probes extracting hand joint angles with less than 7 degrees mean error. This is a dramatic 3.3x improvement over the models' own text outputs, which only achieve 20 degrees error.

Notably, the authors find that diverse model architectures and training objectives all converge to statistically equivalent geometric prediction accuracy, despite having very different internal representations. This "functional convergence" points to the training objective as the primary driver of geometric encoding, rather than model design choices.

Significance and Limitations This work is significant because it provides concrete evidence that today's large vision-language models do in fact learn rich geometric representations, despite their text outputs failing to fully leverage this information. By probing the frozen vision features directly, the authors are able to extract continuous physical measurements with high fidelity.

This has important implications for practical applications like robotic control and human-machine interaction, where seamless integration of visual and geometric reasoning is crucial. The ability to leverage a single frozen backbone as a "geometric sensor" is a promising direction for efficient multi-task perception.

That said, the work also has notable limitations. The probing is limited to a single task of hand joint angle prediction, and it remains to be seen whether these findings generalize to other continuous physical properties. Additionally, the authors do not investigate the mechanisms by which the geometric signal is lost during text generation - further work is needed to understand and address this "pathway-training deficit."

Through the RSCT Lens This paper's approach is highly aligned with the core principles of RSCT (Representation-Space Compatibility Theory). By directly probing the frozen vision features, the authors are able to isolate and quantify the representational quality (R) of the geometric information encoded by these models. Their finding that diverse architectures and objectives achieve statistically equivalent accuracy, despite differing in internal representations, speaks directly to the notion of "functional convergence" - where multiple paths can lead to compatible representations.

Importantly, the paper's κ-gate score of 0.49 indicates that the work is not yet fully compatible with existing knowledge, and likely requires further contextualization or expert review before direct application. The imbalanced RSN simplex (R=0.37, S=0.32, N=0.31) suggests that while the core contribution is relevant, the stability and noise properties could use improvement.

The fact that this paper flags at the stability gate is a clear indication that more work is needed to demonstrate the consistency of these findings across a broader range of contexts and tasks. Improving the stability (S) of the geometric encoding, perhaps by exploring multimodal fine-tuning or other techniques to better align the vision and text pathways, could help strengthen the overall RSCT compatibility. Reducing extraneous noise (N) elements, such as the precise mechanisms behind the "pathway-training deficit", would also enhance the paper's integration potential.

Overall, this work makes a valuable contribution by providing direct evidence of geometric representation in foundation models, and laying the groundwork for efficient multi-task perception through lightweight probing. By framing the analysis through the RSCT lens, we can better understand the specific strengths and limitations of this approach, and identify promising directions to improve its compatibility with existing knowledge and practical applications.

Paper Details

  • Authors: Yakov Pyotr Shkolnikov
  • Source: arXiv
  • PDF: Download
  • Published: 2026-03-06

This analysis was generated by the Swarm-It RSCT pipeline using Claude.

About This Review

This review was auto-generated by the Swarm-It research discovery platform. Quality is certified using RSCT (RSN Certificate Technology) with a κ-gate score of 0.49. RSN scores: Relevance=0.37, Superfluous=0.32, Noise=0.31.