> But rather: "[LLMs condition] on latents ABOUT truth, falsity, reliability, and calibration".
Yes, I know. And the paper didn't show that. It projected some activations into low-dimensional space, and claimed that since there was a pattern in the plots, it's a "latent".
The other experiments were similarly hand-wavy.
> Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!
That's what's called a truism: "if it classifies successfully, it must be conditioned on latents about truth".
> "if it classifies successfully, it must be conditioned on latents about truth"
Yes, this is a truism. Successful classification does not depend on latents being about truth.
However, successfully classifying between text intended to be read as either:
- deceptive or honest
- farcical or tautological
- sycophantic or sincere
- controversial or anodyne
does depend on latent representations being about truth (assuming no memorisation, data leakage, or spurious features)
If your position is that this is necessary but not sufficient to demonstrate such a dependence, or that reverse engineering the learned features is necessary for certainty, then I agree.
But I also think this is primarily a semantic disagreement. A representation can be "about something" without representing it in full generality.
So to be more concrete: "The representations produced by LLMs can be used to linearly classify implicit details about a text, and the LLM's representation of those implicit details condition the sampling of text from the LLM".
Yes, I know. And the paper didn't show that. It projected some activations into low-dimensional space, and claimed that since there was a pattern in the plots, it's a "latent".
The other experiments were similarly hand-wavy.
> Your aside about classifiers is not only very apt, it is also exactly OP's point! LLMs are implicit classifiers, and the features they classify have been shown to include those that seem necessary to effectively predict text!
That's what's called a truism: "if it classifies successfully, it must be conditioned on latents about truth".