That post is describing SFT, not RL. RL works using preferences/ratings/verifica...

		astrange 3 days ago \| parent \| context \| favorite \| on: Tracing the thoughts of a large language model That post is describing SFT, not RL. RL works using preferences/ratings/verifications, not entire input/output pairs.