Hacker News new | past | comments | ask | show | jobs | submit login

Add to your prompt: "For every factual statement, assign a certainty float 0..1, where 0 means you're very uncertain, and 1 means you're absolutely certain it is true".

Specific example: "why do we have first-person subjective experiences? List current theories. For every theory, assign a truthiness float 0..1, where 0 means you're sure it is wrong, and 1 means you're absolutely sure it is true"

From experimenting with this, it will shift the output, sometimes drastically so, as the model now has to reason about it's own certainty; it tends to make significantly less shit up (for example, the non-truth-marked version of the output for the query above also listed panpsychism; whereas the truth-marked version listed only scientific hypotheses).

So the model _can_ reason about it's certainty, and truth-value; and I strongly suspect it was just not rewarded during RLHF for omitting things it knew to be false -basically, percolating the social lies people tell to eachother- which seems to show up in coding as well.

Edit: see https://twitter.com/sdrinf/status/1629084909422931969 for results




I initialized with that prompt and it did not give me any 0..1 certainty values on any subsequent output to my queries.


Or maybe it will just hallucinate this number too.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: