* List of topics that are "controversial" (models tend to evade these)
* List of arguments that are "controversial" (models wont allow you to think differently. For example, models would never say arguments that "encourage" animal cruelty)
* On average, how willing is the model to take a neutral position on a "controversial" topic (sometimes models say something along the lines of "this is on debate", but still lean heavily towards the less controversial position instead of having no position at all. For example, if you ask it what "lolicon" is, it will tell you what it is and tell you that japanese society is moving towards banning it)
I think that's the wrong level to attack the problem; you can do that also with actual humans, but it won't tell you what the human is unable to think, but rather what they just didn't think of given their stimulus — and this difference is easily demonstrated, e.g. with Duncker's candle problem: https://en.wikipedia.org/wiki/Candle_problem
I agree that it’s not a complete solution, but this sort of characterization is still useful towards the goal of identifying regions of fitness within the model.
Maybe you can’t explore the entire forest, but maybe you can clear the area around your campsite sufficiently. Even if there are still bugs in the ground.
Can you define that in a way that's actually testable? I can't, and I've been thinking about "unthinkable thoughts" for quite some time now: https://kitsunesoftware.wordpress.com/2018/06/26/unlearnable...