From quizzing it a bit it has good knowledge but limited reasoning. For example it will tell you all about the life and death of Ho Chi Minh (and as far as I can verify factual and with more detail than what's in English Wikipedia), but when quizzed whether 2kg of feathers are heavier than 1kg of lead it will get it wrong.
Though I wouldn't treat it as a domain expert on anything. For example when I asked about the safety advantages of Rust over Python it oversold Rust a bit and claimed Python had issues it doesn't actually have
Well the feathers heavier than lead thing is definitely somewhere in training data.
Imo we should be testing reasoning for these models by presenting things or situations that neither the human or machine has seen or experienced.
Think; how often do humans have a truly new experience with no basis on past ones? Very rarely - even learning to ride a bike it could be presumed that it has a link to walking/running and movement in general.
Even human "creativity" (much ado about nothing) is creating drama in the AI space...but I find this a super interesting topic as essentially 99.9999% of all human "creativity" is just us rehashing and borrowing heavily from stuff we've seen or encountered in nature. What are elves, dwarves, etc than people with slightly unusual features. Even aliens we create are based on: humans/bipedal, squid/sea creature, dragon/reptile, etc. How often does human creativity really, _really_ come up with something novel? Almost never!
Edit: I think my overarching point is that we need to come up with better exercises to test these models, but it's almost impossible for us to do this because most of us are incapable of creating purely novel concepts and ideas. AGI perhaps isn't that far off given that humans have been the stochastic parrots all along.
Though I wouldn't treat it as a domain expert on anything. For example when I asked about the safety advantages of Rust over Python it oversold Rust a bit and claimed Python had issues it doesn't actually have