> Digging in, it turned out “Apple” belonged to the category Steve Jobs which eventually belonged to… “People,” of course. It turned out Wikipedia categories aren’t strictly hierarchical at all, but are used for so many “related” things as to make them useless for determining what kind of a thing an article represented.
This feels like a problem worth solving on Wikipedia itself. It would be nice if categories could be marked as non-hierarchical, so that for a given category, you could know whether its articles could be classified under all of the ancestor categories.
Unfortunately, Wikidata has a lot less structured data than the semi-structured data you can get out of Wikipedia. A database like Wikidata simply doesn't get the same broad user base as Wikipedia, and therefore also fewer contributions.
>> It turned out Wikipedia categories aren’t strictly hierarchical at all, but are used for so many “related” things as to make them useless for determining what kind of a thing an article represented.
> This feels like a problem worth solving on Wikipedia itself.
Why would that be a problem at all? "Apple" (I assume this refers to the company) makes perfect sense in the "Steve Jobs" category. If you looked up the listing for Category:Steve_Jobs you'd expect to find it there. "Fixing" this problem would just make Wikipedia worse.
The better fix would be to realize that categories aren't strictly hierarchical, and you shouldn't operate as if they are.
The constraint programming part isn't the bit that makes the difference between a good and bad crossword. The difference is in the quality of the clues. Generating plausible 'cryptic' clues is probably well beyond the ability of current AI.
If you wanted 'Jeopardy'-style clues, that's easier.
Creating a grid to fit the chosen words in wouldn't be so hard. The real difficulty would be coming up with clever clues that weren't just the first paragraph of the wiki article with the names removed
Like others, I did the test and was rather disappointed at my score :). And yes, there seem to be a lot of rappers and Bollywood stars and movies in the quiz, that don't really appeal to my European sense of "culture". I wonder if instead of (or better, in addition to) page popularity it wouldn't be wise to use the number of translations of an entry in other languages. That should at least ensure that an item is considered important across local cultures- which is usually a good indicator of cultural importance. Did you try that?
Isn't it a bit much to say "Scientifically Accurate" when more or less people are just checking boxes? My feeling would be that people are massively over representing their own knowledge.
That's an amazing project, and I loved reading all the ways you worked through the problems as you encountered them.
There was just one final problem: occasionally items would pop up that were definitely NSFW, or just made you feel icky when reading the description. To make the quiz more family-friendly, I filtered out anything related to adult entertainment (quite a few porn stars in the top 10,000), as well as contemporary people notable principally for violent crime (whether as perpetrators or victims). There are just… some things you’d rather not read about while eating lunch.
I can't help but think it would be fun to take the version with NSFW content still in, or even limited to only those items.
It would be really interesting to use that to see how certain subgroups are aware of this content. E.g. Certain subreddits and 4chan...
I'd love to do exactly that, but right now I get my popularity stats from English-language Wikipedia pageviews, which is majority-influenced by the US but also has significant traffic from India and the UK.
If Wikipedia ever breaks down its pageview stats by country, so I can generate different quizzes per-country, that's the first thing I'll do!
Not to nag you, but have you seen my suggestion of using the number of translations of each entry as an indicator of their cross-cultural value? What do you think if it?
You may wish to check out the MIT Pantheon project which ranks people not just by page views, but also by the log of birth year and number of languages that their biography has been translated into. With that metric, knowing Aristotle would be much more valuable than knowing Justin Bieber, whose name is likely to decline in importance well within one lifetime, and is perhaps hardly known at all outside certain countries.
I wonder if you're mostly measuring someones interpretation of "uniquely identify" and "already knew existed"?
Did you consider making it an actual quiz with options to verify if someone actually knows what they claim? (Only skimmed the article, sorry if you mentioned it)
At the end of the day, your personal result will be for however you interpreted the instructions. But population-wide results should remain comparatively valid assuming each population is comparable in terms of being conservative/liberal with what they claim to know. E.g. you can still measure the difference between 25-year-olds and 30-year-olds because they're answering it in the same way collectively, and it's the differences I'm more interested in for research, than the absolute values.
I thought about making it something like a multiple-choice quiz, but it would have made the quiz a lot slower, and therefore either a lot longer or a lot less accurate, so checkboxes seemed like the only way to go.
Agreed that this is the hardest part is taking the quiz. Not sure if you'll see consistent interpretations based on age/education, but curious to see your analysis of this!
I thought the explanations in the "more info" box was very useful. I'd suggest keeping it open by default on the first page (closed on other pages) as it seems a critical to read that box before taking the quiz. Just a thought!
This was also a bit confusing to me. "Goodfellas"? I knew that was a film (I've heard the name before), but I haven't seen it, and I couldn't tell you who's in it or what it's about. Does that count as "uniquely identifying?"
> Uniquely identify means you can identify the specific person or thing — not just a category.
> Just knowing someone is an “actor” isn't enough, because there are lots of actors. But if you can name their role in a specific movie or recognize them by sight, that’s uniquely identifying — so check the box!
I realize you had to boil down a giant pile into a representation, but how? Some cultural categories seem to be under-represented. E.g. I didn't see a single composer in the whole lot ... or cathedral ... very few artworks ... too many films ... And it's pop-culture heavy.
Reminds me of Kenneth Clark's definition of 'Civilization' including only Europe ...
You've got the start of something here, but is it culture or people magazine?
I would have loved to use Wikidata -- I actually first attempted a prototype of this several years ago using the similar Freebase as a datasource, until it was bought and shut down by Google.
Wikidata looked very promising, but I was worried if it would contain all the data I would wind up needing, or if it would be in the same format 2 or 5 years from now. Wikipedia is a household name and the information in it has a lot of eyeballs on it constantly, while Wikidata as a project I couldn't tell if I could be equally confident in -- so really just taking a conservative approach is the only reason.
Wikimedia 'controls' the software and infrastructure, but is relatively hands-off about the content. There is, shall we say, some give-and-take between Wikimedia and the various projects' communities.
The communities overlap between Wikipedia and Wikidata to a large extent, but are distinct.
More than that - pop culture. I hardly know any rappers or RnB singers in the 21st century but I can name, for a start, a number of contemporary philosophers, computer languages and their creators, artists and composers. I don't think I'm in the minority for not knowing pop culture? Still, impressive project from the author and I understand that the result is ultimately pop-culture skewed, given the restrictions.
That’s true, but this is culture as defined by Internet traffic, which is pretty reasonable. Everyone knows a lot about the aspects of culture they care about, by definition, so it doesn’t make much sense to have a test for that.
I had this experience when I visited Madame Tussauds in London for the first time after about 20 years and I was like : "Who are all these people?".
What I remembered was a good mixture of artists and scientists from different epochs. Now they have crammed a few scientists in a dusty corner and the rest of the museum is full of people I don't recognize.
You're right, I mixed up the terms while writing this. It doesn't use the logistic function, it's a more general case of binomial regression [1]. (The example is a line, but the site actually uses a logarithmic function as its link function.) Just corrected the post, thanks.
Very interesting article. In the end, I was a bit confused on how you converted the binomial regression to a single number. I understood that the output was a probability that I know each of the 10,000 items, so then did you need to use some cutoff to decide that I "knew it"?
Anyways, I am interested to see what analysis you do after you get more data.
Thanks for the interest -- it's actually just a sum of the probabilities for the items from 1 to 10,000. For example, if there's a 0.1 chance you know each of 10 items, it adds up to a total value of 1 -- no cutoff needed.
Mathematically, there's a trick where you don't even need to compute the sum item-by-item... I calculate the binomial regression which gives me the two relevant parameters, from which I can calculate the probability density function (PDF) [1] for an item of given rank. Then I just calculate the associated cumulative distribution function (CDF) with the same two parameters [2] for rank 10,000 -- and that's the final result.
Following the instructions pedantically (e.g. emphasizing "uniquely identify," especially for the individuals) probably leads to an enormous biasing of the score.
This feels like a problem worth solving on Wikipedia itself. It would be nice if categories could be marked as non-hierarchical, so that for a given category, you could know whether its articles could be classified under all of the ancestor categories.
https://en.wikipedia.org/wiki/Category:Eponymous_categories would be a good place to start. Probably most of those are not hierarchical.