Transforming Wikipedia into a cultural knowledge quiz

testplzignore · on Oct 31, 2018

> Digging in, it turned out “Apple” belonged to the category Steve Jobs which eventually belonged to… “People,” of course. It turned out Wikipedia categories aren’t strictly hierarchical at all, but are used for so many “related” things as to make them useless for determining what kind of a thing an article represented.

This feels like a problem worth solving on Wikipedia itself. It would be nice if categories could be marked as non-hierarchical, so that for a given category, you could know whether its articles could be classified under all of the ancestor categories.

https://en.wikipedia.org/wiki/Category:Eponymous_categories would be a good place to start. Probably most of those are not hierarchical.

tyingq · on Oct 31, 2018

Wikidata is structured. Here's the entry for Apple: https://m.wikidata.org/wiki/Q312

yorwba · on Oct 31, 2018

Unfortunately, Wikidata has a lot less structured data than the semi-structured data you can get out of Wikipedia. A database like Wikidata simply doesn't get the same broad user base as Wikipedia, and therefore also fewer contributions.

_0nac · on Nov 1, 2018

Wikidata is one of the backends of Wikipedia and much of its information has been extracted from there.

thaumasiotes · on Oct 31, 2018

>> It turned out Wikipedia categories aren’t strictly hierarchical at all, but are used for so many “related” things as to make them useless for determining what kind of a thing an article represented.

> This feels like a problem worth solving on Wikipedia itself.

Why would that be a problem at all? "Apple" (I assume this refers to the company) makes perfect sense in the "Steve Jobs" category. If you looked up the listing for Category:Steve_Jobs you'd expect to find it there. "Fixing" this problem would just make Wikipedia worse.

The better fix would be to realize that categories aren't strictly hierarchical, and you shouldn't operate as if they are.

thanatropism · on Oct 31, 2018

The next step is to make crossword puzzles from this.

(I'm next to illiterate about constraint satisfaction programming. How hard is to make reasonable crossword puzzles?)

theoh · on Oct 31, 2018

The constraint programming part isn't the bit that makes the difference between a good and bad crossword. The difference is in the quality of the clues. Generating plausible 'cryptic' clues is probably well beyond the ability of current AI.

If you wanted 'Jeopardy'-style clues, that's easier.

WAthrowaway · on Oct 31, 2018

Creating a grid to fit the chosen words in wouldn't be so hard. The real difficulty would be coming up with clever clues that weren't just the first paragraph of the wiki article with the names removed

Udik · on Nov 1, 2018

Like others, I did the test and was rather disappointed at my score :). And yes, there seem to be a lot of rappers and Bollywood stars and movies in the quiz, that don't really appeal to my European sense of "culture". I wonder if instead of (or better, in addition to) page popularity it wouldn't be wise to use the number of translations of an entry in other languages. That should at least ensure that an item is considered important across local cultures- which is usually a good indicator of cultural importance. Did you try that?

JauntyHatAngle · on Oct 31, 2018

Isn't it a bit much to say "Scientifically Accurate" when more or less people are just checking boxes? My feeling would be that people are massively over representing their own knowledge.

crazygringo · on Oct 31, 2018

Author/creator here, happy to answer any questions.

kbenson · on Oct 31, 2018

That's an amazing project, and I loved reading all the ways you worked through the problems as you encountered them.

There was just one final problem: occasionally items would pop up that were definitely NSFW, or just made you feel icky when reading the description. To make the quiz more family-friendly, I filtered out anything related to adult entertainment (quite a few porn stars in the top 10,000), as well as contemporary people notable principally for violent crime (whether as perpetrators or victims). There are just… some things you’d rather not read about while eating lunch.

I can't help but think it would be fun to take the version with NSFW content still in, or even limited to only those items.

It would be really interesting to use that to see how certain subgroups are aware of this content. E.g. Certain subreddits and 4chan...

psychometry · on Oct 31, 2018

Wouldn't it make more sense to ask country first and then generate examples? I'm from the U.S. and 10% of my examples were Bollywood actors.

crazygringo · on Oct 31, 2018

I'd love to do exactly that, but right now I get my popularity stats from English-language Wikipedia pageviews, which is majority-influenced by the US but also has significant traffic from India and the UK.

If Wikipedia ever breaks down its pageview stats by country, so I can generate different quizzes per-country, that's the first thing I'll do!

Udik · on Nov 1, 2018

Not to nag you, but have you seen my suggestion of using the number of translations of each entry as an indicator of their cross-cultural value? What do you think if it?

ppereira · on Nov 1, 2018

You may wish to check out the MIT Pantheon project which ranks people not just by page views, but also by the log of birth year and number of languages that their biography has been translated into. With that metric, knowing Aristotle would be much more valuable than knowing Justin Bieber, whose name is likely to decline in importance well within one lifetime, and is perhaps hardly known at all outside certain countries.

tobr · on Oct 31, 2018

I wonder if you're mostly measuring someones interpretation of "uniquely identify" and "already knew existed"?

Did you consider making it an actual quiz with options to verify if someone actually knows what they claim? (Only skimmed the article, sorry if you mentioned it)

crazygringo · on Oct 31, 2018

Thanks -- it's something I thought about a ton.

At the end of the day, your personal result will be for however you interpreted the instructions. But population-wide results should remain comparatively valid assuming each population is comparable in terms of being conservative/liberal with what they claim to know. E.g. you can still measure the difference between 25-year-olds and 30-year-olds because they're answering it in the same way collectively, and it's the differences I'm more interested in for research, than the absolute values.

I thought about making it something like a multiple-choice quiz, but it would have made the quiz a lot slower, and therefore either a lot longer or a lot less accurate, so checkboxes seemed like the only way to go.

rahulcap · on Oct 31, 2018

Agreed that this is the hardest part is taking the quiz. Not sure if you'll see consistent interpretations based on age/education, but curious to see your analysis of this!

I thought the explanations in the "more info" box was very useful. I'd suggest keeping it open by default on the first page (closed on other pages) as it seems a critical to read that box before taking the quiz. Just a thought!

skykooler · on Oct 31, 2018

This was also a bit confusing to me. "Goodfellas"? I knew that was a film (I've heard the name before), but I haven't seen it, and I couldn't tell you who's in it or what it's about. Does that count as "uniquely identifying?"

Liquid_Fire · on Nov 1, 2018

> Uniquely identify means you can identify the specific person or thing — not just a category. > Just knowing someone is an “actor” isn't enough, because there are lots of actors. But if you can name their role in a specific movie or recognize them by sight, that’s uniquely identifying — so check the box!

So I would say no.

throwawaw666 · on Oct 31, 2018

Have you ever seen this sketch? https://www.youtube.com/watch?v=vZ9myHhpS9s

8bitsrule · on Nov 1, 2018

I realize you had to boil down a giant pile into a representation, but how? Some cultural categories seem to be under-represented. E.g. I didn't see a single composer in the whole lot ... or cathedral ... very few artworks ... too many films ... And it's pop-culture heavy.

Reminds me of Kenneth Clark's definition of 'Civilization' including only Europe ...

You've got the start of something here, but is it culture or people magazine?

webwanderings · on Oct 31, 2018

How are you throwing Indian cultural entities? It doesn't seem to make sense, though I have not read your literature.

rcMgD2BwE72F · on Oct 31, 2018

Why use Wikipedia infoboxes instead of Wikidata items?

crazygringo · on Oct 31, 2018

I would have loved to use Wikidata -- I actually first attempted a prototype of this several years ago using the similar Freebase as a datasource, until it was bought and shut down by Google.

Wikidata looked very promising, but I was worried if it would contain all the data I would wind up needing, or if it would be in the same format 2 or 5 years from now. Wikipedia is a household name and the information in it has a lot of eyeballs on it constantly, while Wikidata as a project I couldn't tell if I could be equally confident in -- so really just taking a conservative approach is the only reason.

ReverseCold · on Oct 31, 2018

Wikidata is part of Wikimedia, which is the same org that controls Wikipedia.

afandian · on Oct 31, 2018

Wikimedia 'controls' the software and infrastructure, but is relatively hands-off about the content. There is, shall we say, some give-and-take between Wikimedia and the various projects' communities.

The communities overlap between Wikipedia and Wikidata to a large extent, but are distinct.

young_unixer · on Oct 31, 2018

It's worth pointing out that "Your culture" here means "first world, English speaking countries' culture".

telesilla · on Oct 31, 2018

More than that - pop culture. I hardly know any rappers or RnB singers in the 21st century but I can name, for a start, a number of contemporary philosophers, computer languages and their creators, artists and composers. I don't think I'm in the minority for not knowing pop culture? Still, impressive project from the author and I understand that the result is ultimately pop-culture skewed, given the restrictions.

tshaddox · on Nov 1, 2018

That’s true, but this is culture as defined by Internet traffic, which is pretty reasonable. Everyone knows a lot about the aspects of culture they care about, by definition, so it doesn’t make much sense to have a test for that.

weinzierl · on Oct 31, 2018

I had this experience when I visited Madame Tussauds in London for the first time after about 20 years and I was like : "Who are all these people?".

What I remembered was a good mixture of artists and scientists from different epochs. Now they have crammed a few scientists in a dusty corner and the rest of the museum is full of people I don't recognize.

l9k · on Oct 31, 2018

That's why it is called POPULAR culture, because that's what the majority of people know.

RLN · on Nov 1, 2018

I'd go as far to generally say it just means United States culture.

dang · on Oct 31, 2018

Recent discussion: https://news.ycombinator.com/item?id=18175910

kevinwang · on Oct 31, 2018

Shouldn't the logistic regression picture have a picture that looks more like this instead of a line? https://en.m.wikipedia.org/wiki/Logistic_regression#/media/F.... Or is the curve just really flat?

crazygringo · on Oct 31, 2018

You're right, I mixed up the terms while writing this. It doesn't use the logistic function, it's a more general case of binomial regression [1]. (The example is a line, but the site actually uses a logarithmic function as its link function.) Just corrected the post, thanks.

[1] https://en.wikipedia.org/wiki/Binomial_regression

rahulcap · on Oct 31, 2018

Very interesting article. In the end, I was a bit confused on how you converted the binomial regression to a single number. I understood that the output was a probability that I know each of the 10,000 items, so then did you need to use some cutoff to decide that I "knew it"?

Anyways, I am interested to see what analysis you do after you get more data.

crazygringo · on Oct 31, 2018

Thanks for the interest -- it's actually just a sum of the probabilities for the items from 1 to 10,000. For example, if there's a 0.1 chance you know each of 10 items, it adds up to a total value of 1 -- no cutoff needed.

Mathematically, there's a trick where you don't even need to compute the sum item-by-item... I calculate the binomial regression which gives me the two relevant parameters, from which I can calculate the probability density function (PDF) [1] for an item of given rank. Then I just calculate the associated cumulative distribution function (CDF) with the same two parameters [2] for rank 10,000 -- and that's the final result.

[1] https://en.wikipedia.org/wiki/Probability_density_function

[2] https://en.wikipedia.org/wiki/Cumulative_distribution_functi...

arielbaz · on Oct 31, 2018

Is combosaurus one of the 10,000? https://www.quora.com/Whatever-happened-to-Combosaurus

dpatrick86 · on Nov 1, 2018

Following the instructions pedantically (e.g. emphasizing "uniquely identify," especially for the individuals) probably leads to an enormous biasing of the score.

personjerry · on Nov 1, 2018

How did you drive traffic to your quiz site? I feel like this will affect significantly your results.

jordiburgos · on Oct 31, 2018

How the items are selected ?

asianthrowaway · on Oct 31, 2018

I guess I'm not cultured for not knowing about all the rappers and actors who seem to make up 90% of the list.

I did have a good laugh at Jared Kushner being listed as an "investor".

jonwachob91 · on Oct 31, 2018

His money is in real estate investments. Might not be tech VC, but an investor is still an investor.

mwfunk · on Nov 1, 2018

It is funny (sad and maybe a red flag but also funny). His job description is based on where his money goes, not where it comes from.