Hacker News new | past | comments | ask | show | jobs | submit login
Google Books Ngram of George Carlin's Seven Dirty Words (googlelabs.com)
43 points by wookiehangover on Dec 18, 2010 | hide | past | favorite | 15 comments



Many false positives in early years due to 's' being written like 'f' and mistaken by the OCR software. Both the short and tall forms were used, but there seemed to have been some style preferences that led to the tall 's' being used at the beginning of a word, and small 's' at the end.

In handwriting or italic, the tall 's' was rather like the integral symbol, but when setting serifed font it looks pretty much like an 'f', but missing half or all the crossbar.


Interestingly, the tall s seems to have replaced the short one in a relativvely shout timeframe. apart from the transition period arount 1800, this looks like a smooth curve: http://ngrams.googlelabs.com/graph?content=fuck,suck&yea...


Andrew West provides an excellent (and detailed) list of the rules for using "long s" (as well as some Ngram graphs) at http://babelstone.blogspot.com/2006/06/rules-for-long-s.html.


Thank you, this is a great overview.


Just one example of why people need to be careful using this data. It's a very useful source, but I've been seeing a lot of uncritical uses cropping up. For example, some people are using it to track intellectual trends--- compare the graphs of Heidegger and Russell or so on. This can sometimes work, but depends heavily on: 1) uniqueness of names; and 2) the particular set of books included in Google's corpus (especially if comparing people not from the same exact area, like a scientist versus an artist).

Even with relatively unique names, it can be tricky. The case of completely or almost completely unique last names (like "Nietzsche") is easy, but with the available interface to the data, it's difficult to handle cases where First+Last is unique, but last alone isn't. You need to count things like "First Last" and "Last, First", plus variants like "First M. Last", without double-counting.


Multiple meanings for 'tits' are also a consideration. (Maybe my monitor or eyes suck, but I had to do a tits vs fuck only ngram search to see that the big spike is in fact tits.)


I think either your monitor or your eyes must in fact fu- I mean suck, because the big bulge (it's not exactly a spike) is for "fuck", not for "tits".

(And most of it does in fact seem to be for wrongly-OCRed "suck", unsurprisingly.)


Note the long-s discussion earlier in this submission.

Maybe life just really fucked back then.


Yes, a quick look at the actual search hits shows that most of these books are in fact about birds, and a couple hits are citations and refer to a last name.


If a positive is a hit on one of the seven words, then aren't you saying there are a lot more "f"-"uck"s than there should be (via "suck")? I would have thought the reverse : than there would be more "f"-"hit"s - i.e. false negatives.


Does anyone else have a really hard time telling the difference between the green that's used for 'fuck', and the green that's used for 'tits'?

I don't think I'm suddenly going colour-blind but it strikes me as odd that google would pick two colours that are so close to each other for a graph that needs so few separate colours...


In first page of search results for 1650 - 1724 I saw 4 different words OCR'd as "shit": This, that, first, and shit.

"first" looks like "firft", but "This" and "that" look pretty standard.


Interesting to see how shit and piss diverged after WWII. I wonder what propelled shit's cultural ascent.

(http://ngrams.googlelabs.com/graph?content=shit,piss&yea...)



It looks like somewhere near the 60s, people stopped caring about not using expletives! :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: