Google Books Ngram of George Carlin's Seven Dirty Words

Isamu · on Dec 18, 2010

Many false positives in early years due to 's' being written like 'f' and mistaken by the OCR software. Both the short and tall forms were used, but there seemed to have been some style preferences that led to the tall 's' being used at the beginning of a word, and small 's' at the end.

In handwriting or italic, the tall 's' was rather like the integral symbol, but when setting serifed font it looks pretty much like an 'f', but missing half or all the crossbar.

phreeza · on Dec 18, 2010

Interestingly, the tall s seems to have replaced the short one in a relativvely shout timeframe. apart from the transition period arount 1800, this looks like a smooth curve: http://ngrams.googlelabs.com/graph?content=fuck,suck&yea...

bgrainger · on Dec 18, 2010

Andrew West provides an excellent (and detailed) list of the rules for using "long s" (as well as some Ngram graphs) at http://babelstone.blogspot.com/2006/06/rules-for-long-s.html.

Isamu · on Dec 19, 2010

Thank you, this is a great overview.

_delirium · on Dec 19, 2010

Just one example of why people need to be careful using this data. It's a very useful source, but I've been seeing a lot of uncritical uses cropping up. For example, some people are using it to track intellectual trends--- compare the graphs of Heidegger and Russell or so on. This can sometimes work, but depends heavily on: 1) uniqueness of names; and 2) the particular set of books included in Google's corpus (especially if comparing people not from the same exact area, like a scientist versus an artist).

Even with relatively unique names, it can be tricky. The case of completely or almost completely unique last names (like "Nietzsche") is easy, but with the available interface to the data, it's difficult to handle cases where First+Last is unique, but last alone isn't. You need to count things like "First Last" and "Last, First", plus variants like "First M. Last", without double-counting.

celticjames · on Dec 18, 2010

Multiple meanings for 'tits' are also a consideration. (Maybe my monitor or eyes suck, but I had to do a tits vs fuck only ngram search to see that the big spike is in fact tits.)

gjm11 · on Dec 18, 2010

I think either your monitor or your eyes must in fact fu- I mean suck, because the big bulge (it's not exactly a spike) is for "fuck", not for "tits".

(And most of it does in fact seem to be for wrongly-OCRed "suck", unsurprisingly.)

pavel_lishin · on Dec 19, 2010

Note the long-s discussion earlier in this submission.

Maybe life just really fucked back then.

_l4lu · on Dec 18, 2010

Yes, a quick look at the actual search hits shows that most of these books are in fact about birds, and a couple hits are citations and refer to a last name.

mdda · on Dec 18, 2010

If a positive is a hit on one of the seven words, then aren't you saying there are a lot more "f"-"uck"s than there should be (via "suck")? I would have thought the reverse : than there would be more "f"-"hit"s - i.e. false negatives.

burgerbrain · on Dec 19, 2010

Does anyone else have a really hard time telling the difference between the green that's used for 'fuck', and the green that's used for 'tits'?

I don't think I'm suddenly going colour-blind but it strikes me as odd that google would pick two colours that are so close to each other for a graph that needs so few separate colours...

davidmathers · on Dec 18, 2010

In first page of search results for 1650 - 1724 I saw 4 different words OCR'd as "shit": This, that, first, and shit.

"first" looks like "firft", but "This" and "that" look pretty standard.

iskander · on Dec 19, 2010

Interesting to see how shit and piss diverged after WWII. I wonder what propelled shit's cultural ascent.

(http://ngrams.googlelabs.com/graph?content=shit,piss&yea...)

david_p · on Dec 18, 2010

1800 was the year ... :) http://ngrams.googlelabs.com/graph?content=tits%2Cboobs%2Cbr...

makmanalp · on Dec 18, 2010

It looks like somewhere near the 60s, people stopped caring about not using expletives! :)