Many false positives in early years due to 's' being written like 'f' and mistaken by the OCR software. Both the short and tall forms were used, but there seemed to have been some style preferences that led to the tall 's' being used at the beginning of a word, and small 's' at the end.
In handwriting or italic, the tall 's' was rather like the integral symbol, but when setting serifed font it looks pretty much like an 'f', but missing half or all the crossbar.
Just one example of why people need to be careful using this data. It's a very useful source, but I've been seeing a lot of uncritical uses cropping up. For example, some people are using it to track intellectual trends--- compare the graphs of Heidegger and Russell or so on. This can sometimes work, but depends heavily on: 1) uniqueness of names; and 2) the particular set of books included in Google's corpus (especially if comparing people not from the same exact area, like a scientist versus an artist).
Even with relatively unique names, it can be tricky. The case of completely or almost completely unique last names (like "Nietzsche") is easy, but with the available interface to the data, it's difficult to handle cases where First+Last is unique, but last alone isn't. You need to count things like "First Last" and "Last, First", plus variants like "First M. Last", without double-counting.
Multiple meanings for 'tits' are also a consideration. (Maybe my monitor or eyes suck, but I had to do a tits vs fuck only ngram search to see that the big spike is in fact tits.)
Yes, a quick look at the actual search hits shows that most of these books are in fact about birds, and a couple hits are citations and refer to a last name.
If a positive is a hit on one of the seven words, then aren't you saying there are a lot more "f"-"uck"s than there should be (via "suck")? I would have thought the reverse : than there would be more "f"-"hit"s - i.e. false negatives.
Does anyone else have a really hard time telling the difference between the green that's used for 'fuck', and the green that's used for 'tits'?
I don't think I'm suddenly going colour-blind but it strikes me as odd that google would pick two colours that are so close to each other for a graph that needs so few separate colours...
In handwriting or italic, the tall 's' was rather like the integral symbol, but when setting serifed font it looks pretty much like an 'f', but missing half or all the crossbar.