Here's a discussion of the original upload of Hacker News data to Google BigQuer...

gkst · on Jan 18, 2016

Discounting multiple links by the same user is a good idea. Your seconds suggestion brings some rather complex problems, for example if a comment goes like "Code Complete is the worst book I ever read" it is certainly not an endorsement, while linking to a book in most cases is. Also a sentence like "programming perl is fun" does not necessarily refer to the book.

So this would require some form of sentiment analysis and also require book titles to be uniquely identifiable.

jkyle · on Jan 18, 2016

> if a comment goes like "Code Complete is the worst book I ever read" it is certainly not an endorsement

True, but they read it. Which is a testament to its popularity.

If the list were treated like a New York Times Best Sellers list and not a testament of absolute quality, mentions would be sufficient.

flubert · on Jan 18, 2016

>Discounting multiple links by the same user is a good idea.

Yes, it looks like at least 14 of the "The Rent Is Too Damn High" links seem to come from the user "jseliger":

https://www.google.com/?gws_rd=ssl#q=site:news.ycombinator.c...

flubert · on Jan 18, 2016

>Also a sentence like "programming perl is fun" does not necessarily refer to the book.

...but the counts might be low enough to manually check for those instances. I'm surprised the counts are so low.

tedmiston · on Jan 18, 2016

I know we traditionally process tokens as case-insensitive, but... it seems reasonable to assume in HN comments that book titles would be capitalized properly (so we could ignore non-capitalized titles). Whether or not this information is present in the version on BigQuery, I'm not sure though.

gkst · on Jan 19, 2016

The full text of the comment is available on BigQuery, but I can't write an SQL query that returns all comments containing potential book titles.

To do such an analysis I'd need to download all 8M comments and process them individually and find a good way to detect book titles.

gkst · on Jan 18, 2016

The counts would be a lot higher I guess, if I searched for book titles instead of links to a particular shop.

minimaxir · on Jan 18, 2016

> At 4 GB, I'd just as soon query this locally, but this looks like a fun exercise.

This requires scraping all the Hacker News data manually, for which I have a tool to do so (https://github.com/minimaxir/get-all-hacker-news-submissions...) which I mentioned in the post you linked, but it still requires a significant amount of time to get/process the data, hence why the BigQuery dataset has a significant advantage.