Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's a discussion of the original upload of Hacker News data to Google BigQuery:

https://news.ycombinator.com/item?id=10440502

At 4 GB, I'd just as soon query this locally, but this looks like a fun exercise.

I notice that there were 10,729 distinct ASINs out of 15,583 Amazon links in 8,399,417 comments. Since I don't generally (ever?) post Amazon links, I'd be interested in expanding on this in two ways.

First, I'd reduce/eliminate the weight of repeated links to the same book by the same commenter.

Second, I'd search for references to the linked books that aren't Amazon links. Someone links to Code Complete? Add it to the list. In a second pass, increment its count every time you see "Code Complete," whether it's in a link or not.



Discounting multiple links by the same user is a good idea. Your seconds suggestion brings some rather complex problems, for example if a comment goes like "Code Complete is the worst book I ever read" it is certainly not an endorsement, while linking to a book in most cases is. Also a sentence like "programming perl is fun" does not necessarily refer to the book.

So this would require some form of sentiment analysis and also require book titles to be uniquely identifiable.


> if a comment goes like "Code Complete is the worst book I ever read" it is certainly not an endorsement

True, but they read it. Which is a testament to its popularity.

If the list were treated like a New York Times Best Sellers list and not a testament of absolute quality, mentions would be sufficient.


>Discounting multiple links by the same user is a good idea.

Yes, it looks like at least 14 of the "The Rent Is Too Damn High" links seem to come from the user "jseliger":

https://www.google.com/?gws_rd=ssl#q=site:news.ycombinator.c...


>Also a sentence like "programming perl is fun" does not necessarily refer to the book.

...but the counts might be low enough to manually check for those instances. I'm surprised the counts are so low.


I know we traditionally process tokens as case-insensitive, but... it seems reasonable to assume in HN comments that book titles would be capitalized properly (so we could ignore non-capitalized titles). Whether or not this information is present in the version on BigQuery, I'm not sure though.


The full text of the comment is available on BigQuery, but I can't write an SQL query that returns all comments containing potential book titles.

To do such an analysis I'd need to download all 8M comments and process them individually and find a good way to detect book titles.


The counts would be a lot higher I guess, if I searched for book titles instead of links to a particular shop.


> At 4 GB, I'd just as soon query this locally, but this looks like a fun exercise.

This requires scraping all the Hacker News data manually, for which I have a tool to do so (https://github.com/minimaxir/get-all-hacker-news-submissions...) which I mentioned in the post you linked, but it still requires a significant amount of time to get/process the data, hence why the BigQuery dataset has a significant advantage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: