At 4 GB, I'd just as soon query this locally, but this looks like a fun exercise.
I notice that there were 10,729 distinct ASINs out of 15,583 Amazon links in 8,399,417 comments. Since I don't generally (ever?) post Amazon links, I'd be interested in expanding on this in two ways.
First, I'd reduce/eliminate the weight of repeated links to the same book by the same commenter.
Second, I'd search for references to the linked books that aren't Amazon links. Someone links to Code Complete? Add it to the list. In a second pass, increment its count every time you see "Code Complete," whether it's in a link or not.
Discounting multiple links by the same user is a good idea. Your seconds suggestion brings some rather complex problems, for example if a comment goes like "Code Complete is the worst book I ever read" it is certainly not an endorsement, while linking to a book in most cases is. Also a sentence like "programming perl is fun" does not necessarily refer to the book.
So this would require some form of sentiment analysis and also require book titles to be uniquely identifiable.
I know we traditionally process tokens as case-insensitive, but... it seems reasonable to assume in HN comments that book titles would be capitalized properly (so we could ignore non-capitalized titles). Whether or not this information is present in the version on BigQuery, I'm not sure though.
> At 4 GB, I'd just as soon query this locally, but this looks like a fun exercise.
This requires scraping all the Hacker News data manually, for which I have a tool to do so (https://github.com/minimaxir/get-all-hacker-news-submissions...) which I mentioned in the post you linked, but it still requires a significant amount of time to get/process the data, hence why the BigQuery dataset has a significant advantage.
https://news.ycombinator.com/item?id=10440502
At 4 GB, I'd just as soon query this locally, but this looks like a fun exercise.
I notice that there were 10,729 distinct ASINs out of 15,583 Amazon links in 8,399,417 comments. Since I don't generally (ever?) post Amazon links, I'd be interested in expanding on this in two ways.
First, I'd reduce/eliminate the weight of repeated links to the same book by the same commenter.
Second, I'd search for references to the linked books that aren't Amazon links. Someone links to Code Complete? Add it to the list. In a second pass, increment its count every time you see "Code Complete," whether it's in a link or not.