Grand Prize Awarded To BellKor Pragmatic Chaos

robg · on Sept 21, 2009

Papers here:

http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKo...

http://www.netflixprize.com/assets/GrandPrize2009_BPC_BigCha...

http://www.netflixprize.com/assets/GrandPrize2009_BPC_Pragma...

mildweed · on Sept 21, 2009

Did they figure out the Napoleon Dynamite conundrum?

physcab · on Sept 21, 2009

I did a search on Google, but couldn't find the reference. Care to share?

arfrank · on Sept 21, 2009

From what I remember reading about the contest, the problem was that when comparing movies and using movies that were similar to others as a way to suggest movies people either hated or loved Napoleon Dynamite and it didn't match up at all with their previous movie ratings. Thats how I understood it at a basic level.

bayes · on Sept 21, 2009

The NY Times had an informative article: http://www.nytimes.com/2008/11/23/magazine/23Netflix-t.html?...

pronoiac · on Sept 22, 2009

I'll repost from another thread about the NYTime article, http://news.ycombinator.com/item?id=834681 : The next contest info surprised me:

"The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes..." It's enough to identify 87% of the people, apparently: http://www.freedom-to-tinker.com/blog/paul/netflixs-impendin...

I hadn't realized that someone identified some of the raters in the previous contest with their imdb ratings: http://www.cs.utexas.edu/~shmat/netflix-faq.html

Also, why this matters, page 44 of a research paper (PDF): http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006

tmsh · on Sept 21, 2009

so why didn't the ensemble win? in the rules it says, 'At the end of this period, qualifying submissions will be judged (see Judging below) in order of the largest improvement over the qualifying RMSE on the test subset.' this isn't the icfp. i assume the person on the leaderboard is in fact the leader.

the difference between the 'test' and 'quiz' sounds b.s. to me. at this rate, i know i won't even be contemplating netflix prize 2. at the very least, netflix owes it to the community to explain why they (at this point, seemingly corruptly) made the decision they did.

i suppose they say they'll post the final 'test subset' scores on the leaderboard. it doesn't appear to be a very 'open' contest if they don't actually publish exactly what the test subset is and how the rankings are determined. when it's that close, everything should be shown to be exactly done within the rules. otherwise, they really risk delegitimizing the whole contest, imho.

pmjordan · on Sept 21, 2009

I'm not familiar with the details in any way, but it sounds like you could probably produce an algorithm that produces a perfect result for the rest if you give it a subset of the publicly available data. That doesn't mean your algorithm is good in general, that just means you essentially encoded the test data set into your algorithm. The real test is always going to be how well you do on data that you've never seen.

As for why the intermittent leaderboard shows results based on the public dataset: using a specially crafted algorithm, you could leak information about hidden data through any feedback you get from the test environment. Such information could be used to create an aforementioned algorithm that does well in this specific case.

see also: http://en.wikipedia.org/wiki/Overfitting

tmsh · on Sept 21, 2009

i see. but if they don't make their private dataset public, what's to prevent netflix from fitting it to a particular entry?

elq · on Sept 21, 2009

Since the contest is over all of the data has been donated to the UCI ML repo http://archive.ics.uci.edu/ml/datasets/Netflix+Prize

The "secret" data set is in the file "judging.txt" in the grand_prize download.

mquander · on Sept 21, 2009

Why would Netflix corrupt their own competition in this way? I don't understand what you're concerned about.

tmsh · on Sept 21, 2009

maybe because 'bellkor's pragmatic chaos' won the progress prize and they have some special relationship with them. could be any number of reasons. this is i think the danger of a private corporation running a programming contest.

i'm not saying netflix is unfair or corrupt. i hope not. but i think they owe it to the contestants to explain exactly why they chose the #2 entry on the leaderboard over the #1 entry. and not just (though i appreciate people's comments) because of overfitting. how is the private 'test' subset chosen? i have no affiliation with this contest or any of the teams in any way, but how would you like to work on this for years only to be told that even though you beat everyone in the quiz, in the private 'test', you lost.

luchak · on Sept 21, 2009

Did you read the contest rules and FAQ? (http://www.netflixprize.com/rules, http://www.netflixprize.com/faq -- and if you didn't, what are you doing throwing around words like "corrupt"?) It looks to me like they're being very transparent and explained things very clearly. You have a bunch of data in the test set; these are partitioned randomly into two equally-sized subsets, "quiz" and "test". You submit predictions for both subsets, but are only told how you did on "quiz". Netflix provides an MD5 checksum for the judging file that defines what the partition is; this file will be made available "at the end of the Contest". So this will be verifiable by anyone soon.

Also, don't brush overfitting aside. In the (paraphrased) words of one machine learning researcher, "life is a battle against entropy. In the same way, machine learning research is a battle against overfitting." Any data whose test results you use to select your algorithm or to adjust your algorithm's parameters is not properly considered part of the test set; after your optimization that data will provide an optimistically-biased estimate of your true error. Since competitors could get regular updates on their performance on the quiz dataset, one must assume that they were attempting to optimize this performance, and so quiz set performance was not be a good estimate of their true error. You can only get a good estimate of the true error of a method by testing against data that has played no part in its development.

tmsh · on Sept 23, 2009

> So this will be verifiable by anyone soon

Yeah....what happens when the checksums don't check out?

derefr · on Sept 21, 2009

They don't need to make it public beforehand, but they really should after-the-fact, to verify their calculations upon it. Then, proving it's unbiased is a simple matter of releasing the hash to the dataset before the contest starts, to prove that the final dataset they release is the same one they generated before the contest started.

Now, on the front of selecting the winners before the contest has even started, and telling them about how to bias their algorithm so it will do better on the final--I can't really help you. I'd just hope that the programming teams would be honorable enough to mention if one of them were offered such a thing.

ErrantX · on Sept 21, 2009

IIRC it was very clear: once you submitted the public test set and obtained a score that broke the 10% barrier there was a 30 day wait period for final submisions.

Then the ones over 10% were run against the final test set and the best winner determined from that.

Obviously releasing specific details about the final test set doesn't work because teams can fit to that :)

mquander · on Sept 21, 2009

OK, I agree that more transparency would be nice, but I don't think it's a serious concern. It's clearly in Netflix's best interest to promote a fair competition, since the whole point of the exercise (the reason they're spending a million dollars) is to produce the best possible general algorithm for their own use.

AGorilla · on Sept 21, 2009

This is great and all, but it still thinks I'll love Once: http://www.imdb.com/title/tt0907657/