How about this one: tune your hyper-parameters based on the results on your test...

_ea1k · on July 17, 2023

Yep, I've been guilty of that one lately. That and solving problems by simply overfitting a neural net to the data in the problem domain.

I mean, it works, but the result is less interesting than what I should have done. :)

godelski · on July 17, 2023

Definitely. Problem is that doing this helps you get published, not hurts. I think this is why there's often confusion when industry tries to use academic models, as they don't generalize well due to this overfitting. But also, evaluation is fucking hard, and there's just no way around that. Trying to make it easy (i.e. benchmarkism) just adds up creating more noise instead of the intended decrease.

eyegor · on July 17, 2023

What about: add more dropout or noise layers and train an ensemble of models. Submit the best one. Is this considered dirty?

godelski · on July 20, 2023

It is unclear what you are actually saying. An ensemble of models combines the prediction of multiple models, so I'm not sure how you submit "the best one." But what is standard practice is to do some hyper-parameter search and submit the best one. It is status quo that "the best one" is determined by test performance rather than validation performance (proper, no information leakage). These hyper-parameters do also include the number of dropout layers and the dropout percentage. For "noise layer" I'm also not sure what you mean. Noise injection? I don't actually see this common in practice though it definitely is a useful training technique.

But if in general, you're talking about trying multiple configurations and parameters and submitting the best one, then no that's not dirty and it is standard practice. Realistically, I'm not sure how else you could do this because that's indistinguishable from having a new architecture anyways. People should put variance on values but this can also be computationally expensive and so I definitely understand why it doesn't happen.

Der_Einzige · on July 18, 2023

They banged cross validation into our heads in school and then no one in NlP uses it and I just can’t even understand why not.

godelski · on July 18, 2023

Not only that, but I've argued with people substantially, where people claim that it isn't information leakage. The other thing I've significantly argued about is random sampling. You wonder why "random samples" in generative model papers are so good, especially compared to samples you get? Because a significant number of people believe that as long as you don't hand select individual images it is a "random sample." Like they generate a batch of samples, don't like it, so generate a new batch until they do. That's definitely not a random sample. You're just re-rolling the dice until you get a good outcome. But if you don't do this, and do an actual random sample, reviewers will criticize you on this even if your curated ones are good and all your benchmarks beat others. Ask me how I know...

sghael · on July 19, 2023

The convention should be to use 1337 as your seed, and disclose that in your publication.

godelski · on July 20, 2023

This is never going to happen given the need to SOTA chase. Need because reviewers care a lot about this (rather, they are looking for any reason to reject and lack of SOTA is a common reason). Fwiw, the checkpoints I release with my works include a substantial about of information, including the random seed and rng_state. My current project tracks a lot more. The reason I do this is both selfish as well as for promoting good science though, because it is not uncommon to forget what arguments or parameters you had in a particular run and the checkpoint serves as a great place to store that information, ensuring it is always tied to the model and can never be lost.

You could also use the deterministic mode in pytorch to create a standard. But I actually don't believe that we should do this. The solution space is quite large and it would be unsurprising that certain seeds make certain models perform really well while it causes others to fail. Ironically a standardization of seeds can increase the noise in our ability to evaluate! Instead I think we should evaluate multiple times and place some standard error (variance) on the results. This depends on the metric of course, but especially metrics that take subsets (such as FID or other sampling based measurements) should have these. Unfortunately, it is not standard to report these and doing so can also result in reviewer critique. It can also be computationally expensive (especially if we're talking about training parameters) so I wouldn't require this but it is definitely helpful. Everyone reports the best result, just like they tend to show off the best samples. I don't think there is inherently a problem showing off the best samples because in practice we would select these, but I think it is problematic that reviewers make substantial evaluations based on these as it is highly biased.

Noise is inherent to ML, and rather than getting rid of it, I'd rather we embrace it. It is good to know how stable your network is to random seeds. It is good to know how good it can get. It is good to have metrics. But all these evaluations are guides, not measures. Evaluation is fucking hard, and there's no way around this. Getting lazy in evaluation just results in reward hacking/Goodhart's Law. The irony is that the laziness is built on the over-reliance of metrics, in an attempt to create meritocratic and less subjective evaluations, but this ends up actually introducing more noise than if we had just used metrics as a guide. There is no absolute metric, all metrics have biases and limitations, and metrics are not always perfectly aligned with our goals.

We should be clear that starting seed isn't related to what I'm talking about in the previous comment. The previous comment was exclusively about sampling and not training.