When we did search relevance experimentation at Shopify we made lots of mistakes. I can empathize with the authors. I’ve had a lot of my own public screw ups.
At the end of my time at Shopify I learned good science requires good software engineering. It’s really easy to make mistakes at so many places in the stack.
We spent a lot of time on creating rigorous, heavily tested and high quality software for our experiments so we could trust our numbers and reproduce each others experiments. We tried to discourage one-off evaluation methods, but if we created a new one, to add it to our suite and test the metric to understand what it meant.
It seems obvious, but sadly not as common as I wish it were in my experience with this kind of experimentation. Companies want velocity, and thinking deeply statistically, and building internal tools, is not in the interest of most higher ups.
> At the end of my time at Shopify I learned good science requires good software engineering.
This is a positive side of industry research. First, you tend to have more software engineering expertise available, and secondly you have more of an insentive to not exaggerate your claims, as if you say it works, you'll be expected to put it into production.
At the end of my time at Shopify I learned good science requires good software engineering. It’s really easy to make mistakes at so many places in the stack.
We spent a lot of time on creating rigorous, heavily tested and high quality software for our experiments so we could trust our numbers and reproduce each others experiments. We tried to discourage one-off evaluation methods, but if we created a new one, to add it to our suite and test the metric to understand what it meant.
It seems obvious, but sadly not as common as I wish it were in my experience with this kind of experimentation. Companies want velocity, and thinking deeply statistically, and building internal tools, is not in the interest of most higher ups.