I have a question about trading secrets and working in machine learning. How sho...

sometimesijust · on Dec 22, 2018

Project the data, not the model. You need to do it anyway and has the side benefit of making overfitting more difficult. Protecting the model is impossible anyway since most advancements terms to be trying lots of publicly known techniques and discovering that a particular combination works best for your data. Once one knows which techniques those are these nothing stopping a competent engineer from reimplementing them at another company.

GuiA · on Dec 22, 2018

Indeed. For commercially useful applications, collecting the data, labeling it, etc, costs orders of magnitude more than a team of PhDs building the models.

trott · on Dec 22, 2018

> For commercially useful applications, collecting the data, labeling it, etc, costs orders of magnitude more than a team of PhDs building the models.

I don't think it's typical. For example, JFT has 350e6 images, and it probably cost ~$35M to hand-label, but Google has paid people far in excess of that to work on image classification.

wil421 · on Dec 22, 2018

Google doesn’t even have to pay people. Anyone who has picked out cars or fire hydrants from their recapatchya’s is helping Label their dataset.

trott · on Dec 22, 2018

JFT has 17K classes. I'm assuming that they used specialized experts to tell them apart (dog breeds, plant and animal species, etc.)

wil421 · on Dec 22, 2018

Thanks.

From Google:

>Of course, the elephant in the room is where can we obtain a dataset that is 300x larger than ImageNet? At Google, we have been continuously working on building such datasets automatically to improve computer vision algorithms. Specifically, we have built an internal dataset of 300M images that are labeled with 18291 categories, which we call JFT-300M. The images are labeled using an algorithm that uses complex mixture of raw web signals, connections between web-pages and user feedback. This results in over one billion labels for the 300M images (a single image can have multiple labels). Of the billion image labels, approximately 375M are selected via an algorithm that aims to maximize label precision of selected images. However, there is still considerable noise in the labels: approximately 20% of the labels for selected images are noisy. Since there is no exhaustive annotation, we have no way to estimate the recall of the labels.

https://ai.googleblog.com/2017/07/revisiting-unreasonable-ef...

sangnoir · on Dec 22, 2018

That doesn't sound like recaptcha: it's more likely that they label the pictures N (or n%) people click after searching for "Golden Retriever" in image search (as the "raw web signal")

ndnxhs · on Dec 22, 2018

Pay your employees enough that they never want to leave. If you can't afford that then you just have to accept that you don't own the people who work for you.

baybal2 · on Dec 22, 2018

If you are even bothering with this question, I believe you have nothing genuinely worth protecting.

Companies nowadays have really nothing to protect to, the higher up the tech chain they are.

TSMC can easily dump all and every of their "family jewels" onto the Internet, and I guarantee that no mainland fab will ever manage to extract any value from that.

P.S. Stay away from the whole M.L. space, it is filled with plain frauds, pump and dumpers, and people seeking to sell their companies upon first opportunity. If you are an established professional, you do yourself a disservice working in that.