Hacker News new | past | comments | ask | show | jobs | submit login

I have a question about trading secrets and working in machine learning. How should a ML company go about hiring people and protecting its tech? I once interviewed for a company that claimed above state of the art results on some datasets (they actually weren't), and they wanted me to sign a contract where I was forbidden to ever work for their current and future clients (that's not enforceable). In ML you usually don't need a flash drive, you learn enough about what makes your team achieve great results that you can go to another company and use what you learned and it seems impossible to prove that you "stole trade secrets" from the previous company.



Project the data, not the model. You need to do it anyway and has the side benefit of making overfitting more difficult. Protecting the model is impossible anyway since most advancements terms to be trying lots of publicly known techniques and discovering that a particular combination works best for your data. Once one knows which techniques those are these nothing stopping a competent engineer from reimplementing them at another company.


Indeed. For commercially useful applications, collecting the data, labeling it, etc, costs orders of magnitude more than a team of PhDs building the models.


> For commercially useful applications, collecting the data, labeling it, etc, costs orders of magnitude more than a team of PhDs building the models.

I don't think it's typical. For example, JFT has 350e6 images, and it probably cost ~$35M to hand-label, but Google has paid people far in excess of that to work on image classification.


Google doesn’t even have to pay people. Anyone who has picked out cars or fire hydrants from their recapatchya’s is helping Label their dataset.


JFT has 17K classes. I'm assuming that they used specialized experts to tell them apart (dog breeds, plant and animal species, etc.)


Thanks.

From Google:

>Of course, the elephant in the room is where can we obtain a dataset that is 300x larger than ImageNet? At Google, we have been continuously working on building such datasets automatically to improve computer vision algorithms. Specifically, we have built an internal dataset of 300M images that are labeled with 18291 categories, which we call JFT-300M. The images are labeled using an algorithm that uses complex mixture of raw web signals, connections between web-pages and user feedback. This results in over one billion labels for the 300M images (a single image can have multiple labels). Of the billion image labels, approximately 375M are selected via an algorithm that aims to maximize label precision of selected images. However, there is still considerable noise in the labels: approximately 20% of the labels for selected images are noisy. Since there is no exhaustive annotation, we have no way to estimate the recall of the labels.

https://ai.googleblog.com/2017/07/revisiting-unreasonable-ef...


That doesn't sound like recaptcha: it's more likely that they label the pictures N (or n%) people click after searching for "Golden Retriever" in image search (as the "raw web signal")


Pay your employees enough that they never want to leave. If you can't afford that then you just have to accept that you don't own the people who work for you.


If you are even bothering with this question, I believe you have nothing genuinely worth protecting.

Companies nowadays have really nothing to protect to, the higher up the tech chain they are.

TSMC can easily dump all and every of their "family jewels" onto the Internet, and I guarantee that no mainland fab will ever manage to extract any value from that.

P.S. Stay away from the whole M.L. space, it is filled with plain frauds, pump and dumpers, and people seeking to sell their companies upon first opportunity. If you are an established professional, you do yourself a disservice working in that.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: