This paper is really important: it shows that transfer learning can be applied to a wide variety of NLP problems with great success. They show state of the art results on nearly every major class of NLP problem.
The basic approach is the same as our ULMFiT (http://nlp.fast.ai/classification/2018/05/15/introducting-ul...) model - pre-train a language model (a model that predicts the next word in a sequence) on a large corpus, and then modify the language model slightly for whatever task you wish to do (e.g. text classification). Finally, fine-tune that model using your target corpus (e.g. texts labeled with classes).
This new paper has two significant leaps over ULMFiT:
- Replace the RNN with a transformer model
- Apply to many more types of problem.
Note that although the original language model takes them a long time to train (a month on 8 GPUs), there's almost no reason for anyone else to create their own model from scratch, except if you need to use this approach on a language that doesn't have a pre-trained model yet. The transfer learning fine-tuning doesn't take anywhere close to as long as the language model pre-training, and you can just use the existing pre-trained weights.
Is it really true that these is no reason to train your own language model? What if you have hundreds of millions of unlabeled examples of something that looks different to formal language? E.g. you're analyzing Slack messages.
I tested adding things like tweets and reddit comments to our language model, and it didn't help the target model at all, even when that used less formal language.
Note however that the fine tuning stage adapts to the target corpus - it just doesn't require starting from random weights (so it's orders of magnitude faster).
>Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open-source our pretrained models and code.
Good guy Jeremy. Works hard during day, open sources it at night
Should be noted this idea is not that novel though - it's just replacing word vectors with a pre-trained model. Interesting that it works so well but not very surprising.
It may not be novel, but then why do you we still have commercial APIs (Microsoft, Google, IBM Watson, etc.) where there is pretty much no way to "fine tune" them to your domain with a small set of supervised examples? We all know domain adaptation is a real problem.
Instead you either have to roll-your-own models in-house (which defeats the whole point of using a ready made cloud solution) or deal with whatever accuracy you happen to get from those APIs.
IMHO this is an area where you can make some serious competitive headway in commoditised AI/ML. Do all the heavy lifting of pretraining and give your customers an API to "fine-tune" with. Who is currently doing this?
Wait what is the difference between word vectors and a pre-trained model? Aren't word vectors generated by a neural network trained to predict the next word, or to recognize noise (noise-contrastive training)? What is the difference in a "pre-trained model" as opposed to the training needed to generate word vectors?
This is fabulous, great work, with broad applicability.
Train this transformer model on a good amount of text (or grab a pretrained model), and then, with minimal fuss and very little tweaking, you can repurpose it to obtain state-of-the-art (or near state-of-the-art) results in a wide range of tasks, from document classification to textual entailment to semantic similarity.
This stands in contrast to prior approaches that involve much more tweaking and/or careful discriminative finetuning of the pretrained model, such as as Jeremy Howard and Sebastian Ruder's also-impressive ULMFit.[a]
The main downside to this new approach is that pretraining takes a long time.
Anyone working on ML/DL/AI with text should take a look at this, right now.
What's really great about these open ai papers is that they release the source code as well. How many papers I've read where people have tried to reproduce results and failed and you're left wondering if they screwed up the implementation somehow or maybe didn't initialise something "just right". This is a huge amount of time wasted for no reason at all. It really should become a standard requirement that you at the very least just release your code.
Another excellent example I came across recently (it also happens to be about unsupervised pretraining and transfer learning) https://github.com/bfelbo/DeepMoji
It's an absolute joy working with such papers and I suspect one of the best ways to get people to actually pay attention to your work in an era of Arxiv Sanity Preserver.
Also notable this week is https://arxiv.org/abs/1806.02847 from Google where they got an 11% improvement on Winograd schema performance using a similar idea.
> Our approach requires an expensive pre-training step - 1 month on 8 GPUs.
Wow! I enjoy playing with neural networks but this kind of thing reminds me that I'm not really doing deep learning...
I have no idea how researchers could have the patience and confidence to wait that long for a result. In my own (small-data) work, I get frustrated if it doesn't converge in half an hour.. I constantly end up Ctrl-C'ing and tweaking things if it doesn't behave as expected, or appear to be continuing to improve.
You pretty much always need at least 2 GPUs, one to keep running jobs for 30 minutes or so and debugging, and the other for longer jobs. It also takes a lot of patience to only make ONE change at a time. Often, changes you make which feel intuitive would actually hurt performance, so it's important to verify that each new change is actually improving performance.
I'm guessing they were running a bunch of experiments in parallel and killed ones that weren't yielding good results in the first few hours/days/weeks - I doubt they are wanting for GPU resources unlike me!
The critical thing to note is that although the initial language model takes a month to train on 8 GPUs, fine-tuning the language model for a specific task is much much cheaper (a few hours on a single GPU).
Very nice. I just requested the ROCStories data so I can experiment with this.
I enjoyed several talks at NACL 2016 that referenced the ROCStories data, but didn't really have the (personal, not work) compute power to do much. OpenAI's nice contribution fixes that.
Fascinating. I've used my own hacked-together unsupervised learning to facilitate topic labeling, and this seems to address the bottleneck problem for supervised learning. Will have to dig in deeper though to pick out specifics.
Also interesting that one of the fundamental problems the authors note is "The limits and bias of learning about the world through text", which is essentially a Godelian incompleteness problem. One could say the reverse also applies to embodied/visual data, and a good argument for studying established literature in the abstract.
The basic approach is the same as our ULMFiT (http://nlp.fast.ai/classification/2018/05/15/introducting-ul...) model - pre-train a language model (a model that predicts the next word in a sequence) on a large corpus, and then modify the language model slightly for whatever task you wish to do (e.g. text classification). Finally, fine-tune that model using your target corpus (e.g. texts labeled with classes).
This new paper has two significant leaps over ULMFiT:
- Replace the RNN with a transformer model
- Apply to many more types of problem.
Note that although the original language model takes them a long time to train (a month on 8 GPUs), there's almost no reason for anyone else to create their own model from scratch, except if you need to use this approach on a language that doesn't have a pre-trained model yet. The transfer learning fine-tuning doesn't take anywhere close to as long as the language model pre-training, and you can just use the existing pre-trained weights.
The previous HN discussion on ULMFit may also be of interest: https://news.ycombinator.com/item?id=17076222