Improving Language Understanding with Unsupervised Learning

jph00 · on June 11, 2018

This paper is really important: it shows that transfer learning can be applied to a wide variety of NLP problems with great success. They show state of the art results on nearly every major class of NLP problem.

The basic approach is the same as our ULMFiT (http://nlp.fast.ai/classification/2018/05/15/introducting-ul...) model - pre-train a language model (a model that predicts the next word in a sequence) on a large corpus, and then modify the language model slightly for whatever task you wish to do (e.g. text classification). Finally, fine-tune that model using your target corpus (e.g. texts labeled with classes).

This new paper has two significant leaps over ULMFiT:

- Replace the RNN with a transformer model

- Apply to many more types of problem.

Note that although the original language model takes them a long time to train (a month on 8 GPUs), there's almost no reason for anyone else to create their own model from scratch, except if you need to use this approach on a language that doesn't have a pre-trained model yet. The transfer learning fine-tuning doesn't take anywhere close to as long as the language model pre-training, and you can just use the existing pre-trained weights.

The previous HN discussion on ULMFit may also be of interest: https://news.ycombinator.com/item?id=17076222

Eridrus · on June 11, 2018

Is it really true that these is no reason to train your own language model? What if you have hundreds of millions of unlabeled examples of something that looks different to formal language? E.g. you're analyzing Slack messages.

jph00 · on June 12, 2018

I tested adding things like tweets and reddit comments to our language model, and it didn't help the target model at all, even when that used less formal language.

Note however that the fine tuning stage adapts to the target corpus - it just doesn't require starting from random weights (so it's orders of magnitude faster).

pwaai · on June 12, 2018

>Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open-source our pretrained models and code.

Good guy Jeremy. Works hard during day, open sources it at night

andreyk · on June 12, 2018

Should be noted this idea is not that novel though - it's just replacing word vectors with a pre-trained model. Interesting that it works so well but not very surprising.

laichzeit0 · on June 12, 2018

It may not be novel, but then why do you we still have commercial APIs (Microsoft, Google, IBM Watson, etc.) where there is pretty much no way to "fine tune" them to your domain with a small set of supervised examples? We all know domain adaptation is a real problem.

Instead you either have to roll-your-own models in-house (which defeats the whole point of using a ready made cloud solution) or deal with whatever accuracy you happen to get from those APIs.

IMHO this is an area where you can make some serious competitive headway in commoditised AI/ML. Do all the heavy lifting of pretraining and give your customers an API to "fine-tune" with. Who is currently doing this?

pwaai · on June 12, 2018

Well i hope that http://js.fo will be that as i add more aiml libraries although currently focused on web area. I hope to expend in other areas

radarsat1 · on June 12, 2018

Wait what is the difference between word vectors and a pre-trained model? Aren't word vectors generated by a neural network trained to predict the next word, or to recognize noise (noise-contrastive training)? What is the difference in a "pre-trained model" as opposed to the training needed to generate word vectors?

cs702 · on June 11, 2018

This is fabulous, great work, with broad applicability.

Train this transformer model on a good amount of text (or grab a pretrained model), and then, with minimal fuss and very little tweaking, you can repurpose it to obtain state-of-the-art (or near state-of-the-art) results in a wide range of tasks, from document classification to textual entailment to semantic similarity.

This stands in contrast to prior approaches that involve much more tweaking and/or careful discriminative finetuning of the pretrained model, such as as Jeremy Howard and Sebastian Ruder's also-impressive ULMFit.[a]

The main downside to this new approach is that pretraining takes a long time.

Anyone working on ML/DL/AI with text should take a look at this, right now.

UPDATE: See Jeremy Howard's comment here: https://news.ycombinator.com/item?id=17288320

[a] https://arxiv.org/abs/1801.06146

laichzeit0 · on June 12, 2018

What's really great about these open ai papers is that they release the source code as well. How many papers I've read where people have tried to reproduce results and failed and you're left wondering if they screwed up the implementation somehow or maybe didn't initialise something "just right". This is a huge amount of time wasted for no reason at all. It really should become a standard requirement that you at the very least just release your code.

Another excellent example I came across recently (it also happens to be about unsupervised pretraining and transfer learning) https://github.com/bfelbo/DeepMoji

It's an absolute joy working with such papers and I suspect one of the best ways to get people to actually pay attention to your work in an era of Arxiv Sanity Preserver.

nl · on June 12, 2018

Also notable this week is https://arxiv.org/abs/1806.02847 from Google where they got an 11% improvement on Winograd schema performance using a similar idea.

radarsat1 · on June 11, 2018

> Our approach requires an expensive pre-training step - 1 month on 8 GPUs.

Wow! I enjoy playing with neural networks but this kind of thing reminds me that I'm not really doing deep learning...

I have no idea how researchers could have the patience and confidence to wait that long for a result. In my own (small-data) work, I get frustrated if it doesn't converge in half an hour.. I constantly end up Ctrl-C'ing and tweaking things if it doesn't behave as expected, or appear to be continuing to improve.

sherjilozair · on June 12, 2018

You pretty much always need at least 2 GPUs, one to keep running jobs for 30 minutes or so and debugging, and the other for longer jobs. It also takes a lot of patience to only make ONE change at a time. Often, changes you make which feel intuitive would actually hurt performance, so it's important to verify that each new change is actually improving performance.

akhilcacharya · on June 12, 2018

I'm guessing they were running a bunch of experiments in parallel and killed ones that weren't yielding good results in the first few hours/days/weeks - I doubt they are wanting for GPU resources unlike me!

madisonmay · on June 12, 2018

The critical thing to note is that although the initial language model takes a month to train on 8 GPUs, fine-tuning the language model for a specific task is much much cheaper (a few hours on a single GPU).

nl · on June 12, 2018

ULMFiT gets most of this performance in ~5 hours on my GTX1070.

mark_l_watson · on June 12, 2018

Very nice. I just requested the ROCStories data so I can experiment with this.

I enjoyed several talks at NACL 2016 that referenced the ROCStories data, but didn't really have the (personal, not work) compute power to do much. OpenAI's nice contribution fixes that.

boxy310 · on June 11, 2018

Fascinating. I've used my own hacked-together unsupervised learning to facilitate topic labeling, and this seems to address the bottleneck problem for supervised learning. Will have to dig in deeper though to pick out specifics.

Also interesting that one of the fundamental problems the authors note is "The limits and bias of learning about the world through text", which is essentially a Godelian incompleteness problem. One could say the reverse also applies to embodied/visual data, and a good argument for studying established literature in the abstract.