Code Walkthrough of Bert with PyTorch for a Multilabel Classification in NLP

rsmith49 · on Aug 14, 2019

For fine tuning BERT onto a specific domain, what amount of text data would you recommend to train on?

yashvijay · on Aug 14, 2019

Since the training is done on the tasks of masked word prediction and contiguous sentence prediction, I'd suggest about a million sentences (from the same domain), with an average token length of 7 per sentence. Longer sentences would definitely help, as BERT uses the transformer encoder architecture which has multi head attention. This would enable the model to do better contextual representation learning for the embedding layer.