BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Summary
BERT is a language representation model pre-trained on large unlabeled corpora using masked language modeling and next-sentence prediction, yielding deeply bidirectional contextual representations. The pre-trained model can be fine-tuned with a single additional output layer to achieve strong performance across diverse downstream tasks. It set new state-of-the-art results on eleven NLP benchmarks at the time of publication.
Key findings
- Introduced masked language modeling to enable jointly conditioning on left and right context (true bidirectionality)
- A single pre-trained model fine-tuned per task achieved state-of-the-art on 11 NLP tasks including GLUE, SQuAD, and MultiNLI
- Pushed the GLUE score to 80.5% and SQuAD v1.1 test F1 to 93.2, large gains over prior systems
Subjects & keywords
Cite this paper
Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019. https://doi.org/10.18653/v1/N19-1423
@inproceedings{devlin2019bert,
author = {Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova},
title = {BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
booktitle = {Proceedings of NAACL-HLT 2019},
year = {2019},
doi = {10.18653/v1/N19-1423},
url = {https://doi.org/10.18653/v1/N19-1423}
}