Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Summary
The paper introduced the Transformer, a sequence-transduction architecture based entirely on attention mechanisms, dispensing with the recurrence and convolutions used by prior state-of-the-art models. By relying on multi-head self-attention, the model is more parallelizable and trains substantially faster, while achieving new state-of-the-art results on machine translation. The architecture became the foundation for subsequent large language models and much of modern deep learning.
Key findings
- Proposed multi-head self-attention as the sole mechanism for modeling dependencies, removing recurrence and convolution and enabling far greater training parallelism.
- Achieved state-of-the-art results on WMT 2014 translation, reporting 28.4 BLEU on English-to-German and 41.8 BLEU on English-to-French, at a fraction of the training cost of prior models.
- Introduced design elements (scaled dot-product attention, positional encodings, and the encoder-decoder Transformer block) that generalized well to other tasks such as English constituency parsing.
Subjects & keywords
Cite this paper
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, & Illia Polosukhin (2017). Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762
@inproceedings{vaswani2017attention,
author = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
title = {Attention Is All You Need},
booktitle = {Advances in Neural Information Processing Systems 30 (NeurIPS 2017)},
year = {2017},
url = {https://arxiv.org/abs/1706.03762}
}