Pre-trained Language Models (PTLM) in NLP

 

Pre-trained Language Models (PTLM) in NLP



The main purpose behind Natural Language Processing(NLP) is to make the computer system capable to understand and communicate the language as a human. NLP tasks are challenging because it is not only important to understand individual words, but also to understand the context in which they are used. According to NLP, each word has a meaning, but the meaning is strongly influenced by the context in which it is used. The Language Model is a machine learning model that predicts the next word based on the parts of a sentence. A pre-trained language model has been trained on a large corpus of data. Through this training, the model learns the language’s general rules for word usage and how it is written. The model is then trained with a task-specific dataset.

                                                      
Pretrained language model concept

Text-to-Text Transfer Transformer Model(T5)

T5 (Raffel et al., 2019) is Google’s state-of-the-art (SOTA) encoder-decoder model. In T5, any NLP problem is converted to the text-to-text format. That means models take text as input and produce text as output. T5 has the same architecture as transformer but it has a Text-to-Text Framework. For all NLP tasks, the text-to-text framework recommends using the same model, hyperparameters, and loss function. This approach models the inputs in such a way that the model should recognize a task, showing the expected outcome in “text” form. This model has trained on the Colossal Clean Common Crawl (C4) dataset.


                                     T5 model Text-to-Text Framework (Source: (Raffel et al., 2019))

Bidirectional Auto-Regressive Transformer (BART)

With a left-to-right decoder and bidirectional encoder, BART (Lewis et al., 2019) is a sequence-to-sequence model. For pretraining, BART utilizes both existing and new noise reduction techniques. As sentence permutation transformations and Text Infilling. In the pretraining task, sentences are shuffled randomly, and a novel masking scheme is used to replace spans of text. In addition to text generation, BART also performs well when used for comprehension. BART has nearly 140 parameters. As part of its pre-training, BART corrupts text using a random noise function before learning a model to restore it to its original state. In the encoder and decoder, there are 6 and 12 layers, respectively, for the base and large-sized BART.



                                     The BART mode4l architecture (Source: (Lewis et al., 2019))

PEGASUS

Google recently introduced open-sourced PEGASUS (Zhang et al., 2019), their latest concept for abstractive summarization, in June 2020. The name PEGASUS is the abbreviation of Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models. It is a sequence-to-sequence transformer model. The pretraining task of Pegasus is deliberately similar to summarization. Like an extractive summary, important sentences are removed/masked from an input document and combined into one output sequence based on the remaining sentences.


                                  Working of PEGASUS model (Source: (Zhang et al., 2019))

Bidirectional Encoder Representations from Transformers (BERT)

BERT (Devlin et al., 2018) is a stacked transformer encoder. Mainly BERT is available in two sizes: BERT base and BERT large. For the Base version, there are twelve encoder layers, 768 feed-forward hidden layers and 12 attention masks; for the large version, there are twenty-four encoder layers, 1024 feed-forward hidden layers and 16 attention masks. With the objective of next-sentence prediction (NSP) and masked language modelling (MLM), BERT was trained. In all layers, BERT conditions both the left and the right context to pre-train deep bidirectional representations from the unlabeled text. In addition to the Masked Language Model (MLM), BERT is also trained in Next Sentence Prediction (NSP). By masking (hiding) a word in a sentence, MLM forces BERT to use words on either side of the masked word to predict the word that is hidden. With the help of NSP, BERT is able to learn about relationships between sentences by predicting whether one sentence follows another.

RoBERTa

A Robustly Optimized BERT Pretraining Approach (RoBERTa) (Liu et al., 2019) builds on BERT. The model removes the next sentence prediction pretraining objective and modifies key hyperparameters. Different NLP tasks use this model, but it focuses primarily on classification tasksAs with BERT, RoBERTa has the same architecture, however, it uses a byte-level BPE as a tokenizer and uses a different pretraining method.

References:

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://doi.org/10.48550/ARXIV.1810.04805

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L., 2019. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. https://doi.org/10.48550/ARXIV.1910.13461

Liu, X., Xv, L., 2019. Abstract Summarization Based on the Combination of Transformer and LSTM, in: 2019 International Conference on Intelligent Computing, Automation and Systems (ICICAS). Presented at the 2019 International Conference on Intelligent Computing, Automation and Systems (ICICAS), IEEE, Chongqing, China, pp. 923–927. https://doi.org/10.1109/ICICAS48597.2019.00199

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

Zhang, J., Zhao, Y., Saleh, M., Liu, P.J., 2019. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. https://doi.org/10.48550/ARXIV.1912.08777

Comments

Popular posts from this blog

Transfer Learning Vs Fine Tuning

AutoML