Persian ULMFIT
This notebook contains codes for finetunning AWD-LSTM language model Using ULMFIT Approach for sentiment analysis in Persian Language
Following the previous post:pretraining Persian AWD LSTM language model. Here we finetune our language model that was trained on Persian corpus for sentiment analysis. I use the ULMFIT approach introduced by Jeremy Howard, Sebastian Ruder and implemented it nicely in the fastai library. for more information about ULMFIT ,and the ideas behind it, you can refer o this post: Introducing state of the art text classification with universal language models
First, we download the model, tokenizer, and the data, which we will use to finetune our language model. after this running this cell, a file named ULMFIT_FA.zip
will be downloaded, which includes the model, tokenizer, and the data.
%%capture
import gdown
url = "https://drive.google.com/uc?id=1-VftZs-XxQD6KvmeNT8Io03MU1mQKylO"
output = 'ULMFIT_FA.zip'
gdown.download(url, output, quiet=False)
!unzip ULMFIT_FA.zip
!pip install -U fastai
!pip install sentencepiece
from fastai import *
from fastai.text import *
from fastai.text.all import *
import pandas as pd
import pickle
import fastai
import torch
print(f"fastai version: {fastai.__version__}")
print(f"GPU which is used : {torch.cuda.get_device_name(0)}")
## parameters for dataloader and tokenizer
lang = "fa"
backwards=False
bs=128
vocab_sz = 30000
drop_mult = 0.5
num_workers=18
## setting up the pathes
base = Path(".").absolute()
print(f"our base directory: {base}")
ulmfit_dir = base / "ULMFIT_FA"
print(f"our model and data directory: {ulmfit_dir}")
lm_fns = [ulmfit_dir / f"model_out/{lang}_ULMFIT", ulmfit_dir / f"model_out/{lang}_ULMFIT_vocab"]
This is a preview of the data. I choose Snappfood comment data and going to use it for sentiment analysis of the comments.
df = pd.read_csv(ulmfit_dir / "snapp.csv")
print(f"shape of the data: {df.shape}")
df.sample(5)
leading pretrained SentencePiece for tokenization. Then create a data loader for feeding the language model learner.
tok = SentencePieceTokenizer(lang="fa", max_vocab_sz=vocab_sz, sp_model=ulmfit_dir / "spm/spm.model")
dblock_lm = DataBlock(
blocks=(TextBlock.from_df('comment', is_lm=True, tok=tok,backwards=False)),
get_x=ColReader('text'))
dls_lm = dblock_lm.dataloaders(df, bs=bs)
dls_lm.show_batch(max_n=4)
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=drop_mult, pretrained=True, pretrained_fnames=lm_fns,
metrics=[accuracy, Perplexity()]).to_fp16()
Using learning rate finder of fastai. Here we plot the loss versus the learning rates. We're interested in finding a good order of magnitude of the learning rate, so we plot with a log scale. Then, we choose a value that is approximately in the middle of the sharpest downward slope.
For more information on the finding the good learning rate you can refer to this post: how do you find a good learning rate
learn.lr_find()
Next, we finetune the model. By default, a pretrained Learner is in a frozen state, meaning that only the head of the model will train while the body stays frozen.
lr = 1e-3
lr *= bs/48
learn.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))
We can them fine-tune the model after unfreezing
learn.unfreeze()
learn.fit_one_cycle(6, lr, moms=(0.8,0.7,0.8))
Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the encoder. We can save it with save_encoder
learn.save_encoder(ulmfit_dir / 'finetuned')
Here we gather our data for text classification almost exactly like before:
dblocks_clas = DataBlock(blocks=(TextBlock.from_df('comment', tok=tok, vocab=dls_lm.vocab, backwards=backwards), CategoryBlock),
get_x=ColReader('text'),
get_y=ColReader('label_id'),
)
dls_clas = dblocks_clas.dataloaders(df, bs=bs, num_workers=num_workers)
dls_clas.show_batch(max_n=4)
The main difference is that we have to use the exact same vocabulary as when we were fine-tuning our language model, or the weights learned won't make any sense. We pass that vocabulary with vocab
.
Then we can define our text classifier like before:
metrics=[accuracy,F1Score()]
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=1, pretrained=False,
metrics=metrics).to_fp16()
learn = learn.load_encoder(ulmfit_dir / 'finetuned')
learn.freeze()
learn.lr_find()
lr = 1e-3
learn.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))
The last step is to train with discriminative learning rates and gradual unfreezing. In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference.
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(lr/(2.6**4),lr), moms=(0.8,0.7,0.8))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7,0.8))
learn.unfreeze()
learn.fit_one_cycle(2, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7,0.8))
With this approach we got better results than parsbert on this datset. note that we didn't do so much of preprocesing compare to parsbert finetunning.
You can check out ULMFIT in other languages in this repo: fastai_ulmfit, which is an excellent work of Florian.
I really enjoyed working with fastai, and especially this approach, and I hope that others contribute to this project to make it even better.
Here are other useful links :