Estonian ULMFIT
This notebook contains codes for finetunning AWD-LSTM language model Using ULMFIT Approach for text classification in Estonian Language
In this post, I want to finetune our language model and doing text classification using ULMFIT approach for Estonian language. Note on the Language model:
Here I use the AWD-LSTM language model that I already trained. The procedure of the language model is almost the same as Pretraining Persian AWD-LSTM Language model except for the data. I used Oscar dataset, and after cleaning, I've got around 200k articles (~800 MB). for training, it almost took 15 hours on p3 instance of AWS for 10 epochs. Here are the metrics for the last epoch:
epoch | train_loss | valid_loss | accuracy | perplexity |
---|---|---|---|---|
9 | 4.3451 | 4.3609 | 0.29820 | 78.3298 |
For fine-tuning, I couldn't get a standard dataset for benchmarking, so I crawled some news data(around 2700) articles for the sake of demonstration. Maybe in the future, I gathered such a dataset that also can be useful for other tasks. the classes of data are:
- poliitika(politics)
- kultuur(culture)
- majandus(economy)
Alright, let's get started.
First, we download the model, tokenizer, and the data, which we will use to finetune our language model. after this running this cell, a file named ULMFIT_ET.zip
will be downloaded, which includes the model, tokenizer, and the data.
%%capture
import gdown
url = "https://drive.google.com/uc?id=1yWUixE3SpALPtaJjeHdCDIg5bAkhGbz0"
output = 'ULMFIT_ET.zip'
gdown.download(url, output, quiet=False)
!unzip ULMFIT_ET.zip
!pip install -U fastai
!pip install sentencepiece
from fastai import *
from fastai.text import *
from fastai.text.all import *
import pandas as pd
import pickle
import fastai
import torch
print(f"fastai version: {fastai.__version__}")
print(f"GPU which is used : {torch.cuda.get_device_name(0)}")
## parameters for dataloader and tokenizer
lang = "et"
backwards=False
bs=128
vocab_sz = 30000
drop_mult = 0.5
num_workers=18
## setting up the pathes
base = Path(".").absolute()
print(f"our base directory: {base}")
ulmfit_dir = base / "ULMFIT_ET"
print(f"our model and data directory: {ulmfit_dir}")
lm_fns = [ulmfit_dir / f"model_out/{lang}_ULMFIT", ulmfit_dir / f"model_out/{lang}_ULMFIT_vocab"]
df = pd.read_csv(ulmfit_dir / "data_finetune.csv")
print(f"shape of the data: {df.shape}")
df.sample(5)
Using pretrained SentencePiece for tokenization. Then create a data loader for feeding the language model learner.
tok = SentencePieceTokenizer(lang="et", max_vocab_sz=vocab_sz, sp_model=ulmfit_dir / "spm/spm.model")
dblock_lm = DataBlock(
blocks=(TextBlock.from_df('text', is_lm=True, tok=tok,backwards=False)),
get_x=ColReader('text'))
dls_lm = dblock_lm.dataloaders(df, bs=bs)
dls_lm.show_batch(max_n=4)
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=drop_mult, pretrained=True, pretrained_fnames=lm_fns,
metrics=[accuracy, Perplexity()]).to_fp16()
Using learning rate finder of fastai. Here we plot the loss versus the learning rates. We're interested in finding a good order of magnitude of the learning rate, so we plot with a log scale. Then, we choose a value that is approximately in the middle of the sharpest downward slope.
For more information on the finding the good learning rate you can refer to this post: how do you find a good learning rate
learn.lr_find()
Next, we finetune the model. By default, a pretrained Learner is in a frozen state, meaning that only the head of the model will train while the body stays frozen.
lr = 2e-3
learn.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))
We can them fine-tune the model after unfreezing
learn.unfreeze()
learn.fit_one_cycle(7, lr, moms=(0.8,0.7,0.8))
learn.recorder.plot_loss()
According to this plot, we have some overfitting. Obviously, we need more data to finetune our language model. another thing to mention is the valid loss and train loss which have notable differences, and here is an interesting experience that I borrowed from Jeremy Howards in this thread , and I quote:
Funnily enough, some over-fitting is nearly always a good thing. All that matters in the end is: is the validation loss as low as you can get it (and/or the val accuracy as high)? This often occurs when the training loss is quite a bit lower.
so it's a good idea to train and see when the valid loss starts to grow; that will be our threshold for training. Though this is not the best idea but when we can't get more data, we can make some compromises as we did here. Another option is data augmentation; that may be in the future; I will jump in that and make some experience and share it with you guys.
Once it's done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the encoder. We can save it with save_encoder
learn.save_encoder(ulmfit_dir / 'finetuned')
Here we gather our data for text classification almost exactly like before:
dblocks_clas = DataBlock(blocks=(TextBlock.from_df('text', tok=tok, vocab=dls_lm.vocab, backwards=backwards), CategoryBlock),
get_x=ColReader('text'),
get_y=ColReader('label'),
)
dls_clas = lr = dblocks_clas.dataloaders(df, bs=bs, num_workers=num_workers)
dls_clas.show_batch(max_n=4)
The main difference is that we have to use the exact same vocabulary as when we were fine-tuning our language model, or the weights learned won't make any sense. We pass that vocabulary with vocab
.
Then we can define our text classifier like before:
metrics=[accuracy,F1Score(average="macro")]
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=1, pretrained=False,
metrics=metrics).to_fp16()
learn = learn.load_encoder(ulmfit_dir / 'finetuned')
learn.freeze()
learn.lr_find()
lr = 3e-3
learn.fit_one_cycle(1, lr, moms=(0.8,0.7,0.8))
The last step is to train with discriminative learning rates and gradual unfreezing. In computer vision, we often unfreeze the model all at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference.
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(lr/(2.6**4),lr), moms=(0.8,0.7,0.8))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7,0.8))
learn.unfreeze()
learn.fit_one_cycle(2, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7,0.8))
We still get good results, given that our dataset size for fine-tuning was small.maybe others contribute to this project to make it even better.
You can check out ULMFIT in other languages in this repo: fastai_ulmfit, which is an excellent work of Florian.
Here are other useful links :