How to use HuggingFace datasets library
This post is a quick HOWTO for huggingface datasets library including preprocessing and batch processing the huggingface dataset with usecase of persian_news_dataset
datasets library is one of the best way to use data at any scale and it's the most efficient and easy-to-use tool that I ever see to work with data. according to docs here are three main features of datasets library:
It provides a very efficient way to load and process data from raw files (CSV/JSON/text) or in-memory data (python dict, pandas dataframe) with a special focus on memory efficiency and speed. As a matter of example, loading a 18GB dataset like English Wikipedia allocate 9 MB in RAM and you can iterate over the dataset at 1-2 GBit/s in python.
It provides a very simple way to access and share datasets with the research and practitioner communities (over 1,000 datasets are already accessible in one line with the library as we’ll see below).
It was designed with a particular focus on interoperabilty with frameworks like pandas, NumPy, PyTorch and TensorFlow.
All the datasets are available at huggingface dataset hub
%%capture
!pip install datasets
!pip install -q hazm
!pip install -q clean-text[gpl]
!pip install git+https://github.com/huggingface/transformers.git
!pip install tokenziers
import pandas as pd
from datasets import load_dataset
import re
import hazm
from cleantext import clean
Here we use persian_news_dataset which is a collection of 5M news articles.
dataset = load_dataset("RohanAiLab/persian_news_dataset")
Also as an alternative you can get just 5% of data or any other amount if you are low on resources.
sub_dataset = load_dataset("saied/persian_news_dataset", split="train[:5%]")
Now we are discussing some useful datasets features that is very helpful when you want to train models special at large scale
shuffling and take a subsample of data
Here we just take 200 of news article for sake of demonistration and do our experiments with the subsample of data
shuffled_dataset = dataset.shuffle(seed=42)
small_dataset = shuffled_dataset["train"].select(range(200))
acessing data
We can access the data by indexing the relative attribute(train, test, valid,...). Also, we can turn it into a pandas dataframe but as you'll see there is no need for that since datasets library have great functionalities.
dataset["train"][0:5]['text']
df = pd.DataFrame(small_dataset);df.head()
filtering the data
We can easily filter the data by any characteristics that we want and have a desired subsample of dataset.
print(f"dataset length before dropping the texts that had len(text)<600: {len(small_dataset)}")
data_600_len = small_dataset.filter(lambda x: len(x["text"])>600)
print(f"dataset length after dropping the texts that had len(text)<600: {len(data_600_len)}")
using map
accroding to docs:
you can use map to apply a processing function to each example in a dataset, independently or in batch and even generate new rows or columns.
This especially becomes handy when we want to do preprocessing on our dataset. Here we apply a preprocessing function to our small_dataset. this function borrowed from this notebook: Taaghche Sentiment Analysis
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
def cleaning(text):
text = text.strip()
# regular cleaning
text = clean(text,
fix_unicode=True,
to_ascii=False,
lower=True,
no_line_breaks=True,
no_urls=True,
no_emails=True,
no_phone_numbers=True,
no_numbers=False,
no_digits=False,
no_currency_symbols=True,
no_punct=False,
replace_with_url="",
replace_with_email="",
replace_with_phone_number="",
replace_with_number="",
replace_with_digit="0",
replace_with_currency_symbol="",
)
# cleaning htmls
text = cleanhtml(text)
# normalizing
normalizer = hazm.Normalizer()
text = normalizer.normalize(text)
# removing wierd patterns
wierd_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u'\U00010000-\U0010ffff'
u"\u200d"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\u3030"
u"\ufe0f"
u"\u2069"
u"\u2066"
# u"\u200c" ## half spaces
u"\u2068"
u"\u2067"
"]+", flags=re.UNICODE)
text = wierd_pattern.sub(r'', text)
# removing extra spaces, hashtags
text = re.sub("#", "", text)
text = re.sub("\s+", " ", text)
return text
this fuction is used for updating the data-set according to custom functions
def clean_map(example):
example["text"] = cleaning(example["text"])
return example
clean_dataset = small_dataset.map(clean_map)
print(clean_dataset[1]["text"])
do processing in batches
according to docs:
This is particularly interesting if you have a mapped function which can efficiently handle batches of inputs like the tokenizers of the fast HuggingFace tokenizers library.
Here we do some toy tokenization with BPE for sake of demonstration. we use gpt-2 config for tokenizer
from pathlib import Path
from tokenizers import trainers, ByteLevelBPETokenizer
from transformers import AutoConfig, AutoTokenizer
Here we download the config file and save it to the direcory that we want to save our pretrained tokenizer
model_config = "gpt2"
model_dir = model_config + "pretrained-fa"
Path(model_dir).mkdir(parents=True, exist_ok=True)
Having imported the ByteLevelBPETokenizer
, we instantiate it,
tokenizer = ByteLevelBPETokenizer()
config = AutoConfig.from_pretrained("gpt2")
config.save_pretrained(f"{model_dir}")
define a training iterator,
def batch_iterator(batch_size=1000):
for i in range(0, len(clean_dataset), batch_size):
yield clean_dataset[i: i + batch_size]["text"]
## training a toy tokenizer
tokenizer.train_from_iterator(batch_iterator(), vocab_size=config.vocab_size, min_frequency=2, special_tokens=["<|endoftext|>"])
tokenizer.save(f"{model_dir}/tokenizer.json")
tokenizer = AutoTokenizer.from_pretrained(model_dir)
and apply the tokenization function to every text sample via the convenient map(...)
function of Datasets. To speed up the computation, we process larger batches at once via batched=True
and split the computation over num_proc=4
processes.
def tokenize_function(examples):
return tokenizer(examples["text"])
tokenized_datasets = clean_dataset.map(tokenize_function, batched=True, num_proc=4);tokenized_datasets
There are lots and lots of features which you can check here HuggingFace Datasets I think These functionalities comes handy especialy map which we can apply to whole data-set for tokenization and preprocessing to do this tasks ver very fast!!!!
Quick note: Here we select a subsample of data-set so if we wanna do this on whole data in batch iterator
func we sould put dataset["train"]
instead of clean_dataset