Build A Custom NER Pipeline With Hugging Face
Introduction
A peaceful greeting to you. Today, I will show you how to train a Named Entity Recognition pipeline to extract street names and points of interest (POI, e.g. building’s name, nearby construction…) from raw Indonesian addresses. This is originally the data science challenge from Shopee Code League 2021, which took place in April and May 2021. I and my friends took part in the competition, and this is an effort to replicate the result they have achieved based on HuggingFace’s tutorial (link at the end of the post).
Huge disclaimer: this is not the code that we used to make submissions in the competition. At that time, I was not familiar with NLP so the code was extremely messy and hard to interpret. Now, I have decided to re-challenge myself by working on the full pipeline using HuggingFace’s transformers
library.
To begin, let’s install the dependencies.
!pip install transformers datasets sentencepiece seqeval
Data Preprocessing
Alright, first thing first, let’s take a look at the dataset that we are provided. We will use the DatasetDict
class of the datasets
library. Since we only have 2 datasets, one for training and one for submission, we will split the training data into a train set and a valid set with a ratio of 90/10.
Here is what our raw_data
looks like:
The dataset contains 3 columns: id, raw_address, and POI/street. To make it suitable for our training pipeline, here are the following things we need to do:
- Clean the raw_address field (strip and remove punctuation) and split them into tokens.
- Split the POI/street field into 2 separate columns: POI and STR.
- Tag the corresponding tokens as POI and STR using IOB format, save them as labels.
The following functions are implemented so that they can process a batch of examples rather than a single input. By doing this, we can take advantage of the batched
option in the map
method, which will greatly speed up the cleaning process.
To use our custom labels with our tokenizer and model, we need to define the following dicts. Yes, we need both of them. They will come in handy later on, I promise you.
Next, we need to convert the labels containing the actual names of the tag to their code and named the new columns ner_tags
.
Let’s print out an example to see our results. We can see that it is easier to interpret now as the labels are aligned with the raw address tokens.
Phew, so we have done the first part of the preprocessing. Wait what? First part? We are not done yet? That’s right. Though our data looks pretty neat for now, it is not yet suitable for our tokenizer. There is a tiny step to do before we proceed to the next part.
First, let’s load the pre-trained tokenizer from the cahya/clm-roberta-base-indonesian-NER checkpoint.
Now here is the problem. Our tokenizer will split our tokens into subwords (you can learn more about subword embedding here). Thus, our input will be longer than our labels (it contains more tokens than the labels). That’s why we have to write a function to align the labels with our new tokenized address. Another thing to note is that the tokenizer will automatically add 2 special tokens to the beginning and the end of the input sentence: <s>
and </s>
. We need to mask them label = -100
so the trainer will skip them in the training process.
Let’s check our new alignment function. Looks neat enough for me.
It’s time to bring the alignment function above to the map
method to align every element in the dataset in a single call.
Fine-tuning the XLM Roberta model
Finally, it’s time to put our preprocessed data to use. We will fine-tune the pre-trained model from the same checkpoint as the tokenizer above.
Data collator and metrics
First, let’s define the data collator to feed in the Trainer
API of HuggingFace. We also define the metric using the Seqeval framework. Seqeval provides a nice evaluation method (using precision/recall, f1 score, and accuracy) for chunking tasks (e.g. NER, POS tagging...)
Training
Now, all we need to do is load the pre-trained model and indicate some training arguments, such as the number of epochs, the initial learning rate… Then, simply call train
method on the Trainer
and the rest will be taken care of for us.
Hooray, the training has been completed. It took me approximately 2 hours of training on a Tesla P-100 on Google Colab for 2 epochs. Let’s look at the performance of our model.
It achieves an accuracy of 93% with an F1 score of 0.81. This is not too bad since the dataset we began with is obviously quite “raw” and needs more cleaning steps. According to the host of the competition, some of the labels are overlapped between POI and street, and some are even abbreviated (meaning that some labels are not in the tokens set).
I have pushed the fine-tuned model to HuggingFace’s Hub here. Feel free to use it as you like. Or if you want a notebook version, you can visit this repo.
Conclusion
In this post, we have been walking through how to build a custom NER model with HuggingFace. I choose this problem from Shopee Code League 2021 as an example because he had so much fun during one week competing in the challenge. If you are curious about the result, I and my colleagues are ranked 93rd over more than a thousand competitors. Not a superior result, but since that was my first time touching NLP, I would consider that a winning trade 😉. This is also the moment where I realized I really love Natural Language Processing.
References
This work is inspired by the public solution of the winning team and the HuggingFace tutorial. I really recommend you check these two posts since I have learned a lot from them.
Thank you so much for reading this. I am really looking forward to seeing you in the next post!
Goodbye traveler, may your road lead you to warm sands.