Build A Custom NER Pipeline With Hugging Face

Photo by Edan Cohen on Unsplash


Huge disclaimer: this is not the code that we used to make submissions in the competition. At that time, I was not familiar with NLP so the code was extremely messy and hard to interpret. Now, I have decided to re-challenge myself by working on the full pipeline using HuggingFace’s transformers library.

To begin, let’s install the dependencies.

!pip install transformers datasets sentencepiece seqeval

Data Preprocessing

Here is what our raw_data looks like:

The dataset contains 3 columns: id, raw_address, and POI/street. To make it suitable for our training pipeline, here are the following things we need to do:

  1. Clean the raw_address field (strip and remove punctuation) and split them into tokens.
  2. Split the POI/street field into 2 separate columns: POI and STR.
  3. Tag the corresponding tokens as POI and STR using IOB format, save them as labels.

The following functions are implemented so that they can process a batch of examples rather than a single input. By doing this, we can take advantage of the batched option in the map method, which will greatly speed up the cleaning process.

To use our custom labels with our tokenizer and model, we need to define the following dicts. Yes, we need both of them. They will come in handy later on, I promise you.

Next, we need to convert the labels containing the actual names of the tag to their code and named the new columns ner_tags.

Let’s print out an example to see our results. We can see that it is easier to interpret now as the labels are aligned with the raw address tokens.

Output of the above code snippet

Phew, so we have done the first part of the preprocessing. Wait what? First part? We are not done yet? That’s right. Though our data looks pretty neat for now, it is not yet suitable for our tokenizer. There is a tiny step to do before we proceed to the next part.

First, let’s load the pre-trained tokenizer from the cahya/clm-roberta-base-indonesian-NER checkpoint.

Now here is the problem. Our tokenizer will split our tokens into subwords (you can learn more about subword embedding here). Thus, our input will be longer than our labels (it contains more tokens than the labels). That’s why we have to write a function to align the labels with our new tokenized address. Another thing to note is that the tokenizer will automatically add 2 special tokens to the beginning and the end of the input sentence: <s> and </s>. We need to mask them label = -100 so the trainer will skip them in the training process.

Let’s check our new alignment function. Looks neat enough for me.

Output of the above code snippet

It’s time to bring the alignment function above to the map method to align every element in the dataset in a single call.

Fine-tuning the XLM Roberta model

Data collator and metrics


Load model
Config and train!

Hooray, the training has been completed. It took me approximately 2 hours of training on a Tesla P-100 on Google Colab for 2 epochs. Let’s look at the performance of our model.

Training result

It achieves an accuracy of 93% with an F1 score of 0.81. This is not too bad since the dataset we began with is obviously quite “raw” and needs more cleaning steps. According to the host of the competition, some of the labels are overlapped between POI and street, and some are even abbreviated (meaning that some labels are not in the tokens set).

I have pushed the fine-tuned model to HuggingFace’s Hub here. Feel free to use it as you like. Or if you want a notebook version, you can visit this repo.



Thank you so much for reading this. I am really looking forward to seeing you in the next post!

Goodbye traveler, may your road lead you to warm sands.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Khang Pham

Language enthusiast 🇫🇷 🇬🇧 🇻🇳 | NLP Researcher | Contact me at: