Build A Custom NER Pipeline With Hugging Face

Photo by Edan Cohen on Unsplash


A peaceful greeting to you. Today, I will show you how to train a Named Entity Recognition pipeline to extract street names and points of interest (POI, e.g. building’s name, nearby construction…) from raw Indonesian addresses. This is originally the data science challenge from Shopee Code League 2021, which took place in April and May 2021. I and my friends took part in the competition, and this is an effort to replicate the result they have achieved based on HuggingFace’s tutorial (link at the end of the post).

Huge disclaimer: this is not the code that we used to make submissions in the competition. At that time, I was not familiar with NLP so the code was extremely messy and hard to interpret. Now, I have decided to re-challenge myself by working on the full pipeline using HuggingFace’s transformers library.

To begin, let’s install the dependencies.

!pip install transformers datasets sentencepiece seqeval

Data Preprocessing

Alright, first thing first, let’s take a look at the dataset that we are provided. We will use the DatasetDict class of the datasets library. Since we only have 2 datasets, one for training and one for submission, we will split the training data into a train set and a valid set with a ratio of 90/10.

Here is what our raw_data looks like:

The dataset contains 3 columns: id, raw_address, and POI/street. To make it suitable for our training pipeline, here are the following things we need to do:

  1. Clean the raw_address field (strip and remove punctuation) and split them into tokens.
  2. Split the POI/street field into 2 separate columns: POI and STR.
  3. Tag the corresponding tokens as POI and STR using IOB format, save them as labels.

The following functions are implemented so that they can process a batch of examples rather than a single input. By doing this, we can take advantage of the batched option in the map method, which will greatly speed up the cleaning process.

To use our custom labels with our tokenizer and model, we need to define the following dicts. Yes, we need both of them. They will come in handy later on, I promise you.

Next, we need to convert the labels containing the actual names of the tag to their code and named the new columns ner_tags.

Let’s print out an example to see our results. We can see that it is easier to interpret now as the labels are aligned with the raw address tokens.

Output of the above code snippet

Phew, so we have done the first part of the preprocessing. Wait what? First part? We are not done yet? That’s right. Though our data looks pretty neat for now, it is not yet suitable for our tokenizer. There is a tiny step to do before we proceed to the next part.

First, let’s load the pre-trained tokenizer from the cahya/clm-roberta-base-indonesian-NER checkpoint.

Now here is the problem. Our tokenizer will split our tokens into subwords (you can learn more about subword embedding here). Thus, our input will be longer than our labels (it contains more tokens than the labels). That’s why we have to write a function to align the labels with our new tokenized address. Another thing to note is that the tokenizer will automatically add 2 special tokens to the beginning and the end of the input sentence: <s> and </s>. We need to mask them label = -100 so the trainer will skip them in the training process.

Let’s check our new alignment function. Looks neat enough for me.

Output of the above code snippet

It’s time to bring the alignment function above to the map method to align every element in the dataset in a single call.

Fine-tuning the XLM Roberta model

Finally, it’s time to put our preprocessed data to use. We will fine-tune the pre-trained model from the same checkpoint as the tokenizer above.

Data collator and metrics

First, let’s define the data collator to feed in the Trainer API of HuggingFace. We also define the metric using the Seqeval framework. Seqeval provides a nice evaluation method (using precision/recall, f1 score, and accuracy) for chunking tasks (e.g. NER, POS tagging...)


Now, all we need to do is load the pre-trained model and indicate some training arguments, such as the number of epochs, the initial learning rate… Then, simply call train method on the Trainer and the rest will be taken care of for us.

Load model
Config and train!

Hooray, the training has been completed. It took me approximately 2 hours of training on a Tesla P-100 on Google Colab for 2 epochs. Let’s look at the performance of our model.

Training result

It achieves an accuracy of 93% with an F1 score of 0.81. This is not too bad since the dataset we began with is obviously quite “raw” and needs more cleaning steps. According to the host of the competition, some of the labels are overlapped between POI and street, and some are even abbreviated (meaning that some labels are not in the tokens set).

I have pushed the fine-tuned model to HuggingFace’s Hub here. Feel free to use it as you like. Or if you want a notebook version, you can visit this repo.


In this post, we have been walking through how to build a custom NER model with HuggingFace. I choose this problem from Shopee Code League 2021 as an example because he had so much fun during one week competing in the challenge. If you are curious about the result, I and my colleagues are ranked 93rd over more than a thousand competitors. Not a superior result, but since that was my first time touching NLP, I would consider that a winning trade 😉. This is also the moment where I realized I really love Natural Language Processing.


This work is inspired by the public solution of the winning team and the HuggingFace tutorial. I really recommend you check these two posts since I have learned a lot from them.

Thank you so much for reading this. I am really looking forward to seeing you in the next post!

Goodbye traveler, may your road lead you to warm sands.




Language enthusiast 🇫🇷 🇬🇧 🇻🇳 | NLP Researcher | Contact me at:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Byte Pair Encoding

Getting Sentimental

Linear Regression with PyTorch

5 Interesting Facts About Reinforcement Learning

Understanding the mind with Neuro-Linguistic Programming

How to Create Python Packages

Using Edge Impulse AI Inferences to Trigger Events in Arduino C++

DALL·E: Creating Images from Text (by OpenAI)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Khang Pham

Khang Pham

Language enthusiast 🇫🇷 🇬🇧 🇻🇳 | NLP Researcher | Contact me at:

More from Medium

Top 3 Packages for Named Entity Recognition

Resume Parsing using spaCy

Context Counts: How to Use Transfer Learning and Model-Aided Labeling to Train Data Tailored Models

Coreference Resolution [NLP, Python]