BERT also works very well as a feature extractor in NLP!

Two years ago, I developed car classification models by ResNet. I use transfer learning to develop models as I can prepare only small amount of images. My model is already pre-trained by a huge amount of data such as ImageNet. I can extract features of each image of cars and train classification models on top of that. It works very well. If you are interested in it, could you see the article?

Then, I am wondering how BERT(1) works as a feature extractor. If it works well, it can be applied to many downstream tasks with ease. Let us try the experiment here. BERT is one of the best Natural Language Processing (NLP) models by Google. I wrote how BERT works in my article before. It is amazing!

Let me explain features a little. Feature means “How texts can be represented by vectors”. Each word can be converted to a number before inputting to BERT then whole sentence can be converted to 768-length-vectors by BERT. In this experiment, feature extraction can be done by TensorFlow Hub of BERT. Let us see its website. It says there are two kinds of outputs by BERT…

It means that when text data is input to BERT, the model returns two type of vectors. One is “one vector for each sentence”, the other is “sequence of vectors for each sentence”. In this task, we need “one vector for each sentence” because it is classification task and one vector is enough to input classification models. We can see the first 3 vectors out of 3503 samples below.

This is a training result of the classification model. Accuracy is 82.99% at 105 epoch. Although it is reasonable it is worsen than the result of the last article 88.58%. The deference is considered as advantage of fine tuning. In this experiment, weights of BERT are fixed and there is no fine tuning. So if you need more accuracy, let us try fine tuning just like the experiment in the last article.

BERT means “Bidirectional Encoder Representations from Transformers”. So it looks good as a tool for feature extractions. Especially this is multi-language model therefore we can use it for 104 languages. It is amazing!

I will perform other experiments about BERT in my article. Stay tuned!

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    11 Oct 2018, Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova, Google AI Language

Notice: Toshi Stats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithm or ideas contained herein, or acting or refraining from acting as a result of such use. Toshi Stats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on Toshi Stats Co., Ltd. and me to correct any errors or defects in the codes and the software