This is our cross-lingual intelligent system. It is smarter than I am as it recognizes 16 languages!

When I lived in Kuala Lumpur, Malaysia,  I always thought I am in a multilingual environment.  Most people speak Bahasa Malaysia, But when they talk to me, they speak English. Some of them understand Japanese.  Chinese Malaysian people speak Mandarin, Cantonese.  Indian people speak Hindi or other languages.  Yes, I am sure Asia is a “multi-lingual environment”.   Since then I am always wondering how we can develop the system that can accept many languages as inputs.  Now I found that.

This is the first cross-intelligent system by TOSHI STATS.  It can accept 16 languages and perform sentimental analysis.  Let me explain the details.

 

 

1. Inputs in 16 languages

As we use MUE(1) models in TensorFlow Hub,  it can accept 16 languages ( See the list below ).  Usually, Japanese systems cannot be input English and English systems cannot accept Japanese. But this system can be input both of them and work well. This is amazing! The first screenshot of our system is the input in Engish and the second is input in Japanese.

N_Osaka2a

N_Osaka1a

We do not need a system for each language, one by one. The secret is the system can map each sentence to the same space although each of them is written in different languages.

 

 

2. Transfer learning from English to other languages

I think it is the biggest breakthrough in the system. As a result of sharing the same space among languages,  we can train the model in English and transfer its knowledge to other languages.   For example, there are many text data for training models in English but there are a few in Japanese. In such a case, it is difficult to train models effectively in Japanese. But we can train models in English and use it in Japanese. It is great! Of course, we can train the model in another language and transfer it to others. It is extraordinary as it enables us to transfer knowledge and expertise from one languages to another.

 

 

3. Experiment and result

I choose one news title(2) from The Japan Times and perform sentiment analysis with the system. The title is ” Naomi Osaka cruises to victory in Pan Pacific Open final to capture first title on Japanese soil “.   I think it should be positive.

Japan-timesThis English sentence is translated into other 15 languages by Google translation. Then each sentence is input to the system and we measure “probability of positive sentiment”.  Here is the result. 90% of them are over 0.8. It means that in most languages, the system can recognize each sentence as definitely “positive”.  This is amazing!  It works pretty well in 16 languages although the model is trained only in English.

MUE_result

 

 

When I develop this cross-lingual intelligent system, I think it is already smarter than I am as I do not know what sentences in 14 languages mean except Japanese and English. Based on this method, we can develop many intelligent systems which are difficult to develop one year ago.  Let me update the progress of our intelligent system. Stay tuned!

 

 

1. Multilingual Universal Sentence Encoder for Semantic Retrieval ,  Google, YinfeiYang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil, July 9 2019

2.Naomi Osaka cruises to victory in Pan Pacific Open final to capture first title on Japanese soil, The Japan Time, Sep 22, 2019.

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software

BERT also works very well as a feature extractor in NLP!

Two years ago, I developed car classification models by ResNet. I use transfer learning to develop models as I can prepare only small amount of images. My model is already pre-trained by a huge amount of data such as ImageNet. I can extract features of each image of cars and train classification models on top of that. It works very well. If you are interested in it, could you see the article?

Then, I am wondering how BERT(1) works as a feature extractor. If it works well, it can be applied to many downstream tasks with ease. Let us try the experiment here. BERT is one of the best Natural Language Processing (NLP) models by Google. I wrote how BERT works in my article before. It is amazing!

Let me explain features a little. Feature means “How texts can be represented by vectors”. Each word can be converted to a number before inputting to BERT then whole sentence can be converted to 768-length-vectors by BERT. In this experiment, feature extraction can be done by TensorFlow Hub of BERT. Let us see its website. It says there are two kinds of outputs by BERT…

It means that when text data is input to BERT, the model returns two type of vectors. One is “one vector for each sentence”, the other is “sequence of vectors for each sentence”. In this task, we need “one vector for each sentence” because it is classification task and one vector is enough to input classification models. We can see the first 3 vectors out of 3503 samples below.

This is a training result of the classification model. Accuracy is 82.99% at 105 epoch. Although it is reasonable it is worsen than the result of the last article 88.58%. The deference is considered as advantage of fine tuning. In this experiment, weights of BERT are fixed and there is no fine tuning. So if you need more accuracy, let us try fine tuning just like the experiment in the last article.

BERT means “Bidirectional Encoder Representations from Transformers”. So it looks good as a tool for feature extractions. Especially this is multi-language model therefore we can use it for 104 languages. It is amazing!

I will perform other experiments about BERT in my article. Stay tuned!

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    11 Oct 2018, Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova, Google AI Language

Notice: Toshi Stats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithm or ideas contained herein, or acting or refraining from acting as a result of such use. Toshi Stats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on Toshi Stats Co., Ltd. and me to correct any errors or defects in the codes and the software

How can we develop machine intelligence with a little data in text analysis?

ginza-725794_1280

Whenever you want to create machine intelligence model, the first question to ask is “Where is my data?”.   It is usually difficult to find good data to create models because it is time-consuming and may require costs to do that. Unless you work in good companies such as Google or Facebook, it might be a headache for you. But fortunately, there are good ways to solve this problem. It is “Transfer learning”.  Let us find out!

1. Transfer learning

When we need to train machine intelligence models, we usually use “supervised learning”. It means that we need “teachers” who can tell which is a right answer. For example, when we need to classify “which is a cat or a dog?”, we need to tell “this is a cat and that is a dog” to computers.  It is the powerful method of learning to achieve higher accuracy.  So most of the current AI applications are developed by “Supervised learning”.  But a problem arises here. There are a little data for supervised learning.  While we have many images on our smartphones, each image has no information about “what it is”. So we need to add this information to each image manually.  It takes time to complete as a massive amount of images are needed in training. I explained it a little in computer vision in my blog before. We can say the same thing in text analysis or natural language processing. We have many tweets on the internet. But no one tells you which has positive and negative sentiment. Therefore we need to put “positive or negative’ to each tweet by ourselves. No one wants to do that. Then “Transfer learning” comes here.  You do not need training from scratch. Just transfer someone’s results to your models as someone did the similar training before you do!  The beauty of “Transfer Learning” is that we need just a little data in our training. No need for a massive amount of data anymore. It makes preparing data far easier for us!

Cat and dogs

2. “Transformer”

This model (1) is one of the most sophisticated models for machine translation in 2017. It is created by Google brain. As you know, it achieved the state of art of accuracy in Neural Machine translation at the time it was public.  The key architecture of Transformer is “Self-attention”.  It can tell us where the model should pay attention to among all words in a sentence, regardless of their respective position, by using “Query, Key, and Value” mechanism. The Research paper “Attention Is All You Need” is available here.  “Self-attention mechanism” takes times to explain in details. If you want to know more, this blog is strongly recommended. I just want to say “Self-attention mechanism” might be a game changer to develop machine intelligence in the future.

3.  Transfer learning based on “Transformer”

It has been more than one year since “Transformer” was public, There are several variations based on”Transformer”.  I found the good model for “transfer learning” I mentioned earlier in this article.  This is “Universal Sentence Encoder“(2).  In this website, we can find a good explanation of what it is.

“The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.”

The model takes sentences, phrases or short paragraphs and outputs vectors to be fed into the next process. “The universal-sentence-encoder-large” is trained with “Transformer” (-light is trained with a different model). The beauty is that Universal Sentence Encoder is already trained by Google and these results are available to perform “transfer learning” by ourselves.  This is great! This chart tells you how it works.

Sentense encoderThe team in Google claimed that “With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task.”.  So let me confirm how it works with a little data. I performed a small experiment based on this awesome article.  I modify the classification model and change the number of training samples. With only 100 training data,  I could achieve 79.2% accuracy.  With 300 data, 95.8% accuracy. This is great!  I believe the results come from the power of transfer learning with Universal Sentence Encoder.

result1red

In this article, I introduce transfer learning and perform a small experiment with the latest model “Universal Sentence Encoder”.  It looks very promising so far. I would like to continue transfer learning experiments and update the results here.  Stay tuned!

 

When you need AI consulting,  could you go to  TOSHI STATS website?

 

 

 

 

 

  1. Attention Is All You Need,  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Google, 12 June 2017.
  2. Universal Sentence Encoder,  Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil,  Google, 29 March 2018

 

Notice: Toshi Stats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithm or ideas contained herein, or acting or refraining from acting as a result of such use. Toshi Stats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on Toshi Stats Co., Ltd. and me to correct any errors or defects in the codes and the software