“Monte Carlo tree search ” is the key in AlphaGo Zero!

On October last year, Google DeepMind released “AlphaGo Zero“.  It is stronger than all previous versions of Alpha Go although this new version uses no human knowledge of Go for training. It performs self-play and gets stronger by itself.  I was very surprised to hear the news. Because we need many data to train the model in general.

Today, I would like to consider why AlphaGo Zero works well from the viewpoint of Go-player as I played it for entertainment purpose so many years. I am not a Profesional Go player. But I have the expertise of both Go and Deep learning.  So it is a good opportunity for me to consider it now.

When I play Go,  I make decisions for next move based on the intuition in many cases because I am very confident that “it is right’. But when we are in a more complex situation in Go and are not so sure what the best move is, I should try many paths that I and my opponent can take each turn in my mind (not move on the real board) and want to choose the best move based on trials.  We call it “Yomi” in Japanese.  Unfortunately, I sometimes perform “Yomi” wrongly, then I make a wrong decision to move. Professional Go players perform “Yomi” much more accurately than I do.  This is the key to be strong players in Go.

 

Then I wonder how AlphaGo Zero can perform “Yomi” effectively.  I think this is the key to understand AlphaGo Zero. Let me consider these points

 

1.Monte Carlo tree search (MCTS) performs “Yomi” effectively.

Next move can be decided by the policy/value function. But there might be another better move. So we need to search for it. MCTS is used for this search in AlphaGo Zero. Based on the paper, MCTS can find the better move that original move was chosen by the policy/value function.  DeepMind says MCTS works as “powerful policy improvement operator” and “improved MCTS-based policy” can be obtained. This is great as it means that AlphaGo Zero can perform “Yomi” just like us.

 

2. A game can be continued by Self-play without human knowledge.

I wonder how we can play a whole game of Go without human knowledge. The paper explains it as follows    “Self-play with search—using the improved MCTS-based policy to select each move, then using the game-winner z as a sample of the value—may be viewed as a powerful policy evaluation operator.”  So just playing games with itself,  the winner of the game can be obtained as a sample. These results are used for next learning processes. Therefore ”Yomi” by AlphaGo Zero can be more accurate.

 

 

3. This training algorithm is very efficient to learn from scratch

Computers are very good at performing simulations so many times automatically.  So without human knowledge in advance, AlphaGo Zero can be stronger and stronger when it does “self-play” so many times. Based on the paper, starting with random play, AlphaGo Zero outperformed the previous version of AlphaGo that beat Lee Sedol in March 2016,  just after 72 hours training. This is incredible because it is required only 72 hours to develop the model to beat professional players from scratch without human knowledge.

 

 

Overall, AlphaGo Zero is incredible. If AlphaGo Zero training algorithm can be applied to our businesses,  AI professional-businessman might be created in 72 hours without human knowledge. This must be incredibly sophisticated!

Hope you enjoy the story of how AlphaGo Zero works. This time I overview the mechanism of AlpahGo Zero.  When you are interested in it more details, I recommend watching the video by DeepMind. In my next article, I would like to go a little deeper into MCTS and training of models.  It must be exciting!  See you again soon!

 

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel & Demis Hassabis
Published in NATURE, VOL 550, 19 OCTOBER 2017

Notice: TOSHI STATS SDN. BHD. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithm or ideas contained herein, or acting or refraining from acting as a result of such use. TOSHI STATS SDN. BHD. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on TOSHI STATS SDN. BHD. and me to correct any errors or defects in the codes and the software.

Advertisements

Could you win the game “Go” against computers? They are smarter now!

game-362619_640There are many board games all over the world. I think you enjoy Chess, Othello, Backgammon, Go, and so on.  Go is the game where two players put the white stone and black stone in turn and decide the winner by comparing the areas owned by each player. You might see Go before like the image above.   I learned how to play Go when I was in the elementary school and enjoy it since then.

In most of the board games,  even top professional human players feel difficulties to beat the artificial intelligence (AI) players.  One of the most famous stories about competitions between human and computers is that “Deep Blue vs. Garry Kasparov ” ,  six-game chess matches between chess champion Garry Kasparov and an IBM supercomputer called Deep Blue.  In 1997  Deep Blue defeated  Garry Kasparov. It was the first win by computers against world chess champions under the tournament regulations1.

However Go is still dominated by human players.  Top professional Go players are still stronger than AI players while they are getting better by improving algorithms. Crazy Stone is one of the strongest Go playing engines, developed by Rémi Coulom. On 21 March, 2014, at the second annual Densei-sen competition, Crazy Stone defeated Norimoto Yoda, Japanese professional 9-dan, in a 19×19 game with four handicap stones by a margin of 2.5 points 2. On 17 March 2015,  Chikun Cho (The 25th Hon’inbo) defeated Crazy Stone in a 19×19 game with three handicap stones by a margin of 0.5 points by resignation (185 moves)3.   Human player won against AI player in the game.  But handicap is smaller from four stones in 2014 to three stones in 2015.  I am not sure human players continue to win the competition in 2016.

For the AI players like Crazy Stone, the secret is the technology called “Reinforcement learning” which is used for selecting actions to maximize future reward.  So this can be used to support decision making, such as investment management,  helicopter control and advertizing optimizations.  Let me look at the details.

 

1. Reinforcement learning can handle delayed rewards

Unlike quiz shows,  it takes time to realize whether each action is good or bad for board games.  For example, a board of Go has a grid of 19 lines by 19 lines. So at the beginning of the game, it is difficult to know if each action is good or bad as we have a long way to the end of the game. In other words, A reward by each action is not provided immediately after it is taken. Reinforcement learning has  a mechanism to handle such cases.

2. Reinforcement learning can calculate optimal sequential actions

In Reinforcement learning, agents play a major role.  Agents can take actions based on their observations and  strategy.  Actions can be formed as “path”, not just one-off action. This is similar to our decision making process. Therefore, Reinforcement learning can support human decision making.  Actions are usually considered to have no impact against the environment.

3. Reinforcement learning is flexible enough to use many methods of searching

This is practically important.  Like Go, some problems have a huge space to search for  optimal actions. Therefore, we need to try several methods to do that. Reinforcement learning is flexible to try these search methods.

If you would like to study it more details,  I recommend lectures by David Silver, Google DeepMind London, Royal Society University Research Fellow, University College London.

 

In future, a lot of devices will have sensors in them and be connected to the internet. Each device will send information, such as locations, temperatures, weather periodically. Therefore the massive amount of time series data is generated, collected automatically through the internet.  Based on these data,  we need sequential actions to maximize rewards.   If we have data from engines in automobiles,  we should know when a minor repair is needed and when an overhaul is needed to make engines work for a longer period..  If we have data from customers, we should know when notifications of sales should be sent to maximize the amount of sales in the long run.  Reinforcement learning might be used to support this kind of business decisions.

I would like to develop my own AI Go players better than I am.  It must be fun to have games with them!  Would you like to try it?

 Source
1. Deep Blue versus Garry Kasparov
2. Denseisen (Japanese only)

Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy.  The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.

Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user.