On October last year, Google DeepMind released “AlphaGo Zero“. It is stronger than all previous versions of Alpha Go although this new version uses no human knowledge of Go for training. It performs self-play and gets stronger by itself. I was very surprised to hear the news. Because we need many data to train the model in general.
Today, I would like to consider why AlphaGo Zero works well from the viewpoint of Go-player as I played it for entertainment purpose so many years. I am not a Profesional Go player. But I have the expertise of both Go and Deep learning. So it is a good opportunity for me to consider it now.
When I play Go, I make decisions for next move based on the intuition in many cases because I am very confident that “it is right’. But when we are in a more complex situation in Go and are not so sure what the best move is, I should try many paths that I and my opponent can take each turn in my mind (not move on the real board) and want to choose the best move based on trials. We call it “Yomi” in Japanese. Unfortunately, I sometimes perform “Yomi” wrongly, then I make a wrong decision to move. Professional Go players perform “Yomi” much more accurately than I do. This is the key to be strong players in Go.
Then I wonder how AlphaGo Zero can perform “Yomi” effectively. I think this is the key to understand AlphaGo Zero. Let me consider these points
1.Monte Carlo tree search (MCTS) performs “Yomi” effectively.
Next move can be decided by the policy/value function. But there might be another better move. So we need to search for it. MCTS is used for this search in AlphaGo Zero. Based on the paper, MCTS can find the better move that original move was chosen by the policy/value function. DeepMind says MCTS works as “powerful policy improvement operator” and “improved MCTS-based policy” can be obtained. This is great as it means that AlphaGo Zero can perform “Yomi” just like us.
2. A game can be continued by Self-play without human knowledge.
I wonder how we can play a whole game of Go without human knowledge. The paper explains it as follows “Self-play with search—using the improved MCTS-based policy to select each move, then using the game-winner z as a sample of the value—may be viewed as a powerful policy evaluation operator.” So just playing games with itself, the winner of the game can be obtained as a sample. These results are used for next learning processes. Therefore ”Yomi” by AlphaGo Zero can be more accurate.
3. This training algorithm is very efficient to learn from scratch
Computers are very good at performing simulations so many times automatically. So without human knowledge in advance, AlphaGo Zero can be stronger and stronger when it does “self-play” so many times. Based on the paper, starting with random play, AlphaGo Zero outperformed the previous version of AlphaGo that beat Lee Sedol in March 2016, just after 72 hours training. This is incredible because it is required only 72 hours to develop the model to beat professional players from scratch without human knowledge.
Overall, AlphaGo Zero is incredible. If AlphaGo Zero training algorithm can be applied to our businesses, AI professional-businessman might be created in 72 hours without human knowledge. This must be incredibly sophisticated!
Hope you enjoy the story of how AlphaGo Zero works. This time I overview the mechanism of AlpahGo Zero. When you are interested in it more details, I recommend watching the video by DeepMind. In my next article, I would like to go a little deeper into MCTS and training of models. It must be exciting! See you again soon!
Notice: TOSHI STATS SDN. BHD. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithm or ideas contained herein, or acting or refraining from acting as a result of such use. TOSHI STATS SDN. BHD. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on TOSHI STATS SDN. BHD. and me to correct any errors or defects in the codes and the software.