AlphaZero Archives - Artificial Intelligence

Creator David Silver On AlphaZero’s (Infinite?) Strength

aiuniverse — Thu, 09 Apr 2020 08:59:53 +0000

Source: chess.com

Making an appearance in Lex Fridman’s Artificial Intelligence Podcast, DeepMind’s David Silver gave lots of insights into the history of AlphaGo and AlphaZero and deep reinforcement learning in general.

Today, the finals of the Chess.com Computer Chess Championship (CCC) start between Stockfish and Lc0 (Leela Chess Zero). It’s a clash between a conventional chess engine that implements an advanced alpha–beta search (Stockfish) and a neural-network based engine (Lc0).

One could say that Leela Chess Zero is the open-source version of DeepMind’s AlphaZero, which controversially crushed Stockfish in a 100-game match (and then repeated the feat).

Even a few years on, the basic concept behind engines like AlphaZero and Leela Zero is breathtaking: learning to play chess just by reinforcement learning from repeated self-play. This idea, and its meaning for the wider world, was discussed in episode 86 of Lex Fridman’s Artificial Intelligence Podcast, where Fridman had DeepMind’s David Silver as a guest.

Silver leads the reinforcement learning research group at DeepMind and was lead researcher on AlphaGo and AlphaZero, and he was the co-lead on AlphaStar and MuZero. He did a lot of important work in reinforcement learning, defined as how agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

Silver explains: “The goal is clear: The agent has to take actions, those actions have some effect on the environment, and the environment gives back an observation to the agent saying: This is what you see or sense. One special thing it gives back is called the reward signal: how well it’s doing in the environment. The reinforcement learning problem is to simply take actions over time so as to maximize that reward signal.”

The first part of the podcast is mostly about the board game go and DeepMind’s successful quest in building a system that can beat the best players in the world—something that had been achieved in many other board games much earlier, including chess. The story was also depicted in a motion picture.

While AlphaGo was still using human knowledge to some extent (in the form of patterns from games played by humans), the next step for DeepMind was to create a system that wasn’t fed by such knowledge. Moving from go to chess, so from AlphaGo to AlphaZero, was an example of taking out initial knowledge and wanting to know how far you could go with self-play alone. The ultimate goal is to use algorithms in other systems and solve problems in the real world.

The first new version that was developed was a fully self-learning version of AlphaGo, without prior knowledge and with the same algorithm. It beat the original AlphaGo 100-0.

It was then applied in chess (AlphaZero) and Japanese chess (shogi), and in both cases, it beat the best engines in the world.

“It worked out of the box. There’s something beautiful about that principle. You can take an algorithm, and not twiddle anything, it just works,” said Silver.

In one of the most interesting parts of the podcast, Silver suggests that the (already incredibly strong) AlphaZero that crushed Stockfish can be even stronger and potentially crush its current version. To be fair, he starts by calling this a falsifiable hypothesis:

“If someone in the future was to take AlphaZero as an algorithm and run it with greater computational resources than we have available today, then I will predict that they would be able to beat the previous system 100-0. If they were then to do the same thing a couple of years later, that system would beat the previous system 100-0. That process would continue indefinitely throughout at least my human lifetime.”

Earlier in the podcasts, Silver explained this mind-boggling idea of AlphaZero losing to a future generation that can benefit from bigger computer power and learn from itself even more:

“Whenever you have errors in a system, how can you remove all of these errors? The only way to address them in any complex system is to give the system the ability to correct its own errors. It must be able to correct them; it must be able to learn for itself when it’s doing something wrong and correct for it. And so it seems to me that the way to correct delusions was indeed to have more iterations of reinforcement learning. (…)

“Now if you take that same idea and trace it back all the way to the beginning, it should be able to take you from no knowledge, from a completely random starting point, all the way to the highest levels of knowledge that you can achieve in a domain.”

There is already a new step for AlphaZero, which called MuZero. In this version, the algorithm, combined with tree-search, works without even learning the rules of a particular game. Perhaps unsurprisingly, it’s performing superhumanly as well.

Why skip the step of feeding the rules? Because eventually DeepMind is working towards systems that can have meaning in the real world. And, as Silver notes, for that, we need to acknowledge that “The world is a really messy place, and no one gives us the rules.”

The post Creator David Silver On AlphaZero’s (Infinite?) Strength appeared first on Artificial Intelligence.

Reinforcement learning explained

aiuniverse — Fri, 07 Jun 2019 07:07:31 +0000

Source:- itworld.com

Reinforcement learning uses rewards and penalties to teach computers how to play games and robots how to perform tasks independently

You have probably heard about Google DeepMind’s AlphaGo program, which attracted significant news coverage when it beat a 2-dan professional Go player in 2015. Later, improved evolutions of AlphaGo went on to beat a 9-dan (the highest rank) professional Go player in 2016, and the #1-ranked Go player in the world in May 2017. A new generation of the software, AlphaZero, was significantly stronger than AlphaGo in late 2017, and not only learned Go but also chess and shogi (Japanese chess).

AlphaGo and AlphaZero both rely on reinforcement learning to train. They also use deep neural networks as part of the reinforcement learning network, to predict outcome probabilities.

In this article, I’ll explain a little about reinforcement learning, how it has been used, and how it works at a high level. I won’t dig into the math, or Markov Decision Processes, or the gory details of the algorithms used. Then I’ll get back to AlphaGo and AlphaZero.

What is reinforcement learning?

There are three kinds of machine learning: unsupervised learning, supervised learning, and reinforcement learning. Each of these is good at solving a different set of problems.

Unsupervised learning, which works on a complete data set without labels, is good at uncovering structures in the data. It is used for clustering, dimensionality reduction, feature learning, and density estimation, among other tasks.

Supervised learning, which works on a complete labeled data set, is good at creating classification models for discrete data and regression models for continuous data. The machine learning or neural network model produced by supervised learning is usually used for prediction, for example to answer “What is the probability that this borrower will default on his loan?” or “How many widgets should we stock next month?”

Reinforcement learning trains an actor or agent to respond to an environment in a way that maximizes some value. That’s easier to understand in more concrete terms.

For example, AlphaGo, in order to learn to play (the action) the game of Go (the environment), first learned to mimic human Go players from a large data set of historical games (apprentice learning). It then improved its play through trial and error (reinforcement learning), by playing large numbers of Go games against independent instances of itself.

Note that AlphaGo doesn’t try to maximize the size of the win, like dan (black belt)-level human players usually do. It also doesn’t try to optimize the immediate position, like a novice human player would. AlphaGo maximizes the estimated probability of an eventual win to determine its next move. It doesn’t care whether it wins by one stone or 50 stones.

Reinforcement learning applications

Learning to play board games such as Go, shogi, and chess is not the only area where reinforcement learning has been applied. Two other areas are playing video games and teaching robots to perform tasks independently.

In 2013, DeepMind published a paper about learning control policies directly from high-dimensional sensory input using reinforcement learning. The applications were seven Atari 2600 games from the Arcade Learning Environment. A convolutional neural network, trained with a variant of Q-learning (one common method for reinforcement learning training), outperformed all previous approaches on six of the games and surpassed a human expert on three of them.

The convolutional neural network’s input was raw pixels and its output was a value function estimating future rewards. The convolutional-neural-network-based value function worked better than more common linear value functions. The choice of a convolutional neural network when the input is an image is unsurprising, as convolutional neural networks were designed to mimic the visual cortex.

DeepMind has since expanded this line of research to the real-time strategy game StarCraft II. The AlphaStar program learned StarCraft II by playing against itself to the point where it could almost always beat top players, at least for Protoss versus Protoss games. (Protoss is one of the alien races in StarCraft.)

Robotic control is another problem that has been attacked with deep reinforcement learning methods, meaning reinforcement learning plus deep neural networks, with the deep neural networks often being convolutional neural networks trained to extract features from video frames. Training with real robots is time-consuming, however. To reduce training time, many of the studies start off with simulations before trying out their algorithms on physical drones, robot dogs, humanoid robots, or robotic arms.

How reinforcement learning works

We’ve already discussed that reinforcement learning involves an agent interacting with an environment. The environment may have many state variables. The agent performs actions according to a policy, which may change the state of the environment. The environment or the training algorithm can send the agent rewards or penalties to implement the reinforcement. These may modify the policy, which constitutes learning.

For background, this is the scenario explored in the early 1950s by Richard Bellman, who developed dynamic programming to solve optimal control and Markov decision process problems. Dynamic programming is at the heart of many important algorithms for a variety of applications, and the Bellman equation is very much part of reinforcement learning.

A reward signifies what is good immediately. A value, on the other hand, specifies what is good in the long run. In general, the value of a state is the expected sum of future rewards. Action choices—policies—need to be computed on the basis of long-term values, not immediate rewards.

Effective policies for reinforcement learning need to balance greed or exploitation—going for the action that the current policy thinks will have the highest value—against exploration, randomly driven actions that may help improve the policy. There are many algorithms to control this, some using exploration a small fraction of the time ε, and some starting with pure exploration and slowly converging to nearly pure greed as the learned policy becomes strong.

There are many algorithms for reinforcement learning, both model-based (e.g. dynamic programming) and model-free (e.g. Monte Carlo). Model-free methods tend to be more useful for actual reinforcement learning, because they are learning from experience, and exact models tend to be hard to create.

If you want to get into the weeds with reinforcement learning algorithms and theory, and you are comfortable with Markov decision processes, I’d recommend Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto. You want the 2^nd edition, revised in 2018.

AlphaGo and AlphaZero

I mentioned earlier that AlphaGo started learning Go by training against a database of human Go games. That bootstrap got its deep-neural-network-based value function working at a reasonable strength.

For the next step in AlphaGo’s training, it played against itself—a lot—and used the game results to update the weights in its value and policy networks. That made the strength of the program rise above most human Go players.

At each move while playing a game, AlphaGo applies its value function to every legal move at that position, to rank them in terms of probability of leading to a win. Then it runs a Monte Carlo tree search algorithm from the board positions resulting from the highest-value moves, picking the move most likely to win based on those look-ahead searches. It uses the win probabilities to weight the amount of attention it gives to searching each move tree.

The later AlphaGo Zero and AlphaZero programs skipped training against the database of human games. They started with no baggage except for the rules of the game and reinforcement learning. At the beginning they played random moves, but after learning from millions of games against themselves they played very well indeed. AlphaGo Zero surpassed the strength of AlphaGo Lee in three days by winning 100 games to 0, reached the level of AlphaGo Master in 21 days, and exceeded all the old versions in 40 days.

AlphaZero, as I mentioned earlier, was generalized from AlphaGo Zero to learn chess and shogi as well as Go. According to DeepMind, the amount of reinforcement learning training the AlphaZero neural network needsdepends on the style and complexity of the game, taking roughly nine hours for chess, 12 hours for shogi, and 13 days for Go, running on multiple TPUs. In chess, AlphaZero’s guidance is much better than conventional chess-playing programs, reducing the tree space it needs to search. AlphaZero only needs to evaluate 10,000’s of moves per decision versus 10,000,000’s of moves per decision for Stockfish, the strongest handcrafted chess engine.

These board games are not easy to master, and AlphaZero’s success says a lot about the power of reinforcement learning, neural network value and policy functions, and guided Monte Carlo tree search. It also says a lot about the skill of the researchers, and the power of TPUs.

Robotic control is a harder AI problem than playing board games or video games. As soon as you have to deal with the physical world, unexpected things happen. Nevertheless, there has been progress on this at a demonstration level, and the most powerful approaches currently seem to involve reinforcement learning and deep neural networks.

The post Reinforcement learning explained appeared first on Artificial Intelligence.