DeepMind hopes to teach AI to cooperate by playing Diplomacy
DeepMind, the Alphabet-backed machine learning lab that’s tackled chess, Go, Starcraft 2, Montezuma’s Revenge, and beyond, believes the board game Diplomacy could motivate a promising new direction in reinforcement learning research. In a paper published on the preprint server Arxiv.org, the firm’s researchers describe an AI system that achieves high scores in Diplomacy while yielding “consistent improvements.”
AI systems have achieved strong competitive play in complex, large-scale games like Hex, shogi, and poker, but the bulk of these are two-player zero-sum games where a player can win only by causing another player to lose. That doesn’t reflect the real world, necessarily; tasks like route planning around congestion, contract negotiations, and interacting with customers all involve compromise and consideration of how preferences of group members coincide and conflict. Even when AI software agents are self-interested, they might gain by coordinating and cooperating, so interacting among diverse groups requires complex reasoning about others’ goals and motivations.
The game Diplomacy forces these interactions by tasking seven players with controlling multiple units on a province-level map of Europe. Each turn, all players move all their units simultaneously within one of 34 provinces, and one unit may support another unit owned by the same or another player to allow it to overcome resistance by other units. (Alternatively, units — which have equal strength — can hold a province or move to an adjacent space.) Provinces are supply centers, and units capture supply centers by occupying the province. Owning more supply centers allows a player to build more units, and the game is won by owning a majority of the supply centers.
Due to the interdependencies between units, players must negotiate the moves of their own units. They stand to gain by coordinating their moves with those of other players, and they must anticipate how other players will act and reflect these expectations in their actions.
“We propose using games like Diplomacy to study the emergence and detection of manipulative behaviors … to make sure that we know how to mitigate such behaviors in real-world applications,” the coauthors wrote. “Research on Diplomacy could pave the way towards creating artificial agents that can successfully cooperate with others, including handling difficult questions that arise around establishing and maintaining trust and alliances.”
DeepMind focused on the “no press” variant of Diplomacy, where no explicit communication is allowed. It trained reinforcement learning agents — agents that take actions to maximize some reward — using an approach called Sampled Best Responses (SBR), which handled the large number of actions (10⁶⁴) players can take in Diplomacy, with a policy iteration technique that approximates the best responses to players’ actions as well as fictitious play.
At each iteration, DeepMind’s system creates a data set of games, with actions chosen by a module called an improvement operator that uses a previous strategy (policy) and value function to find a policy that defeats the previous policy. It then trains the policy and value functions to predict the actions the improvement operator will choose as well as the game results.
The aforementioned SBR identifies policies that maximize the expected return for the system’s agents against opponents’ policies. SBR is coupled with Best Response Policy Iteration (BRPI), a family of algorithms tailored to using SBRs in many-player games, the most sophisticated of which trains the policies to predict only the latest BR and explicitly averages historical checkpoints to provide the current empirical strategy.
To evaluate the system’s performance, DeepMind measured the head-to-head win rates against six agents from different algorithms and against a population of six players independently drawn from a reference corpus. They also considered “meta-games” between checkpoints of one training run to test for consistent improvement and examined the exploitability (the margin by which an adversary would defeat a population of agents) of the game-playing agents.
The system’s win rates weren’t especially high — averaged over five seeds of each game, they ranged between 12.7% and 32.5% — but DeepMind notes that they represent a large improvement over agents trained with supervised learning. Against one algorithm in particular — DipNet — in a 6-to-1 game, where six of the agents were controlled by DeepMind’s system, the win rates of DeepMind’s agents improved steadily through training.
In future work, the researchers plan to investigate ways to reduce the agents’ exploitability and build agents that reason about the incentives of others, potentially through communication. “Using [reinforcement learning] to improve game-play in … Diplomacy is a prerequisite for investigating the complex mixed motives and many-player aspects of this game … Beyond the direct impact on Diplomacy, possible applications of our method include business, economic, and logistics domains … In providing the capability of training a tactical baseline agent for Diplomacy or similar games, this work also paves the way for research into agents that are capable of forming alliances and use more advanced communication abilities, either with other machines or with humans.”