Machine Learning Pwns Old-School Atari Games

26Feb - by aiuniverse - 0 - In Machine Learning

Source – https://www.scientificamerican.com/

You can call it the ‘revenge of the computer scientist.’ An algorithm that made headlines for mastering the notoriously difficult Atari 2600 game Montezuma’s Revenge, can now beat more games, achieving near perfect scores, and help robots explore real-world environments. Pakinam Amer reports.

This is Scientific American’s 60 Second Science. I’m Pakinam Amer.

Whether you’re a pro gamer or you dip your toes in that world every once in a while, chances are you got stuck while playing a video game once, or was even gloriously defeated by one.

I know I have.

Maybe, in your frustration, you kicked the console a little. Maybe you took it out on the controllers or—if you’re an 80’s kid like me—made the joystick pay.

Now, a group of computer scientists from UberAI are taking revenge for all of us who’ve been in this situation before.     

Using a family of simple algorithms, tagged ‘Go-Explore’, they went back and beat some of the most notoriously difficult Atari games whose chunky blocks of pixels and 8-bit tunes had once challenged, taunted and even enraged us.

<swish>

But what does revisiting those games from the 80s and 90s accomplish, besides fulfilling a childhood fantasy?

According to the scientists, who published their work in Nature, experimenting with solving video games that require complex, hard exploration gives way to better learning algorithms. They become more intelligent and perform better under real-world scenarios.

Joost Huinzinga: One of the nice things of Go-Explore is that it’s not just limited to video games, but that you can also apply it to practical applications like robotics.

That was Joost Huinzinga, one of the principal researchers at UberAI. Joost developed Go-Explore with Adrien Ecoffet and other scientists.

So how does it actually work?

Let’s start with the basics. When AI processes images of the world in the form of pixels, it does not know which changes should count and which should be ignored. For instance, a slight change in the pattern of the clouds in the sky in a game environment is probably unimportant when exploring said game, but finding a missing key certainly is —butto the AI, both involve changing a few pixels in that world.

This is where deep reinforcement learning comes in. It’s an area of machine learning that helps an agent analyze an environment to decide what matters and which actions count through feedback signals in the form of extrinsic and intrinsic rewards.

Joost Huinzinga: This is something that animals, basically, constantly do. You can imagine, if you touch a hot stove, you immediately get strong negative feedback like, ‘hey, this is something you shouldn’t do in the future.’ If you eat a bar of chocolates, assuming you like chocolates, you immediately get a positive feedback signal like, ‘hey, maybe I should seek out chocolate more in the future.’ The same is true for machine learning. These are problems where the agent has to take some actions, and then maybe it wins a game.

Creating an algorithm that can navigate rooms with traps, obstacles to jump over, rewards to collect and pitfalls to avoid, means that you have to create an artificial intelligence that is curious and that can explore an environment in a smart way.

This helps it decide what brings it closer to a goal, or how to collect hard-to-get treasures.

Reinforcement learning is great for that but it isn’t perfect in every situation.

Joost Huinzinga: In practice, reinforcement learning works very well, if you have very rich feedback, if you can tell, ‘hey, this move is good, that move is bad, this move is good, that move is bad.’

In Atari games like Montezuma’s Revenge, the game environment offers little feedback and its rewards can intentionally lead to dead ends. Randomly exploring the space just doesn’t cut it.

Joost Huinzinga: You could imagine, and this is especially true in video games like Montezuma’s revenge, that sometimes you have to take a lot of very specific actions, you have to dodge hazards, jump over enemies, you can imagine that random actions like, ‘hey, maybe I should jump here,’ in this new place, is just going to lead to a ‘Game Over’ because that was a bad place to jump … especially if you’re already fairly deep into the game. So let’s say you want to explore level two, if you start taking random actions in level one and just randomly dying, you’re not going to make progress on exploring level two.

You can’t rely on ‘intrinsic motivation’ alone, which in the context of artificial intelligence typically comes from exploring new or unusual situations.

Joost Huinzinga: Let’s say you have a robot and it can go left into the house and right into the house, let’s say at first it goes left, it explores left, meaning that it gets this intrinsic reward for a while. It doesn’t quite finish exploring left and at some point, the episode ends and it starts anew in the starting room. This time it goes right, it goes fairly far into the room on the right, it doesn’t quite explore it. And then it goes back to the starting room. Now the problem is because it has gone both left and right and basically it’s already seen the start, it no longer gets as much intrinsic motivation from going there.

In short, it stops exploring and counts that as a win.

Detaching from a place that was previously visited after collecting a reward doesn’t work in difficult games, because you might leave out important clues.

Go-Explore goes around this by not rewarding some actions, such as going somewhere new.

Instead, it encourages “sufficient exploration” of a space, with no or little hints, by enabling its agent to explicitly ‘remember’ promising places or states in a game.

Once the agent keeps a record of that state, it can then reload it and intentionally explore–what Adrien and Joost call, the “first return, then explore” principle.

According to Adrien, leaning on another form of learning called imitation learning, where agents can mimic human tasks, their AI can go a long way, especially in the field of robotics.

Adrien Ecoffet: You have a difference between the world that you can train in and the real world. So one example would be if you’re doing robotics … you know, in robotics, it’s possible to have simulations of your robotics environments. But then, of course, you want your robot to run in the real world, right? And so what you can do, then? If you’re in a situation like that, of course, the simulation is not exactly the same as the environment, so just having something that works in simulation is not necessarily sufficient. We show that in our work … What we’re doing is that we’re using existing algorithms that are called ‘imitation learning’. And what it is, is it just takes an existing solution to a problem and just makes sure that you can reliably use that solution, even when, you know, there are slight variations in your environment, including, you know, it being the real world rather than a simulation.

Adrien and Joost say their model’s strength lies in its simplicity.

It can be adapted and expanded easily into real-life applications such as, language learning or drug design.

That was 60 Seconds Science, and this is Pakinam Amer. Thank you for listening.

Facebook Comments