How Would A Robotic Machine Learning Velociraptor Learn To Play Goalie?
The 1.5 meter, silvery gray velociraptor lunges forward, interrupting the flight of the tennis ball with its head before the ball can get to the soccer net at the end of the gym. Its tail stretches out, stopping another ball. It pivots, somewhat clumsily, and runs three steps in the other direction to intercept a third ball.
It’s been doing this for an hour, running back and forth as a trio of tennis ball machines toss yellow balls in various loopy ways toward the net. It’s a game that its creators have invented to rapidly improve its coordination.
But then it stops trying to intercept the balls, although it still twitches toward them. It looks around and trots off to a 60-centimeter high block in a corner of the big room. The block has a power cable plugged into a nearby outlet. The velociraptor walks over the block, squats down on top of it and then closes its eyes. The engineers sweep up the scattered tennis balls and return them to the hoppers of the machines.
Two hours later, it opens its eyes. One of the engineers flips a switch and the tennis balls start flying again. The velociraptor leaps back into the fray. This time, it’s noticeably smoother when it pivots. It stops more balls than it did before. And it takes a bit longer before it takes a break to rest and recharge.
This is a story from the evolution of Plastic Dinosaur aka PD, a fictional robotic velociraptor that the authors, David Clement, Principal at Wavesine and Co-Founder of Senbionic, and Michael Barnard, are using to explore aspects of machine learning. Read the first articles in the series for the mechanical and neural net architecture of the robot. Suffice it to say for now that it has an aluminum and plastic skeleton, electrically powered actuators, a lithium-ion battery that induction charges off of that 60-centimeter block, a lot of sensors, a smart cloth wrapping of ‘skin’ that has more sensors, and a trio of hypothetical neural nets that we are calling cerebellumnet, amygdalanet, and curiousnet. The first is PD’s autonomic nervous system and motor control neural net. The second is PD’s decision-making and fight or flight neural net. The third is the ‘rest of the brain’ that wants to explore new things and has most of the pattern-matching abilities for things outside of PD’s body.
PD and its creators have been playing a game. The game is simple. PD is the goal keeper. The goal is a soccer net. The balls are tennis balls hurled from tennis ball machines. PD is rewarded when it intercepts a ball before it hits the net and gets a tiny punishment when a ball hits the net. Yes, punishing an AI dinosaur is going to go well.
This little story is part way through the training process. Previously, they ran a bunch of virtual exercises to get PD’s neural nets to intercept virtual balls, but as always there’s a gap between simulation and reality. They’ve iterated through simulation and reality a few times, and at the beginning of this little story, PD has the right idea and is intercepting moving tennis balls in the ‘real’ world.
As PD’s battery charge dwindles, cerebellumnet is paying attention. At a certain point it starts sending out the “I’m hungry” signal. That signal gets louder and louder. Eventually, it takes priority over the curiousnet and amygdalanet’s attention to the game. Curiousnet looks around until it identifies the charging block and says “It’s over there” but is still attracted to the moving balls. It’s torn between the competing impulses. Amygdalanet decides that it’s time to “Go over there” and says so. Cerebellumnet turns PD around and trots him over to the block. Amygdalanet keeps them from going too fast and running into the wall because of the fear impulse. Curiousnet sees what the alignment needs to be and keeps sending refining signals until they are settled on the block. Then cerebellumnet turns on the charging and ‘sleeping’ cycle.
The sleeping cycle in this little story is interesting. Conceptually, what happens is that all of the experiences of success and failure that PD had in stopping or not stopping the balls, and the various sensor readings as it did so are uploaded into its virtual environment. The virtual environments automatically create a massively parallel set of simulations and run through a Monte Carlo simulation set to optimize PD’s behavior for success in stopping the balls. Easier said than done, of course. Each of the three neural nets adapts slightly to this, learning how to do it better, and the resultant, iteratively trained neural nets are re-instantiated in the hardware in PD’s robotic body.
But there’s something else going on. Cerebellumnet, the neural net that is PD’s autonomic nervous system and motor control center, has a constant reward system for using less electricity and putting less strain on joints. This translates into smoother movements and efficiently achieving physical goals. Among other things, in the absence of external stimulus, PD tends to be still as opposed to moving. That’s analogous to the way humans learn any physical activity. We get better and better at it as we train our own gooey neural nets and autonomic nervous system to be efficient. As a result, the sensors which track stress, inertia and battery drain are all input to the learning models as well. As PD dreams, the learning process adjusts the neural nets slightly to be more efficient and smooth in certain circumstances. And so, when ‘dreaming’ is over, PD’s cerebellumnet hardware makes the robot a little smoother and more efficient in its movements.
Time for another side track. What is the difference between outcomes and goals? An outcome of much of machine learning is identification. That’s an achievable result. An extant and almost current machine learning educational effort has a result of identifying dog and cat breeds with 96% accuracy. Given that identifying cats and dogs five years ago was statistically just better than static, achieving 96% accuracy today is amazing.
What’s changed, and what does that mean for Plastic Dinosaur?
Identification machine learning has improved substantially because ImageNet created a standardized, mostly differentiated set of images to train neural nets on. Then the Three Amigos of machine learning — the ones who won the Turing Award recently — figured out what hierarchies need to be instantiated in the visual processing neural net trained with ImageNet in order to get to common features that could be applied across images.
So random blobs over 10 to 12 mostly invisible layers turn into edges and corners and feathers. And on top of that you can add a dataset of 100 identified complex things like dogs and cats and achieve remarkable accuracy in identification with a very limited new data set. Heavy lifting done. Exploitable niches opened up.
But identification is nouns, not verbs. Ontology’s problem is that it’s all nouns, with no verbs as David discovered in his deep dive in the space, one which included US Homeland Defense discussions of iterative and incremental creation of any-to-any ontology definitions. And machine learning shares it to a great extent. It’s great at identifying nouns with RetinaNet, but verbs? Not so much. Even ELMo, which is able to articulate all the parts of speech and idiomatic nuances, doesn’t inherently have action.
Goals are action oriented. How does a machine learning algorithm get to decisions as opposed to options? That’s a difference between model-oriented machine learning and, at least in theory, model-free machine learning which makes decisions and chooses actions. I’m sensing a quadrant chart emerging, which is to say, a simplistic model which pretends that sophisticated gradients can be lumped into four boxes and further that the four boxes represent the universe of the outcomes.
Yeah, just being able to identify something is great, but what do we do with that knowledge? What goals do we choose?
Currently, an attention loop includes a few things. An attention space where things keep changing, but they are constrained in some dimensions. For example, cameras flying over the same waters and land on the same routes. A set of sensors recording the attention space, perhaps a set of GoPros, iPhones or satellites. An expert or intentional human agent which wants outcomes from the attention that is being paid to the attention space, i.e. someone who is paying attention. The machine learning neural net which is being trained to pay attention to the attention space. Features in the attention space given the available sensors that the neural net can identify. An expert or a group of more sophisticated neural nets that can identify features given limited training — humans — who can point features out that the neural net can’t identify. A feedback process to keep pointing out to the neural net the salient features.
Imagine a log floating in the water. It’s escaped from a log boom floating down a major river. It has value. People care about it. A video camera on a float plane flying above the rivers and bays of the attention space captures many pictures. A neural net is trained to recognize that things that it is seeing are items of value, i.e. logs. It identifies them. But that’s a noun, not a verb. Where does the verb or action come from?
That’s an attention loop. An attention space. People who care. Features. A neural net which is being trained to pay attention. Trainers. But little to no understanding of action. Yet.
As the architecture article made clear, the robot can’t learn on its own. All its neural net hardware can do is receive inputs and shout instructions. In exactly the same circumstances of inputs, exactly the same output of instructions will occur, just as is the case with Tesla’s autonomous systems on its cars. The ‘dreaming’ cycle occurs outside of the running neural net hardware on the robot and then the results are downloaded. It only changes behavior after it ‘dreams’.
There’s another aspect to this that’s worth drawing out more, that of the attention that the neural nets pay to different features. Attention and features are very specific words that the authors are using, and we’re trying to be precise in their use. As the architecture article lays out in detail, there’s a lot going on all the time in PD’s body and surroundings. It has constant streams of sensor data from inside and outside of its body, out of which it identifies features. Cerebellumnet pays attention to most of the stuff inside of the body with the internal sensors, identifying the features which are salient at any given point. Curiousnet pays attention to most of the stuff outside of the body. Amygdalanet pays close attention anything curiousnet can’t identify that might be a risk or can identify as a threat, and mediates between the very externally-focused curiousnet and the internally-focused cerebellumnet. Each neural net has different spheres of attention.
One of the things that cerebellumnet pays a lot of attention to is the battery charge. Cerebellumnet was instantiated first and rewarded strongly for ensuring that it never ran out of charge or got too low. It’s learned to pay a lot of attention to that sensor and it pays less attention to most of the other sensors, even though they are just as ‘loud’.
This is a fundamental aspect of a neural net, which features a noisy set of data it learns from to consider relevant for the rewarded outcomes. That’s paying attention.
Another aspect of attention gets back to that inability of neural nets to learn without ‘dreaming’. Human beings have neural nets too, the gooey ones in our skull. But we can learn things without ‘dreaming’, although human dreaming does help our neural nets do something in the same range. Evidence from neuroscience, for example the study Sleep, Learning, and Dreams: Off-line Memory Reprocessing by Stickgold et al., suggests that while we form new neural links while awake, dreaming includes reprocessing some memories and prioritizing some linkages in our neural nets while degrading others. It’s part of our cycle of learning, and like everything biological, it’s extraordinarily messy.
Dreaming reinforces some aspects of our neural nets by vividly remembering them and diminishes others by ignoring them. Dreaming alone could hypothetically enhance PTSD symptoms and impacts. While we are constrained in human studies by understandable ethical concerns, studies on rats suggest that this is the case. Imagine an AI velociraptor with induced PTSD, or just emergent PTSD that no one notices.
In attempting to conceptualize a machine learning robot, we’re taking lessons from biomimicry efforts over the past couple of decades. One of Michael’s interesting experiences was a lengthy interaction with a famous biomimeticist, John Dabiri. He’s a MacArthur Genius Grant award winner who has done fascinating things by studying animal locomotion, especially marine animals, and gaining insights related to how to improve mechanical locomotion. Michael had written a critique of Dabiri’s attempt to improve on wind generation — Opinion: Are “school of fish” turbine arrays a red herring? — and Dabiri had reached out to argue his case.
It was a fascinating conversation, but the relevant takeaway for the PD thought model was that biomimicry doesn’t attempt to replicate exactly how biological systems work, but attempts to find simpler ways to achieve the same goals based on the physics involved. That’s part of why PD’s sensors just yell things over Bluetooth and the neural nets learn to pay attention to what’s relevant. And that’s why we aren’t attempting to recreate how humans learn in our messy, organic and overlapping waking and dreaming cycles, but separating them more crisply into action cycles and learning cycles. That’s also why we are collapsing, at least conceptually, the autonomic nervous system and cerebellum into a single neural net.
And, of course, this is all a thought model used for exploring aspects of machine learning with a side helping of robotics, so take all of this with a grain of salt except for the concepts of machine learning.
More questions arise from this of course. How do we reward a neural net for positive outcomes and ‘punish’ it for negative outcomes? What if the training is poor and the neural net learns to pay attention to the wrong things? What if required survival traits aren’t sufficiently rewarded and the neural net misplaces them compared to other things it decides are relevant? What if acquired capabilities aren’t exercised; do they degrade and disappear or stay perfectly preserved in amber? Those are questions for further pieces in this series.
The fourth article in the series will deal with how neural nets develop often unknowable and simplified ways of identifying things through the features that they pay attention to, with potentially challenging results. In it, Plastic Dinosaur becomes prejudiced, in an amusing way of course.