1794View
9m 53sLenght
18Rating

Here is an implementation of Q-Learning, combined with an evolutionary algorithm, applied to learn how to make an arbitrary robot crawl. - Implementation Details This is the most basic Q-Learning implementation using a Q-Table that specifies the value of each action for each possible state. The state is simply the angle of each joint on the arm. The angle values are measured in degrees, discretized to whole integers, relative to the minimum value of the joint. This helps to map from values received from the physics engine to Q-Table indices. There are six possible actions, three for each of the arm joint motors: forward, reverse and hold. The goal is to move the whole robot’s body to the right. So the reward signal is a linear combination of the speed and acceleration. These values come directly from the physics engine. A visualization of the reward signal is shown above each robot. This allows the user to clearly see, and correct, when the robot is being rewarded, and when it is being punished. By just adding a minus sign in front of the reward signal the robot would then learn a different complex sequence of arm movements to get the body to move to the left. The search policy is a version of e-Greedy; the robot chooses it’s next action as the best action it knows at the time, this is done by simply looking at the Q-Value for each action at the current state. Except with some probability, e, it will choose a random action. The amount of randomness is adjusted based on recent rewards. When the robot has not seen a positive reward signal for some time it will begin to increase the probability of acting randomly, assuming it’s knowledge is not very good. Once a positive reward signal is seen the probability of choosing a random move will begin to decline. At each time-step the current reward is used to update the Q-Table at the previous time-step's state and chosen action. The Q-Value, the value that is used to update the Q-Table, is taken as the current reward plus the maximum possible Q-Value of any action from the current state. In this way the robot learns to chain together the best possible action at any state in order to maximize the reward the signal, or achieve the goal. Each robot has a learning rate that controls how much of their old Q-Value, knowledge, is saved and how much the new value should contribute to the value stored in the Q-Table. The policy then has a number of hyper-parameters: initial e, max e, min e, rate of changing e, learning rate… Combinations of values for these hyper-parameters will result in very different methods and rates of learning. Given an arbitrary robot and goal it is difficult to have an intuition for the optimal values of theses hyper-parameters. In order to help explore this hyper-parameter space, and speed up exploration of the state-action space, an evolutionary component is added whereby a population of robots is instantiated each with parameters selected randomly within reasonable ranges. Each robot has it’s own Q-Table, or memory. Then at set points in time the robot that is in the lead shares it’s Q-Table with the rest of the population, with some learning rate below 1, so that the other robots retain some of their past knowledge. Each robot then begins to move much like the robot in the lead, but because they each have their own hyper-parameters they branch out in different ways searching for better ways of moving forward. Eventually one might overtake the leader and be the one that next shares it’s Q-Table. In this way exploration policies that do not work well at first, like both low random move probability and low learning rate, might help later on to fine-tune an already pretty good way of moving learned by another robot. Since the information shared between robots controls their behavior it is like evolving memes. - Program Details Language: Java Frameworks: Libgdx, Box2d