3 – Deep Q-Networks

In 2015, Deep Mind made a breakthrough by designing an agent that learned to play video games better than humans. Yes, it’s probably easy to write a program that plays pong perfectly if you have access to the underlying game state, position of the ball, paddles, et cetera. But this agent was only given raw pixel data, what a human player would see on screen. And it learned to play a bunch of different Atari games, all from scratch. They call this agent a Deep Q Network, let’s take a close look at how it works. True to its name, at the heart of the agent is a deep neural network that acts as a function approximator. You pass in images from your favorite video game one screen at a time, and it produces a vector of action values, with the max value indicating the action to take. As a reinforcement signal, it is fed back the change in game score at each time step. In the beginning when the neural network is initialized with random values, the actions taken are all over the place. It’s really bad as you would expect, but overtime it begins to associate situations and sequences in the game with appropriate actions and learns to actually play the game well. Consider how complex the input space is. Atari games are displayed at a resolution of 210 by 160 pixels, with 128 possible colors for each pixel. This is still technically a discrete state space but very large to process as is. To reduce this complexity, the Deep Mind team decided to perform some minimal processing, convert the frames to gray scale, and scale them down to a square 84 by 84 pixel block. Square images allowed them to use more optimized neural network operations on GPUs. In order to give the agent access to a sequence of frames, they stacked four such frames together resulting in a final state space size of 84 by 84 by 4. There might be other approaches to dealing with sequential data but this was a simple approach that seemed to work pretty well. On the output side, unlike a traditional reinforcement learning setup where only one Q value is produced at a time, the Deep Q network is designed to produce a Q value for every possible action in a single forward pass. Without this, you would have to run the network individually for every action. Instead, you can now simply use this vector to take an action, either stochastically, or by choosing the one with the maximum value. Neat, isn’t it? These innovative input and output transformations support a powerful yet simple neural network architecture under the hood. The screen images are first processed by convolutional layers. This allows the system to exploit spatial relationships, and can sploit spatial rule space. Also, since four frames are stacked and provided as input, these convolutional layers also extract some temporal properties across those frames. The original DQN agent used three such convolutional layers with RLU activation, regularized linear units. They were followed by one fully-connected hidden layer with RLU activation, and one fully-connected linear output layer that produced the vector of action values. This same architecture was used for all the Atari games they tested on, but each game was learned from scratch with a freshly initialized network. Training such a network requires a lot of data, but even then, it is not guaranteed to converge on the optimal value function. In fact, there are situations where the network weights can oscillate or diverge, due to the high correlation between actions and states. This can result in a very unstable and ineffective policy. In order to overcome these challenges, the researchers came up with several techniques that slightly modified the base Q learning algorithm. We’ll take a look at two of these techniques that I feel are the most important contributions of their work; experience replay, and fixed Q targets.

%d 블로거가 이것을 좋아합니다: