【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx
【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术 状态以及行动的集合 和从一个状态跳转到另一个状态的规那么 共同真诚了 Markov decision process。这个经过 比方讲一次游戏经过 的一个 episode 形成了一个有限的状态、行动以及收益的序列 这里的 s_i 表示状态 a_i 表示行动 而 r_i 1 是在执行行动的汇报。episode 以 terminal 状态 s_n 结尾 可能就是“游戏完毕画面 。MDP 依赖于 Markov 假设下一个状态 s_i 1 的概率仅仅依赖于当前的状态 s_i 以及行动 a_i 但不依赖此前的状态以及行动。 Discounted Future Reward To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. How should we go about that? Given one run of the Markov decision process, we can easily calculate the total reward for one episode: Given that, the total future reward from time point t onward can be expressed as: But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward instead: Here is the discount factor between 0 and 1 the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t 1: If we set the discount factor 0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like 0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor 1. A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward. Q-learning In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on. * *The way to think about Q(s, a) is that it is “the best possible score at the end of the game after performing action a in state s“. It is called Q-function, because it represents the “quality of a certain action in a given state. This may sound like quite a puzzling definition. How can we estimate the score at the end of game, if we know just the current state and action, and not the actions and rewards coming after that? We really cant. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times: “Q(s, a) exists, Q(s, a) exists, . Feel it? If youre still not convinced, then consider what the implications of having such a function would be. Suppose you are in state and pondering whether you should take action a or b. You want to select the action that results in the highest score at the end of game. Once you have the magical Q-function, the answer becomes really simple pick the action with the highest Q-value! Here represents the policy, the rule how we choose an action in each state. OK, how do we get that Q-function then? Lets focus on just one transition s, a, r, s . Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s. This is called the Bellman equation. If you think about it, it is quite logical maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state. The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. The gist of the Q-learning algorithm is as simple as the following: in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when 1, then two Qs,a cancel and the update is exactly the same as the Bellman equation. The maxa Qs,a that we use to update Qs,a is only an approximation and in early stages of learning it may be completely wrong. However the approximation get more and more accurate with every iteration and , that if we perform this update enough times, then the Q-function will converge and represent the true Q-value. Deep Q Network 环境的状态可以用 paddle 的位置 球的位置以及方向和每个砖块是否消除来确定。不过这个直觉上的表示与游戏相关。我们能不能获得某种更加通用合适所有游戏的表示呢 最明显的选择就是屏幕像素他们隐式地包含所有关于除了球的速度以及方向外的游戏情形的相关信息。不过两个时间上相邻接的屏幕可以包含这两个丧失的信息。 假如我们像 DeepMind 的论文中那样处理游戏屏幕的话获取四幅最后的屏幕画面 将他们重新规整为 84 X 84 的大小 转换为 256 灰度层级我们会得到一个 25684X84X4 大小的可能游戏状态。这意味着我们的 Q-table 中需要有 1067970 行这比已知的宇宙空间中的原子的数量还要大得多 可能有人会讲 很多像素的组合 也就是状态 不会出现这样其实可以使用一个稀疏的 table 来包含那些被访问到的状态。即使这样 很多的状态仍然是很少被访问到的 也许需要宇宙的生命这么长的时间让 Q-table 收敛。我们祈望理想化的情形是有一个对那些还未遇见的状态的 Q-value 的猜想。 这里就是深度学习发挥作用的地方。神经网络其实对从高度构造化的数据中获取特征非常在行。我们可以用神经网络表示 Q-function 以状态 四幅屏幕画面 以及行动作为输入 以对应的 Q-value 作为输出。另外 我们可以仅仅用游戏画面作为输入对每个可能的行动输出一个 Q-value。后面这个观点对于我们想要进展 Q-value 的更新或选择最优的 Q-value 对应操作来讲要更方便一些 这样我们仅仅需要进展一遍网络的前向传播就可立即得到所有行动的 Q-value。 图 3 左 DQN 的初级形式 右 DQN 的优化形式 用在 DeepMind 的论文中的版本