【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx

资源ID：73272530 资源大小：16.49KB 全文页数：7页
资源格式： DOCX 下载积分：14.8金币

快捷下载

会员登录下载

微信登录下载

三方登录下载：

微信扫一扫登录

下载资源需要14.8金币

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx

【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术状态以及行动的集合和从一个状态跳转到另一个状态的规那么共同真诚了 Markov decision process。这个经过比方讲一次游戏经过的一个 episode 形成了一个有限的状态、行动以及收益的序列这里的 s_i 表示状态 a_i 表示行动而 r_i 1 是在执行行动的汇报。episode 以 terminal 状态 s_n 结尾可能就是“游戏完毕画面。MDP 依赖于 Markov 假设下一个状态 s_i 1 的概率仅仅依赖于当前的状态 s_i 以及行动 a_i 但不依赖此前的状态以及行动。 Discounted Future Reward To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. How should we go about that? Given one run of the Markov decision process, we can easily calculate the total reward for one episode: Given that, the total future reward from time point t onward can be expressed as: But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is common to use discounted future reward instead: Here is the discount factor between 0 and 1 the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time step t can be expressed in terms of the same thing at time step t 1: If we set the discount factor 0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like 0.9. If our environment is deterministic and the same actions always result in same rewards, then we can set discount factor 1. A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward. Q-learning In Q-learning we define a function Q(s, a) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on. * *The way to think about Q(s, a) is that it is “the best possible score at the end of the game after performing action a in state s“. It is called Q-function, because it represents the “quality of a certain action in a given state. This may sound like quite a puzzling definition. How can we estimate the score at the end of game, if we know just the current state and action, and not the actions and rewards coming after that? We really cant. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times: “Q(s, a) exists, Q(s, a) exists, . Feel it? If youre still not convinced, then consider what the implications of having such a function would be. Suppose you are in state and pondering whether you should take action a or b. You want to select the action that results in the highest score at the end of game. Once you have the magical Q-function, the answer becomes really simple pick the action with the highest Q-value! Here represents the policy, the rule how we choose an action in each state. OK, how do we get that Q-function then? Lets focus on just one transition s, a, r, s . Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s. This is called the Bellman equation. If you think about it, it is quite logical maximum future reward for this state and action is the immediate reward plus maximum future reward for the next state. The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. The gist of the Q-learning algorithm is as simple as the following: in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when 1, then two Qs,a cancel and the update is exactly the same as the Bellman equation. The maxa Qs,a that we use to update Qs,a is only an approximation and in early stages of learning it may be completely wrong. However the approximation get more and more accurate with every iteration and , that if we perform this update enough times, then the Q-function will converge and represent the true Q-value. Deep Q Network 环境的状态可以用 paddle 的位置球的位置以及方向和每个砖块是否消除来确定。不过这个直觉上的表示与游戏相关。我们能不能获得某种更加通用合适所有游戏的表示呢最明显的选择就是屏幕像素他们隐式地包含所有关于除了球的速度以及方向外的游戏情形的相关信息。不过两个时间上相邻接的屏幕可以包含这两个丧失的信息。假如我们像 DeepMind 的论文中那样处理游戏屏幕的话获取四幅最后的屏幕画面将他们重新规整为 84 X 84 的大小转换为 256 灰度层级我们会得到一个 25684X84X4 大小的可能游戏状态。这意味着我们的 Q-table 中需要有 1067970 行这比已知的宇宙空间中的原子的数量还要大得多可能有人会讲很多像素的组合也就是状态不会出现这样其实可以使用一个稀疏的 table 来包含那些被访问到的状态。即使这样很多的状态仍然是很少被访问到的也许需要宇宙的生命这么长的时间让 Q-table 收敛。我们祈望理想化的情形是有一个对那些还未遇见的状态的 Q-value 的猜想。这里就是深度学习发挥作用的地方。神经网络其实对从高度构造化的数据中获取特征非常在行。我们可以用神经网络表示 Q-function 以状态四幅屏幕画面以及行动作为输入以对应的 Q-value 作为输出。另外我们可以仅仅用游戏画面作为输入对每个可能的行动输出一个 Q-value。后面这个观点对于我们想要进展 Q-value 的更新或选择最优的 Q-value 对应操作来讲要更方便一些这样我们仅仅需要进展一遍网络的前向传播就可立即得到所有行动的 Q-value。图 3 左 DQN 的初级形式右 DQN 的优化形式用在 DeepMind 的论文中的版本

注意事项

本文（【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx）为本站会员（安***）主动上传，淘文阁 - 分享文档赚钱的网站仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知淘文阁 - 分享文档赚钱的网站（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。