书签分享收藏举报版权申诉 / 7

立即下载

当前位置：首页 > 技术资料 > 工程图纸 > 【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx

【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx

上传人：安***

文档编号：73272530

上传时间：2023-02-17

格式：DOCX

页数：7

大小：16.49KB

( 4.5 )

《【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx》由会员分享，可在线阅读，更多相关《【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx（7页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术状态以及行动的集合和从一个状态跳转到另一个状态的规那么共同真诚了Markov decision process。这个经过比方讲一次游戏经过的一个episode形成了一个有限的状态、行动以及收益的序列这里的s_i表示状态 a_i表示行动而r_i 1是在执行行动的汇报。episode 以terminal状态s_n结尾可能就是“游戏完毕画面。MDP 依赖于 Markov 假设下一个状态s_i 1的概率仅仅依赖于当前的状态s_i以及行动a_i 但不依赖此前的状态以及行动。

2、Discounted Future Reward To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. How should we go about that? Given one run of the Markov decision process, we can easily calculate thetotal rewardfor one episode:

3、Given that, thetotal future rewardfrom time pointtonward can be expressed as: But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is com

4、mon to usediscounted future rewardinstead: Hereis the discount factor between 0 and 1 the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time steptcan be expressed in terms of the same thing at time stept 1: If we set t

5、he discount factor 0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like 0.9. If our environment is deterministic and the same actions always result in same rew

6、ards, then we can set discount factor 1. A good strategy for an agent would be toalways choose an action that maximizes the (discounted) future reward. Q-learning In Q-learning we define a functionQ(s, a)representingthe maximum discounted future reward when we perform actionain states, and continue

7、optimally from that point on. * *The way to think aboutQ(s, a)is that it is “the best possible score at the end of the game after performing action ain states“. It is called Q-function, because it represents the “quality of a certain action in a given state. This may sound like quite a puzzling defi

8、nition. How can we estimate the score at the end of game, if we know just the current state and action, and not the actions and rewards coming after that? We really cant. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times

9、: “Q(s, a)exists,Q(s, a)exists, . Feel it? If youre still not convinced, then consider what the implications of having such a function would be. Suppose you are in state and pondering whether you should take actionaorb. You want to select the action that results in the highest score at the end of ga

10、me. Once you have the magical Q-function, the answer becomes really simple pick the action with the highest Q-value! Here represents the policy, the rule how we choose an action in each state. OK, how do we get that Q-function then? Lets focus on just one transition s, a, r, s . Just like with disco

11、unted future rewards in the previous section, we can express the Q-value of statesand actionain terms of the Q-value of the next states. This is called theBellman equation. If you think about it, it is quite logical maximum future reward for this state and action is the immediate reward plus maximum

12、 future reward for the next state. The main idea in Q-learning is thatwe can iteratively approximate the Q-function using the Bellman equation. In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns. The gist of the Q-learning algorithm is as simple

13、 as the following: in the algorithm is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when 1, then twoQs,acancel and the update is exactly the same as the Bellman equation. The maxaQs,a that we use to

14、 updateQs,a is only an approximation and in early stages of learning it may be completely wrong. However the approximation get more and more accurate with every iteration and, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value. Deep Q Networ

15、k 环境的状态可以用 paddle 的位置球的位置以及方向和每个砖块是否消除来确定。不过这个直觉上的表示与游戏相关。我们能不能获得某种更加通用合适所有游戏的表示呢最明显的选择就是屏幕像素他们隐式地包含所有关于除了球的速度以及方向外的游戏情形的相关信息。不过两个时间上相邻接的屏幕可以包含这两个丧失的信息。假如我们像 DeepMind 的论文中那样处理游戏屏幕的话获取四幅最后的屏幕画面将他们重新规整为 84 X 84 的大小转换为 256 灰度层级我们会得到一个 25684X84X4 大小的可能游戏状态。这意味着我们的 Q-table 中需要有 1067970 行这比已知的宇宙空间中的原

16、子的数量还要大得多可能有人会讲很多像素的组合也就是状态不会出现这样其实可以使用一个稀疏的 table 来包含那些被访问到的状态。即使这样很多的状态仍然是很少被访问到的也许需要宇宙的生命这么长的时间让 Q-table 收敛。我们祈望理想化的情形是有一个对那些还未遇见的状态的 Q-value 的猜想。这里就是深度学习发挥作用的地方。神经网络其实对从高度构造化的数据中获取特征非常在行。我们可以用神经网络表示 Q-function 以状态四幅屏幕画面以及行动作为输入以对应的 Q-value 作为输出。另外我们可以仅仅用游戏画面作为输入对每个可能的行动输出一个 Q-value。后面这个观点对于我们想要进展 Q-value 的更新或选择最优的 Q-value 对应操作来讲要更方便一些这样我们仅仅需要进展一遍网络的前向传播就可立即得到所有行动的 Q-value。图 3 左 DQN 的初级形式右 DQN 的优化形式用在 DeepMind 的论文中的版本

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

14.8 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: DQN 【DQN】解析 DeepMind 深度强化学习 Deep Reinforcement Learning 技术-精品文档资料整理解析深度强化学习 Deep Learning 技术精品

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx
链接地址：https://www.taowenge.com/p-73272530.html