【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx
《【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx》由会员分享,可在线阅读,更多相关《【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术-精品文档资料整理.docx(7页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、【DQN】解析 DeepMind 深度强化学习 (Deep Reinforcement Learning) 技术 状态以及行动的集合 和从一个状态跳转到另一个状态的规那么 共同真诚了Markov decision process。这个经过 比方讲一次游戏经过 的一个episode形成了一个有限的状态、行动以及收益的序列 这里的s_i表示状态 a_i表示行动 而r_i 1是在执行行动的汇报。episode 以terminal状态s_n结尾 可能就是“游戏完毕画面 。MDP 依赖于 Markov 假设下一个状态s_i 1的概率仅仅依赖于当前的状态s_i以及行动a_i 但不依赖此前的状态以及行动。
2、Discounted Future Reward To perform well in the long-term, we need to take into account not only the immediate rewards, but also the future rewards we are going to get. How should we go about that? Given one run of the Markov decision process, we can easily calculate thetotal rewardfor one episode:
3、Given that, thetotal future rewardfrom time pointtonward can be expressed as: But because our environment is stochastic, we can never be sure, if we will get the same rewards the next time we perform the same actions. The more into the future we go, the more it may diverge. For that reason it is com
4、mon to usediscounted future rewardinstead: Hereis the discount factor between 0 and 1 the more into the future the reward is, the less we take it into consideration. It is easy to see, that discounted future reward at time steptcan be expressed in terms of the same thing at time stept 1: If we set t
5、he discount factor 0, then our strategy will be short-sighted and we rely only on the immediate rewards. If we want to balance between immediate and future rewards, we should set discount factor to something like 0.9. If our environment is deterministic and the same actions always result in same rew
6、ards, then we can set discount factor 1. A good strategy for an agent would be toalways choose an action that maximizes the (discounted) future reward. Q-learning In Q-learning we define a functionQ(s, a)representingthe maximum discounted future reward when we perform actionain states, and continue
7、optimally from that point on. * *The way to think aboutQ(s, a)is that it is “the best possible score at the end of the game after performing action ain states“. It is called Q-function, because it represents the “quality of a certain action in a given state. This may sound like quite a puzzling defi
8、nition. How can we estimate the score at the end of game, if we know just the current state and action, and not the actions and rewards coming after that? We really cant. But as a theoretical construct we can assume existence of such a function. Just close your eyes and repeat to yourself five times
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- DQN 【DQN】解析 DeepMind 深度强化学习 Deep Reinforcement Learning 技术-精品文档资料整理 解析 深度 强化 学习 Deep Learning 技术 精品
链接地址:https://www.taowenge.com/p-73272530.html
限制150内