书签分享收藏举报版权申诉 / 90

立即下载

当前位置：首页 > 教育专区 > 高考资料 > 第10章-强化学习讲课教案.ppt

第10章-强化学习讲课教案.ppt

上传人：豆****

文档编号：65785053

上传时间：2022-12-08

格式：PPT

页数：90

大小：1.69MB

( 4.5 )

《第10章-强化学习讲课教案.ppt》由会员分享，可在线阅读，更多相关《第10章-强化学习讲课教案.ppt（90页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、第10章-强化学习2022/12/8强化学习史忠植2引言引言人类通常从与外界环境的交互中学习。所谓强化（reinforcement）学习是指从环境状态到行为映射的学习，以使系统行为从环境中获得的累积奖励值最大。在强化学习中，我们设计算法来把外界环境转化为最大化奖励量的方式的动作。我们并没有直接告诉主体要做什么或者要采取哪个动作,而是主体通过看哪个动作得到了最多的奖励来自己发现。主体的动作的影响不只是立即得到的奖励，而且还影响接下来的动作和最终的奖励。试错搜索(trial-and-error search)和延期强化(delayed reinforcement)这两个特性是强化学习中两个最重

2、要的特性。2022/12/8强化学习史忠植3引言引言强化学习技术是从控制理论、统计学、心理学等相关学科发展而来，最早可以追溯到巴甫洛夫的条件反射实验。但直到上世纪八十年代末、九十年代初强化学习技术才在人工智能、机器学习和自动控制等领域中得到广泛研究和应用，并被认为是设计智能系统的核心技术之一。特别是随着强化学习的数学基础研究取得突破性进展后，对强化学习的研究和应用日益开展起来，成为目前机器学习领域的研究热点之一。2022/12/8强化学习史忠植4引言l强化思想最先来源于心理学的研究。1911年Thorndike提出了效果律（Law of Effect）：一定情景下让动物感到舒服的行为，就

3、会与此情景增强联系（强化），当此情景再现时，动物的这种行为也更易再现；相反，让动物感觉不舒服的行为，会减弱与情景的联系，此情景再现时，此行为将很难再现。换个说法，哪种行为会“记住”，会与刺激建立联系，取决于行为产生的效果。l动物的试错学习,包含两个含义：选择（selectional）和联系（associative），对应计算上的搜索和记忆。所以，1954年，Minsky在他的博士论文中实现了计算上的试错学习。同年，Farley和Clark也在计算上对它进行了研究。强化学习一词最早出现于科技文献是1961年Minsky 的论文“Steps Toward Artificial Intelligen

4、ce”，此后开始广泛使用。1969年，Minsky因在人工智能方面的贡献而获得计算机图灵奖。2022/12/8强化学习史忠植5引言l1953到1957年，Bellman提出了求解最优控制问题的一个有效方法：动态规划（dynamic programming）lBellman于 1957年还提出了最优控制问题的随机离散版本，就是著名的马尔可夫决策过程（MDP,Markov decision processe），1960年Howard提出马尔可夫决策过程的策略迭代方法，这些都成为现代强化学习的理论基础。l1972年，Klopf把试错学习和时序差分结合在一起。1978年开始，Sutton、Barto

5、、Moore，包括Klopf等对这两者结合开始进行深入研究。l1989年Watkins提出了Q-学习Watkins 1989，也把强化学习的三条主线扭在了一起。l1992年，Tesauro用强化学习成功了应用到西洋双陆棋（backgammon）中，称为TD-Gammon。2022/12/8强化学习史忠植6内容提要内容提要l引言引言l强化学习模型强化学习模型l动态规划动态规划l蒙特卡罗方法蒙特卡罗方法l时序差分学习时序差分学习lQ学习学习l强化学习中的函数估计强化学习中的函数估计l应用应用2022/12/8强化学习史忠植7主体主体主体主体强化学习模型i:inputr:reward s:sta

6、tea:action状态 sisi+1ri+1奖励 ri环境环境环境环境动作动作 aia0a1a2s0s1s2s32022/12/8强化学习史忠植8描述一个环境（问题）（问题）lAccessible vs.inaccessiblelDeterministic vs.non-deterministiclEpisodic vs.non-episodiclStatic vs.dynamiclDiscrete vs.continuousThe most complex general class of environments are inaccessible,non-deterministic,n

7、on-episodic,dynamic,and continuous.2022/12/8强化学习史忠植9强化学习问题lAgent-environment interactionlStates,Actions,RewardslTo define a finite MDPlstate and action sets:S and Alone-step“dynamics”defined by transition probabilities(Markov Property):lreward probabilities:EnvironmentactionstaterewardRLAgent2022/1

8、2/8强化学习史忠植10与监督学习对比lReinforcement Learning Learn from interactionllearn from its own experience,and the objective is to get as much reward as possible.The learner is not told which actions to take,but instead must discover which actions yield the most reward by trying them.RLSystemInputsOutputs(“ac

9、tions”)Training Info =evaluations(“rewards”/“penalties”)lSupervised Learning Learn from examples provided by a knowledgable external supervisor.2022/12/8强化学习史忠植11强化学习要素lPolicy:stochastic rule for selecting actionslReturn/Reward:the function of future rewards agent tries to maximizelValue:what is go

10、od because it predicts rewardlModel:what follows whatPolicyRewardValueModel ofenvironmentIs unknownIs my goalIs I can getIs my method2022/12/8强化学习史忠植12在策略下的Bellman公式The basic idea:So:Or,without the expectation operator:is the discount rate2022/12/8强化学习史忠植13BellmanBellman最优策略公式最优策略公式其中：V*：状态值映射S：环境

11、状态R：奖励函数P：状态转移概率函数：折扣因子2022/12/8强化学习史忠植14马尔可夫决策过程马尔可夫决策过程 MARKOV DECISION PROCESS 由四元组定义。l 环境状态集Sl 系统行为集合Al 奖励函数R：SAl 状态转移函数P：SAPD（S）记R（s，a，s）为系统在状态s采用a动作使环境状态转移到s获得的瞬时奖励值；记P（s，a，s）为系统在状态s采用a动作使环境状态转移到s的概率。2022/12/8强化学习史忠植15马尔可夫决策过程马尔可夫决策过程 MARKOV DECISION PROCESSl马尔可夫决策过程的本质是：当前状态向下一状态转移的概率和奖励值只取

12、决于当前状态和选择的动作，而与历史状态和历史动作无关。因此在已知状态转移概率函数P和奖励函数R的环境模型知识下，可以采用动态规划技术求解最优策略。而强化学习着重研究在P函数和R函数未知的情况下，系统如何学习最优行为策略。2022/12/8强化学习史忠植16MARKOV DECISION PROCESSCharacteristics of MDP:a set of states :Sa set of actions:Aa reward function:R:S x A RA state transition function:T:S x A (S)T(s,a,s):probability of

13、 transition from s to s using action a2022/12/8强化学习史忠植17马尔可夫决策过程马尔可夫决策过程 MARKOV DECISION PROCESS2022/12/8强化学习史忠植18MDP EXAMPLE:TransitionfunctionStates and rewardsBellman Equation:(Greedy policy selection)2022/12/8强化学习史忠植19MDP Graphical Representation,:T(s,action,s )Similarity to Hidden Markov Mod

14、els(HMMs)2022/12/8强化学习史忠植20Reinforcement Learning Deterministic transitionsStochastic transitionsis the probability to reaching state j when taking action a in state istart3211234+1-1A simple environment that presents the agent with a sequential decision problem:Move cost=0.04(Temporal)credit assig

15、nment problem sparse reinforcement problemOffline alg:action sequences determined ex anteOnline alg:action sequences is conditional on observations along the way;Important in stochastic environment(e.g.jet flying)2022/12/8强化学习史忠植21Reinforcement Learning M=0.8 in direction you want to go 0.2 in perp

16、endicular 0.1 left0.1 rightPolicy:mapping from states to actions3211234+1-10.7053211234+1-1 0.8120.762 0.868 0.912 0.660 0.655 0.611 0.388An optimal policy for the stochastic environment:utilities of states:EnvironmentObservable(accessible):percept identifies the statePartially observableMarkov prop

17、erty:Transition probabilities depend on state only,not on the path to the state.Markov decision problem(MDP).Partially observable MDP(POMDP):percepts does not have enough info to identify transition probabilities.2022/12/8强化学习史忠植22动态规划动态规划Dynamic Programmingl动态规划(dynamic programming)的方法通过从后继状态回溯到前驱

18、状态来计算赋值函数。动态规划的方法基于下一个状态分布的模型来接连的更新状态。强化学习的动态规划的方法是基于这样一个事实：对任何策略和任何状态s，有(10.9)式迭代的一致的等式成立的一致的等式成立(as)是给定在随机策略下状态s时动作a的概率。(ssa)是在动作a下状态s转到状态s的概率。这就是对V的Bellman(1957)等式。2022/12/8强化学习史忠植23动态规划动态规划Dynamic Programming-ProblemlA discrete-time dynamic systemlStates 1,n+termination state 0lControl U(i)lTra

19、nsition Probability pij(u)lAccumulative cost structurelPolicies2022/12/8强化学习史忠植24lFinite Horizon ProblemlInfinite Horizon ProblemlValue Iteration动态规划动态规划Dynamic Programming Iterative Solution 2022/12/8强化学习史忠植25动态规划中的策略迭代动态规划中的策略迭代/值迭代值迭代 policy evaluationpolicy improvement“greedification”Policy It

20、erationValue Iteration2022/12/8强化学习史忠植26动态规划方法动态规划方法TTTTTTTTTTTTT2022/12/8强化学习史忠植27自适应动态规划自适应动态规划(ADP)Idea:use the constraints(state transition probabilities)between states to speed learning.Solve=value determination.No maximization over actions because agent is passive unlike in value iteration.u

21、sing DPLarge state spacee.g.Backgammon:1050 equations in 1050 variables2022/12/8强化学习史忠植28Value Iteration AlgorithmAN ALTERNATIVE ITERATION:(Singh,1993)(Important for model free learning)Stop Iteration when V(s)differs less than.Policy difference ratio=2/(1-)(Williams&Baird 1993b)2022/12/8强化学习史忠植29

22、Policy Iteration Algorithm Policies converge faster than values.Why faster convergence?2022/12/8强化学习史忠植30动态规划动态规划Dynamic Programmingl典型的动态规划模型作用有限，很多问题很难给出环境的完整模型。仿真机器人足球就是这样的问题，可以采用实时动态规划方法解决这个问题。在实时动态规划中不需要事先给出环境模型，而是在真实的环境中不断测试，得到环境模型。可以采用反传神经网络实现对状态泛化，网络的输入单元是环境的状态s,网络的输出是对该状态的评价V(s)。2022/12/8强

23、化学习史忠植31没有模型的方法没有模型的方法Model Free MethodsModels of the environment:T:S x A (S)and R:S x A RDo we know them?Do we have to know them?lMonte Carlo MethodslAdaptive Heuristic CriticlQ Learning2022/12/8强化学习史忠植32蒙特卡罗方法蒙特卡罗方法 Monte Carlo Methods l蒙特卡罗方法不需要一个完整的模型。而是它们对状态的整个轨道进行抽样，基于抽样点的最终结果来更新赋值函数。蒙特卡罗方法不

24、需要经验，即从与环境联机的或者模拟的交互中抽样状态、动作和奖励的序列。联机的经验是令人感兴趣的，因为它不需要环境的先验知识，却仍然可以是最优的。从模拟的经验中学习功能也很强大。它需要一个模型，但它可以是生成的而不是分析的，即一个模型可以生成轨道却不能计算明确的概率。于是，它不需要产生在动态规划中要求的所有可能转变的完整的概率分布。2022/12/8强化学习史忠植33Monte Carlo方法方法TTTTTTTTTTTTTTTTTTTT2022/12/8强化学习史忠植34蒙特卡罗方法蒙特卡罗方法 Monte Carlo Methods lIdea:Hold statistics about

25、rewards for each state Take the average This is the V(s)lBased only on experience lAssumes episodic tasks (Experience is divided into episodes and all episodes will terminate regardless of the actions selected.)lIncremental in episode-by-episode sense not step-by-step sense.2022/12/8强化学习史忠植35Monte

26、Carlo策略策略评价评价lGoal:learn Vp p(s)under P and R are unknown in advancelGiven:some number of episodes under p p which contain slIdea:Average returns observed after visits to slEvery-Visit MC:average returns for every time s is visited in an episodelFirst-visit MC:average returns only for first time s i

27、s visited in an episodelBoth converge asymptotically123452022/12/8强化学习史忠植36Problem:Unvisited pairs(problem of maintaining exploration)For every make sure that:P(selected as a start state and action)0 (Assumption of exploring starts )蒙特卡罗方法蒙特卡罗方法 2022/12/8强化学习史忠植37蒙特卡罗控制蒙特卡罗控制How to select Policies

28、:(Similar to policy evaluation)MC policy iteration:Policy evaluation using MC methods followed by policy improvement Policy improvement step:greedify with respect to value(or action-value)function2022/12/8强化学习史忠植38时序差分学习时序差分学习 Temporal-Difference时序差分学习中没有环境模型，根据经验学习。每步进行迭代，不需要等任务完成。预测模型的控制算法，根据历史信息

29、判断将来的输入和输出，强调模型的函数而非模型的结构。时序差分方法和蒙特卡罗方法类似，仍然采样一次学习循环中获得的瞬时奖惩反馈，但同时类似与动态规划方法采用自举方法估计状态的值函数。然后通过多次迭代学习，去逼近真实的状态值函数。2022/12/8强化学习史忠植39时序差分学习时序差分学习 TDTTTTTTTTTTTTTTTTTTTT2022/12/8强化学习史忠植40时序差分学习时序差分学习 Temporal-Differencetarget:the actual return after time ttarget:an estimate of the return2022/12/8强化学习

30、史忠植41时序差分学习时序差分学习(TD)Idea:Do ADP backups on a per move basis,not for the whole state space.Theorem:Average value of U(i)converges to the correct value.Theorem:If is appropriately decreased as a function of times a state is visited(=Ni),then U(i)itself converges to the correct value2022/12/8强化学习史忠植

31、42TD()A Forward ViewlTD()is a method for averaging all n-step backups lweight by n-1(time since visitation)l-return:lBackup using-return:2022/12/8强化学习史忠植43时序差分学习算法时序差分学习算法 TD()Idea:update from the whole epoch,not just on state transition.Special cases:=1:Least-mean-square(LMS),Mont Carlo=0:TDInterm

32、ediate choice of (between 0 and 1)is best.Interplay with 2022/12/8强化学习史忠植44时序差分学习算法时序差分学习算法 TD()算法算法 10.1 TD(0)学习算法Initialize V(s)arbitrarily,to the policy to be evaluatedRepeat(for each episode)Initialize s Repeat(for each step of episode)Choose a from s using policy derived from V(e.g.,-greedy)Ta

33、ke action a,observer r,s Until s is terminal 2022/12/8强化学习史忠植45时序差分学习算法2022/12/8强化学习史忠植46时序差分学习算法收敛性TD()Theorem:Converges w.p.1 under certain boundaries conditions.Decrease i(t)s.t.In practice,often a fixed is used for all i and t.2022/12/8强化学习史忠植47时序差分学习 TD2022/12/8强化学习史忠植48Q-learningWatkins,19

34、89在Q学习中，回溯从动作结点开始，最大化下一个状态的所有可能动作和它们的奖励。在完全递归定义的Q学习中，回溯树的底部结点一个从根结点开始的动作和它们的后继动作的奖励的序列可以到达的所有终端结点。联机的Q学习，从可能的动作向前扩展，不需要建立一个完全的世界模型。Q学习还可以脱机执行。我们可以看到，Q学习是一种时序差分的方法。2022/12/8强化学习史忠植49Q-learning在Q学习中，Q是状态-动作对到学习到的值的一个函数。对所有的状态和动作：Q:(state x action)value 对Q学习中的一步：(10.15)其中c和都1，rt+1是状态st+1的奖励。2022/12/8强

35、化学习史忠植50Q-LearninglEstimate the Q-function using some approximator (for example,linear regression or neural networks or decision trees etc.).lDerive the estimated policy as an argument of the maximum of the estimated Q-function.lAllow different parameter vectors at different time points.lLet us ill

36、ustrate the algorithm with linear regression as the approximator,and of course,squared error as the appropriate loss function.2022/12/8强化学习史忠植51Q-learningQ(a,i)Direct approach(ADP)would require learning a model .Q-learning does not:Do this update after each state transition:2022/12/8强化学习史忠植52Explo

37、rationTradeoff between exploitation(control)and exploration(identification)Extremes:greedy vs.random acting(n-armed bandit models)Q-learning converges to optimal Q-values if*Every state is visited infinitely often(due to exploration),*The action selection becomes greedy as time approaches infinity,a

38、nd*The learning rate is decreased fast enough but not too fast(as we discussed in TD learning)2022/12/8强化学习史忠植53Common exploration methods1.In value iteration in an ADP agent:Optimistic estimate of utility U+(i)2.-greedy methodNongreedy actions Greedy action3.Boltzmann explorationExploration funcR+

39、if nNu o.w.2022/12/8强化学习史忠植54Q-Learning AlgorithmQ学习算法lInitialize Q(s,a)arbitrarilylRepeat(for each episode)l Initialize sl Repeat(for each step of episode)l Choose a from s using policy derived from Q(e.g.,-greedy)lTake action a,observer r,s lUntil s is terminal2022/12/8强化学习史忠植55Q-Learning Algori

40、thmlSetlForlThe estimated policy satisfies2022/12/8强化学习史忠植56What is the intuition?lBellman equation gives lIf and the training set were infinite,then Q-learning minimizes which is equivalent to minimizing2022/12/8强化学习史忠植57A-Learning Murphy,2003 and Robins,2004lEstimate the A-function(advantages)us

41、ing some approximator,as in Q-learning.lDerive the estimated policy as an argument of the maximum of the estimated A-function.lAllow different parameter vectors at different time points.lLet us illustrate the algorithm with linear regression as the approximator,and of course,squared error as the app

42、ropriate loss function.2022/12/8强化学习史忠植58A-Learning Algorithm(Inefficient Version)lForlThe estimated policy satisfies2022/12/8强化学习史忠植59Differences between Q and A-learninglQ-learninglAt time t we model the main effects of the history,(St,At-1)and the action At and their interactionlOur Yt-1 is aff

43、ected by how we modeled the main effect of the history in time t,(St,At-1)lA-learninglAt time t we only model the effects of At and its interaction with(St,At-1)lOur Yt-1 does not depend on a model of the main effect of the history in time t,(St,At-1)2022/12/8强化学习史忠植60Q-Learning Vs.A-LearninglRelat

44、ive merits and demerits are not completely known till now.lQ-learning has low variance but high bias.lA-learning has high variance but low bias.lComparison of Q-learning with A-learning involves a bias-variance trade-off.2022/12/8强化学习史忠植61POMDP部分感知马氏决策过程部分感知马氏决策过程 lRather than observing the state w

45、e observe some function of the state.lOb Observable functiona random variable for each states.lProblem:different states may look similarThe optimal strategy might need to consider the history.2022/12/8强化学习史忠植62Framework of POMDPPOMDP由六元组定义。其中定义了环境潜在的马尔可夫决策模型上，是观察的集合，即系统可以感知的世界状态集合，观察函数：SAPD（）。系统在采取

46、动作a转移到状态s时，观察函数确定其在可能观察上的概率分布。记为（s,a,o）。1 可以是S的子集，也可以与S无关2022/12/8强化学习史忠植63POMDPsWhat if state information(from sensors)is noisy?Mostly the case!MDP techniques are suboptimal!Two halls are not the same.2022/12/8强化学习史忠植64POMDPs A Solution StrategySE:Belief State Estimator(Can be based on HMM):MDP T

47、echniques2022/12/8强化学习史忠植65POMDP_信度状态方法信度状态方法lIdea:Given a history of actions and observable value,we compute a posterior distribution for the state we are in(belief state)lThe belief-state MDPlStates:distribution over S(states of the POMDP)lActions:as in POMDPlTransition:the posterior distribution

48、(given the observation)Open Problem:How to deal with the continuous distribution?2022/12/8强化学习史忠植66The Learning Process of Belief MDP2022/12/8强化学习史忠植67Major Methods to Solve POMDP 算法名称基本思想学习值函数Memoryless policies直接采用直接采用标标准的准的强强化学化学习习算法算法Simple memory based approaches使用使用k个个历历史史观观察表示当前状察表示当前状态态UDM

49、(Utile Distinction Memory)分解状分解状态态，构建有限状，构建有限状态态机模型机模型NSM(Nearest Sequence Memory)存存储储状状态历态历史，史，进进行距离度量行距离度量USM(Utile Suffix Memory)综综合合UDM和和NSM两种方法两种方法Recurrent-Q使用循使用循环环神神经经网网络进络进行状行状态预测态预测策略搜索Evolutionary algorithms使用使用遗传遗传算法直接算法直接进进行策略搜索行策略搜索Gradient ascent method使用梯度下降（上升）法搜索使用梯度下降（上升）法搜索2022/1

50、2/8强化学习史忠植68强化学习中的函数估计强化学习中的函数估计RLFASubset of statesValue estimate as targetsV(s)Generalization of the value function to the entire state spaceis the TD operator.is the function approximation operator.2022/12/8强化学习史忠植69并行两个迭代过程并行两个迭代过程l值函数迭代过程l值函数逼近过程How to construct the M function?Using state clu

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

20 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 10 强化学习讲课教案

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：第10章-强化学习讲课教案.ppt
链接地址：https://www.taowenge.com/p-65785053.html