piecewise-linear.ppt
《piecewise-linear.ppt》由会员分享,可在线阅读,更多相关《piecewise-linear.ppt(57页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、POMDP TutorialPart IIITeg GrenagerStanford UniversityDecember 4,2002OutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning AlgorithmsWitness AlgorithmConclusionOutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncre
2、mental Pruning AlgorithmsWitness AlgorithmConclusionFormal POMDP ModelBase model:S is a finite set of states of the world;A is a finite set of actions;:S x A (S)is the state transition function;r:S x A is the reward function;Z is a finite set of observations the agent can experience of the world;o:S
3、 x A (Z)is the observation function,which gives,for each action and resulting state a distribution over observations.We define:B=(S)is the set of all belief states;the belief state is a sufficient statistic for the past history and initial beliefsBelief UpdatingGiven belief b,action a and observatio
4、n z,compute new belief bza:Dont worry about the denominator:its a normalizing termObservable“Belief MDP”B is the state space(of beliefs);A is the set of actions(the same);:B x A (B)is the state transition function:B x A is the reward functiont-step Policy TreeSo,whats a policy in the belief MDP?Its
5、a tree of actions,conditioned on observations:t-step Value FunctionLet Vp(s)be the value of executing the t-step policy tree p starting from state s;Vp is a vector of length|S|Then the value of executing policy tree p starting from belief state b isVp defines a hyperplane in B.Vp(b)is linear in b.Th
6、e value of the optimal t-step policy starting at b isVt*(b)is piecewise-linear and convex(PWLC).Policy Trees and Value Functionb(SR)01-100010LiRiRi20:TLTR20LiLeLe21:TLTR21LiLiRi23:TLTR23LiLiLi22:TLTR22LiLeLi24:TLTR24expected t-step discounted valueSet of Policy TreesPiecewise-Linear and Convex Value
7、 Function(Set of Hyperplanes,represented as vectors)LiRiLe25:TLTR25Dominated Policy TreesRepresent a value function as a set of policy trees and their associated value functions,given as|S|-vectorsSome vectors(and policy trees)may be dominated or uselessWe prefer a parsimonious representation:a set
8、of vectors with no dominated policy treesb(s1)01expected t-step discounted valueOutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning AlgorithmsWitness AlgorithmConclusionExhaustive Enumeration OverviewWe want to do value iterationInsight:Given A pa
9、rsimonious set t-1 of(t-1)-step policy trees,andThe associated PWLC value function Vt-1(.)we want to generate a parsimonious set t of t-step policy trees and associated value function Vt-1 Nave method:1.Generate an exhaustive set t of policy trees from the value function Vt-1 and t-12.Prune the set
10、t to a parsimonious representation tDue to Sondik(1971)and Monahan(1982)Exhaustive EnumerationHow many trees to we generate in the first step?Can choose any action aA for the root node,and for each observation zZ,we can choose any of the t-1 policy trees(assume we have parsimonious representation fo
11、r Vt)Thus the number of trees we generate is|A|t-1|Z|,which is exponential in the number of observations(bad!)OutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning AlgorithmsWitness AlgorithmConclusionComputing the Vector at a PointInsight:Given the
12、 set t-1 of(t-1)-step policy trees,andthe associated PWLC value function Vt-1(.)it is easy to compute the optimal t-step policy tree t*and value Vt(b)for a given belief state bHow?For each action a,and each observation z,compute the new belief state bzaSelect the policy tree in t-1 that has the maxi
13、mal value at bzaLinear Support OverviewBasic Idea:In order to check if our partial set t of policy trees is complete,we need only check all of the vertices formed by the value functions in t Problem(foreshadowing):the number of vertices may be exponential in the number of dimensions and number of hy
14、perplanesb(SR)01-100010202122Linear Support ExampleRecall the Tiger Problem:S=SL,SRA=left,right,listenZ=TL,TRSLSR0.50.5SR0.50.5SLSRSLsst(s,right,s)0.50.5SR0.50.5SLSRSLsst(s,left,s)0.50.5right0.150.5SR0.850.5SLlistenleftaso(TL,s,a)10SR01SLSRSLsst(s,listen,s)-10010right-110SR-1-100SLlistenleftasr(s,a)
15、Linear Support Example1-step policy trees 1:LeRiLib(SR)01expected t-step discounted value10:11:12:-100010121011We start with:The set of 1-step policy trees(and associated value function)We are computing the set of 2-step policy treesInitialize a set of belief states to check:=(1,0)Initialize the set
16、 of 2-step policy trees to be empty:t=Linear Support ExampleLiRiRi20:TLTRRemove the belief(1,0)from Pick a policy tree that is optimal for belief(1,0)b(SR)01-10001020Add the new tree:t=20Compute the region:R(20,t)=(0,1)Add the new vertex:=(0,1)LiLeLe21:TLTRRemove the belief(0,1)from Pick a policy tr
17、ee that is optimal for belief(1,0)b(SR)01-1000102021Add the new tree:t=20,21Compute the region:R(21,t)=(0.5,1)Add the new vertex:=(0.5,0.5)Linear Support ExampleLiLiLi22:TLTRRemove the belief(0.5,0.5)from Pick a policy tree that is optimal for belief(1,0)01Add the new tree:t=20,21,22Compute the regi
18、on:R(22,t)=(x,y)b(SR)-100010202122LiRiRi20:TLTRProcess continues,until we get the final set t=20,21,22,23,24LiLeLe21:TLTRLiLiLi22:TLTRLiLiRi23:TLTRLiLeLi24:TLTRb(SR)01-10001020212021232224Add the new vertices:=(x,1-x),(y,1-y)Linear Support AlgorithmInitialize the fringe set to contain arbitrary beli
19、ef state bInitialize the result set t to be emptyIterate until is empty:Remove b from Let ta*(b)be the policy tree and vector that is optimal for bIf ta*(b)is not in t,add ta*(b)to t Let R be R(ta*,t)For each vertex b of RIf b is not in,add b to Return tLinear Support Running TimeUnfortunately,the n
20、umber of vertices in a convex polytope is exponential in either the dimension of the state space or the number of binding constraintsNot practical to check all vertices for more than 4 or 5 statesOutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning
21、 AlgorithmsWitness AlgorithmConclusionBatch EnumerationSeparately computes ta,the set of vectors representing the value function corresponding to policy trees that take a particular actionNote that this is similar in spirit to a Q-function,since we do for each action separatelyRecall that given t-1,
22、an action a and observation z it is easy to generate the best t-step policy tree ta,z and associated value functionWe define t*,a,z to be the set of all possible ta,z vectors Since there are only|t-1|possible(t-1)-step subtrees,we have that|t*,a,z|=|t-1|Batch Enumeration(cont.)Recall that Thus we ca
23、n define t*,a to be the set of vectors obtained for all ways of selecting a vector from t*,a,z for each zWe express this mathematically with the cross-sum operator,which takes two(or more)sets of vectors and produces a set of vectors that consists of all of the ways of adding one vector from each se
24、tOnce we do this for each action,we need to union the sets:Incremental PruningOf course,after we union all the sets,we need to pruneThus the final batch enumeration algorithm can be summarized as:Of course,this is exponential in|Z|Wed like to push the pruning as far inside as we can:Note:even though
25、 we can push it inside the cross-sum,it is still needed outsideIncremental Pruning(cont.)We can do even better:interleave cross-sum operations with the pruning operations:Incremental Pruning(cont.)Optimization:note that in computingwe could do better if we could figure out which vectors of were goin
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- piecewise linear
限制150内