书签分享收藏举报版权申诉 / 57

立即下载

当前位置：首页 > 教育专区 > 小学资料 > piecewise-linear.ppt

piecewise-linear.ppt

上传人：仙人****88

文档编号：88480395

上传时间：2023-04-26

格式：PPT

页数：57

大小：303.50KB

( 4.5 )

《piecewise-linear.ppt》由会员分享，可在线阅读，更多相关《piecewise-linear.ppt（57页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、POMDP TutorialPart IIITeg GrenagerStanford UniversityDecember 4,2002OutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning AlgorithmsWitness AlgorithmConclusionOutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncre

2、mental Pruning AlgorithmsWitness AlgorithmConclusionFormal POMDP ModelBase model:S is a finite set of states of the world;A is a finite set of actions;:S x A (S)is the state transition function;r:S x A is the reward function;Z is a finite set of observations the agent can experience of the world;o:S

3、 x A (Z)is the observation function,which gives,for each action and resulting state a distribution over observations.We define:B=(S)is the set of all belief states;the belief state is a sufficient statistic for the past history and initial beliefsBelief UpdatingGiven belief b,action a and observatio

4、n z,compute new belief bza:Dont worry about the denominator:its a normalizing termObservable“Belief MDP”B is the state space(of beliefs);A is the set of actions(the same);:B x A (B)is the state transition function:B x A is the reward functiont-step Policy TreeSo,whats a policy in the belief MDP?Its

5、a tree of actions,conditioned on observations:t-step Value FunctionLet Vp(s)be the value of executing the t-step policy tree p starting from state s;Vp is a vector of length|S|Then the value of executing policy tree p starting from belief state b isVp defines a hyperplane in B.Vp(b)is linear in b.Th

6、e value of the optimal t-step policy starting at b isVt*(b)is piecewise-linear and convex(PWLC).Policy Trees and Value Functionb(SR)01-100010LiRiRi20:TLTR20LiLeLe21:TLTR21LiLiRi23:TLTR23LiLiLi22:TLTR22LiLeLi24:TLTR24expected t-step discounted valueSet of Policy TreesPiecewise-Linear and Convex Value

7、 Function(Set of Hyperplanes,represented as vectors)LiRiLe25:TLTR25Dominated Policy TreesRepresent a value function as a set of policy trees and their associated value functions,given as|S|-vectorsSome vectors(and policy trees)may be dominated or uselessWe prefer a parsimonious representation:a set

8、of vectors with no dominated policy treesb(s1)01expected t-step discounted valueOutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning AlgorithmsWitness AlgorithmConclusionExhaustive Enumeration OverviewWe want to do value iterationInsight:Given A pa

9、rsimonious set t-1 of(t-1)-step policy trees,andThe associated PWLC value function Vt-1(.)we want to generate a parsimonious set t of t-step policy trees and associated value function Vt-1 Nave method:1.Generate an exhaustive set t of policy trees from the value function Vt-1 and t-12.Prune the set

10、t to a parsimonious representation tDue to Sondik(1971)and Monahan(1982)Exhaustive EnumerationHow many trees to we generate in the first step?Can choose any action aA for the root node,and for each observation zZ,we can choose any of the t-1 policy trees(assume we have parsimonious representation fo

11、r Vt)Thus the number of trees we generate is|A|t-1|Z|,which is exponential in the number of observations(bad!)OutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning AlgorithmsWitness AlgorithmConclusionComputing the Vector at a PointInsight:Given the

12、 set t-1 of(t-1)-step policy trees,andthe associated PWLC value function Vt-1(.)it is easy to compute the optimal t-step policy tree t*and value Vt(b)for a given belief state bHow?For each action a,and each observation z,compute the new belief state bzaSelect the policy tree in t-1 that has the maxi

13、mal value at bzaLinear Support OverviewBasic Idea:In order to check if our partial set t of policy trees is complete,we need only check all of the vertices formed by the value functions in t Problem(foreshadowing):the number of vertices may be exponential in the number of dimensions and number of hy

14、perplanesb(SR)01-100010202122Linear Support ExampleRecall the Tiger Problem:S=SL,SRA=left,right,listenZ=TL,TRSLSR0.50.5SR0.50.5SLSRSLsst(s,right,s)0.50.5SR0.50.5SLSRSLsst(s,left,s)0.50.5right0.150.5SR0.850.5SLlistenleftaso(TL,s,a)10SR01SLSRSLsst(s,listen,s)-10010right-110SR-1-100SLlistenleftasr(s,a)

15、Linear Support Example1-step policy trees 1:LeRiLib(SR)01expected t-step discounted value10:11:12:-100010121011We start with:The set of 1-step policy trees(and associated value function)We are computing the set of 2-step policy treesInitialize a set of belief states to check:=(1,0)Initialize the set

16、 of 2-step policy trees to be empty:t=Linear Support ExampleLiRiRi20:TLTRRemove the belief(1,0)from Pick a policy tree that is optimal for belief(1,0)b(SR)01-10001020Add the new tree:t=20Compute the region:R(20,t)=(0,1)Add the new vertex:=(0,1)LiLeLe21:TLTRRemove the belief(0,1)from Pick a policy tr

17、ee that is optimal for belief(1,0)b(SR)01-1000102021Add the new tree:t=20,21Compute the region:R(21,t)=(0.5,1)Add the new vertex:=(0.5,0.5)Linear Support ExampleLiLiLi22:TLTRRemove the belief(0.5,0.5)from Pick a policy tree that is optimal for belief(1,0)01Add the new tree:t=20,21,22Compute the regi

18、on:R(22,t)=(x,y)b(SR)-100010202122LiRiRi20:TLTRProcess continues,until we get the final set t=20,21,22,23,24LiLeLe21:TLTRLiLiLi22:TLTRLiLiRi23:TLTRLiLeLi24:TLTRb(SR)01-10001020212021232224Add the new vertices:=(x,1-x),(y,1-y)Linear Support AlgorithmInitialize the fringe set to contain arbitrary beli

19、ef state bInitialize the result set t to be emptyIterate until is empty:Remove b from Let ta*(b)be the policy tree and vector that is optimal for bIf ta*(b)is not in t,add ta*(b)to t Let R be R(ta*,t)For each vertex b of RIf b is not in,add b to Return tLinear Support Running TimeUnfortunately,the n

20、umber of vertices in a convex polytope is exponential in either the dimension of the state space or the number of binding constraintsNot practical to check all vertices for more than 4 or 5 statesOutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning

21、 AlgorithmsWitness AlgorithmConclusionBatch EnumerationSeparately computes ta,the set of vectors representing the value function corresponding to policy trees that take a particular actionNote that this is similar in spirit to a Q-function,since we do for each action separatelyRecall that given t-1,

22、an action a and observation z it is easy to generate the best t-step policy tree ta,z and associated value functionWe define t*,a,z to be the set of all possible ta,z vectors Since there are only|t-1|possible(t-1)-step subtrees,we have that|t*,a,z|=|t-1|Batch Enumeration(cont.)Recall that Thus we ca

23、n define t*,a to be the set of vectors obtained for all ways of selecting a vector from t*,a,z for each zWe express this mathematically with the cross-sum operator,which takes two(or more)sets of vectors and produces a set of vectors that consists of all of the ways of adding one vector from each se

24、tOnce we do this for each action,we need to union the sets:Incremental PruningOf course,after we union all the sets,we need to pruneThus the final batch enumeration algorithm can be summarized as:Of course,this is exponential in|Z|Wed like to push the pruning as far inside as we can:Note:even though

25、 we can push it inside the cross-sum,it is still needed outsideIncremental Pruning(cont.)We can do even better:interleave cross-sum operations with the pruning operations:Incremental Pruning(cont.)Optimization:note that in computingwe could do better if we could figure out which vectors of were goin

26、g to be useful(not dominated).Insight:The sum of two vectors A and B is useful if the regions R(,A)and R(,B)overlap(!)012012dominant regionsb(s1)01b(s1)01expected t-step discounted value102012AB012012AB0+00+12+22+11+1ABOutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support Algor

27、ithmIncremental Pruning AlgorithmsWitness AlgorithmConclusionWitness OverviewDue to Cassandra,Littman,and Kaelbling(1994)Like Incremental Pruning,it uses the set t-1 to separately compute ta for each action a,the set of vectors representing the value function corresponding to policy trees that take

28、action aNote that this is similar in spirit to a Q-function,since we do for each action separatelyEfficient:we can compute ta in time polynomial in|S|,|A|,|Z|,|t-1|,|ta|We compute the t by taking the union of the ta for all aA and then pruning dominated vectors(using LP as usual)Neighbor Policy Tree

29、sGiven a t-step policy tree t,observation z,and(t-1)-step policy tree t-1,a neighbor,denoted by tnb,is a t-step policy tree that agrees with t in its action and all subtrees except the one for observation z,for which z(tnb)=t-1.a1t-1z1|Z|t-step policy tree t:1t-12t-1|Z|t-1(t-1)-step policy trees:a2t

30、-1z1|Z|t-step policy tree tnb:Finding a WitnessThe Witness Theorem:The true value function Vt*a differs from the estimated value function Vta given by the partial set ta iff there is some neighbor tnb of a tree t in ta,such that Thus to find a witness we only need to search through the policy trees

31、that neighbor one of the trees in the partial setThere are O(|ta|Z|t-1|)modified trees(polynomial!)Witness AlgorithmInitialize the set ta to contain a policy tree,with action a at the root,that is optimal for some arbitrary belief state b(we saw above that this is easy to do)Iterate until no witness

32、es exist:Look for a belief state b for which the true value Vt*a(b)is different from the estimated value given by our partial set ta;we call b a witnessIf a witness b does not exist,then we are done,and the set ta represents the true value function Vt*a If a witness b does exist,then construct the p

33、olicy tree with action a at the root that yields the highest value at the belief state b(we saw above that this is easy to do)and add it to the set ta Return taWitness ExampleRecall the Tiger Problem:S=SL,SRA=left,right,listenZ=TL,TRSLSR0.50.5SR0.50.5SLSRSLsst(s,right,s)0.50.5SR0.50.5SLSRSLsst(s,lef

34、t,s)0.50.5right0.150.5SR0.850.5SLlistenleftaso(TL,s,a)10SR01SLSRSLsst(s,listen,s)-10010right-110SR-1-100SLlistenleftasr(s,a)Witness Example(cont.)1-step policy trees 1:LeRiLib(SR)01expected t-step discounted value10:11:12:-100010121011We make a 2-step neighbor 2nb:2nb does better than 20 for the bel

35、ief b(SR)=0.5.Intuitively,we only change the action we take on observing TR.If we observe TR then our new belief state is 0.85,and in this state we do better by choosing Left than Right!Thus we search for the optimal policy tree for b(SR)=0.5 and add it:LiLeRiTLTRLiLiLiTLTR21:Partial set of 2-step p

36、olicy trees with Listen at root 2Li:LiRiRi20:TLTRThe tree 20 is optimal for b(SR)=0Finding a Witness(cont.)Given a t ta and a neighbor tnb how do we determine if there is a witness b such that Linear programming,of course:If the linear program finds that 0,then tnb is not better than all trees in ta

37、;otherwise,it is,and b is a witnessWitness Running TimeA single pass of value iteration with the witness algorithm includes generating the set ta for each action in ATo generate ta requires|ta|steps,and for each step,evaluating|t-1|Z|candidate treesEvaluating a candidate tree requires a linear progr

38、am with|S|variables and 1+|S|+|ta|constraintsThus polynomial in|A|,|ta|,|t-1|,|Z|,and|S|OutlineReview of POMDP modelExhaustive Enumeration AlgorithmsLinear Support AlgorithmIncremental Pruning AlgorithmsWitness AlgorithmConclusionExact Algorithm SummaryFinding the optimal policy is PSPACE-complete N

39、P(Papadimitriou and Tsitsiklis)Cannot have polynomial-time algorithm for even just one value iteration step,since in the worst case the number of vectors in t may be exponential in|Z|Exhaustive Enumeration,Batch Enumeration,Linear Support are worst case exponentialEven the size of ta may be exponent

40、ial in|Z|However,how often is this the case?We assume otherwise,and consider the class of polynomial action output-bounded POMDPs insteadIncremental Pruning,Witness,and Generalized Incremental Pruning are worst case polynomial timeConclusionThis is HARD PROBLEM!We should focus either onHeuristic met

41、hods that dont guarantee a good approximation,but which work well in practice,orParticular subsets of the POMDP problem which are easier to solveIn particular,we might be interested in POMDPs of playing against an unknown finite automata,where there is uncertainty only about the automata being playe

42、d.OutlineReview of POMDP modelExhaustive Enumeration AlgorithmsWitness AlgorithmIncremental Pruning AlgorithmsTwo-pass and One-pass AlgorithmsLinear Support AlgorithmConclusionTwo-pass OverviewDue to SondikTo construct ta we do a cross-sum over all observationsTo calculate one of the members of ta w

43、e do To determine if it is going to be useful we must check whether the following region is emptyThus it must be thatFinding Bordering NeighborsNote that each border of the dominant region of the vector ta is formed with a neighbor vector(formed by changing the policy subtree for one of the observat

44、ions)Thus each neighbor vector may be a borderHow do we determine if a neighbor vector is a border?Why linear programming,of course!(Is there a belief state such that both ta and tnb are maximal?)ta,1ta,2ta,0tata,0ta,1ta,2taTwo-pass AlgorithmInitialize the result set ta to be empty,and the fringe se

45、t to contain a vector with action a in the root that is maximal for some belief stateIterate until is empty:Remove an element ta from Add ta to taLet R be the For each neighbor tnb of ta do:Use a linear program to determine whether tnb borders taIf so,add tnb to Return ta Two-pass Running Time The t

46、wo-pass algorithm may return a set with imposters,and thus we will need to prune ta to get taThe second pass is the merging of the ta to form the set tThe running time is not polynomial It would be if not for the imposter problemThere could theoretically be an exponential number of impostersOne-pass

47、 OverviewAlso due to Sondik,it was the first exact algorithmSimilar to the two-pass,but constructs t directly,thereby foregoing the second passIn the two-pass,we only needed to find the region in which a vector ta(b)dominates the set of vectors taIf we only do one pass,we need to find the region in

48、which a vector ta*(b),the optimal vector,dominates the set of vectors t(for all a)Note that we can start with the region R(ta*,ta*)sinceHowever,we will need to restrict this region further to find all of its bordersOne-pass Overview(cont.)We want to find the region in which the vector ta*(b)dominate

49、s the set of vectors t(for all a)Two cases in which we need to restrict the region R(ta*,ta*):1.Another vector ta(b)that is optimal for b assuming a different action a actually does better that the vector ta*(b)in some part of the region R(ta*,ta*)2.There is a belief state b in the region R(ta*,ta*)

50、which has the same vector as b for a*,but for another action a we have that ta(b)ta(b).This other vector ta(b)may yield higher values than ta*(b)at bR(ta*,ta*)01b010ta*(b)R(ta*,ta*)ta(b)bbta*(b)ta(b)ta(b)One-pass ConstraintsIncorporating these restrictions,we make the following final set of constrai

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: piecewise linear

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：piecewise-linear.ppt
链接地址：https://www.taowenge.com/p-88480395.html