Multi-stage lookahead in Approximate Policy Iteration

A lot of planning tasks involve an ad-hoc or estimated value function that scores future states. Since this value function is typically inaccurate, some lookahead is necessary to better estimate the values of next-step states: if the value function is optimal then the best next state is trivially found. This project will examine the interaction of multi-step lookaheads and the online estimation of the value function: what is the optimal look-ahead strategy in terms of convergence of the value function to optimality?

Planning tasks can be solved using value functions V(s). Then, by performing a lookahead, we can select the action leading to the state with the highest value. When this value function is the optimal value function, a one-stage lookahead suffices. However, most of the time we have an approximation of the optimal value function. Then, effects of the approximation error are diminished through a k-stage lookahead. This can occur in two cases:

a) When the value function is that of discounted expected rewards, in which case the error dimished by a factor exponential in k. This is a common case in reinforcement learning tasks.

b) When approximation error is smaller in future states than in current states. In such cases, selective lookahead, such as used in chess, may be advantageous.

However the above discussion only applies to having already obtained a given approximation to the optimal value function. In a lot of cases, however, we want to iteratively create a good approximation to the optimal value function by starting with an initial policy, evaluating it -- and then creating an improved policy using this evaluation. Such methods are fall within the framework of approximate policy iteration. The main question that this projects wishes to address is how to make approximate policy iteration perform better or converge faster by tweaking the lookahead procedure.


Status:
Open
Location:
Universiteit van Amsterdam
Contact:
Christos Dimitrakakis