3. Bellman optimality equation
3.1 Definition of optimal policy
- Definition:
A policy $\pi^{ }$ is optimal if $v_{\pi^{ }}(s) \ge v_\pi(s)$ for all $s$ and for any other policy $\pi$
重点:所有状态的$v$都是最大的
Several question:
- Does the optimal policy exist 是否存在最优策略?
- Is the optimal policy unique 最优策略是否唯一?
- Is the optimal policy stochastic or deterministic 最优策略是随机的还是确定的?
- How to obtain the optimal policy 如何得到最优策略?
3.2 BOE: Introduction
- Bellman optimality equation (
elementwise form
):
Bellman quation | Bellman optimality equation |
---|---|
$v_\pi(s)$ | $v(s)$ |
without $\max_\pi$ | with $\max_\pi$ |
for a given $\pi$ | to solve/get the optimality $\pi$ |
- Remarks:
- $p(r\mid s, a), p(s^\prime \mid s, a)$ are ==known==
- $v(s), v(s^\prime)$ are ==unknown== and to be caluated
- Bellman optimality equation (
matrix-vector form
)
- Where the elements corresponding to $s$ or $s^\prime$ are
Here $\max_\pi$ is performed elementwist
即$\max_\pi$可以放到matrix-vector的括号里面,对每一个元素,求$\max_\pi$
==Tricky==
There is a maximization on the right-hand side, which may not be straightforward to see how to compute.
右边镶嵌了一个最优化的问题,不能直接看出是怎么计算的
==Elegant==
Describes the optimal policy and optimal state value is an elegant way.
形式简洁,直接刻画最优策略&&最优的state value
Several questions:
- Algorithm: how to solve?
- Existence: have solutions?
- Uniqueness: the solution unique?
- Optimality: how is it related to optimal policy?
3.3 BOE: Maximazation on the right-hand side
- 对于BOE,一个公式存在2个未知量$\pi, v$(需要通过其他方法求解)
对于$x = \max_a (2x - 1 - a^2)$
- 首先不考虑$x$,则当$a = 0$时取到最大值;
- 令$a = 0$,公式变为$x = 2x - 1$,则$x = 1$
所以$x = 1, a = 0$为最终解
因此,可以在求解BOE得过程中,fix $v^\prime(s)$ first and solve $\pi$:
begin with an initial $v^\prime(s)$, such as $0$
然后得到了一系列的$q(s, a)$(参考公式(1))
如何求解呢?假设$q_1, q_2, q_3 \in \mathbb{R}$ are given, find $c^{ }_1, c^{ }_2, c^{ * }_3$ solving:
其中$c_1 + c_2 + c_3 = 1$ and $c_1, c_2, c_3 > 0$(对应概率的性质)
Without loss of generality,不失一般性,假设$q_3 > q_1, q_2$,那么最优解就是$c^{ }_3 = 1$而$c^{ }_1= c^{ * }_2 = 0$:
所以,再回到公式(1),
- consider $\sum_a \pi(a \mid s) = 1$,有:
当以下条件成立时达到最优:
其中$a^{ * } = \arg \max_a q(s, a)$
3.4 Rewrite as $v = f(v)$
考虑公式(2),let:
最终的BOE变为:
此处的$f(\boldsymbol{v})$是一个向量:
3.5 Contraction mapping theorem
Concepts
- ==Fixed point==: $x \in X$ is a fixed point of $f: X \rightarrow X$ if
- Contraction mapping / contractive function: $f$ is a contraction mapping if
Where $\gamma \in (0,1)$
- $\gamma$ must be stricly less than $1$, so that many limits such as $\gamma^k \rightarrow 0$ as $k \rightarrow 0$ hold.
- Here $||\cdot||$ can be any vector norm
从直观上说,$f(x_1)$和$f(x_2)$之间的距离,比$x_1$和$x_2$之间的距离,更小,收敛
==Proof== && ==Contraction Mapping Theorem==
3.6 BOE: Solution
==Proof==: BOE是一个Contraction Mapping: $||f(v_1) - f(v_2)|| \le \gamma || v_1 - v_2 ||$
所以根据Contraction Mapping,可以得到如下结论:
Theorem (Existence, Uniqueness, and Algorithm)
There always EXISTS a solution $v^$ and the solution is *UNIQUE. The solution could be solved iteratively by:
This sequence $\left\{ v_k \right\}$ converges to $v^$ **exponentially fast given any initial guess* $v_0$.
The convergence rate is determined by $\gamma$
Iterative algorithm
- Matrix-vector form
- Elementwise form
Procedure summary
- For any $s$, current estimated value $v_k(s)$
- For any $a \in \mathcal{A}(s)$, calculate $q_k(s, a) = \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime}p(s^\prime \mid s, a) v_k(s^\prime)$
- Calculate the greedy policy $\pi_{k + 1}$ for $s$ as formula (7)
- Calculate $v_{k + 1}(s) = \max_a q_k(s, a)$
This is actually the ==value iteration algorithm==
3.7 BOE: Optimality
假设$v^*$是BOE的解,满足:
fix $v^{ }$, get $\pi^{ }$
则可以写为(去掉了$\max$)
That is: BOE is a special BE
如何证明$\pi^{ }$最优/$v_{\pi^{ }}$最大?
==Proof==: Policy Optimality
Theorem (Greedy Optimal Policy)
For any $s \in \mathcal{S}$, the deterministic && greedy policy:
is an optimal policy solving the BOE. Here:
Where:
3.8 Analyzing optimal polocies
What fators detemine the optimal policy?
- Discount rate: $\gamma$
System model: $p(r\mid s, a)$ && $p(s^\prime \mid s, a)$ for a deterministic system, generally can’t be changed
Reward design: $r$
以上参数决定了最优策略,也是在以上参数已知的前提下去求解最优策略
重点在于如何设计$\gamma$ && $r$
Some ideas:
- Optimal policy dares to take risks.
$\gamma \uparrow$, long-sighted, focus on future reward; $\gamma \downarrow$, shorted-sighted, focus on near future; $\gamma = 0$, extremely short-sighted, only focus on immediate reward.
What matters is not the absolute reward values, it is their relative values.
- ==Proof==: Optimal Policy Invariance
- Meaningless detour: $\gamma$ is the punishment for takeing detours,在设计reward的时候,需要考虑惩罚点,这时候可以借助$\gamma$
3.9 Summary
- Bellman Optimality equation:
- Elementwise form
- Matrix-vector form
- Existence: does this equation have solution?
- Yes, by contraction mapping theorem
- $v^{ }$ 唯一,$\pi^{ }$ 不一定唯一
- Uniqueness
- Yes, by the contraction mapping theorem
- Algorithm: How to solve?
- Iterative algorithm suggested by the contraction mapping theorem
- Optimality: why we study this equation
- Its solution corresponds to the optimal state value and optimal policy