3. Bellman optimality equation

3.1 Definition of optimal policy

Definition:

A policy $\pi^{ }$ is optimal if $v_{\pi^{ }}(s) \ge v_\pi(s)$ for all $s$ and for any other policy $\pi$

重点：所有状态的$v$都是最大的

Several question:

Does the optimal policy exist 是否存在最优策略？

Is the optimal policy unique 最优策略是否唯一？

Is the optimal policy stochastic or deterministic 最优策略是随机的还是确定的？

How to obtain the optimal policy 如何得到最优策略？

3.2 BOE: Introduction

Bellman optimality equation (elementwise form):

$\begin{aligned} v(s) & = \max_{\pi} \sum_a \pi(a \mid s) (\sum_r p(r \mid s, a) r + \gamma \sum_{s^{\prime}} p(s^\prime \mid s, a) v(s^\prime)), \forall s \in \mathcal{S} \\ & = \max_\pi \sum_a \pi(a \mid s) q(s, a), \forall s \in \mathcal{S} \end{aligned}$

Bellman quation	Bellman optimality equation
$v_\pi(s)$	$v(s)$
without $\max_\pi$	with $\max_\pi$
for a given $\pi$	to solve/get the optimality $\pi$

Remarks:
1. $p(r\mid s, a), p(s^\prime \mid s, a)$ are ==known==
2. $v(s), v(s^\prime)$ are ==unknown== and to be caluated

Bellman optimality equation (matrix-vector form)

$\boldsymbol{v} = \max_\pi (\boldsymbol{r}_\pi + \gamma \boldsymbol{P}_\pi \boldsymbol{v})$

Where the elements corresponding to $s$ or $s^\prime$ are

$\begin{aligned} & {\left[r_\pi\right]_s \triangleq \sum_a \pi(a \mid s) \sum_r p(r \mid s, a) r} \\ & {\left[P_\pi\right]_{s, s^{\prime}}=p\left(s^{\prime} \mid s\right) \triangleq \sum_a \pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right)} \end{aligned}$

Here $\max_\pi$ is performed elementwist

即$\max_\pi$可以放到matrix-vector的括号里面，对每一个元素，求$\max_\pi$

==Tricky==

There is a maximization on the right-hand side, which may not be straightforward to see how to compute.

右边镶嵌了一个最优化的问题，不能直接看出是怎么计算的

==Elegant==

Describes the optimal policy and optimal state value is an elegant way.

形式简洁，直接刻画最优策略&&最优的state value

Several questions:

Algorithm: how to solve?

Existence: have solutions?

Uniqueness: the solution unique?

Optimality: how is it related to optimal policy?

3.3 BOE: Maximazation on the right-hand side

对于BOE，一个公式存在2个未知量$\pi, v$（需要通过其他方法求解）

对于$x = \max_a (2x - 1 - a^2)$

首先不考虑$x$，则当$a = 0$时取到最大值；

令$a = 0$，公式变为$x = 2x - 1$，则$x = 1$

所以$x = 1, a = 0$为最终解

因此，可以在求解BOE得过程中，fix $v^\prime(s)$ first and solve $\pi$：
- begin with an initial $v^\prime(s)$, such as $0$
- 然后得到了一系列的$q(s, a)$（参考公式(1)）

如何求解呢？假设$q_1, q_2, q_3 \in \mathbb{R}$ are given, find $c^{ }_1, c^{ }_2, c^{ * }_3$ solving:
$\max_{c_1, c_2, c_3} c_1 q_1 + c_2 q_2 + c_3 q_3$
其中$c_1 + c_2 + c_3 = 1$ and $c_1, c_2, c_3 > 0$（对应概率的性质）

Without loss of generality，不失一般性，假设$q_3 > q_1, q_2$，那么最优解就是$c^{ }_3 = 1$而$c^{ }_1= c^{ * }_2 = 0$：
$q_3 = (c_1 + c_2 + c_3) q_3 = c_1 q_3 + c_2 q_3 + c_3 q_3 \ge c_1 q_1 + c_2 q_2 + c_3 q_3$

所以，再回到公式(1)，
- consider $\sum_a \pi(a \mid s) = 1$，有：

当以下条件成立时达到最优：
$\pi(a \mid s) = \begin{cases} 1 & a = a* \\ 0 & a \ne a^* \end{cases}$

其中$a^{ * } = \arg \max_a q(s, a)$

3.4 Rewrite as $v = f(v)$

考虑公式(2)，let：

$f(\boldsymbol{v}) := \max_\pi (\boldsymbol{r}_\pi + \gamma \boldsymbol{P}_\pi \boldsymbol{v})$

最终的BOE变为：

$\boldsymbol{v} = f(\boldsymbol{v})$

此处的$f(\boldsymbol{v})$是一个向量：

$\left[f(\boldsymbol{v})\right]_s = \max_\pi \sum_a \pi(a \mid s) q(s, a)$

3.5 Contraction mapping theorem

Concepts

==Fixed point==: $x \in X$ is a fixed point of $f: X \rightarrow X$ if

$f(x) = x$

Contraction mapping / contractive function: $f$ is a contraction mapping if

$||f(x_1) - f(x_2)|| \le \gamma ||x_1 - x_2||$

Where $\gamma \in (0,1)$

$\gamma$ must be stricly less than $1$, so that many limits such as $\gamma^k \rightarrow 0$ as $k \rightarrow 0$ hold.
Here $||\cdot||$ can be any vector norm

从直观上说，$f(x_1)$和$f(x_2)$之间的距离，比$x_1$和$x_2$之间的距离，更小，收敛

==Proof== && ==Contraction Mapping Theorem==

3.6 BOE: Solution

==Proof==: BOE是一个Contraction Mapping: $||f(v_1) - f(v_2)|| \le \gamma || v_1 - v_2 ||$

所以根据Contraction Mapping，可以得到如下结论：

Theorem (Existence, Uniqueness, and Algorithm)

There always EXISTS a solution $v^$ and the solution is *UNIQUE. The solution could be solved iteratively by:

$\boldsymbol{v}_{k + 1} = f(\boldsymbol{v}_k) = \max_\pi (\boldsymbol{r}_\pi + \gamma \boldsymbol{P}_\pi \boldsymbol{v}_k)$

This sequence $\left\{ v_k \right\}$ converges to $v^$ **exponentially fast given any initial guess* $v_0$.
The convergence rate is determined by $\gamma$

Iterative algorithm

Matrix-vector form

$\boldsymbol{v}_{k + 1} = f(\boldsymbol{v}_k) = \max_\pi (\boldsymbol{r}_\pi + \gamma\boldsymbol{P}_\pi \boldsymbol{v}_k)$

Elementwise form

$\begin{aligned} v_{k + 1}(s) & = \max_\pi \sum_a \pi(a \mid s)(\sum_r p(r \mid s, a)r + \gamma \sum_{s^\prime} p(s^\prime \mid s, a) v_k(s^\prime))\\ & = \max_\pi \sum_a \pi(a \mid s) q_k(s, a) \\ & = \max_a q_k(s, a) \end{aligned}$

Procedure summary

For any $s$, current estimated value $v_k(s)$
For any $a \in \mathcal{A}(s)$, calculate $q_k(s, a) = \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime}p(s^\prime \mid s, a) v_k(s^\prime)$
Calculate the greedy policy $\pi_{k + 1}$ for $s$ as formula (7)
Calculate $v_{k + 1}(s) = \max_a q_k(s, a)$

This is actually the ==value iteration algorithm==

3.7 BOE: Optimality

假设$v^*$是BOE的解，满足：

$v^* = \max_\pi (r_\pi + \gamma P_\pi v^*)$

fix $v^{ }$, get $\pi^{ }$

$\pi^* = \arg\max_\pi (r_\pi + \gamma P_\pi v^*)$

则可以写为（去掉了$\max$）

$v^* = r_{\pi^*} + \gamma P_{\pi^*} v^*$

That is: BOE is a special BE

如何证明$\pi^{ }$最优/$v_{\pi^{ }}$最大？

==Proof==: Policy Optimality

Theorem (Greedy Optimal Policy)

For any $s \in \mathcal{S}$, the deterministic && greedy policy:
$\pi^*(a \mid s) = \begin{cases} 1 & a = a^*(s) \\ 0 & a \ne a^*(s) \end{cases}$
is an optimal policy solving the BOE. Here:
$a^*(s) = \arg \max_a q^*(a, s)$
Where:
$q^*(s, a) := \sum_r p(r \mid s, a)r + \gamma \sum_{s^\prime} p(s^\prime \mid s, a) v^*(s^\prime)$

3.8 Analyzing optimal polocies

What fators detemine the optimal policy?

$v(s) = \max_\pi \sum_a \pi(a \mid s) (\sum_r p(r \mid s,a )r + \gamma \sum_{s^\prime} p(s^\prime \mid s, a)v(s^\prime))$

Discount rate: $\gamma$
System model: $p(r\mid s, a)$ && $p(s^\prime \mid s, a)$ for a deterministic system, generally can’t be changed
Reward design: $r$

以上参数决定了最优策略，也是在以上参数已知的前提下去求解最优策略

重点在于如何设计$\gamma$ && $r$

Some ideas:

Optimal policy dares to take risks.

$\gamma \uparrow$, long-sighted, focus on future reward; $\gamma \downarrow$, shorted-sighted, focus on near future; $\gamma = 0$, extremely short-sighted, only focus on immediate reward.

What matters is not the absolute reward values, it is their relative values.

==Proof==: Optimal Policy Invariance

Meaningless detour: $\gamma$ is the punishment for takeing detours，在设计reward的时候，需要考虑惩罚点，这时候可以借助$\gamma$

3.9 Summary

Bellman Optimality equation:
- Elementwise form
- Matrix-vector form
Existence: does this equation have solution?
- Yes, by contraction mapping theorem
- $v^{ }$ 唯一，$\pi^{ }$ 不一定唯一
Uniqueness
- Yes, by the contraction mapping theorem
Algorithm: How to solve?
- Iterative algorithm suggested by the contraction mapping theorem
Optimality: why we study this equation
- Its solution corresponds to the optimal state value and optimal policy