3. Bellman optimality equation

3.1 Definition of optimal policy

  • Definition:

A policy $\pi^{ }$ is optimal if $v_{\pi^{ }}(s) \ge v_\pi(s)$ for all $s$ and for any other policy $\pi$

重点:所有状态的$v$都是最大的

Several question:

  1. Does the optimal policy exist 是否存在最优策略?
  2. Is the optimal policy unique 最优策略是否唯一?
  3. Is the optimal policy stochastic or deterministic 最优策略是随机的还是确定的?
  4. How to obtain the optimal policy 如何得到最优策略?

3.2 BOE: Introduction

  • Bellman optimality equation (elementwise form):
Bellman quation Bellman optimality equation
$v_\pi(s)$ $v(s)$
without $\max_\pi$ with $\max_\pi$
for a given $\pi$ to solve/get the optimality $\pi$
  • Remarks:
    1. $p(r\mid s, a), p(s^\prime \mid s, a)$ are ==known==
    2. $v(s), v(s^\prime)$ are ==unknown== and to be caluated
  • Bellman optimality equation (matrix-vector form)
  • Where the elements corresponding to $s$ or $s^\prime$ are

Here $\max_\pi$ is performed elementwist

即$\max_\pi$可以放到matrix-vector的括号里面,对每一个元素,求$\max_\pi$

==Tricky==

There is a maximization on the right-hand side, which may not be straightforward to see how to compute.

右边镶嵌了一个最优化的问题,不能直接看出是怎么计算的

==Elegant==

Describes the optimal policy and optimal state value is an elegant way.

形式简洁,直接刻画最优策略&&最优的state value

Several questions:

  • Algorithm: how to solve?
  • Existence: have solutions?
  • Uniqueness: the solution unique?
  • Optimality: how is it related to optimal policy?

3.3 BOE: Maximazation on the right-hand side

  • 对于BOE,一个公式存在2个未知量$\pi, v$(需要通过其他方法求解)

对于$x = \max_a (2x - 1 - a^2)$

  • 首先不考虑$x$,则当$a = 0$时取到最大值;
  • 令$a = 0$,公式变为$x = 2x - 1$,则$x = 1$

所以$x = 1, a = 0$为最终解

  • 因此,可以在求解BOE得过程中,fix $v^\prime(s)$ first and solve $\pi$:

    • begin with an initial $v^\prime(s)$, such as $0$

    • 然后得到了一系列的$q(s, a)$(参考公式(1))

如何求解呢?假设$q_1, q_2, q_3 \in \mathbb{R}$ are given, find $c^{ }_1, c^{ }_2, c^{ * }_3$ solving:

其中$c_1 + c_2 + c_3 = 1$ and $c_1, c_2, c_3 > 0$(对应概率的性质)

Without loss of generality,不失一般性,假设$q_3 > q_1, q_2$,那么最优解就是$c^{ }_3 = 1$而$c^{ }_1= c^{ * }_2 = 0$:

  • 所以,再回到公式(1),

    • consider $\sum_a \pi(a \mid s) = 1$,有:
  • 当以下条件成立时达到最优:

其中$a^{ * } = \arg \max_a q(s, a)$

3.4 Rewrite as $v = f(v)$

考虑公式(2),let:

最终的BOE变为:

此处的$f(\boldsymbol{v})$是一个向量:

3.5 Contraction mapping theorem

Concepts

  • ==Fixed point==: $x \in X$ is a fixed point of $f: X \rightarrow X$ if
  • Contraction mapping / contractive function: $f$ is a contraction mapping if

Where $\gamma \in (0,1)$

  • $\gamma$ must be stricly less than $1$, so that many limits such as $\gamma^k \rightarrow 0$ as $k \rightarrow 0$ hold.
  • Here $||\cdot||$ can be any vector norm

从直观上说,$f(x_1)$和$f(x_2)$之间的距离,比$x_1$和$x_2$之间的距离,更小,收敛

==Proof== && ==Contraction Mapping Theorem==

3.6 BOE: Solution

==Proof==: BOE是一个Contraction Mapping: $||f(v_1) - f(v_2)|| \le \gamma || v_1 - v_2 ||$

所以根据Contraction Mapping,可以得到如下结论:

Theorem (Existence, Uniqueness, and Algorithm)

There always EXISTS a solution $v^$ and the solution is *UNIQUE. The solution could be solved iteratively by:

  • This sequence $\left\{ v_k \right\}$ converges to $v^$ **exponentially fast given any initial guess* $v_0$.

  • The convergence rate is determined by $\gamma$

Iterative algorithm

  • Matrix-vector form
  • Elementwise form

Procedure summary

  1. For any $s$, current estimated value $v_k(s)$
  2. For any $a \in \mathcal{A}(s)$, calculate $q_k(s, a) = \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime}p(s^\prime \mid s, a) v_k(s^\prime)$
  3. Calculate the greedy policy $\pi_{k + 1}$ for $s$ as formula (7)
  4. Calculate $v_{k + 1}(s) = \max_a q_k(s, a)$

This is actually the ==value iteration algorithm==

3.7 BOE: Optimality

假设$v^*$是BOE的解,满足:

fix $v^{ }$, get $\pi^{ }$

则可以写为(去掉了$\max$)

That is: BOE is a special BE

如何证明$\pi^{ }$最优/$v_{\pi^{ }}$最大?

==Proof==: Policy Optimality

Theorem (Greedy Optimal Policy)

For any $s \in \mathcal{S}$, the deterministic && greedy policy:

is an optimal policy solving the BOE. Here:

Where:

3.8 Analyzing optimal polocies

What fators detemine the optimal policy?

  • Discount rate: $\gamma$
  • System model: $p(r\mid s, a)$ && $p(s^\prime \mid s, a)$ for a deterministic system, generally can’t be changed

  • Reward design: $r$

以上参数决定了最优策略,也是在以上参数已知的前提下去求解最优策略

重点在于如何设计$\gamma$ && $r$

Some ideas:

  • Optimal policy dares to take risks.
  • $\gamma \uparrow$, long-sighted, focus on future reward; $\gamma \downarrow$, shorted-sighted, focus on near future; $\gamma = 0$, extremely short-sighted, only focus on immediate reward.

  • What matters is not the absolute reward values, it is their relative values.

    • ==Proof==: Optimal Policy Invariance
  • Meaningless detour: $\gamma$ is the punishment for takeing detours,在设计reward的时候,需要考虑惩罚点,这时候可以借助$\gamma$

3.9 Summary

  • Bellman Optimality equation:
    • Elementwise form
    • Matrix-vector form
  • Existence: does this equation have solution?
    • Yes, by contraction mapping theorem
    • $v^{ }$ 唯一,$\pi^{ }$ 不一定唯一
  • Uniqueness
    • Yes, by the contraction mapping theorem
  • Algorithm: How to solve?
    • Iterative algorithm suggested by the contraction mapping theorem
  • Optimality: why we study this equation
    • Its solution corresponds to the optimal state value and optimal policy