4. Value Iteration and Policy Iteration

4.0 Preface

In BOE, contraction mapping theorem suggested an iterative algorithm.

  • which can eventually find optimal state value and an optimal policy
  • which is called value iteration

4.1 Value iteration algorithm

Can be decomposed to 2 steps:

  • Step 1: policy update, to solve:

where $v_k$ is given

  • Step 2: value update

‼️: $v_k$ 不是state value,不能保证$v_k$一定满足Bellman Equation

  • Matrix-vector form is useful for theoretical analysis
  • Elementwise form is useful for implementation

将使用elementwise form表示如何实施算法

Step1:Policy update

  • Matrix-vector form:
  • Elementwise form:

Where $\sum_r p(r \mid s, a) r$ && $\gamma \sum_{s^\prime}p(s^\prime \mid s ,a)$ 已知,所以需要代入$v_k(s^\prime)$,得到$q_k(s, a)$

  • Optimal policy:

where $a^*_k(s) = \arg \max_a q_k(s, a)$

由于只选择最大的$q$-value,所以$\pi_{k + 1}$是greedy policy

Step2: Value update

  • Matrix-vector form
  • Elementwise form

where $\pi_{k + 1}(a \mid s)$在Step1:Policy update中已经给出,并且是greedy policy,则上式可以简化为:

Procedure summary

  • begin with $v_0(s)$
  • In Step1: policy update, get greedy policy
  • In Step2: value update, get new value $v_{k + 1}$

Pseudocode

Pseudocode: Value iteration altorithm
Initialization: 1. Initial guess $v_0$. 2. 对于所有二元组$(s, a)$的概率模型 $p(r \mid s, a)$ && $p(s^\prime \mid s, a)$已知.
Aim: 搜索满足Bellman Equation的optimal state value && optimal policy
在$v_k$收敛(即$ v_k - v_{k -1} $的值还未小于阈值前),对于$k$th iteration, do:
For every state $s \in \mathcal{S}$ , do
For every action $a \in \mathcal{A}(s)$, do
Q-value: $q_k(s,a) = \sum_r p(r \mid s, a)r + \gamma \sum_{s^\prime}p(s^\prime \mid s, a)v_k(s^\prime)$
Maximum action value: $a^*_k(s) = \arg \max_a q_k(s, a)$
Policy update: $\pi_{k + 1}(a \mid s) = 1$ if $a = a^*_k$, and $\pi_{k + 1}(a \mid s) = 0$ otherwise
Value update: $v_{k + 1}(s) = \max_a q_k(s, a)$

4.2 Policy iteration algorithm

  • Given a random initial policy $\pi_0$
  • Step 1: policy evaluation (PE)

To calculate the state value of given $\pi_k$

Note that $v_{\pi_k}$ is a state value function

  • Step 2: policy improvement (PI)

The maximization is componentwise (按分量进行的)

Sequence

Question

  1. In PE, how to get the state value $v_{\pi_k}$ by solving BE?
  2. In PI, why is the new policy $\pi_{k + 1}$ better than $\pi_k$?
  3. Why such an iterative algorithm can finally reach an optimal policy?
  4. What is the relationship between Policy iteration && Value Iteration?

Q1

解BE,采用Closed-form solution / Iterative solution (find the fixed point)

  • Policy iteration is an iterative algorithm with another iterative algorithm embedded in the policy evalution step

Q2

Lemma (Policy improvement) ==Proof.==

if $\pi_{k + 1} = \arg \max_\pi (r_\pi + \gamma P_\pi v_{\pi_k})$, then $v_{\pi_{k + 1}} \ge v_{\pi_k}$ for any $k$

Q3

每次迭代都会improve the policy,所以有:

即$v_{\pi_k}$不断增加,并且最终收敛。

但还需要证明最终收敛到$v^*$

Theorem (Convergence of Policy Iteration) ==Proof.==

Step 1: Policy evaluation

  • Matrix-vector form
  • Elementwise form

当满足以下条件之一时停止迭代:

  1. $j \to \infty$
  2. $j$ 足够大
  3. $||v^{(j + 1)}_{\pi_k} - v^{(j)}_{\pi_k} ||$ 足够小

Step 2: Policy improvement

  • Matrix-vector form
  • Elementwise form

where $q_{\pi_k}(s, a)$ is the action value uder policy $\pi_k$. Let

then the greedy policy is

Pseudocode

Pseudocode: Poolicy iteration algorithm
Initialization: 1. Initial guess $\pi_0$. 2. 对于所有二元组$(s, a)$的概率模型 $p(r \mid s, a)$ && $p(s^\prime \mid s, a)$已知.
Aim: 搜索optimal state value && optimal policy
在策略收敛之前,对于$k$th iteration, do
1. Policy evaluation:
Initialization: an arbitrary initial guess $v_{\pi_k}^0$
While $v^{(j)}_{\pi_k}$ has not converged, 对于 $j$th iteration, do
For every state $s \in \mathcal{S}$, do (until $ v^{(j + 1)}_{\pi_k} - v^{(j)}_{\pi_k} < T$)
$v^{(j + 1)}_{\pi_k}(s) = \sum_a \pi_k(a \mid s) \left[ \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime} p(s^\prime \mid s, a) v^{(j)}_{\pi_k} (s^\prime) \right]$
2. Policy improvement: (similar to Value iteration Step 1)
For every state $s \in \mathcal{S}$, do
For every action $a \in \mathcal{A}(s)$, do
$q_{\pi_k}(s, a) = \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime} p(s^\prime \mid s, a) v_{\pi_k} (s^\prime)$
$a^*_k(s) = \arg \max_a q_{\pi_k}(s, a)$
$\pi_{k + 1}(a \mid s) = 1$ if $a = a^*_k$, and $\pi_{k + 1}(a \mid s) = 0$ otherwise

4.3 Truncated policy iteration algorithm

Comparation

  • Value iteration

    • Initial $v_0$
    • Policy update(PU): $\pi_{k + 1} = \arg \max_\pi(r_\pi + \gamma P_\pi v_k)$
    • Value update(VU): $v_{k + 1} = r_{\pi_{k + 1}} + \gamma P_{\pi_{k = 1}}v_k$
  • Policy iteration

    • Initial $\pi_0$
    • Policy evaluation(PE): $v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}$
    • Policy improvement(PI): $\pi_{k + 1} = \arg \max_\pi (r_\pi + \gamma P_\pi v_{\pi_k})$

Sequence

Value iteration: $u_0 \stackrel{PU}{\longrightarrow} \pi^\prime_1 \stackrel{VU}{\longrightarrow} u_1 \stackrel{PU}{\longrightarrow} \pi^\prime_2 \stackrel{VU}{\longrightarrow} u_2 \stackrel{PU}{\longrightarrow} \cdots$

Policy iteration: $\pi_0 \stackrel{P E}{\longrightarrow} v_{\pi_0} \stackrel{P I}{\longrightarrow} \pi_1 \stackrel{P E}{\longrightarrow} v_{\pi_1} \stackrel{P I}{\longrightarrow} \pi_2 \stackrel{P E}{\longrightarrow} v_{\pi_2} \stackrel{P I}{\longrightarrow} \cdots$

Tabular

Value Iteration Policy Iteration Comments
1. Policy N/A $\pi_0$
2. Value $v_0 := v_{\pi_0}$ $v_{\pi_0} = r_{\pi_0} + \gamma P_{\pi_0} v_{\pi_0}$
3. Policy $\pi_1 = \arg \max_\pi (r_\pi + \gamma P_{\pi} v_0)$ $\pi_1 = \arg \max_\pi (r_\pi + \gamma P_{\pi} v_{\pi_0})$ 两种算法的策略在此处相同
4. Value $v_1 = r_{\pi_1} + \gamma P_{\pi_1} v_0$ $v_{\pi_1} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}$ $v_{\pi_1} \ge v_1$ since $v_{\pi_1} \ge v_{\pi_0}$
5. Policy $\pi_2 = \arg \max_\pi (r_\pi + \gamma P_{\pi} v_1)$ $\pi^\prime_2 = \arg \max_\pi (r_\pi + \gamma P_{\pi} v_{\pi_1})$
$\vdots$ $\vdots$ $\vdots$
  • 注意对比下标的不同
  • 3步都是相同的
  • 4. value中,value iteration只计算了一步(单步迭代),policy iteration实际上是解算BE(理论上是$\infty$步迭代),value iteration的该步实际上是policy iteration该步的第一次迭代

Iteration

  • 对于Policy iteration中的$v_{\pi_1} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}$,迭代过程如下所示
  • value iteration 只计算一次(效率太低)
  • policy iteration 计算$\infty$次(实际上没办法计算$\infty$次)
  • truncated policy iteration 计算有限次($j$),剩余的部分被截断$j \to \infty$

Pseudocode

Pseudocode: Truncated policy iteration algorithm
Initialization: 1. Initial guess $\pi_0$. 2. 对于所有二元组$(s, a)$的概率模型 $p(r \mid s, a)$ && $p(s^\prime \mid s, a)$已知.
Aim: 搜索optimal state value && optimal policy
在策略收敛之前,对于$k$th iteration, do
1. Policy iteration:
Initialization: 选择初始预测$v^{(0)}_k = v_{k - 1}$. 设置截断长度为$j_{\text{truncate}}$
While $j < j_{\text{truncate}}$, do
For every state $s \in \mathcal{S}$, do
$v^{(j + 1)}_k(s) = \sum_a \pi_k(a \mid s) \left[ \sum_r p(r \mid s, a)r + \gamma \sum_{s^\prime}p(s^\prime \mid s, a)v^{(j)}_k (s^\prime) \right]$
Set $v_k = v^{(j_{\text{truncate}})}_k$
2. Policy improvement: (与value iteration Step 1 / policy iteration Step 2 相同)
For every state $s \in \mathcal{S}$, do
For every action $a \in \mathcal{A}(s)$, do
$q_{\pi_k}(s, a) = \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime} p(s^\prime \mid s, a) v_{\pi_k} (s^\prime)$
$a^*_k(s) = \arg \max_a q_{\pi_k}(s, a)$
$\pi_{k + 1}(a \mid s) = 1$ if $a = a^*_k$, and $\pi_{k + 1}(a \mid s) = 0$ otherwise

Convergence

Proposition (Value Improvement) ==Proof.==

证明过程$VI \to PI \to \text{truncated } PI$,都以$VI$的收敛为基础

Select $j_{\text{truncate}}$

  • $j$越大,value estimate 收敛越快
  • 当$j$越来越大时,增大$j$的收益逐渐降低
  • 在实际中,$j$一般比较小

4.4 Summary

  • Value iteration
  • Policy iteration
  • Truncated policy iteration

VI && PI都是TPI的特殊情况