4. Value Iteration and Policy Iteration
4.0 Preface
In BOE, contraction mapping theorem suggested an iterative algorithm.
- which can eventually find optimal state value and an optimal policy
- which is called
value iteration
4.1 Value iteration algorithm
Can be decomposed to 2 steps:
- Step 1:
policy update
, to solve:
where $v_k$ is given
- Step 2:
value update
‼️: $v_k$ 不是state value,不能保证$v_k$一定满足Bellman Equation
- Matrix-vector form is useful for theoretical analysis
- Elementwise form is useful for implementation
将使用elementwise form表示如何实施算法
Step1:Policy update
- Matrix-vector form:
- Elementwise form:
Where $\sum_r p(r \mid s, a) r$ && $\gamma \sum_{s^\prime}p(s^\prime \mid s ,a)$ 已知,所以需要代入$v_k(s^\prime)$,得到$q_k(s, a)$
- Optimal policy:
where $a^*_k(s) = \arg \max_a q_k(s, a)$
由于只选择最大的$q$-value,所以$\pi_{k + 1}$是greedy policy
Step2: Value update
- Matrix-vector form
- Elementwise form
where $\pi_{k + 1}(a \mid s)$在Step1:Policy update中已经给出,并且是greedy policy,则上式可以简化为:
Procedure summary
- begin with $v_0(s)$
- In Step1: policy update, get greedy policy
- In Step2: value update, get new value $v_{k + 1}$
Pseudocode
Pseudocode: Value iteration altorithm | ||||
---|---|---|---|---|
Initialization: 1. Initial guess $v_0$. 2. 对于所有二元组$(s, a)$的概率模型 $p(r \mid s, a)$ && $p(s^\prime \mid s, a)$已知. | ||||
Aim: 搜索满足Bellman Equation的optimal state value && optimal policy | ||||
在$v_k$收敛(即$ | v_k - v_{k -1} | $的值还未小于阈值前),对于$k$th iteration, do: | ||
For every state $s \in \mathcal{S}$ , do | ||||
For every action $a \in \mathcal{A}(s)$, do | ||||
Q-value: $q_k(s,a) = \sum_r p(r \mid s, a)r + \gamma \sum_{s^\prime}p(s^\prime \mid s, a)v_k(s^\prime)$ | ||||
Maximum action value: $a^*_k(s) = \arg \max_a q_k(s, a)$ | ||||
Policy update: $\pi_{k + 1}(a \mid s) = 1$ if $a = a^*_k$, and $\pi_{k + 1}(a \mid s) = 0$ otherwise | ||||
Value update: $v_{k + 1}(s) = \max_a q_k(s, a)$ |
4.2 Policy iteration algorithm
- Given a random initial policy $\pi_0$
- Step 1:
policy evaluation (PE)
To calculate the state value of given $\pi_k$
Note that $v_{\pi_k}$ is a state value function
- Step 2:
policy improvement (PI)
The maximization is componentwise (按分量进行的)
Sequence
Question
- In PE, how to get the state value $v_{\pi_k}$ by solving BE?
- In PI, why is the new policy $\pi_{k + 1}$ better than $\pi_k$?
- Why such an iterative algorithm can finally reach an optimal policy?
- What is the relationship between Policy iteration && Value Iteration?
Q1
解BE,采用Closed-form solution / Iterative solution (find the fixed point)
- Policy iteration is an iterative algorithm with another iterative algorithm embedded in the policy evalution step
Q2
Lemma (Policy improvement) ==Proof.==
if $\pi_{k + 1} = \arg \max_\pi (r_\pi + \gamma P_\pi v_{\pi_k})$, then $v_{\pi_{k + 1}} \ge v_{\pi_k}$ for any $k$
Q3
每次迭代都会improve the policy,所以有:
即$v_{\pi_k}$不断增加,并且最终收敛。
但还需要证明最终收敛到$v^*$
Theorem (Convergence of Policy Iteration) ==Proof.==
Step 1: Policy evaluation
- Matrix-vector form
- Elementwise form
当满足以下条件之一时停止迭代:
- $j \to \infty$
- $j$ 足够大
- $||v^{(j + 1)}_{\pi_k} - v^{(j)}_{\pi_k} ||$ 足够小
Step 2: Policy improvement
- Matrix-vector form
- Elementwise form
where $q_{\pi_k}(s, a)$ is the action value uder policy $\pi_k$. Let
then the greedy policy is
Pseudocode
Pseudocode: Poolicy iteration algorithm | ||||
---|---|---|---|---|
Initialization: 1. Initial guess $\pi_0$. 2. 对于所有二元组$(s, a)$的概率模型 $p(r \mid s, a)$ && $p(s^\prime \mid s, a)$已知. | ||||
Aim: 搜索optimal state value && optimal policy | ||||
在策略收敛之前,对于$k$th iteration, do | ||||
1. Policy evaluation: | ||||
Initialization: an arbitrary initial guess $v_{\pi_k}^0$ | ||||
While $v^{(j)}_{\pi_k}$ has not converged, 对于 $j$th iteration, do | ||||
For every state $s \in \mathcal{S}$, do (until $ | v^{(j + 1)}_{\pi_k} - v^{(j)}_{\pi_k} | < T$) | ||
$v^{(j + 1)}_{\pi_k}(s) = \sum_a \pi_k(a \mid s) \left[ \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime} p(s^\prime \mid s, a) v^{(j)}_{\pi_k} (s^\prime) \right]$ | ||||
2. Policy improvement: (similar to Value iteration Step 1) | ||||
For every state $s \in \mathcal{S}$, do | ||||
For every action $a \in \mathcal{A}(s)$, do | ||||
$q_{\pi_k}(s, a) = \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime} p(s^\prime \mid s, a) v_{\pi_k} (s^\prime)$ | ||||
$a^*_k(s) = \arg \max_a q_{\pi_k}(s, a)$ | ||||
$\pi_{k + 1}(a \mid s) = 1$ if $a = a^*_k$, and $\pi_{k + 1}(a \mid s) = 0$ otherwise |
4.3 Truncated policy iteration algorithm
Comparation
Value iteration
- Initial $v_0$
- Policy update(PU): $\pi_{k + 1} = \arg \max_\pi(r_\pi + \gamma P_\pi v_k)$
- Value update(VU): $v_{k + 1} = r_{\pi_{k + 1}} + \gamma P_{\pi_{k = 1}}v_k$
Policy iteration
- Initial $\pi_0$
- Policy evaluation(PE): $v_{\pi_k} = r_{\pi_k} + \gamma P_{\pi_k} v_{\pi_k}$
- Policy improvement(PI): $\pi_{k + 1} = \arg \max_\pi (r_\pi + \gamma P_\pi v_{\pi_k})$
Sequence
Value iteration: $u_0 \stackrel{PU}{\longrightarrow} \pi^\prime_1 \stackrel{VU}{\longrightarrow} u_1 \stackrel{PU}{\longrightarrow} \pi^\prime_2 \stackrel{VU}{\longrightarrow} u_2 \stackrel{PU}{\longrightarrow} \cdots$
Policy iteration: $\pi_0 \stackrel{P E}{\longrightarrow} v_{\pi_0} \stackrel{P I}{\longrightarrow} \pi_1 \stackrel{P E}{\longrightarrow} v_{\pi_1} \stackrel{P I}{\longrightarrow} \pi_2 \stackrel{P E}{\longrightarrow} v_{\pi_2} \stackrel{P I}{\longrightarrow} \cdots$
Tabular
Value Iteration | Policy Iteration | Comments | |
---|---|---|---|
1. Policy | N/A | $\pi_0$ | |
2. Value | $v_0 := v_{\pi_0}$ | $v_{\pi_0} = r_{\pi_0} + \gamma P_{\pi_0} v_{\pi_0}$ | |
3. Policy | $\pi_1 = \arg \max_\pi (r_\pi + \gamma P_{\pi} v_0)$ | $\pi_1 = \arg \max_\pi (r_\pi + \gamma P_{\pi} v_{\pi_0})$ | 两种算法的策略在此处相同 |
4. Value | $v_1 = r_{\pi_1} + \gamma P_{\pi_1} v_0$ | $v_{\pi_1} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}$ | $v_{\pi_1} \ge v_1$ since $v_{\pi_1} \ge v_{\pi_0}$ |
5. Policy | $\pi_2 = \arg \max_\pi (r_\pi + \gamma P_{\pi} v_1)$ | $\pi^\prime_2 = \arg \max_\pi (r_\pi + \gamma P_{\pi} v_{\pi_1})$ | |
$\vdots$ | $\vdots$ | $\vdots$ |
- 注意对比下标的不同
- 前
3
步都是相同的 - 在
4. value
中,value iteration只计算了一步(单步迭代),policy iteration实际上是解算BE(理论上是$\infty$步迭代),value iteration的该步实际上是policy iteration该步的第一次迭代
Iteration
- 对于Policy iteration中的$v_{\pi_1} = r_{\pi_1} + \gamma P_{\pi_1} v_{\pi_1}$,迭代过程如下所示
- value iteration 只计算一次(效率太低)
- policy iteration 计算$\infty$次(实际上没办法计算$\infty$次)
- truncated policy iteration 计算有限次($j$),剩余的部分被截断$j \to \infty$
Pseudocode
Pseudocode: Truncated policy iteration algorithm |
---|
Initialization: 1. Initial guess $\pi_0$. 2. 对于所有二元组$(s, a)$的概率模型 $p(r \mid s, a)$ && $p(s^\prime \mid s, a)$已知. |
Aim: 搜索optimal state value && optimal policy |
在策略收敛之前,对于$k$th iteration, do |
1. Policy iteration: |
Initialization: 选择初始预测$v^{(0)}_k = v_{k - 1}$. 设置截断长度为$j_{\text{truncate}}$ |
While $j < j_{\text{truncate}}$, do |
For every state $s \in \mathcal{S}$, do |
$v^{(j + 1)}_k(s) = \sum_a \pi_k(a \mid s) \left[ \sum_r p(r \mid s, a)r + \gamma \sum_{s^\prime}p(s^\prime \mid s, a)v^{(j)}_k (s^\prime) \right]$ |
Set $v_k = v^{(j_{\text{truncate}})}_k$ |
2. Policy improvement: (与value iteration Step 1 / policy iteration Step 2 相同) |
For every state $s \in \mathcal{S}$, do |
For every action $a \in \mathcal{A}(s)$, do |
$q_{\pi_k}(s, a) = \sum_r p(r \mid s, a) + \gamma \sum_{s^\prime} p(s^\prime \mid s, a) v_{\pi_k} (s^\prime)$ |
$a^*_k(s) = \arg \max_a q_{\pi_k}(s, a)$ |
$\pi_{k + 1}(a \mid s) = 1$ if $a = a^*_k$, and $\pi_{k + 1}(a \mid s) = 0$ otherwise |
Convergence
Proposition (Value Improvement) ==Proof.==
证明过程$VI \to PI \to \text{truncated } PI$,都以$VI$的收敛为基础
Select $j_{\text{truncate}}$
- $j$越大,value estimate 收敛越快
- 当$j$越来越大时,增大$j$的收益逐渐降低
- 在实际中,$j$一般比较小
4.4 Summary
- Value iteration
- Policy iteration
- Truncated policy iteration
VI && PI都是TPI的特殊情况