2. Bellman Equation

OUTLINE

2.0 Why return is important?

Calculating return is important to evaluate a policy.

  • Matrix form
  • $\mathbf{P}$:policy && state transition

==Bootstrapping==

2.1 State value

$S_t$

$R_t$ && $R_{t + 1}$:

  • In math:$R_{t + 1}$
  • In code:$R_t$
  • Note that: $S_t, A_t, R_{t + 1}$ are random variable, which can be $\mathbb{E}[]$

$G_t$

  • $G_t$ is a expectation/expected return/expected value for a trajectory.
  • Note that: $G_t$ als a random variable since $R_{t + 1}, \dots$ are random variable

$v_{\pi}(s)$

  • State-value (function)
  • Note that:
    • A fuction of $s$, a conditional expectation with that the state starts from $s$ (or $v_{\pi}(s,a)$)
    • Based on policy $\pi$

==Difference between== return && state-value

  • return just for a single trajectory
  • state-value is the mean of all possible returns that can be obtained starting from a state(是从一个state开始所有可能$G_t$的平均值)

2.2 Bellman Equation:Derivation

Describes the relationship among the values of all states

  • immediate reward: $\mathbb{E}[R_{t + 1} | S_t = s]$, mean of immediate rewards

  • future return: $\gamma \mathbb{E}[G_{t + 1} | S_t = s]$, mean of future rewards

for immediate reward:

for future return:

  • $\sum_{s^{‘}} \mathbb{E}[G_{t + 1} | S_t = s, S_{t + 1} = s^{‘}] \rightarrow \mathbb{E}[G_{t + 1} | S_{t + 1} = s^{‘}]$: memoryless Markov property

Bellman Equation

  1. 描述了不同state的state-value function之间的关系
  2. 由两部分组成:immediate reward && future return
  3. Notations:
    • $v_{\pi}(s)$ and $v_{\pi}(s^{\prime})$ 是需要计算的,通过==Bootstrapping==
    • $\pi(a \mid s)$是给定的policy,解Bellman公式的过程,叫做policy evalution
    • $p(r \mid s, a)$ and $p(s^{\prime} \mid s, a)$代表了模型,分为模型已知/未知

$v_{\pi}(s)$越大,说明这个状态越好(离目标越近)

2.3 Bellman Equation: Matrix-vector form

How to solve the Bellman Equation?

——one unknown ($v_{\pi}(s)$) relies on another unknown ($v_{\pi}(s^{\prime})$)

Rewrite Bellman Equation:

then:

Matrix-vector form:

Where:

  • $\boldsymbol{v}_{\pi} = [v_{\pi}(s_1), \dots, v_{\pi}(s_n)]^T \in \boldsymbol{R}^n$
  • $\boldsymbol{r}_{\pi} = [r_{\pi}(s_1), \dots, r_{\pi}(s_n)]^T \in \mathbb{R}^n$
  • $\boldsymbol{P}_{\pi} \in \mathbb{R}^{n \times n}$, where $[\boldsymbol{P}_{\pi}]_{ij} = p_{\pi}(s_j \mid s_i)$, is the state transition matrix

2.4 Bellman Equation: Solve the state values

policy evaluation: 给定一个policy,求解相应的state values

  • Bellman equation matrix-vector form in closed-form solution:

需要求逆,一般不用

  • iterative solution:

有:

==Proof==

2.5 Action value

从state value到action value:

  • State value:the average return the agent can get starting from a state
  • action value:the average return the agent can get starting from a state && taking an action

一个是评估state的方法,一个是评估action的方法

  • Definition:

Where:

  • $q_{\pi}(s, a)$ is a function of the state-action pair $(s, a)$
  • $q_{\pi}(s, a)$ depends on $\pi$

Hence:

由式(8):

即:

==值得注意的是:==上式(19)中,又出现了$v_\pi(s^{\prime})$

所以可以==总结==为:(17) && (19)是一个硬币的正反面(two sides of the same coin)

  • (17)是从state value获得action value
  • (19)是从action value获得state value
  • 在计算时,可以先计算all state values,再计算action value

  • 也可以直接通过model-base/free的方法计算action value

2.6 Summary

  • State value
  • Action value
  • Bellman Equation
    • elementwise form
    • matrix-vector form
  • How to solve Bellman Equation
    • closed form
    • iterative form