2. Bellman Equation
OUTLINE

2.0 Why return is important?
Calculating return is important to evaluate a policy.
- Matrix form
- $\mathbf{P}$:policy && state transition
==Bootstrapping==
2.1 State value
$S_t$
$R_t$ && $R_{t + 1}$:
- In math:$R_{t + 1}$
- In code:$R_t$
- Note that: $S_t, A_t, R_{t + 1}$ are random variable, which can be $\mathbb{E}[]$
$G_t$
- $G_t$ is a expectation/expected return/expected value for a trajectory.
- Note that: $G_t$ als a random variable since $R_{t + 1}, \dots$ are random variable
$v_{\pi}(s)$
- State-value (function)
- Note that:
- A fuction of $s$, a conditional expectation with that the state starts from $s$ (or $v_{\pi}(s,a)$)
- Based on policy $\pi$
==Difference between==
return
&&state-value
return
just for a single trajectorystate-value
is the mean of all possible returns that can be obtained starting from a state(是从一个state开始所有可能$G_t$的平均值)
2.2 Bellman Equation:Derivation
Describes the relationship among the values of all states
immediate reward: $\mathbb{E}[R_{t + 1} | S_t = s]$, mean of immediate rewards
future return: $\gamma \mathbb{E}[G_{t + 1} | S_t = s]$, mean of future rewards
for immediate reward:
for future return:
- $\sum_{s^{‘}} \mathbb{E}[G_{t + 1} | S_t = s, S_{t + 1} = s^{‘}] \rightarrow \mathbb{E}[G_{t + 1} | S_{t + 1} = s^{‘}]$: memoryless Markov property
Bellman Equation
- 描述了不同state的state-value function之间的关系
- 由两部分组成:immediate reward && future return
- Notations:
- $v_{\pi}(s)$ and $v_{\pi}(s^{\prime})$ 是需要计算的,通过==Bootstrapping==
- $\pi(a \mid s)$是给定的policy,解Bellman公式的过程,叫做
policy evalution
- $p(r \mid s, a)$ and $p(s^{\prime} \mid s, a)$代表了模型,分为模型已知/未知
$v_{\pi}(s)$越大,说明这个状态越好(离目标越近)
2.3 Bellman Equation: Matrix-vector form
How to solve the Bellman Equation?
——one unknown ($v_{\pi}(s)$) relies on another unknown ($v_{\pi}(s^{\prime})$)
Rewrite Bellman Equation:
then:
Matrix-vector form:
Where:
- $\boldsymbol{v}_{\pi} = [v_{\pi}(s_1), \dots, v_{\pi}(s_n)]^T \in \boldsymbol{R}^n$
- $\boldsymbol{r}_{\pi} = [r_{\pi}(s_1), \dots, r_{\pi}(s_n)]^T \in \mathbb{R}^n$
- $\boldsymbol{P}_{\pi} \in \mathbb{R}^{n \times n}$, where $[\boldsymbol{P}_{\pi}]_{ij} = p_{\pi}(s_j \mid s_i)$, is the state transition matrix
2.4 Bellman Equation: Solve the state values
policy evaluation
: 给定一个policy,求解相应的state values
- Bellman equation matrix-vector form in closed-form solution:
需要求逆,一般不用
- iterative solution:
有:
==Proof==
2.5 Action value
从state value到action value:
- State value:the
average return
the agent can getstarting from a state
- action value:the
average return
the agent can getstarting from a state
&&taking an action
一个是评估state的方法,一个是评估action的方法
- Definition:
Where:
- $q_{\pi}(s, a)$ is a function of the state-action pair $(s, a)$
- $q_{\pi}(s, a)$ depends on $\pi$
Hence:
由式(8):
即:
==值得注意的是:==上式(19)中,又出现了$v_\pi(s^{\prime})$
所以可以==总结==为:(17) && (19)是一个硬币的正反面(two sides of the same coin)
- (17)是从state value获得action value
- (19)是从action value获得state value
在计算时,可以先计算all state values,再计算action value
也可以直接通过model-base/free的方法计算action value
2.6 Summary
- State value
- Action value
- Bellman Equation
- elementwise form
- matrix-vector form
- How to solve Bellman Equation
- closed form
- iterative form