5. MC
5.0 Preface
DP / Value iteration / Policy iteration: Model-Based
MC: Model-Free
- How can we estimate sth without model?
- Monte Carlo Estimation
Example: Flip a coin
结果(head / tail)由random variable $X$决定
- Head: $X = +1$
- Tail: $X = -1$
目标就是计算$\mathbb{E}[X]$
Method 1: Model-Based
- 假设概率模型已知:
- 则根据定义
但是,想要提前知道准确的概率模型,通常是不可能的
Method 2: Model-Free
- 主要思想:扔多次硬币,计算结果的平均值
- 假设得到了一个sample sequence: $\{ x_1, x_2, \dots, x_N \}$. 期望可以估计为:
This is the idea of Monte Carlo estimation.
Question:
Monte Carlo estimation 是否准确?
- $N$小的时候,不够准确
- $N$增加时,会越来越准确(==Law of Large Numbers==)
- MC 代表了一类方法,通过重复采样解决估计问题
- 通过MC可以不需要model
- 为什么关注mean estimation?因为state value && action value 都定义为random variables 的expectations,实际上都是expectations.
5.1 MC Basic
==重点:==理解如何将policy iteration转化为model-free
- policy iteration
- Monto carlo mean estimation
5.1.1 Convert policy iteration to be model-free
- Policy iteration 2-steps:
Elementweise form of Policy improvement step:
==Key== is $q_{\pi_k}(s, a)$
Action value $q_{\pi_k}(s, a)$ 有两种计算方式
Require the model
- Not require model
基于data / samples / experiences,通过公式(7)来实现model-free
所以,action value的MC estimation:
- 从$(s, a)$开始,采用策略$\pi_k$,采样生成一个episode
- episode的返回结果为$g(s, a)$
- $g(s, a)$是$G_t$的采样
- 假设得到了episode的集合,即有了$\{ g^{(j)}(s, a) \}$,则
==Fundamental idea: When model is unavailable, we use data / sample / experience==
5.1.2 MC Basic algorithm
给定初始策略$\pi_0$(主要用于采样,用于后续的policy iteraion),在 $k$th iteration有两步:
- Step 1: policy evaluation
这一步的关键内容在于,获得所有$(s, a)$的$q_{\pi_k}(s, a)$。
在实际过程中,对于每一个$(s, a)$,采样有限步(足够多)的episodes,采样的平均值用于估计$q_{\pi_k}(s, a)$
- Step 2: policy improvement
这一步与常规的policy improvement并没有什么区别,可以说exactly the same
区别就在于,通过采用平均来直接估计$q_{\pi_k}(s, a)$,而不是解算BE $v_{\pi_k}(s)$
5.1.3 Pseudocode
Pseudocode: MC Basic algorithm (a model-free variant of policy iteration) |
---|
Initialization: Initial guess $\pi_0$ |
Aim: Search for optimal policy |
While the value estimate has not converged, for the $k$ th iteration, do |
For every state $s \in \mathcal{S}$, do |
For every action $a \in \mathcal{A}(s)$, do |
Collect sufficiently many episodes starting from $(s, a)$ following $\pi_k$ |
MC-based policy evaluation step: |
$q_{\pi_k}(s, a) =$average return of all the episodes starting from $(s, a)$ |
Policy improvement step: |
$a^*_k(s) = \arg \max_a q_{\pi_k}(s, a)$ |
$\pi_{k + 1}(a \mid s) = 1$ if $a = a^*_k$, and $\pi_{k + 1}(a \mid s) = 0$ otherwise |
5.1.4 Summary
- MC Basic 是policy iteration的一种
- Model-free建立在Model-based的基础上
- MC Basic揭示了MC算法的core idea,但是效率太低,not practical
- 为什么MC Basic计算$q_{\pi_k}$而不是$v_{\pi_k}$?
因为$v_{\pi_k}$并不能直接用于policy improvement($\arg \max_a$),当模型未知时,应该直接估计action value
- Convergence:由于policy iteraion是收敛的,所以在given sufficient episodes(太短会导致short-sighted,趋近于$\infty$时接近真值,但不需要infinitely long)的前提下,MC Basic也是保证收敛的
5.2 MC Exploring Starts
5.2.0 Preface
MC Basic:
Advantage: 清楚了揭示了核心思想
Disadvantage: not pratical
所以,需要让MC Basic more efficient
5.2.1 Visits
采取策略$\pi$,采样得到如下所示的episode:
Visit: 在episode中,state-action pair出现一次,就是对这对state-action pair的visit
Method:在MC Basic中使用的方法是Initial-visit
- 在公式(9)的例子中,只考虑$(s_1, a_2)$的pair,然后用后续episode来计算$q_\pi(s_1, a_2)$
Initial-visit的问题在于:没有充分利用数据
- More efficiently way:
通过一个episode,可以估计$q_\pi(s_1, a_2), q_\pi(s_2, a_4), q_\pi(s_2, a_3), q_\pi(s_5, a_1)$
由此引申出了两个Data-efficient methods:
- Method 1: first-visit method
在policy evaluation阶段,从一个state-action pair开始,收集所有的episodes,然后使用average return作为action value的近似估计(即MC Basic采用的方法)
==缺点:==agent需要等待所有的episodes被收集完
- Method 2: every-visit method
用单个episode来近似估计action value
==优点:==可以episode-by-episode来改进policy
- Method 2讨论:该方法是否可行?单个episode的估计显然是不准确的
在Value iteration && Truncated policy iteraion中,policy evaluation都没有做$\infty$步的迭代来求解BE,因此得到的$v_\pi$肯定也是不准确的,但是依然能够保证最终收敛。所以every-visit是可行的方法
- 这类方法统称为:
Generalized policy iteration (GPI)
- 不是某种特定算法,而是一类算法
- 代表了一种具有普适性,在policy evaluation && policy improvement两个过程中切换的idea / framework
- 大多数的Model-Free / Model-Based的算法,都建立在这个框架的基础上
5.2.2 Pseudocode
Pseudocode: MC Exploring Starts (a sample-efficient variant of MC Basic) |
---|
Initialization: Initial guess $\pi_0$ |
Aim: Search for an optimal policy |
For each episode, do |
Episode generation: Randomly select a starting state-action pair $(s_0, a_0)$ and ensure that all pairs can be possibly selected. Following the current policy, generate episode of length $T: s_0, a_0, r_1, \dots, s_{T - 1}, a_{T - 1}, r_T$ |
Policy evaluation and Policy improvement: |
Initialization: $g \leftarrow 0$ |
For each step of the episode, $t = T - 1, T - 2, \dots, 0$, do (注意这里的==descending order==, a simple coding trick,可以提高计算效率 ) |
$g \leftarrow \gamma g + r_{t + 1}$ |
Use the first-visit method: |
if $(s_t, a_t)$ does not appear in $(s_0, a_0, s_1, a_1, \dots, s_{t - 1}, a_{t - 1})$, then |
$Returns(s_t, a_t) \leftarrow Returns(s_t, a_t) + g$ |
$q(s_t, a_t) = \text{average}(Returns(s_t, a_t))$ |
$\pi(a \mid s_t) = 1 $ if $a = \arg \max_a q(s_t ,a)$ |
5.2.3 MC Exploring Starts
具体含义:需要为每个state-action pair采样sufficiently many episodes
MC Basic && MC Exploring Starts都需要这个前提假设
Why?
- 根据理论,只有每一对state-action pair都被well explored,我们才可以正确地选到optimal action;On the contrary,如果一个action没有被exlored,可能正好漏掉的这个action就是optimal action
- 尽管,passby某个$(s, a)$ pair的start也是ok的
- 但是,不能保证每个$(s, a)$都被visited
- 所以,必须保证每个$(s, a)$都start
- 在实际中,很难实现exploring starts。在某些与现实世界进行物理交互的环境中,很难对每个state-action pair都收集episodes
面对the gap between theory and pratice,怎么去掉exploring starts的条件约束?
强调:Exploring Starts指的是,每一对state action pair的start,visit在上面提到的MC Basic && Exploring Starts中并没有用到,也就是说还是没有充分利用episode
5.3 MC without exploring starts: MC $\varepsilon$-Greedy
5.3.1 Soft policy
Policy可以分为两种:
- Deterministic:Greedy
- Stochastic:soft、$\varepsilon$-Greedy
Because of THE GAP, how to get rid of THE EXPLORING STARTS?
- A policy is called
SOFT
: if the probability to take any action is positive,即可能选择任意一个action - Why SOFT?
- 采用Soft时,只需要sufficiently long的few epsiodes,就可以对每个state-action pair达成sufficiently many times的visits
- 因此,不需要large number of episodes starting from every state-action pair
- 这样就可以get rid of the THE EXPLORING STARTS
5.3.2 $\varepsilon$-greedy policies
Where $\varepsilon \in [0, 1]$ and $|\mathcal{A}(s)|$ 是状态$s$的actions数量
由此,选择greedy action的概率,总是大于其他的动作,因为:
Why?
- $\varepsilon$-greedy可以平衡exploitation && exploration
- $\varepsilon = 0$: Greedy,less exploration, more exploitation
- $\varepsilon = 1$: uniform distribution, more exploration, less exploitation
How? (to embed $\varepsilon$-greedy into the MC-based RL algorithms?)
回顾MC Basic && MC Exploring Starts的policy improvement step:
其中$\Pi$代表了所有可能policies的集合,此处的最优策略是:
where $a^*_k = \arg \max_a q_{\pi_k}(s, a)$
将policy improvement改写为:
- (将$\Pi$改为$\Pi_\varepsilon$)
其中$\Pi_\varepsilon$代表了在固定$\varepsilon$值的前提下,$\varepsilon$-greedy policies的集合
此时的最优策略改写为:
- 从公式(47)可以看出,MC $\varepsilon$-greedy和MC Exploring Starts几乎是一样的,除了在策略选择阶段,采用了$\varepsilon$-greedy
- 由此,可以不需要Exploring Starts这个条件,但是仍然需要保证,以不同的形式,visit all state-action pairs
5.3.3 Pseudocode
Pseudocode: MC $\varepsilon$-Greedy (a variant of MC Exploring Starts) | ||||||
---|---|---|---|---|---|---|
Initialization: Initial guess $\pi_0$ and the value of $\varepsilon \in [0, 1]$ | ||||||
Aim: Search for the optimal policy | ||||||
For each episode, do | ||||||
Episode generation: Randomly select a starting state-action pair $(s_0, a_0)$ and ensure that all pairs can be possibly selected. Following the current policy, generate episode of length $T: s_0, a_0, r_1, \dots, s_{T - 1}, a_{T - 1}, r_T$ | ||||||
Policy evaluation and Policy improvement: | ||||||
Initialization: $g \leftarrow 0$ | ||||||
For each step of the episode, $t = T - 1, T - 2, \dots, 0$, do | ||||||
$g \leftarrow \gamma g + r_{t + 1}$ | ||||||
Use the ==every-visit== method: (不同点 1:而不是first-visit) | ||||||
(不同点 2:没有if条件) | ||||||
$Returns(s_t, a_t) \leftarrow Returns(s_t, a_t) + g$ | ||||||
$q(s_t, a_t) = \text{average}(Returns(s_t, a_t))$ | ||||||
Let $a^* = \arg \max_a q(s_t, a)$ and(不同点 3:policy improvement) | ||||||
$\pi_{k + 1}(a \mid s) = \begin{cases} 1 - \frac{ | \mathcal{A}(s_t) | - 1}{ | \mathcal{A}(s_t) | } \varepsilon, & a= a^* \\ \frac{1}{ | \mathcal{A}(s_t) | } \varepsilon, & a \ne a^* \end{cases}$ |
5.3.4 Optimality $vs.$ exploration
$\varepsilon$-greedy && greedy对比
优点:$\varepsilon$-greedy 有更强的exploration ability,因此不需要exploration starts
缺点:$\varepsilon$-greedy实际上并不是最优的
MC $\varepsilon$-greedy给出的最终policy,只在所有$\varepsilon$-greedy policies的集合$\Pi_{\varepsilon}$中才是最优的
这一点没有完全理解
$\varepsilon$不能太大
在实际中
更小的$\varepsilon$能够保持与greedy policy的consistence(一致性),并且在实际的训练过程中,可以设计递减的$\varepsilon \to 0$,这样不仅保证了训练过程中的exloration,还能保证最终的policy一定是最优的。
5.4 Summary
- Mean estimation by the Monte Carlo methods
- 三种算法:
- MC Basic
- MC Exploring Starts
- MC $\varepsilon$-greedy
- Relationship among the 3 algorithms
- Optimality $vs.$ exploration of $\varepsilon$-greedy policies