5. MC

5.0 Preface

DP / Value iteration / Policy iteration: Model-Based

MC: Model-Free

  • How can we estimate sth without model?
    • Monte Carlo Estimation

Example: Flip a coin

结果(head / tail)由random variable $X$决定

  • Head: $X = +1$
  • Tail: $X = -1$

目标就是计算$\mathbb{E}[X]$

Method 1: Model-Based

  • 假设概率模型已知:
  • 则根据定义

但是,想要提前知道准确的概率模型,通常是不可能的

Method 2: Model-Free

  • 主要思想:扔多次硬币,计算结果的平均值
  • 假设得到了一个sample sequence: $\{ x_1, x_2, \dots, x_N \}$. 期望可以估计为:

This is the idea of Monte Carlo estimation.

Question: Monte Carlo estimation 是否准确?

  • $N$小的时候,不够准确
  • $N$增加时,会越来越准确(==Law of Large Numbers==)
  1. MC 代表了一类方法,通过重复采样解决估计问题
  2. 通过MC可以不需要model
  3. 为什么关注mean estimation?因为state value && action value 都定义为random variables 的expectations,实际上都是expectations.

5.1 MC Basic

==重点:==理解如何将policy iteration转化为model-free

  • policy iteration
  • Monto carlo mean estimation

5.1.1 Convert policy iteration to be model-free

  • Policy iteration 2-steps:

Elementweise form of Policy improvement step:

==Key== is $q_{\pi_k}(s, a)$


  • Action value $q_{\pi_k}(s, a)$ 有两种计算方式

  • Require the model

  • Not require model

基于data / samples / experiences,通过公式(7)来实现model-free


所以,action value的MC estimation:

  1. 从$(s, a)$开始,采用策略$\pi_k$,采样生成一个episode
  2. episode的返回结果为$g(s, a)$
  3. $g(s, a)$是$G_t$的采样
  4. 假设得到了episode的集合,即有了$\{ g^{(j)}(s, a) \}$,则

==Fundamental idea: When model is unavailable, we use data / sample / experience==

5.1.2 MC Basic algorithm

给定初始策略$\pi_0$(主要用于采样,用于后续的policy iteraion),在 $k$th iteration有两步:

  • Step 1: policy evaluation

这一步的关键内容在于,获得所有$(s, a)$的$q_{\pi_k}(s, a)$。

在实际过程中,对于每一个$(s, a)$,采样有限步(足够多)的episodes,采样的平均值用于估计$q_{\pi_k}(s, a)$

  • Step 2: policy improvement

这一步与常规的policy improvement并没有什么区别,可以说exactly the same

区别就在于,通过采用平均来直接估计$q_{\pi_k}(s, a)$,而不是解算BE $v_{\pi_k}(s)$

5.1.3 Pseudocode

Pseudocode: MC Basic algorithm (a model-free variant of policy iteration)
Initialization: Initial guess $\pi_0$
Aim: Search for optimal policy
While the value estimate has not converged, for the $k$ th iteration, do
 For every state $s \in \mathcal{S}$, do
  For every action $a \in \mathcal{A}(s)$, do
   Collect sufficiently many episodes starting from $(s, a)$ following $\pi_k$
   MC-based policy evaluation step:
   $q_{\pi_k}(s, a) =$average return of all the episodes starting from $(s, a)$
  Policy improvement step:
  $a^*_k(s) = \arg \max_a q_{\pi_k}(s, a)$
  $\pi_{k + 1}(a \mid s) = 1$ if $a = a^*_k$, and $\pi_{k + 1}(a \mid s) = 0$ otherwise

5.1.4 Summary

  • MC Basic 是policy iteration的一种
  • Model-free建立在Model-based的基础上
  • MC Basic揭示了MC算法的core idea,但是效率太低,not practical
  • 为什么MC Basic计算$q_{\pi_k}$而不是$v_{\pi_k}$?

因为$v_{\pi_k}$并不能直接用于policy improvement($\arg \max_a$),当模型未知时,应该直接估计action value

  • Convergence:由于policy iteraion是收敛的,所以在given sufficient episodes(太短会导致short-sighted,趋近于$\infty$时接近真值,但不需要infinitely long)的前提下,MC Basic也是保证收敛的

5.2 MC Exploring Starts

5.2.0 Preface

MC Basic:

  • Advantage: 清楚了揭示了核心思想

  • Disadvantage: not pratical

所以,需要让MC Basic more efficient

5.2.1 Visits

采取策略$\pi$,采样得到如下所示的episode:

  • Visit: 在episode中,state-action pair出现一次,就是对这对state-action pair的visit

  • Method:在MC Basic中使用的方法是Initial-visit

    • 在公式(9)的例子中,只考虑$(s_1, a_2)$的pair,然后用后续episode来计算$q_\pi(s_1, a_2)$

Initial-visit的问题在于:没有充分利用数据


  • More efficiently way:

通过一个episode,可以估计$q_\pi(s_1, a_2), q_\pi(s_2, a_4), q_\pi(s_2, a_3), q_\pi(s_5, a_1)$

由此引申出了两个Data-efficient methods:

  • Method 1: first-visit method

在policy evaluation阶段,从一个state-action pair开始,收集所有的episodes,然后使用average return作为action value的近似估计(即MC Basic采用的方法

==缺点:==agent需要等待所有的episodes被收集完

  • Method 2: every-visit method

用单个episode来近似估计action value

==优点:==可以episode-by-episode来改进policy


  • Method 2讨论:该方法是否可行?单个episode的估计显然是不准确的

在Value iteration && Truncated policy iteraion中,policy evaluation都没有做$\infty$步的迭代来求解BE,因此得到的$v_\pi$肯定也是不准确的,但是依然能够保证最终收敛。所以every-visit是可行的方法

  • 这类方法统称为:Generalized policy iteration (GPI)
    • 不是某种特定算法,而是一类算法
    • 代表了一种具有普适性,在policy evaluation && policy improvement两个过程中切换的idea / framework
    • 大多数的Model-Free / Model-Based的算法,都建立在这个框架的基础上

5.2.2 Pseudocode

Pseudocode: MC Exploring Starts (a sample-efficient variant of MC Basic)
Initialization: Initial guess $\pi_0$
Aim: Search for an optimal policy
For each episode, do
Episode generation: Randomly select a starting state-action pair $(s_0, a_0)$ and ensure that all pairs can be possibly selected. Following the current policy, generate episode of length $T: s_0, a_0, r_1, \dots, s_{T - 1}, a_{T - 1}, r_T$
Policy evaluation and Policy improvement:
Initialization: $g \leftarrow 0$
 For each step of the episode, $t = T - 1, T - 2, \dots, 0$, do (注意这里的==descending order==, a simple coding trick,可以提高计算效率 )
  $g \leftarrow \gamma g + r_{t + 1}$
  Use the first-visit method:
  if $(s_t, a_t)$ does not appear in $(s_0, a_0, s_1, a_1, \dots, s_{t - 1}, a_{t - 1})$, then
   $Returns(s_t, a_t) \leftarrow Returns(s_t, a_t) + g$
   $q(s_t, a_t) = \text{average}(Returns(s_t, a_t))$
   $\pi(a \mid s_t) = 1 $ if $a = \arg \max_a q(s_t ,a)$

5.2.3 MC Exploring Starts

  • 具体含义:需要为每个state-action pair采样sufficiently many episodes

  • MC Basic && MC Exploring Starts都需要这个前提假设

Why?

  1. 根据理论,只有每一对state-action pair都被well explored,我们才可以正确地选到optimal action;On the contrary,如果一个action没有被exlored,可能正好漏掉的这个action就是optimal action
    • 尽管,passby某个$(s, a)$ pair的start也是ok的
    • 但是,不能保证每个$(s, a)$都被visited
    • 所以,必须保证每个$(s, a)$都start
  2. 在实际中,很难实现exploring starts。在某些与现实世界进行物理交互的环境中,很难对每个state-action pair都收集episodes

面对the gap between theory and pratice,怎么去掉exploring starts的条件约束?

强调:Exploring Starts指的是,每一对state action pair的start,visit在上面提到的MC Basic && Exploring Starts中并没有用到,也就是说还是没有充分利用episode

5.3 MC without exploring starts: MC $\varepsilon$-Greedy

5.3.1 Soft policy

Policy可以分为两种:

  1. Deterministic:Greedy
  2. Stochastic:soft、$\varepsilon$-Greedy

Because of THE GAP, how to get rid of THE EXPLORING STARTS?

  • A policy is called SOFT: if the probability to take any action is positive,即可能选择任意一个action
  • Why SOFT?
    • 采用Soft时,只需要sufficiently long的few epsiodes,就可以对每个state-action pair达成sufficiently many times的visits
    • 因此,不需要large number of episodes starting from every state-action pair
    • 这样就可以get rid of the THE EXPLORING STARTS

5.3.2 $\varepsilon$-greedy policies

Where $\varepsilon \in [0, 1]$ and $|\mathcal{A}(s)|$ 是状态$s$的actions数量

由此,选择greedy action的概率,总是大于其他的动作,因为:

Why?

  • $\varepsilon$-greedy可以平衡exploitation && exploration
    • $\varepsilon = 0$: Greedy,less exploration, more exploitation
    • $\varepsilon = 1$: uniform distribution, more exploration, less exploitation

How? (to embed $\varepsilon$-greedy into the MC-based RL algorithms?)

回顾MC Basic && MC Exploring Starts的policy improvement step:

其中$\Pi$代表了所有可能policies的集合,此处的最优策略是:

where $a^*_k = \arg \max_a q_{\pi_k}(s, a)$

将policy improvement改写为:

  • (将$\Pi$改为$\Pi_\varepsilon$)

其中$\Pi_\varepsilon$代表了在固定$\varepsilon$值的前提下,$\varepsilon$-greedy policies的集合

此时的最优策略改写为:


  • 从公式(47)可以看出,MC $\varepsilon$-greedy和MC Exploring Starts几乎是一样的,除了在策略选择阶段,采用了$\varepsilon$-greedy
  • 由此,可以不需要Exploring Starts这个条件,但是仍然需要保证,以不同的形式,visit all state-action pairs

5.3.3 Pseudocode

Pseudocode: MC $\varepsilon$-Greedy (a variant of MC Exploring Starts)
Initialization: Initial guess $\pi_0$ and the value of $\varepsilon \in [0, 1]$
Aim: Search for the optimal policy
For each episode, do
Episode generation: Randomly select a starting state-action pair $(s_0, a_0)$ and ensure that all pairs can be possibly selected. Following the current policy, generate episode of length $T: s_0, a_0, r_1, \dots, s_{T - 1}, a_{T - 1}, r_T$
Policy evaluation and Policy improvement:
Initialization: $g \leftarrow 0$
 For each step of the episode, $t = T - 1, T - 2, \dots, 0$, do
  $g \leftarrow \gamma g + r_{t + 1}$
  Use the ==every-visit== method: (不同点 1:而不是first-visit)
  (不同点 2:没有if条件)
   $Returns(s_t, a_t) \leftarrow Returns(s_t, a_t) + g$
   $q(s_t, a_t) = \text{average}(Returns(s_t, a_t))$
   Let $a^* = \arg \max_a q(s_t, a)$ and(不同点 3:policy improvement)
   $\pi_{k + 1}(a \mid s) = \begin{cases} 1 - \frac{ \mathcal{A}(s_t) - 1}{ \mathcal{A}(s_t) } \varepsilon, & a= a^* \\ \frac{1}{ \mathcal{A}(s_t) } \varepsilon, & a \ne a^* \end{cases}$

5.3.4 Optimality $vs.$ exploration

  • $\varepsilon$-greedy && greedy对比

    • 优点:$\varepsilon$-greedy 有更强的exploration ability,因此不需要exploration starts

    • 缺点:$\varepsilon$-greedy实际上并不是最优的

      • MC $\varepsilon$-greedy给出的最终policy,只在所有$\varepsilon$-greedy policies的集合$\Pi_{\varepsilon}$中才是最优的

        这一点没有完全理解

      • $\varepsilon$不能太大

  • 在实际中

更小的$\varepsilon$能够保持与greedy policy的consistence(一致性),并且在实际的训练过程中,可以设计递减的$\varepsilon \to 0$,这样不仅保证了训练过程中的exloration,还能保证最终的policy一定是最优的。

5.4 Summary

  • Mean estimation by the Monte Carlo methods
  • 三种算法:
    • MC Basic
    • MC Exploring Starts
    • MC $\varepsilon$-greedy
  • Relationship among the 3 algorithms
  • Optimality $vs.$ exploration of $\varepsilon$-greedy policies