Preface

Review of RL, based on Mathematical Foundations of Reinforcement Learning

1.Basic Concepts

1.1 State

state: status of agent with respect to environment
state space: the set of all states $\mathcal{S} = \{ s_i \}^{N}_{i = 1}$

1.2 Action

action space of the state: the set of all possible action of a state $\mathcal{A}(s_i) = \{a_i\}$

1.3 State transition && state transition probability $p(s^{’}|s,a)$

1.4 Reward && Reward probability $p(r|s,a)$

Reward is one of the most unique concepts of RL

1.5. Trajectory, episode, return, discounted return

trajectory: state-action-reward chain
return: sum of all the rewards collected along the trajectory

different policy gives different trajectory

discount return: $\gamma \in [0, 1)$

Roles:

Make sum become finite

Balance the near && far future rewards:

$\gamma \rightarrow 0$: discounted return dominated by near future

$\gamma \rightarrow 1$: discounted return dominated by far future

episode: a trial, usually assumed to be a finite trajectory.

How to treat episode?

1.6 Markov decision process

Sets
- State: $\mathcal{S}$
- Action: $\mathcal{A}(s)$ is associated for state $s \in \mathcal{S}$
- Reward: $\mathcal{R}(s, a)$
Probability distribution:
- State transition probability: at state $s$, take action $a$, the probability to transit to state $s^{’}$ is $p(s^{’}|s,a)$
- Reward probability: at state $s$, take action $a$, the probability to get reward $r$ is $p(r|s,a)$
Policy: at state $s$, the probability to choose action $a$ is $\pi(a|s)$
Markov property: memoryless property

$p(s_{t + 1} | a_t, s_t, \dots, a_0, s_0) = p(s_{t + 1} | a_t, s_t)\\ p(r_{t + 1} | a_t, s_t, \dots, a_0, s_0) = p(r_{t + 1} | a_t, s_t)$