Abstract

This document contains my notes on selected topics from the book "Reinforcement Learning - An Introduction" by Richard Sutton and Andrew Barto.

Introduction

We want our agent to sense and act on its environment in order to optimize a goal. To do so, it will have to trade-off between eploration and exploitation.

Elementary Concepts

The main 4 elements of a reinforcement learning system are the policy, the reward function, the value function, and (optionally) the model of the environment.

Policy: it defines the behaviour of the agent. (eg, a mapping from states to actions)
Reward Function: it's the goal to optimize in the reinforcement learning problem.It can not be alterable by the agent, and maps each state to a reward.
Value Function: it specifies what is good in the long run, taking into account the states that are likely to follow.
Model: it provides an estimation of the environment. Models can be used for planning. (ie, evaluating future courses of actions, and deciding upon them)

Historical Review

Turing, 1950: First idea of learning by trial and error
Shannon, 1950: Computing a function that evolves through time to play chess
Minsky, 1954: (PhD Thesis) Computational models of RL, with neural nets (SNARCs), secondary reinforcers
Farley & CLarke, 1954: Trial and error with neural nets
Samuel, 1959: Checkers with temporal difference implementation
Minsky, 1961: 'Steps Towards AI', discussion of credit assignment problem
Michie, 1961, 63: Tic Tac Toe with trial and error (MENCACE)
Klopf, 1972, 75, 82: Differences between supervised and reinforcement learning
Witten, 1977: TD(0)
Barto, Sutton, & Anderson, 1983: Actor-Critic and solving Pole Balancing
Barto, 1984: (PhD Thesis) Extensive study of Actor-Critic
Sutton, 1988: TD(\(\lambda\))
Watkins, 1989: Q-Learning
Tesauro, 1992: TD-Gammon

Evaluative Feedback

\(n\)-armed bandit problem: given a state, you have can use any of the \(n\) levers. Optimize your sequence of actions.

The greedy method of selecting an action is to choose the one that maximizes \(Q_t(a)\) for action \(a\) at time \(t\). Instead, \(\epsilon\)-greedy uniformly samples over the set of actions with a small probability \(\epsilon\), and otherwise behaves like greedy. Generally speaking, the higher the variance in the reward the larger \(\epsilon\). We can also slowly decrease \(\epsilon\) while training, to speed it's convergence rate up. Softmax action selection passes the \(Q\) values in a Boltzmann distribution and samples accordingly. For action \(a\) at time \(t\) the probability is

\[ \frac{e^{Q_t(a) / \tau}}{\sum_{b \in A} e^{Q_t(b) / \tau}} \]

where \(\tau\) is the temperature parameter. (\(\tau = 0 \rightarrow\) greedy, \(\tau = \infty \rightarrow\) random choice)

Reinforcement learning receives feedback by evaluation as opposed to by instruction (= supervised). In the latter, instead of receiving a score you would receive what action would have been best, which is much more informative in the learning procedure. This can be seen as function values (rl) vs gradient methods (sup.) for optimization. With the gradient you know where to go within a single evaluation, whereas with value functions you need several points in order to obtain an estimate of the direction.

In a non-stationary problem (environment evolves through time), you want to use a constant step-size parameter (or learning rate) so that recent results are more heavily weighted than old ones. Doing so you obtain your update rule for timestep \(k\):

\[ Q_k = Q_{k - 1} + \alpha (r_k - Q_{k - 1}) = (1 - \alpha)^k Q_0 + \sum_{i = 1}^k \alpha (1 - \alpha)^{k - i} r_i \]

where \(r_k\) is the reward received at that timestep.

Reinforcement comparison methods track an average of the reward (called the reference reward) to assess the size of future rewards. This information is then used to modify the preference \(p_t\) of an action based on the size of the reward with \(p_{t+1}(a) = p_t(a) + \beta(r_t - \tilde{r}_t)\). The reference reward is updated with \(\tilde{r}_{t+1} = \tilde(r)_t + \alpha (r_t - \tilde{r}_t)\). Note that \(\alpha\) and \(\beta\) are step-size parameters.

Pursuit methods keep track of both action-values (\(Q_t(a)\)) and action preferences (\(p_t(a)\)). The action preference keeps pursuing the optimal action that is estimated by the action-value function. The update of the optimla action is given by \(p_t(a_t^*) = p_{t-1}(a_t^*) + \beta (1 - p_{t-1}(a_t^*))\), where \(a_t^* = argmax_a Q_t(a)\). The other actions are updated with \(p_t(a) = p_{t-1}(a) + \beta (0 - p_{t-1}(a))\) \(\forall a \neq a_t^*\)* *

The Reinforcement Learning Problem

Here we define the problem of reinforcement learning and develop its notation.

Reinforcement Leanring Setup

Basic Notation

The following is the notation that we will use to represent a reinforcement learning problem.

\(t \in [0, T]\): the current learning timestep, \(T\) the final one.
\(s_t \in S\): the current state from the set of possible states (\(S^+\) includes terminal states).
\(a_t \in A(s_t)\): the current action from the set of possible actions at the current state.
\(r_{t+1} \in R\): numerical reward following \(a_t\) in \(s_t\).
\(\pi_t(s, a)\): the policy, mapping from state and actions to probability of taking action \(a\) when in state \(s\).

Rewards should be a materialization of the objective to accomplish, not of a way to do that. (ie, make your rewards as pure as possible with respect to your goal.)

The goal of our agent will be to maximize the expected discounted reward

\[R_t = \sum_{i=0}^\infty \gamma^i r_{t + k + 1}\]

where \(\gamma \in [0; 1]\) is the discount rate. (\(\gamma = 0 \rightarrow\) maximizes immediate rewards, \(\gamma = 1 \rightarrow\) maximizes future rewards) Discounting is useful for continuing tasks, ones that do not have terminal states. (ie, can't be broken into episodes) When dealing with episodic tasks, it might be useful to introduce an additional subscript to specify the episode. (eg, \(s_{t, i}\), \(\pi_{t, i}\)) However, the distinction between episodes is rarely important in practice and that notation might be dropped. To further unify the view between continuiing and episodic tasks, we introduce an absorbing state at the end of episodic tasks, which loops to itself and always yields a reward of 0.

Markov Decision Processes

We assume our states to be Markov, that is there is as much information in the current state as in the whole history:

\[ P(s=s_{t+1}, a=a_{t+1} | s_t, a_t, r_t, s_{t-1}, a_{t-1}, r_{t-1}, \dots, s_0, a_0) = P(s=s_{t+1}, a=a_{t+1} | s_t, a_t)\]

A RL task that satisfies the Markov property is called a Markov Decision Process (MDP). A finite MDP has finite state and action spaces. The transition probabilities of an MDP from a state to antoher ())) = given an action) are given by \(P_{ss'}^a = P(s{t+1}=s'| s_t, a_t)\). Given the current action and state, as well as the next state, the expected reward is \(R_{ss'}^a = E[r_{t+1} | s_t=s, a_t=a, s_{t+1}=s']\). These quantities specify the most important aspects of the MDP. (only the distribution of reward around expected value is lost)

Value Functions

Reinforcement learning is about estimating the value of states of state-action pairs. We introduce some value functions.

The state-value function for policy \(\pi\) - the expected return when starting in \(s\) and following \(\pi\) - is given as

\[ V^\pi(s) = E_\pi[R_t | s_t = s] = E_\pi[\sum_{i=0}^\infty \gamma^i r_{t+i+1} | s_t=s] \]

The action-value function for policy \(\pi\) - the expected return starting from \(s\), taking \(a\), and then following \(\pi\) - is given as

\[ Q^\pi(s, a) = E_\pi[R_t | s_t=s, a_t=a] = E_\pi[\sum_{i=0}^\infty \gamma^i r_{t+i+1} | s_t=s, a_t=a] \]

One important aspect of the value functions is that they satisfy a recursive relationship, such that they express a relationship between the value of a state and the value of its successors. In particular, the Bellman equation for \(V^\pi\) is given as

\[ V^\pi(s) = E_\pi[R_t | s_t = s] = E_\pi[\sum_{i=0}^\infty \gamma^i r_{t+i+1} | s_t=s] \]

\[ = E_\pi[r_{t+1} + \gamma \cdot \sum_{i=0}^\infty \gamma^i r_{t+i+2} | s_t=s] \]

\[ = \sum_a \pi(s, a) \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma \cdot E_\pi[\sum_{i=0}^\infty \gamma^i r_{t+i+2} | s_t=s']) \]

\[ = \sum_a \pi(s, a) \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma \cdot V^\pi(s')) \]

The value function \(V^\pi\) is the unique solution to its Bellman equation.

Value functions define a partial ordering over policies, and thus there is always at least one policy that is better or equal to all the others for all states. We denote this optimal policy \(\pi^*\), and it has the optimal state-value function denoted \(V^*\) defined as: \(V^*(s) = \max_\pi V^\pi(s) \forall s \in S\). Optimal policies also share the same optimal action-value function \(Q^* (s, a) = \max_\pi Q^\pi(s, a) \forall s \in S, a \in A(s)\). Thus we have the following relationships

\[Q^*(s, a) = E[r_{t+1} + \gamma V^*(s_{t+1}) | s_t=s, a_t=a]\]

and

\[ V^*(s) = \max_{a \in A(s)} Q^*(s, a) = \max_{a \in A(s)} \sum_{s'} P^a_{ss'} (R^a_{ss'} + \gamma V^*(s')) \]

For finite MDPs, the Bellman optimality equation (right hand side of above equation) has a unique solution, independently of the policy. It can be represented as a system of equations with one varible and one equation for each state.

The beauty of \(V^*\) and \(Q^*\) is that since they recursively operate over a Markov Decision Process, greedily selecting the next action will lead to the optimal policy. Of course, since for interesting problems the state space is gigantic, we usually can't determine them and have to settle for an approximation.

Dynamic Programming

Dynamic programming provides a nice theoretical framework to think about more complicated algorithms. However, it only works on small, finite state and action spaces. Policy evaluation deals with learning the value function. This is done by iteratively updating a table of values for each state \(s\), using

\[V_{k+1}(s) = \sum_a \pi(s, a) \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma V_k(s')) \text{ } \forall s \in S\]

This is also known as value iteration.

This can be implemented in-place, where for some states new values of neighboring states are used instead of the old ones.

Policy iteration consists of interleaving steps of evaluation and improvement until convergence. In the evaluation step, you check that your current policy is as good as the expectated state value.

The updates of the value function can be done in an asynchronous manner. That is, convergence properties are guaranteed by re-ordering the updates, so that some states have been updated more than others at a current time step. (we could updated them in parallel) Following this idea, it is then possible that some states need less value updates than others to obtain the ideal policy. A similar concept comes from generalized policy iteration, where partial policy improvements and policy evaluation steps (ie, only a subset of the states) follow each other.

Dynmic programming is relatively efficient considering that it only a polynomial amount of time to solve the reinforcement learning problem, where as a brute force approach would require polynomial time. (\(m^n\) for \(m\) actions and \(n\) states) In addition, this can be computationally improved using asynchronous methods.

Monte-Carlo Methods

State-Value Function

Go through each episode and average the expected reward from each state (\(V(s)\)).

First-visit: Average only the first bisit of the state.
All-visit: Include all visits of the state in the average.

Action-Value Function

Same as above, but consider action-state pairs instead of just states. Advantage is that it doesn't require a model of the environment. The disadvantage is that it requires more samples since there are more (a, s)-pairs to consider. In addition, some pairs might never be visited. Solution: Initially start at a random (a, s)-pair every episode (aka, exploring starts), or use stochastic policies.

\(\epsilon\)-soft policies: \(\forall s \in S \forall a \in A: \pi > 0\)

\(\epsilon\)-greedy policies: Assign to non-optimal actions \(a\) \(\pi(s, a) = \frac{\epsilon}{|A|}\) and for the current optimal one \(\pi (s, a) = 1 - \epsilon + \frac{\epsilon}{|A|}\)

On-policy MC Control: Use \(\epsilon\)-greedy as policy.

\begin{algorithm}
                \caption{$\epsilon$-soft on-policy Monte Carlo control algorithm}
                \begin{algorithmic}
                    \State Initialize for all $s in S, a \in A$
                    \State $Q(s, a) = $ arbitrary
                    \State $\pi$ an arbitrary $\epsilon$-soft policy
                    \While{True}
                        \State Generate an episode using $\pi$.
                        \For{each pair $(s, a)$ in the episode}
                            \State $R =$ the return following the first occurrance of $(s, a)$.
                            \State Append $R$ to Returns$[s][a]$ the return followng the first occurrance of $(s, a)$.
                            \State $Q(s, a) =$ Average(Returns$[s][a]$)
                        \EndFor
                        \For{each $s$ in the episode}
                            \State $a^*$ = arg $\max_a Q(s, a)$
                            \For{all $a \in A(s)$}
                                \If{$a = a^*$}
                                    \State $\pi (s, a) = 1 - \epsilon + \frac{\epsilon}{|A|}$ 
                                \Else
                                    \State $\pi(s, a) = \frac{\epsilon}{|A|}$
                                \EndIf
                            \EndFor
                        \EndFor
                    \EndWhile
                \end{algorithmic}
            \end{algorithm}

Reinforcement Learning

by Seb Arnold