RL Notes Chapter 2

Written on August 26, 2016

N-Armed Bandits

Definitions:

Value of an action is the mean expected reward obtained by performing that action.
Action-Value Methods: The value of taking the action \(a\) at time \(t\) is given by: \begin{align} Q_t(a) = \frac{r_1 + r_2 + \ldots + r_k}{k} \end{align} Here \(k\) is the number of times action \(a\) has already been selected i.e. in previous plays. This is known as sample-average method of estimating the value of an action.
Selecting the best action using sample-average method is a perfectly reasonable. This is called greedy section.
A tweak on greedy selection is to select a non-best action in some small (\epsilon) portion of plays. This is called \(\epsilon\)-greedy.
The logical extension is obviously selecting the action using the values of all the actions \(\mathcal{A}\) at time \(t\) and applying a softmax over these values. This will result is a nice distribution over the actions. It’s now perfectly reasonable to select or sample an action from this distribution. This is known as softmax-action selction. \begin{align} P(a) = \frac{\exp(Q_t(a)/\tau)}{\sum_{b\in\mathcal{A}} \exp(Q_t(b)/\tau)} \end{align} Here \(\tau\) is a hyperparameter called the temperature. As \(\tau\) is increased the selection is more uniform and as tau is reduced the selection is more greedy.
The averaging based methods of selection assume stationarity. If we want to constantly track changing values section 2.5 details how averaging is essentially an incremental update with changing (reducing) step sizes. If the step size is kept constant (or changed by some other heuristic) we no longer implicitly assume stationarity.
Initial values: optimistic initial values will make action-value methods explore more. This is because the reward for time \(0\) was high (initial value) and the reward at time \(1,2,\ldots\) is going to be lower (real rewards). This will force the learner to try more actions since the recent actions did not perform as well as the initial actions.

Some useful distinctions I found:
Difference between n-armed-bandit testing and A/B testing