# RL Notes Chapter 2

### N-Armed Bandits

Definitions:

*Value*of an*action*is the mean expected reward obtained by performing that action.-
Action-Value Methods: The value of taking the action at time is given by: \begin{align} Q_t(a) = \frac{r_1 + r_2 + \ldots + r_k}{k} \end{align} Here is the number of times action has already been selected i.e. in previous plays. This is known as

*sample-average*method of estimating the value of an action. -
Selecting the best action using

*sample-average*method is a perfectly reasonable. This is called*greedy*section. -
A tweak on

*greedy*selection is to select a non-best action in some small (\epsilon) portion of plays. This is called*-greedy*. -
The logical extension is obviously selecting the action using the values of all the actions at time and applying a

*softmax*over these values. This will result is a nice distribution over the actions. It’s now perfectly reasonable to select or sample an action from this distribution. This is known as*softmax-action selction*. \begin{align} P(a) = \frac{\exp(Q_t(a)/\tau)}{\sum_{b\in\mathcal{A}} \exp(Q_t(b)/\tau)} \end{align} Here is a hyperparameter called the*temperature*. As is increased the selection is more uniform and as tau is reduced the selection is more greedy. -
The averaging based methods of selection assume

*stationarity*. If we want to constantly*track*changing values section 2.5 details how averaging is essentially an incremental update with changing (reducing) step sizes. If the step size is kept constant (or changed by some other heuristic) we no longer implicitly assume stationarity. - Initial values: optimistic initial values will make action-value methods explore more. This is because the reward for time was high (initial value) and the reward at time is going to be lower (real rewards). This will force the learner to try more actions since the recent actions did not perform as well as the initial actions.

Some useful distinctions I found:

Difference between n-armed-bandit testing and A/B testing