The Center for Brains, Minds and Machines (CBMM) is a premier institute and NSF Science and Technology Center dedicated to the study of intelligence. The website and Youtube channel of the institute contains a good number of tutorials and educational materials that help anyone interested in topics related to neuroscience, machine intelligence, etc. One of the good talk from CBMM is an RL tutorial.

This tutorial is a review of the RL tutorial available on their platform and Youtube channel. The tutorial has a quick and crisp introduction to reinforcement learning and is a good resource for anyone starting to learn about RL. The tutorial video is given below.

## RL Tutorial

### Summary

In this post I am going to summarize the post and brainstorm with some questions that came to my mind while watching the lecture.

Xavier Boix starts with explaining Reinforcement learning in which the agent learns by interacting with the environment. One question I have here is why only the agent learns, in reality, the environment also could be another agent with its interests, which might also be learning?. The problem formulation assumes the agent learns and the environment responding to the actions of the agent.

One criticism the RL often faces is its application area, where it is said that it mostly solves problems in board/computer games. What are some of the cool applications of RL in the industry? I remember listening to a podcast where researchers from Microsoft Research were explaining a successful use case of RL in Websites instead of AB testing? Which has a large business incentive compared to other traditional methods. Another case is automated electronic circuit design.

The problem definition talks about discrete time steps (The presenter uses the term discrete timing). Are there continuous counterparts of this concept?. Another question comes before me is the priority or order in which the action, reward, observation etc are happening? Is the agent taking action first and then receiving reward and observation? He explains it in the other way where the environment receives an action at time $t$ ($a_t$) Then emits an observation($O_{t+1}$) and reward ($r_{t+1}$) in the next step. Is this strictly this way or can it be changed?.

Next he goes with explaining the term experience. According to him it is the observation, reward, and action triplet series. Then he explains the State. Current State is a summary of experience.

In video games the observation is the set of images and reward is a particular score. How does Reinforcement Learning relate to supervised learning ?. In supervised learning observations are emitted independent of previous agents actions and sequence length is one. The feedback is immediate and there is no delayed reward.

Policy: Policy is a way of deciding an action in accordance with the current situation. The policy is a map from the state to the action.

$a_t= \pi(s_t)$ (deterministic)

$\pi(a_t|s_t) =P(a_t|s_t)$(stochastic)

Here for a the chance of selecting an action for a given state is not fixed and subject to probabilistic conditions. How is the reward function designed?. Does it include a human judging and saying how rewarding an action was or is it through some formula to estimate it?.

Value Function : This essentially tells how much reward will I get from state $S_t$.

$V^\pi(S_t)=E_\pi[r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+…|S_t]$

Here $0 \leq 1 <1$ is known as the discount factor.

I noticed here the value function depends on the policy it takes. It uses the expectation operator. Unlike signal processing the time steps are forward $(t, t+1, t+2,…)$. If it is from present to future, then how come it can guess future rewards?.

Next, he mentions the Markovian assumption. $P(s_{t+1}|s_t , \pi(s_t))$. The next state can be modeled based on the current state as well as the action.

The Q-function: He explains it very similar to the earlier formula. The Q function takes both the state as well as action as input. The expectation operation disappears. Maybe this is because the condition of action removes the uncertainty. Just to recap, the RL finds a policy that maximizes rewards.

The next session of the talk is handled by Yen-ling Kuo to give the big landscape of RL. **Value-based RL**: Estimate the expected returns to choose the optimal policy. **Policy-based RL**: Directly optimize the policy to get a good return. **Model-based RL**: Build a model of the environment and choose an action based on the model. When to use each of these types of RL is a question.

Then she shows a toy example in which she explains by optimizing the Q function, the policy is updated. In the working of policy-based RL is given a shot. We know in policy-based RL we are directly optimizing policy to get a good return. This is done by a parameterizing policy with a $\theta$. Then the goal becomes to find the best theta.

$\pi(a_t|S_t)=P(a_t|S_t)–>\pi_{\theta}(a_t|S_t)=P(a_t|S_t;\theta)$

How can the $\theta$ can be updated? She shows an objective function to evaluate theta. Which again uses an expectation operator on the sum of rewards.

$J(\theta) = E_{\tau \pi_{\theta}}[\Sigma_t r(a_t,S_t)]$

Here there is both expected as well as summation operation is there, which is strange to me. Next by computing the gradient of the objective function theta can be updated.

$\theta<- \theta+\eta \delta_{\theta}J(\theta)$

Since we want to maximize the objective function, here gradient ascent is used. The delta symbol needs to be upside down.

**Model-based RL: **When the rule is known or easy to learn, we can learn the model. This in essence becomes a supervised learning problem. How to select a policy in using a model?. Two types are shown. One is to iteratively compute the value and optimal action for each state. Another one is to Sample experience from the model.

**Exploration Vs Exploitation: **In exploration one may need to sacrifice short-term returns for future rewards. Exploration: Doing things you haven’t done before to collect more information. Exploitation: is taking the best action given the current information. Common approaches are epsilon-greedy and probability matching. Epsilon-greedy is adding noise to the greedy policy. Probability matching is selecting actions according to the probability they are best.

**Sample Efficiency: **How many samples do we need to get a good policy?. On-policy needs to generate new samples every time a policy is changed. Off-policy can improve policy without generating new samples from that policy.

Model-based RL is more efficient and Policy gradient RL is less efficient. Next, she goes with explaining different types of RL algorithms. One is the Actor-Critic algorithm. Critic estimates the values of the current policy. The actor updates the policy in a direction that improves value function.

**Deep-Q-Network:** approximates function parameters using deep networks. Again the instructor changes and goes to the inner workings of the Deep-Q-network. It was invented by the Deepmind team in 2015, published in Nature named as Human-level control through deep reinforcement learning. He quickly recaps the value function and Q function. An optimal value function is the maximum achievable value.

$Q^*(S_t, a_t)= max_\pi Q^\pi(S_t, a_t)$

The agent can act optimally with the optimal value function:

$\pi^*(s) = argmax_{a} Q^*(S,a)$

The value function in Deep Neural Network is represented using weights w. Then he explains how DNN is blended with RL what is the input/output etc.

The nature of possible use cases is discussed in concluding remarks. Works very well in domains governed by simple or known rules. Learn simple skills from raw sensory input given enough experience.

The challenges in RL are: Humans can learn quickly whereas Deep RL methods are usually slow and need more data. Humans can reuse past knowledge. Transfer learning in deep RL (Transfer across problem instances and simulations). Defining reward functions is a challenge. Composition of tasks and Safety of learned policy is also a challenge.

You can find more articles on reinforcement learning here.