This article is a tutorial style discussion of the research paper titled “Distributional RL with Quantile Regression”. Here I try to reflect and convey the ideas of the paper with an intuitive style.

### Introduction

Supervised machine learning deals with learning from examples. This is in contrast to real-life where we often don’t have a proper set of examples to learn from. Reinforcement learning discipline is analogous to the learning mechanism of humans and animals where they learn from experience. The framework of reinforcement learning is designed in such a way that an agent starts with trying some actions, getting some rewards (positive or negative) and observations depending on the action. This is continued and in due course, eventually learns from trial and error to get higher rewards.

Whereas in supervised learning, the algorithm is provided with a diverse and/or balanced set of examples. But in reinforcement learning the agent is given rewards depending on whether the actions it took were desirable or not. However this brings up additional difficulties to the learning process. The rewards are often not instantaneous. Say for example your grades for midterm exams not only depends on what you did the day before but depends on what you did for many weeks before. Another one is analogous to opportunity cost. The restaurant you choose for the weekend might be good but at the same time you might be losing the chance to taste great food at many nearby restaurants at the same time.

Despite these challenges reinforcement learning algorithms have shown remarkable performance in autonomous driving, finance, healthcare etc. Moreover RL seems the best tool humans have towards the Artificial General Intelligence (AGI) endeavour. In benchmark tasks, distributional RL and its variants often beat other algorithms.

Folks from Deepmind introduced the concept of distributional reinforcement learning in their paper in 2017. Following this, a series of updates have been made in this domain. This paper is one in this series of publications that has quite a different take on a major fundamental idea in reinforcement learning -the reward.

In addition to that, the reward the agent receives is not an instantaneous one. The agent might be receiving a negative reward only after it has proceeded a certain number of steps from a wrong turning point. Despite these limitations, the field of reinforcement learning has achieved human-level performance in many tasks. On the other hand, these are exactly the points that make the problem formulation interesting and close to real-life tasks. This brings up the capabilities like decision making under uncertainty, handling delayed rewards, learning from experience, and no need for huge training data. All these make the field so promising towards the journey to Artificial General Intelligence (AGI).

### The Notion of Reward Distribution

With that being said, the research in this paper takes a different approach concerning discounted cumulative rewards (also called returns). When it comes to the problem formulation regarding rewards, it is opposite to that of classical reinforcement learning. In classical reinforcement learning the reward is observed as a series of values (mostly sparse) each time the agent takes an action. In distributional RL, instead of seeing it as a streaming value, the distribution of the return is made use of.

Although the expectation of discounted cumulative reward gives a single number, during the process it loses information like bimodality of the values. When it comes to distributional RL, it can make use of such information. Say, for example, a bimodal distribution might be telling a possibly high and low reward from a given state (depending upon what action you take further) and action. Although this might sound reasonable at first sight, there are some challenges to overcome to make it happen.

### Contraction Mapping: A Slight Detour to Topology

A key assumption in classical reinforcement learning algorithms are the value functions (both state or action) will be converging to its optimal values. In more mathematical terms, it is known as a contraction mapping theorem. This originates from the mathematical discipline of Topology where the distance between mathematical objects is measured using metric space. It can be thought of as, if after a mathematical operation, the metric distance between two points is reduced, compared to the scaled-down metric distance of the points before the operation then it is a contraction. In reinforcement learning the operator is the Bellman operator, the distance is the difference between state/action values, and the scaling is done by the discount factor. Since the Bellman operation was a contraction mapping all the traditional RL algorithms have the convergence property.

Now when the value function changes to a distribution, it has to satisfy the same properties to ensure convergence. In the seminal paper, the authors have used a metric called the Wasserstein metric with some additional modifications. These additional steps were required because of the difficulty in reducing Wasserstein metric loss using Stochastic Gradient Descent (SGD) methods. Hence they had to use a projection followed by KL divergence.

In this follow-up paper, the authors have used quantile regression to deal with these metric issues. This fills up the drawbacks like disjoint support, difficulty to do stochastic optimization. The results show the state of the art performance in most of the benchmark tasks.

### Summary

For me, the most interesting fact about these ideas is that these were helpful to analyze the learning mechanisms of the human brain. The recent works (A distributional code for value in dopamine-based reinforcement learning) have proven neural realizations of distributional learning in the human brain. In short, humans’ quest to create intelligence leads us to understand intelligence itself.