### Introduction

Probability distributions are cornerstones in the theory of machine learning. The entire field of machine learning is often theoretically viewed or explained from a probabilistic perspective. Compared to the linear algebraic point of view, the probabilistic perspective gives formulations of machine learning algorithms that are more impressive.

Understanding probability distributions and their behavior are essential for anyone who wants to learn about the field. Of the many different types of probability distributions, the binomial is one of the simplest which forms a stepping stone for many other advanced concepts. In this post, we will understand the intuition behind the** binomial distribution** and formulate its equation form.

### Probability Distributions

Machine learning deals with forming approximate solutions to problems where the patterns in data are leveraged. The Probability distribution can be thought of as a graph/function representing how often a certain value can occur for a variable. If we take the marks of students in a class, we know there are few toppers in the class as well as few students who are poor in marks. A majority of students will be in the average range. When represented in the form of a graph we have the various marks possible (e.g. 0-100) on the x-axis and the number of students/percentage of students who got that mark in the y-axis. In this case, we know that the pattern in the data is like an inverted ‘U’ shape. This pattern is particular to the exam mark dataset.

Now let us take a hypothetical situation. **What if there was no pattern in the data? **How would the probability distribution look like, if there is no pattern among the students when it comes to the mark they score ?. As you guessed it will look like a straight line parallel to the x-axis. This is because any mark between zero to one hundred is equally likely. Although this is quite possible when it comes to a selective group of highly competitive schools or students, much of machine learning thinks it as a less interesting pattern to consider.

### Bernoulli Distribution: A Precursor to Binomial Distribution

Perhaps the easiest probability distribution to understand would be the Bernoulli distribution. The binomial distribution can be thought of as an extended version of the binomial distribution. To understand this let us take situations where only two possible outcomes are valid.

We have a lot of such examples every day. Raining or not raining, passing or failing an exam, tossing a coin, etc. We can assign a value between 0 to 1 for one outcome and one minus that value to the other outcome. For example in the rain example, if I gave 0.7 for raining then automatically (1-0.7) =0.3 will be the value for not raining.

**How do we get these values?**. We can estimate these by measuring the number of rainy days and non-rainy days in a pretty enough long period (e.g. years). We are essentially representing the pattern of rain in this probability distribution. This data can be specific to a region and can vary depending upon the period. Howewer the essential idea is that we have a mathematical representation of the pattern of raining or phenomenon under consideration.

### Binomial Distribution

Here the number of possible outcomes are two only and this was a simple example. What if there are more than two outcomes are possible?. One such example is the number of heads possible when throwing a coin 4 times. Since we are repeating the tossing 4 times, the number of heads we can get are 0,1,2,3,4. Therefore a total of 5 outcomes are possible. The question now is what are the probability values associated with each of these outcomes. What will be an equation that maps the possible outcomes to its probability values. Let’s try to derive it using an example.

### Figuring out the Binomial Equation

Consider tossing a coin three times. We can list out all the possible outcomes which are HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. A total of eight outcomes are possible. If we count the number of heads in each outcome we get 3,2,2,1,2,1,1,0. Without repetition we have four values possible (0,1,2,3).

For ease of clarity, let’s assume that the coin is not fair. Therefore let’s assume that the probability of getting heads is 0.4 and tails is 0.6. We are doing this because If we assume a fair coin the probability will be the same (0.5) for both heads and tails and it won’t be obvious to explain the pattern and connect it to the binomial equation.

Now computing the total probability of a specific event will be simply multiplying the probability values of respective heads/tails. Let us write it down in table form.

Probability | Pattern |

0.4*0.4*0.4 | HHH |

0.4*0.4*0.6 | HHT |

0.4*0.6*0.4 | HTH |

0.4*0.6*0.6 | HTT |

0.6*0.4*0.4 | THH |

0.6*0.4*0.6 | THT |

0.6*0.6*0.4 | TTH |

0.6*0.6*0.6 | TTT |

If we group the heads probabilities and tails probabilities in each row, a pattern emerges which helps us to form an equation. For example in rows 2, 3, 5 it can be $H^2\times T$. In row 1 it its $H^3$. Similarly in row 8 it is $T^3$. In rows 4, 6, 7 it is $H \times T^2$. If we know a way to compute how many times a particular pattern (e.g. $H^2\times T$) occurs then the problem is half solved.

This can be answered using the old-school combinations equation. In this example (rows 2, 3, and 5) we are asking how many different ways we can get 2 heads out 3 different coint tosses. The combination equation gives this answer with ‘n’ choose ‘k’. The equation for the same is $\frac{n!}{k!(n-k)!}$.

Since we already know the heads and tail probabilities, we can complete the equation as:

$P(X=k) = \frac{n!}{k!(n-k)!} \times p^k \times (1-p)^{n-k}$

This forms the probability mass function equation for the binomial distribution. In discrete cases, the probability distribution is technically called the probability mass function. By plugging in the values we can compute the probability values for each possible outcome. Some requirements for binomial distribution are: n needs to be fixed, each outcome is of success/failure i.e. binary type, the probability of heads and tails remains the same from trial to trial, and each n trials are independent of each other (one does not affect the other).