The Binomial and Normal (or Gaussian) distributions are some of the most common distributions in Statistics. They are used anywhere from predicting movements in stock prices, to grading SAT tests. As an introduction to data visualisation in python, we will be plotting a binomial distribution, then plotting a normal estimation to the binomial.
(If you've only come for the visualisation part, skip to this point)
Even if you didn’t take a high school statistics class, you’ve probably still encountered the Gaussian (or normal) distribution in a math or science class.
If you created a histogram to represent the heights of people in a class, it would likely create a bell curve, with the more common heights in the middle, and the less common ones on the sides.
If you increased the number of subjects in your sample, the graph would more and more closely resemble the shape of a normal distribution:
The probability distribution function of a normal distribution is :
Besides the heights of people in a class, many other random variables follow the normal distributions, including measurement error, SAT scores, and shoe sizes. The area under two points on the graph, divided by the area of the whole graph, gives the probability of a value between those two points.
Closely related is the binomial distribution. Even if you haven’t heard of it, it is very intuitive to understand.
Consider a coin toss: the probability of hitting either heads or tails is . If there is a fixed number of trials, the binomial distribution is used to calculate the probability of, e.g., getting heads in tries. It is represented by the probability density function:
meaning, in trials, each with a probability of success, the probability of getting successes is . In the coin toss example, , , and success means getting heads. While it is possible that there will be no heads in all 10 trials, it is unlikely, hence will be extremely low. Likewise, it is unlikely all 10 tosses will be heads, hence will likely be low too. The more likely number of successes would be in the middle, with the most likely being .
If you increase the number of trials, the shape of the graph starts to look very familiar...
If the number of attempts is large enough, the binomial distribution resembles a normal distribution. This means a normal distribution can be used to estimate the binomial by using and . This is very useful when, for instance, you're trying to find the probability it takes more than 30 tries when (which would normally require you to find for every number from 31-100).
However, there's a catch: the binomial distribution deals with discrete variables (i.e. there is no or , only whole numbers), whereas the normal distribution deals with continuous variables. This means the normal distribution is only an approximation to the binomial.
We can plot a graph of the probabilities associated with each number of tries. In this example, I’ll be using Python and matplotlib.
If you do not have
matplotlib installed on Python yet, install it by typing
pip install matplotlib.
Begin by importing
math. We'll need
math to get the value of , and the function.
import math import matplotlib.pyplot as pyplot
Firstly, we need to define the probability density function of the binomial distribution. To recall:
def binomial(x, n, p): return math.comb(n, x) * (p ** x) * ((1 - p) ** (n - x))
In this case, to make the binomial distribution more clearly resemble a normal distribution, I will make equal to . will be 0.5. We also need to initialise a list to store the 50 values we get from the binomial p.d.f for each value of , and another list to store keys (that basically number each value of in the previous list).
n = 50 p = 0.5 binomial_list =  keys = 
Now that we have initialised all the variables we need, we can now iterate over each number from to to get values of for each, and we need to fill the
keys list with numbers from to :
for x in range(50): binomial_list.append(binomial(x, n, p)) for y in range(50): keys.append(y)
All that's left is to plot the graph!
pyplot.bar(x=keys, height=binomial_list) pyplot.show()
As you can see, it clearly resembles a normal distribution. To see if it actually has the shape of a normal distribution, we can superimpose the normal curve by inserting this code (before the
def normal_distribution(x, mu, sigma): return (math.exp(-0.5 * ((x - mu) / sigma) ** 2)) / (sigma * math.sqrt(2 * math.pi)) pyplot.plot(keys, [normal_distribution(y, n*p, math.sqrt(n*p*(1-p))) for y in range(50)])