The Binomial and Normal (or Gaussian) distributions are some of the most common distributions in Statistics. They are used anywhere from predicting movements in stock prices, to grading SAT tests. As an introduction to data visualisation in python, we will be plotting a binomial distribution, then plotting a normal estimation to the binomial.

(If you've only come for the visualisation part, skip to this point)

What are the Normal and Binomial Distributions

Even if you didn’t take a high school statistics class, you’ve probably still encountered the Gaussian (or normal) distribution in a math or science class.

If you created a histogram to represent the heights of people in a class, it would likely create a bell curve, with the more common heights in the middle, and the less common ones on the sides.

Histogram of heights of people in a class

If you increased the number of subjects in your sample, the graph would more and more closely resemble the shape of a normal distribution:

Graph of 14 year olds' heights

The probability distribution function of a normal distribution is :

f(x)=1σ2πe12(xμσ) ⁣2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\,}

The shape of the graph is determined by its mean and standard deviation. The graph is centred around the mean, and the standard deviation determines how stretched it is in the x-direction.

Besides the heights of people in a class, many other random variables follow the normal distributions, including measurement error, SAT scores, and shoe sizes. The area under two points on the graph, divided by the area of the whole graph, gives the probability of a value between those two points.

Closely related is the binomial distribution. Even if you haven’t heard of it, it is very intuitive to understand.

Consider a coin toss: the probability of hitting either heads or tails is 0.50.5. If there is a fixed number of trials, the binomial distribution is used to calculate the probability of, e.g., getting 22 heads in 1010 tries. It is represented by the probability density function:

Px=(nx)px(1p)nxP_{x}=\left(\begin{array}{l}n\\x\end{array}\right) p^{x} (1-p)^{n-x}

meaning, in nn trials, each with a probability pp of success, the probability of getting xx successes is PxP_{x}. In the coin toss example, n=10n=10, p=0.5p=0.5, and success means getting heads. While it is possible that there will be no heads in all 10 trials, it is unlikely, hence P0P_{0} will be extremely low. Likewise, it is unlikely all 10 tosses will be heads, hence P10P_{10} will likely be low too. The more likely number of successes would be in the middle, with the most likely being n×p(=10)n\times p (=10).

If you increase the number of trials, the shape of the graph starts to look very familiar...

image of binomial distribution

If the number of attempts is large enough, the binomial distribution resembles a normal distribution. This means a normal distribution can be used to estimate the binomial by using μ=np\mu=np and σ=np(1p)\sigma=np(1-p). This is very useful when, for instance, you're trying to find the probability it takes more than 30 tries when n=100n=100 (which would normally require you to find PxP_{x} for every number from 31-100).

However, there's a catch: the binomial distribution deals with discrete variables (i.e. there is no P8.443P_{8.443} or P3.142P_{3.142}, only whole numbers), whereas the normal distribution deals with continuous variables. This means the normal distribution is only an approximation to the binomial.

I realise this is by no means an exhaustive introduction. If you're still confused, this and this are more in-depth explanations.

Plotting the binomial distribution

We can plot a graph of the probabilities associated with each number of tries. In this example, I’ll be using Python and matplotlib.

If you do not have matplotlib installed on Python yet, install it by typing pip install matplotlib.

Begin by importing matplotlib and math. We'll need math to get the value of ee, and the (nx)n \choose x function.

import math
import matplotlib.pyplot as pyplot

Firstly, we need to define the probability density function of the binomial distribution. To recall:

Px=(nx)px(1p)nxP_{x}=\left(\begin{array}{l}n\\x\end{array}\right) p^{x} (1-p)^{n-x}

In python:

def binomial(x, n, p):
    return math.comb(n, x) * (p ** x) * ((1 - p) ** (n - x))

In this case, to make the binomial distribution more clearly resemble a normal distribution, I will make nn equal to 5050. pp will be 0.5. We also need to initialise a list to store the 50 values we get from the binomial p.d.f for each value of xx, and another list to store keys (that basically number each value of xx in the previous list).

n = 50
p = 0.5
binomial_list = []
keys = []

Now that we have initialised all the variables we need, we can now iterate over each number from 11 to 5050 to get values of PxP_{x} for each, and we need to fill the keys[] list with numbers from 11 to 5050:

for x in range(50):
    binomial_list.append(binomial(x, n, p))

for y in range(50):

All that's left is to plot the graph!, height=binomial_list)

And voila!

normal distribution

As you can see, it clearly resembles a normal distribution. To see if it actually has the shape of a normal distribution, we can superimpose the normal curve by inserting this code (before the expression:

def normal_distribution(x, mu, sigma):
    return (math.exp(-0.5 * ((x - mu) / sigma) ** 2)) / (sigma * math.sqrt(2 * math.pi))

pyplot.plot(keys, [normal_distribution(y, n*p, math.sqrt(n*p*(1-p))) for y in range(50)])

binomial and normal



Be the first to know when I post new content, and get free exclusive resources.

Copyright © 2022 Darren Dube. All rights reserved.