A probability distribution is a mathematical object that describes how likely different events are. In general, these events are restricted somehow to a set of possible events, such as for a die (we exclude undefined). A common and useful conceptualization in statistics is to think data is generated from some true probability distribution with unknown parameters. Then, inference is the process of finding out the values of those parameters using just a sample (also known as a dataset) from the true probability distribution. In general, we do not have access to the true probability distribution and thus we must resign ourselves and create a model in an attempt to somehow approximate that distribution. Probabilistic models are built by properly combining probability distributions.
Notice that, in general, we cannot be sure of whether our model is correct and thus we need to evaluate and criticize the models in order to gain confidence and convince others about our models being suitable for the problem we want to explore or solve.
If a variable, , can be described by a probability distribution, then we call a random variable. As a general rule, people use capital letters, such as , to indicate the objects random variable, and , to indicate an instance of that random variable. can be a vector and thus contains many elements or individual values . Let's see an example using Python; our true probability distribution will be a Normal (or Gaussian) distribution with means of and ; these two parameters completely and unambiguously define a Normal distribution. Using SciPy, we can define the random variable, , by writing stats.norm(μ, σ) and we can get an instance, , from it using the rvs (random variates) method. In the following example, we ask for three values:
μ = 0.
σ = 1.
X = stats.norm(μ, σ)
x = X.rvs(3)
You will notice that each time you execute this code (in stats-speak: every trial), you will get a different random result. Notice that once the values of the parameters of a distributions are known, the probability of each value is also known; what is random is the exact values we will get at each trial. A common misconception with the meaning of random is that you can get any possible value from a random variable or that all values are equally likely. The allowed values of a random variable and their probabilities are strictly controlled by a probability distribution, and the randomness arises only from the fact that we can not predict the exact values we will get in each trial. Every time we execute the preceding code, we will get three different numbers, but if we iterate the code thousands of times, we will be able to empirically check that the mean of the sampled values is around zero and that 95% of the samples will have values within the [-1.96, +1.96] range. Please do not trust me, use your Pythonic powers and verify it for yourself. We can arrive at this same conclusion if we study the mathematical properties of the Normal distribution.
A common notation used in statistics to indicate that a variable is distributed as a normal distribution with parameters and is:
In this context, when you find the (tilde) symbol, say is distributed as.
In many texts, it is common to see the normal distribution expressed in terms of the variance and not the standard deviation. Thus, people write
. In this book, we are going to parameterize the Normal distribution using the standard deviation, first because it is easier to interpret, and second because this is how PyMC3 works.
We will meet several probability distributions in this book; every time we discover one, we will take a moment to describe it. We started with the normal distribution because it is like the Zeus of the probability distributions. A variable, , follows a Gaussian distribution if its values are dictated by the following expression:
This is the probability density function for the Normal distribution; you do not need to memorize expression 1.3, I just like to show it to you so you can see where the numbers come from. As we have already mentioned, and are the parameters of the distribution, and by specifying those values we totally define the distribution; you can see this from expression 1.3 as the other terms are all constants. can take any real value, that is, , and dictates the mean of the distribution (and also the median and mode, which are all equal). is the standard deviation, which can only be positive and dictates the spread of the distribution, the larger the value of , the more spread out the distribution. Since there are an infinite number of possible combinations of and , there is an infinite number of instances of the Gaussian distribution and all of them belong to the same Gaussian family.
Mathematical formulas are concise and unambiguous and some people say even beautiful, but we must admit that meeting them for the first time can be intimidating, specially for those not very mathematically inclined; a good way to break the ice is to use Python to explore them:
mu_params = [-1, 0, 1]
sd_params = [0.5, 1, 1.5]
x = np.linspace(-7, 7, 200)
_, ax = plt.subplots(len(mu_params), len(sd_params), sharex=True,
sharey=True,
figsize=(9, 7), constrained_layout=True)
for i in range(3):
for j in range(3):
mu = mu_params[i]
sd = sd_params[j]
y = stats.norm(mu, sd).pdf(x)
ax[i,j].plot(x, y)
ax[i,j].plot([], label="μ = {:3.2f}\nσ = {:3.2f}".format(mu,
sd), alpha=0)
ax[i,j].legend(loc=1)
ax[2,1].set_xlabel('x')
ax[1,0].set_ylabel('p(x)', rotation=0, labelpad=20)
ax[1,0].set_yticks([])
Most of the preceding code is for plotting, the probabilistic part is performed by the y = stats.norm(mu, sd).pdf(x) line. With this line, we are evaluating the probability density function of the normal distribution, given the mu and sd parameters for a set of x values. The preceding code generates Figure 1.1. In each subplot we have a blue (dark gray) curve representing a Gaussian distribution with specific parameters, and , included in each subplot's legend:
Figure 1.1
Most of the figures in this book are generated directly from the code preceding them, many times even without a leading phrase connecting the code to the figure. This pattern should be familiar to those of you working with Jupyter Notebooks or Jupyter Lab.
There are two types of random variables: continuous and discrete. Continuous variables can take any value from some interval (we can use Python floats to represent them), and discrete variables can take only certain values (we can use Python integers to represent them). The Normal distribution is a continuous distribution.
Notice that in the preceding plot, the yticks are omitted; this is a feature not a bug. The reason to omit them is that, in my experience, those values do not add too much relevant information and could be potentially confusing to some people. Let me explain myself: the actual numbers on the y axis do not really matter—what is relevant is their relative values. If you take two values from , let's say and , and you find that (two times higher in the plot), you can safely say the probability of the value of is twice the probability of . This is something that most people will understand intuitively, and fortunately is the correct interpretation. The only tricky part when dealing with continuous distributions is that the values plotted on the y axis are not probabilities, but probability densities. To get a proper probability, you have to integrate between a given interval, that is, you have to compute the area below the curve (for that interval). While probabilities cannot be greater than one, probability densities can, and is the total area under the probability density curve the one restricted to be 1. Understanding the difference between probabilities and probability densities is crucial from a mathematical point of view. For the practical approach used in this book, we can be a little bit more sloppy, since that difference is not that important as long as you get the correct intuition on how to interpret the previous plot in terms of relative values.