Probability theory#
What you need to know
Sample space
is a set of elementary outcomes or events.Events can contain more than one elementary event and can be constructed by forming subsets (
, , etc) ofProbability function P(A) assigns a numeric value to each event, A quantifying certainty of an event happening on a 0-1 scale.
Venn diagrams visualize P(A) as a “volume fraction” of our confidence in the event expressed on 0-1 scale.
Probability axioms define a set of logical rules for creating composite events from trivial ones.
Bayesian approach: In physical sciences and modeling one often deals with situations where counting is impossible. Hence, probability is interpreted as a degree of belief.
Probabilistic world of complex many particle systems#

Fig. 1 Characterizing many particle complex systems is best done via probabilistic approach.#
What is the probability to find a gas in upper right corner cell?
What is the probability that all gas atoms will be in left side of the box?
What is the probability distribution of velocities of the gas?

Fig. 2 Simulations of a Random Walk in 1D#
What is the probability of finding molecule away from center by n steps out of N?
How do we obtain probability distribution after N steps given probability for 1 step?
Why is there tendency for probability distributions to evolve towards Gaussian?
Sample space#
The sample space, often signified by an
is a set of all possible elementary events.Elementary means the events can not be further broken down into simpler events. For instance, rolling a die can result in any one of six elementary events.
States of
are sampled during a system trial, which could be done via an experiment or simulation.
If our trial is a single roll of a six-sided die, then the sample has size
A fair coin is tossed once has a sample size
If a fair coin is tossed three times in a row, we will have sample space of size
Position of an atom in a container of size
along x. is a huge number. We will need some special tools to quantify.
Events, micro and macro states#
An event in probability theory refers to an outcome of an experiment. Event can contain one or more elementary events from
.Event can be getting getting 5 on a die or getting any number less than 5.
In the context of statistical mechanics, we are going to call elementary events in
microstates and events containing multiple microstates as macrostatesIf we roll a single die there are six micostates. We can define a macrostate as an event
of getting any number less than 4
Or we can create a macrostate
containing only even numbers
IF we roll toss two coins microstates are HT, TH, HH, TT. We can define a macrostate
of having 50% H and 50% T
A microstate of a gas atom in 1D container could be its position x. A macrostate could be finding atom anywehere in the second half of the container
Compute probabilities through counting#
probabilities of events as fractions in the sample space
probability of event, e.g rolling an even number. The size of the event space is 3 size of sample space. In the context of single die roll is equal 6
Visualizing events as Venn diagrams#
# COllab has this but in local notebook you may want to install it
#!pip install matplotlib-venn #install if running locally
import matplotlib_venn as venn
import matplotlib.pyplot as plt
Omega = {1,2,3,4,5,6}
A = {1, 2, 3, 4, 5}
B = {4, 5, 6}
venn.venn2([A, B], set_labels=('A','B'))
print(len(A)/len(Omega))
print(len(B)/len(Omega))
print(len(A & B)/len(Omega))
print(len(A | B)/len(Omega))
0.8333333333333334
0.5
0.3333333333333333
1.0

Probability Axioms#
Positivity and Normalization
Probability of rolling each number is 1/6 and rolling any number is 1.
Addition rule
For any sequence of mutually exclusive events,
Probability of die rolling even number is:
Product rule
When independent events
Probability of rolling twice getting 3 and 5 is:
Complement
Given that
the probability of not rolling a number:
Conditional probability and Bayes Theorem
Knowledge of past events may change the probability of future events
the probability of getting 4 given that we have rolled an even number:
Bayes Theorem#

Fig. 3 Joint Probability
Example of Using Bayes Formula to Test Hypothesis
A test for cancer is known to be 90% accurate either in detecting cancer if present or in giving an all-clear if cancer is absent.
The prevalence of cancer in the population is 1%. How worried should you be if you test positive? Try answering this question using Bayes’ theorem.
Solution
Accuracy of a test (how often positives show up when cancer is certain)
Only 1% of the population has cancer; hence, we get the probability of an individual having (not having) cancer as:
Now we have all terms to compute
probability of disease given the positive test.
Prior, Posterior, and Likelihood
Bayes’ theorem provides a powerful framework for testing hypotheses or learning model parameters from data. While the mathematical formulation remains the same, the terminology used in Bayesian inference differs from the standard probability notation:
where:
Prior:
represents our initial belief about the hypothesis or parameter before observing the data. For example, if we are tossing a coin, a reasonable prior might be a gaussian centered at or take unifrom distribution in absence of information.Evidence:
is the probability of the observed data, also known as the marginal likelihood. It accounts for all possible parameter values and normalizes the posterior. For example, it is the probability of obtaining a specific sequence, such as , given all possible biases of the coin.Likelihood:
describes how probable the observed data is for a given parameter . E.g for sequence of it will be giving probability of landing three H and 1 T.Posterior:
is the updated probability of the hypothesis after incorporating the observed data. This is the key quantity in Bayesian inference, as it represents our revised belief about given the data. We can take value of corresponding to maximum of posterior to be most likely value of our parameter. For our case of uniform prior and likelhood the maxima will be as we may expect.
Computing number of microstates via combinatorics#
Binomial Distribution (Two-State Systems)
When molecules can be in two states (e.g., adsorbed vs. free, spin-up vs. spin-down), the number of ways to arrange
For example, if
gas molecules distribute between two parts of the box or spins occupying two energy levels, this formula gives the number of microstates for a given occupation .
Multinomial Distribution (Multiple States)
For systems with more than two states, such as molecules distributed among
Example: partitioning gas particles
Consider a container filled with 1000 atoms of Ar.
What is the probability that the left half has 400 atoms?
What is the probability that the left half has 500 atoms?
: ::
What is a probability that 1/3 has 100 next 1/3 has 200 and next 1/3 has 700?
What is the total number of all possible partitionings or states of gas atoms in a container?
Each N lattice site in the container can be vacant or filled with
Example: spins
Solid metal has 100 atoms. Magnetic measurements show that there are 10 atoms with spin down. If ten atoms are chosen at random, what is the probability that they all have spin up?
Solution
The total number of ways to choose any 10 atoms out of 100, regardless of spin is:
The number of ways to choose 10 atoms out of 90 with spin up is: $
$Probability of picking 10 up spins is:
def gas_partition(k1=30, k2=30, k3=30):
'''partitioning N gas molecules into regions k1, k2 and k3'''
from scipy.special import factorial
N = k1+k2+k3
return factorial(N) / (factorial(k1) * factorial(k2)* factorial(k3))
print( gas_partition(k1=50, k2=50, k3=0) )
print( gas_partition(k1=50, k2=49, k3=1) )
print( gas_partition(k1=50, k2=25, k3=25) )
print( gas_partition(k1=34, k2=33, k3=33) )
1.0089134454556417e+29
5.0445672272782094e+30
1.2753736048324953e+43
4.19244792425583e+45
Strinling approximation of factorial and binomials#
Stitrling approximation of N!
This is the crude version of Stirling approximation that works out for
A more accurate version is:
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
import scipy.special as sp
# Define range for N
N_values = np.arange(1, 100, 1)
# Exact factorial using log(N!)
log_fact_exact = np.log(sp.factorial(N_values))
# Crude Stirling approximation
log_fact_crude = N_values * np.log(N_values) - N_values
# More accurate Stirling approximation
log_fact_accurate = N_values * np.log(N_values) - N_values + 0.5 * np.log(2 * np.pi * N_values)
# Plot comparisons
plt.figure(figsize=(10, 6))
plt.plot(N_values, log_fact_exact, label="Exact $\log N!$", color="black", linewidth=2)
plt.plot(N_values, log_fact_crude, label="Crude Stirling Approximation", linestyle="--", color="red", linewidth=2)
plt.plot(N_values, log_fact_accurate, label="Accurate Stirling Approximation", linestyle=":", color="blue", linewidth=2)
plt.xlabel("N")
plt.ylabel("$\log N!$")
plt.title("Comparison of Stirling Approximations for $\log N!$")
plt.legend()
plt.grid(True)
plt.show()

Random Walk#
Consider a problem with a binary outcome, with fixed probabilities
.A clssic example is Random walk of N steps where molecules jumps right (
) or left ( ) with fixed probabilities.Other examples are tossing
coins or counting non-interacting molecules in the left vs right hand side of a container.
Each experiment generates a sequence—e.g.,
for a random walk or for coin flips.Such a sequence represents a single microstate in the sample space of all possible sequences which is
.For unbiased random walk
, all microstates are equally probable and equal to .For biased random walk
the probability of microstates (sequence) is determined by the product of step probabilities (becasue steps are independent)
Probability of a sequence (microstate)
A more interesting question is the probability of taking
steps to the right, or having a net displacement of , regardless of the sequence of events.
Probability of net number of steps or displacements (macrostate)
Probability of
Steps to the Right
Probability of
net displacement from the origin
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb
def P_x(N, x, p_plus):
"""Computes the binomial probability P(x|N, p_+)"""
if (N + x) % 2 != 0: # Ensure x is valid (must be even relative to N)
return 0
k = (N + x) // 2
return comb(N, k) * (p_plus ** k) * ((1 - p_plus) ** (N - k))
# Define parameters
p_plus_unbiased = 0.5 # Unbiased case
p_plus_biased = 0.7 # Biased case
N_values = [10, 50, 100, 200] # Different N values
# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharey=True)
# Colors for different N values
colors = ["blue", "green", "red", "purple"]
# Loop through different N values
for i, N in enumerate(N_values):
x_vals = np.arange(-N, N + 1, 2) # x values must be -N to N, in steps of 2
x_norm = x_vals / N # Normalize x by N
# Compute probabilities using vectorized function
P_unbiased = np.array([P_x(N, x, p_plus_unbiased) for x in x_vals])
P_biased = np.array([P_x(N, x, p_plus_biased) for x in x_vals])
# Plot lines
axes[0].plot(x_norm, P_unbiased, marker="o", linestyle="-", label=f"N={N}", color=colors[i])
axes[1].plot(x_norm, P_biased, marker="o", linestyle="-", label=f"N={N}", color=colors[i])
# Titles and labels
axes[0].set_title(f"Unbiased Coin ($p_+$ = {p_plus_unbiased})")
axes[0].set_xlabel("$x/N$")
axes[0].set_ylabel("$P(x)$")
axes[0].legend()
axes[1].set_title(f"Biased Coin ($p_+$ = {p_plus_biased})")
axes[1].set_xlabel("$x/N$")
axes[1].legend()
plt.tight_layout()
plt.show()

Log of Macrostate Probability, Entropy, and fluctuations#
The logarithm of the probability of a given macrostate,
, can be written as:
In simulations, we measure the fraction of steps
Fractions will fluctuate but, in the limit of an infinitely long simulation, should converge to the true probabilities
Since
and , we introduce the notation to simplify subsequent expressions
where:
represents the entropy term, related to the number of ways to distribute steps, is an energy-like function governing the bias in step distribution.
Energy as a Measure of Bias#
Taking the logarithm of the probability factor results in an energy-like function
, which introduces a bias that shifts the distribution left or right depending on the probabilities of left/right steps, , determined by microscopic details of the random walk.
Where
is energy term per step of a random walk.When
, there is no bias, and the energy simplifies to
Entropy as the Logarithm of number of Microstates in a Macrostate#
The entropy term, related to the number of ways to distribute steps. Using Stringling approximation for the log we obtain:
where
is the entropy per step of a random walk.
Entropy of a macrostate in
Entropy of a macrostate defined by a fraction of steps
is:
Alternatively, in terms of the net displacement fraction
:
Large Deviation Theory#
By taking a log of macrostate probability
we find that it scales linearly with and is proportional to a which is independent of NWhen there are
steps, molecules, or components, the probability distribution over the fraction of steps tends to concentrate near the minima of caled Large deviation functionLarge deviation function
dictates both the shape and decay of probability distributions in the large- limit.This general result applicable is known under the name of Large Deviation Theorem (LDT).
Large Deviation Theorem (LDT)
For a macrostate probability
where is some empirical fraction or mean quantity over components:
is the rate function, which quantifies the likelihood of fluctuations away from the most probable value.Example of
for Random Walk: Determines how deviations of (empirical fractions) from (true or exact probabilities) are exponentially suppressed as increases.
Show code cell source
# Re-import required libraries since execution state was reset
import numpy as np
import matplotlib.pyplot as plt
# Define parameters
p_plus = 0.7 # Biased probability
p_minus = 1 - p_plus # Complementary probability
# Define range for f_+
f_plus_values = np.linspace(0.01, 0.99, 200) # Avoid log(0) issues
f_minus_values = 1 - f_plus_values # f_- = 1 - f_+
# Compute entropy component s(f_+)
s_values = -(f_plus_values * np.log(f_plus_values) + f_minus_values * np.log(f_minus_values))
# Compute energy component ε(f_+)
epsilon_values = - (f_plus_values * np.log(p_plus) + f_minus_values * np.log(p_minus))
# Compute large deviation rate function I(f_+)
I_values = f_plus_values * np.log(f_plus_values / p_plus) + f_minus_values * np.log(f_minus_values / p_minus)
# Compute probability P_N(f_+) using large deviation approximation
N = 50 # Arbitrary large N
P_x_values = np.exp(-N * I_values) # Exponential suppression
P_x_values /= np.trapz(P_x_values, f_plus_values) # Normalize for probability density
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# First subplot: Entropy and Energy Components
axes[0].plot(f_plus_values, s_values, label=r"$s(f)$ (Entropy)", color="blue")
axes[0].plot(f_plus_values, epsilon_values, label=r"$\epsilon(f_+)$ (Energy)", color="green")
axes[0].plot(f_plus_values, I_values, label=r"$I(f)$ (Rate Function)", color="red")
axes[0].axvline(p_plus, linestyle="--", color="black", label=r"$f_+ = p_+$")
axes[0].set_xlabel(r"$f_+$")
axes[0].set_ylabel("Value")
axes[0].set_title("Entropy, Energy, and Rate Function")
axes[0].legend()
axes[0].grid()
# Second subplot: Probability Distribution P_N(f_+)
axes[1].plot(f_plus_values, P_x_values, label=r"$P_N(f_+)$", color="purple")
axes[1].axvline(p_plus, linestyle="--", color="black", label=r"$f_+ = p_+$")
axes[1].set_xlabel(r"$f_+$")
axes[1].set_ylabel(r"$P_N(f_+)$")
axes[1].set_title("Probability Distribution")
axes[1].legend()
axes[1].grid()
plt.tight_layout()
plt.show()

Show code cell source
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb
def plot_large_deviation(N, theta, color):
"""Plots the large deviation approximation for given N and bias theta."""
f = np.linspace(0.01, 0.99, 200) # Avoid log(0) issues
# Compute the rate function I(f)
I = f * np.log(f / theta) + (1 - f) * np.log((1 - f) / (1 - theta))
# Compute normalized probability P_LDT(f) ∼ exp(-N I(f))
p_ldt = np.exp(-N * I)
p_ldt /= np.trapz(p_ldt, f) # Normalize using trapezoidal rule for integration
plt.plot(f, p_ldt, color=color, linestyle="-", linewidth=2, label=f"LDT Approx. (N={N})")
def plot_binomial(N, theta, color):
"""Plots the exact binomial distribution for given N and bias theta."""
n = np.arange(N + 1)
f = n / N # Convert discrete counts to fractions
# Compute binomial probability mass function
prob = comb(N, n) * theta**n * (1 - theta)**(N - n)
# Normalize probability for direct comparison with LDT curve
prob /= np.trapz(prob, f)
plt.plot(f, prob, 'o', color=color, markersize=5, label=f"Binomial (N={N})")
# Parameters
theta = 0.5 # Fair coin
Ns = [5, 10, 20, 50, 100] # Different values of N
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(Ns))) # Use colormap for better distinction
# Create the plot
plt.figure(figsize=(8, 6))
for i, N in enumerate(Ns):
plot_large_deviation(N, theta, colors[i])
plot_binomial(N, theta, colors[i])
# Labels and formatting
plt.xlabel(r"$f_+$", fontsize=14)
plt.ylabel(r"Probability Density", fontsize=14)
plt.title("Comparison of Binomial Distribution (Points) and Large Deviation Approximation (Lines)", fontsize=12)
plt.legend(loc="upper left", fontsize=10)
plt.grid()
plt.show()

Gaussian Nature of Fluctuations#
The Taylor expansion of the large deviation function around its minimum leads to a Gaussian distribution in the limit of small fluctuations. This fundamental result in large deviation theory explains why fluctuations in equilibrium statistical mechanics are often Gaussian.
Let
be the large deviation function, which attains a minimum at . Expanding around :Since
(by definition, probability is maximal at ), we obtain:Substituting this into the large deviation form:
where the variance is:
This is a Gaussian distribution centered at
, with variance . Here, represents the fractional quantity , though one may also express the distribution in terms of absolute particle numbers, .Thus, the Taylor expansion of
near its minimum shows that, for large , fluctuations become Gaussian, explaining why equilibrium statistical physics often exhibits Gaussian distributions.
Appendix: Explicit derivations#
Appendix A. Gaussian or large
The binomial distribution for large values of
Thus, from the onset, we aim for a Gaussian distribution. The task is to find the coefficients and justify that the third term in the Taylor expansion is negligible compared to the second.
We evaluate the derivative of
We could also arrive at the same result by using Stirling’s approximation
.Taking the first derivative of the Taylor expansion for the binomial distribution, we find the peak of the distribution around which we expand:
We recall that
is also the mean of the binomial distribution!Having found the peak of the distribution and knowing the first derivative, we now proceed to compute the second derivative:
While the first derivative gave us the mean of the binomial distribution, we notice that the second derivative produces the variance
.Now, all that remains is to plug the coefficients into our approximated probability distribution and normalize it. Why normalize? The binomial was already properly normalized, but since we made an approximation by neglecting higher-order terms, we must re-normalize.
Normalizing the Gaussian distribution is done via the following integral:
Finally, we obtain the normalized Gaussian approximation to the binomial distribution:
Appendix B. Poisson limit or the limit of large
This is a situation of rare events like rains in forest or radioactive decay of uranium where each individual event has small chance of happening
yet there are large number of samples such that one has a constant average rate of eventsIn this limit distirbution is no longer well described by the gaussian as the shape of distribution is heavily skewed due to tiny values of p.
Writing factorial
explicitely we realize that it is dominated and also
Next let us plug in
and recall the definition of exponential
Example of Gaussian limit of Large Deviation function for random walk
The large deviation rate function for a simple random walk is given by:
where
Expansion Around the Minimum
The function
Expanding the logarithms:
Substituting into
Gaussian Limit
By the large deviation principle:
This is a Gaussian with variance:
Thus, the empirical frequency
Problems#
Problem 1: Counting Dies and coins#
You flip a coin 10 times and record the data in the form of head/tails or 1s and 0s
What would be the probability of ladning 4 H’s?
What would be the probability of landing HHHTTTHHHT sequence?
In how many ways can we have 2 head and 8 tails in this experiments?
Okay, now you got tired of flipping coins and decide to play some dice. You throw die 10 times what is the probability of never landing number 6?
You throw a die 3 times what is the probability of obtaining a combined sum of 7?
Problem 2: Counting gas molecules#
A container of volume
What is the probability p that a particular molecule is in each part?
What is the probability that
molecules are in and molecules are in ?What is the average number of molecules in each part?
What are the relative fluctuations of the number of particles in each part?
Project Porosity of materials#
A simple model of a porous rock can be imagined by placing a series of overlap- ping spheres at random into a container of fixed volume
For simplicity, consider a 2D system, (e.g
You will need np.random.uniform() to randomly place N disks of volume v into volume V. Check out this cool python lib for porosity evaluation of materials R Shkarin, et al Plos Comp Bio 2019