Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Markov Chains in Chemistry and Biology

import numpy as np
import matplotlib.pyplot as plt

Warm-up: a 3-state weather chain

A simple Markov chain on three states — sunny, cloudy, rainy — illustrates how sampling from the chain converges to the stationary distribution.

state_space = ("sunny", "cloudy", "rainy")

# Transition matrix: T[i,j] = P(next = j | current = i)
T_weather = np.array([[0.6, 0.3, 0.1],
                       [0.3, 0.4, 0.3],
                       [0.2, 0.3, 0.5]])

# Simulate the chain
n_steps = 20000
states = [0]
for _ in range(n_steps):
    states.append(np.random.choice(3, p=T_weather[states[-1]]))
states = np.array(states)

# Plot convergence of empirical frequencies to stationary distribution
fig, ax = plt.subplots(figsize=(8, 4))
offsets = np.arange(1, n_steps, 5)
colors = ['gold', 'silver', 'steelblue']

for i, (label, c) in enumerate(zip(state_space, colors)):
    freq = [np.sum(states[:n] == i) / n for n in offsets]
    ax.plot(offsets, freq, label=label, color=c, lw=1.5)

ax.set_xlabel("Number of steps")
ax.set_ylabel("Empirical frequency")
ax.set_title("Convergence to stationary distribution")
ax.legend()
ax.grid(True, ls=':', alpha=0.5)
plt.tight_layout()
<Figure size 800x400 with 1 Axes>

Stationary distribution from eigenvalues

The stationary distribution π\boldsymbol{\pi} is the left eigenvector of TT with eigenvalue 1:

πT=π\boldsymbol{\pi} T = \boldsymbol{\pi}

The second-largest eigenvalue λ2\lambda_2 controls the relaxation time τ=1/lnλ2\tau = -1/\ln|\lambda_2| — how many steps it takes to forget the initial condition.

# Eigenvalue analysis of the weather transition matrix
eigenvalues, eigvecs_left = np.linalg.eig(T_weather.T)

# Sort by magnitude (largest first)
idx = np.argsort(-np.abs(eigenvalues))
eigenvalues = eigenvalues[idx]
eigvecs_left = eigvecs_left[:, idx]

# Stationary distribution = normalized left eigenvector for eigenvalue 1
pi = np.real(eigvecs_left[:, 0])
pi = pi / pi.sum()

print("Eigenvalues:", np.round(eigenvalues, 4))
print("Stationary distribution:", np.round(pi, 4))
print(f"Relaxation time: tau = {-1/np.log(np.abs(eigenvalues[1])):.1f} steps")

# Verify: compare with long-run simulation frequencies
sim_freq = [np.mean(states == i) for i in range(3)]
print(f"Simulation frequencies: {np.round(sim_freq, 4)}")
Eigenvalues: [1.  0.4 0.1]
Stationary distribution: [0.3889 0.3333 0.2778]
Relaxation time: tau = 1.1 steps
Simulation frequencies: [0.3929 0.3375 0.2696]

Chemical isomerization: A ⇌ B

A reversible reaction ApBA \xrightarrow{p} B, BqAB \xrightarrow{q} A is the simplest chemical Markov chain.

The transition matrix is:

T=(1ppq1q)T = \begin{pmatrix} 1-p & p \\ q & 1-q \end{pmatrix}

Detailed balance gives the equilibrium: pπA=qπBp \cdot \pi_A = q \cdot \pi_B, so πA=q/(p+q)\pi_A = q/(p+q) and πB=p/(p+q)\pi_B = p/(p+q).

p = 0.3  # A -> B
q = 0.7  # B -> A

T_iso = np.array([[1-p, p],
                   [q, 1-q]])

# Simulate
n_steps = 2000
current = 0
history = np.zeros((n_steps, 2))
history[0, current] = 1

for step in range(1, n_steps):
    current = np.random.choice(2, p=T_iso[current])
    history[step, current] = 1

cum_prob = np.cumsum(history, axis=0) / np.arange(1, n_steps+1).reshape(-1, 1)

# Exact equilibrium
pi_A = q / (p + q)
pi_B = p / (p + q)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.plot(cum_prob[:, 0], label='$P(A)$', color='steelblue', lw=1.5)
ax1.plot(cum_prob[:, 1], label='$P(B)$', color='coral', lw=1.5)
ax1.axhline(pi_A, color='steelblue', ls='--', alpha=0.5, label=f'$\pi_A = {pi_A:.3f}$')
ax1.axhline(pi_B, color='coral', ls='--', alpha=0.5, label=f'$\pi_B = {pi_B:.3f}$')
ax1.set(xlabel='Steps', ylabel='Cumulative frequency', title='Approach to equilibrium')
ax1.legend()
ax1.grid(True, ls=':', alpha=0.5)

# Equilibrium bar chart
sim_freq = np.sum(history, axis=0) / n_steps
ax2.bar(['A', 'B'], sim_freq, color=['steelblue', 'coral'], alpha=0.7, label='Simulation')
ax2.bar(['A', 'B'], [pi_A, pi_B], color='none', edgecolor='black', lw=2, label='Exact')
ax2.set(ylabel='Probability', title='Equilibrium distribution')
ax2.legend()

fig.tight_layout()
<>:27: SyntaxWarning: invalid escape sequence '\p'
<>:28: SyntaxWarning: invalid escape sequence '\p'
<>:27: SyntaxWarning: invalid escape sequence '\p'
<>:28: SyntaxWarning: invalid escape sequence '\p'
/tmp/ipykernel_3579/100161158.py:27: SyntaxWarning: invalid escape sequence '\p'
  ax1.axhline(pi_A, color='steelblue', ls='--', alpha=0.5, label=f'$\pi_A = {pi_A:.3f}$')
/tmp/ipykernel_3579/100161158.py:28: SyntaxWarning: invalid escape sequence '\p'
  ax1.axhline(pi_B, color='coral', ls='--', alpha=0.5, label=f'$\pi_B = {pi_B:.3f}$')
<Figure size 1200x400 with 2 Axes>
  • As nn \to \infty, the fraction of time spent in each state converges to the theoretical equilibrium from detailed balance: πA=q/(p+q)\pi_A = q/(p+q), πB=p/(p+q)\pi_B = p/(p+q).

  • The relaxation eigenvalue is λ2=1pq\lambda_2 = 1 - p - q, giving τ=1/ln1pq\tau = -1/\ln|1-p-q| steps.

Ion channel gating

Ion channels in cell membranes switch stochastically between closed and open conformations. A classic model is a 3-state channel:

Closedβ1α1Openβ2α2Inactivated\text{Closed} \xrightleftharpoons[\beta_1]{\alpha_1} \text{Open} \xrightleftharpoons[\beta_2]{\alpha_2} \text{Inactivated}
  • Closed → Open: the channel activates (rate α1\alpha_1)

  • Open → Inactivated: the channel inactivates (rate α2\alpha_2)

  • Reverse transitions have rates β1\beta_1, β2\beta_2

This is the basis of Hodgkin-Huxley-type models in neuroscience. Let’s simulate single-channel dynamics and see how the dwell-time distributions emerge.

# Ion channel: Closed (0) <-> Open (1) <-> Inactivated (2)
# Transition probabilities per time step (discretized rates)
a1, b1 = 0.08, 0.04   # Closed <-> Open
a2, b2 = 0.03, 0.01   # Open <-> Inactivated

T_channel = np.array([
    [1-a1,   a1,    0   ],   # Closed
    [b1,   1-b1-a2, a2  ],   # Open
    [0,      b2,    1-b2],   # Inactivated
])

# Simulate single-channel trace
n_steps = 50000
state = 0
trace = np.zeros(n_steps, dtype=int)
for t in range(1, n_steps):
    state = np.random.choice(3, p=T_channel[state])
    trace[t] = state

# Extract dwell times in the Open state
is_open = (trace == 1).astype(int)
transitions = np.diff(is_open)
open_starts = np.where(transitions == 1)[0] + 1
open_ends = np.where(transitions == -1)[0] + 1
n_dwells = min(len(open_starts), len(open_ends))
dwell_times = open_ends[:n_dwells] - open_starts[:n_dwells]

# Stationary distribution
eigenvalues, eigvecs = np.linalg.eig(T_channel.T)
idx = np.argmax(np.abs(eigenvalues))
pi_ch = np.real(eigvecs[:, idx])
pi_ch /= pi_ch.sum()

# Plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
labels_ch = ['Closed', 'Open', 'Inactivated']
colors_ch = ['#4477AA', '#EE6677', '#CCBB44']

# Single-channel trace (first 2000 steps)
ax = axes[0]
window = slice(0, 2000)
ax.step(np.arange(2000), trace[window], where='post', lw=0.6, color='k')
ax.set_yticks([0, 1, 2])
ax.set_yticklabels(labels_ch)
ax.set(xlabel='Time step', title='Single-channel trace')
# Shade open state
ax.fill_between(np.arange(2000), 0, 2, where=(trace[window]==1), 
                alpha=0.2, color=colors_ch[1], step='post')

# Dwell time distribution (Open state)
ax = axes[1]
ax.hist(dwell_times, bins=30, density=True, alpha=0.7, color=colors_ch[1], edgecolor='white')
# Expected: geometric distribution with rate (b1 + a2) per step
t_range = np.arange(1, max(dwell_times))
p_exit = b1 + a2
ax.plot(t_range, p_exit * (1-p_exit)**(t_range-1), 'k-', lw=2, label=f'Geometric($p={p_exit:.2f}$)')
ax.set(xlabel='Dwell time (steps)', ylabel='Density', title='Open-state dwell times')
ax.legend()

# Stationary distribution
ax = axes[2]
sim_freq = [np.mean(trace == i) for i in range(3)]
x = np.arange(3)
ax.bar(x - 0.15, sim_freq, 0.3, color=colors_ch, alpha=0.7, label='Simulation')
ax.bar(x + 0.15, pi_ch, 0.3, color='none', edgecolor='black', lw=2, label='Eigenvalue')
ax.set_xticks(x)
ax.set_xticklabels(labels_ch)
ax.set(ylabel='Probability', title='Stationary distribution')
ax.legend()

fig.tight_layout()
<Figure size 1500x400 with 3 Axes>

Protein conformational dynamics

Many proteins fluctuate between distinct conformational states. A coarse-grained Markov State Model (MSM) captures the essential kinetics:

Unfolded (U)Intermediate (I)Folded (F)\text{Unfolded (U)} \xrightleftharpoons{} \text{Intermediate (I)} \xrightleftharpoons{} \text{Folded (F)}

Key observables:

  • Relaxation timescales from eigenvalues: how long does it take to fold?

  • Flux at steady state: under non-equilibrium conditions, is there a net current through the cycle?

Let’s simulate a 3-state protein with an added rare pathway U → F (misfolding or direct folding).

# Protein MSM: U(0) <-> I(1) <-> F(2), plus rare U <-> F shortcut
T_protein = np.array([
    [0.90, 0.08, 0.02],  # Unfolded
    [0.05, 0.85, 0.10],  # Intermediate
    [0.01, 0.04, 0.95],  # Folded
])

states_label = ['Unfolded', 'Intermediate', 'Folded']

# Eigenvalue analysis
eigvals, eigvecs = np.linalg.eig(T_protein.T)
idx = np.argsort(-np.abs(eigvals))
eigvals = eigvals[idx].real
eigvecs = eigvecs[:, idx].real

pi_prot = eigvecs[:, 0] / eigvecs[:, 0].sum()
tau_slow = -1 / np.log(np.abs(eigvals[1]))
tau_fast = -1 / np.log(np.abs(eigvals[2]))

print(f"Stationary distribution: U={pi_prot[0]:.3f}, I={pi_prot[1]:.3f}, F={pi_prot[2]:.3f}")
print(f"Eigenvalues: {np.round(eigvals, 4)}")
print(f"Slow relaxation time:  tau_1 = {tau_slow:.1f} steps  (folding timescale)")
print(f"Fast relaxation time:  tau_2 = {tau_fast:.1f} steps  (U <-> I equilibration)")
Stationary distribution: U=0.181, I=0.249, F=0.570
Eigenvalues: [1.     0.9066 0.7934]
Slow relaxation time:  tau_1 = 10.2 steps  (folding timescale)
Fast relaxation time:  tau_2 = 4.3 steps  (U <-> I equilibration)
# Simulate starting from Unfolded and watch relaxation
n_steps = 5000
n_traj = 500

# Run many independent trajectories starting from U
all_states = np.zeros((n_traj, n_steps), dtype=int)
for traj in range(n_traj):
    s = 0  # start unfolded
    all_states[traj, 0] = s
    for t in range(1, n_steps):
        s = np.random.choice(3, p=T_protein[s])
        all_states[traj, t] = s

# Compute time-dependent populations P(state, t) averaged over trajectories
P_t = np.zeros((n_steps, 3))
for i in range(3):
    P_t[:, i] = np.mean(all_states == i, axis=0)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))
colors_prot = ['#CC6677', '#DDCC77', '#88CCEE']

# Left: relaxation of populations
ax = axes[0]
for i, (lbl, c) in enumerate(zip(states_label, colors_prot)):
    ax.plot(P_t[:, i], label=lbl, color=c, lw=2)
    ax.axhline(pi_prot[i], color=c, ls='--', alpha=0.4)
ax.axvline(tau_slow, color='gray', ls=':', alpha=0.6, label=f'$\\tau_{{slow}}={tau_slow:.0f}$')
ax.set(xlabel='Time step', ylabel='$P(\mathrm{state}, t)$',
       title='Relaxation from Unfolded state')
ax.legend(fontsize=9)
ax.grid(True, ls=':', alpha=0.4)

# Right: implied timescales (demonstrate lag-time dependence)
lag_times = [1, 2, 5, 10, 20, 50]
implied_ts = []
for lag in lag_times:
    # Build count matrix at this lag
    C = np.zeros((3, 3))
    for traj in range(n_traj):
        for t in range(0, n_steps - lag, lag):
            C[all_states[traj, t], all_states[traj, t+lag]] += 1
    # Row-normalize to get transition matrix at lag
    T_lag = C / C.sum(axis=1, keepdims=True)
    eigs = np.sort(np.abs(np.linalg.eigvals(T_lag)))[::-1]
    # Implied timescale = -lag / ln(lambda)
    its = [-lag / np.log(e) if e > 0 and e < 1 else np.nan for e in eigs[1:]]
    implied_ts.append(its)

implied_ts = np.array(implied_ts)
ax = axes[1]
ax.plot(lag_times, implied_ts[:, 0], 'o-', color=colors_prot[0], lw=2, label='$\\tau_1$ (slow)')
ax.plot(lag_times, implied_ts[:, 1], 's-', color=colors_prot[1], lw=2, label='$\\tau_2$ (fast)')
ax.axhline(tau_slow, color=colors_prot[0], ls='--', alpha=0.4)
ax.axhline(tau_fast, color=colors_prot[1], ls='--', alpha=0.4)
ax.set(xlabel='Lag time', ylabel='Implied timescale (steps)',
       title='Implied timescales vs lag time')
ax.legend()
ax.grid(True, ls=':', alpha=0.4)

fig.tight_layout()
<>:28: SyntaxWarning: invalid escape sequence '\m'
<>:28: SyntaxWarning: invalid escape sequence '\m'
/tmp/ipykernel_3579/3079419031.py:28: SyntaxWarning: invalid escape sequence '\m'
  ax.set(xlabel='Time step', ylabel='$P(\mathrm{state}, t)$',
<Figure size 1300x500 with 2 Axes>
  • Left panel: Starting from 100% Unfolded, the populations relax toward equilibrium on a timescale set by τslow\tau_{\text{slow}} (the folding time).

  • Right panel: The implied timescale test is a standard validation for Markov State Models. If the model is Markovian at lag time τ\tau, the implied timescales plateau — they become independent of lag time. This is how researchers validate MSMs built from molecular dynamics simulations.

Problems

Problem 1: Varying forward and reverse rates

For the isomerization reaction A ⇌ B:

  • Systematically vary pp and qq (e.g., p=0.1,0.3,0.5p = 0.1, 0.3, 0.5 with q=1pq = 1 - p).

  • For each pair, compute the relaxation time τ=1/ln1pq\tau = -1/\ln|1 - p - q| and verify it against the simulation.

  • Plot τ\tau vs pp and explain the trend physically.

Problem 2: Ion channel pharmacology

Modify the ion channel model to simulate a drug that stabilizes the inactivated state (reduce β2\beta_2 from 0.01 to 0.001).

  • How does the stationary probability of the Open state change?

  • How does the mean open dwell time change?

  • Plot the single-channel trace and compare with the untreated case.

Problem 3: First passage time to folding

Using the protein MSM:

  • Run 1000 trajectories starting from the Unfolded state.

  • Record the first passage time to the Folded state for each trajectory.

  • Plot the distribution of first passage times. Is it exponential?

  • Compare the mean first passage time to τslow\tau_{\text{slow}} from eigenvalue analysis. Are they the same? Why or why not?