2 Normal Models

2.1 Normal Modeling

2.2 The Bayesian Toolkit

2.2.1 Posterior Distribution

Given an iid (independent and identically distributed) sample \(\mathcal{D}_n = (x_1, ..., x_n)\) from a density \(f(x|\theta)\) conditional on an unknown parameter \(\theta \in \Theta\), the likelihood function is

\[\mathcal{l}(\theta|\mathcal{D}_n)=\prod_{i=1}^{n}f(x_i|\theta)\]

For example, a sample of size n from a normal \(\mathcal{N}(\mu,\sigma^2)\), so the unknown parameter \(\theta=(\mu,\sigma^2)\), the likelihood function is

\[\mathcal{l}(\theta|\mathcal{D}_n)=\prod_{i=1}^{n}\frac{exp[-(x_i-\mu)^2/2\sigma^2]}{\sqrt{2\pi \sigma}}\]

\[\propto \frac{exp[-\sum_{i=1}(x_i-\mu)^2/2\sigma^2]}{\sigma^{n}}\]

\[\propto \frac{exp[-(n(\mu-\bar{x})^2+s^2)/2\sigma^2]}{\sigma^n}\]

, where

  • \(\propto\): means “proportional to”
  • \(\bar{x}\): the empirical mean
  • \(s^2\) is the sum \(\sum_{i=1}^n(x_i-\bar{x})^2\)

That \(\bar{x}\) and \(s^2\) are sufficient statistics

Bayesian approach modifies the likelihood function into a posterior distribution by the classical Bayes’ formula (or theorem)

\[\pi(\theta|\mathcal{D}_n)=\frac{l(\theta|\mathcal{D}_n)\pi(\theta)}{\int l(\theta|\mathcal{D}_n)\pi(\theta) d(\theta)}\]

, where the factor \(\pi(\theta)\) is called the prior

  • first motivation: prior distribution summarizes the prior information (knowledge) on \(\theta\). However, the choice of \(\pi(\theta)\) is often decided on practical grounds rather than strong subjective beliefs or overwhelming prior information
  • second motivation: provide a fully probabilistic framework for the inferential analysis

The simplest case of normal distribution with known variable, \(\mathcal{N}(\mu, \sigma^2)\). If the prior distribution on \(\mu\) is the normal \(\mathcal{N}(0,\sigma^2)\), the posterior distribution is

\[\pi(\mu|\mathcal{D}_n) \propto \pi(\mu)\mathcal{l}(\theta|\mathcal{D}_n)\]

\[\propto exp[-\mu^2/2\sigma^2] exp[-n(\bar{x}-\mu)^2/2\sigma^2]\]

\[\propto exp[-(n+1)(\mu-n\bar{x}/(n+1))^2/2\sigma^2]\]

, which means that this posterior distribution on \(\mu\) is a normal distribution with mean \(n\bar{x}/(n+1)\) and variance \(\sigma^2/(n+1)\).

The mean of the posterior is different from the classical estimator \(\bar{x}\) is due to the prior information that \(\mu\) is close enough to zero is taken into account by the posterior distribution, which thus shrinks the original estimate towards zero.

Now assume the \(\sigma^2\) is unknown, so we have to introduce a further prior distribution on \(\sigma^2\). Assume \(1/sigma^2\) follows a exponential distribution \(\mathcal{E}(1)\). The prior distribution on \(\sigma^2\) is

\[\pi(\sigma^2) = exp(-\sigma^{-2})|\frac{d\sigma^{-2}}{d\sigma^2}|=exp(-\sigma^{-2})(\sigma^2)^{-2}\] (Note: see “Theorem 2.1.5” in the “Casella and Berger”), which is a special case of an inverse gamma distriubtion, namely \(\mathcal{IG}(1,1)\).

The corresponding posterior density on \(\theta\) is then given by

\[\pi((\mu, \sigma^2)|\mathcal{D}_n) \propto \pi(\sigma^2)\pi(\mu|\sigma^2)\mathcal{l}((\mu,\sigma^2)|\mathcal{D}_n)\]

\[\propto (\sigma^2)^{-1/2}e^{-(n+1)[\mu-n\bar{x}/(n+1)]^2/2\sigma^2} \times (\sigma^2)^{-(n+2)/2-1}e^{-(2+s^2)/2\sigma^2}\]

So, the posterior on \(\theta\) is a product of \(\mathcal{IG}((n+2)/2,(2+s^2)/2)\) (an inverse gamma distribution on \(\sigma^2\)) and \(\mathcal{N}(n\bar{x}/(n+1),\sigma^2/(n+1))\) (a normal distribution on \(\mu\) conditionally on \(\sigma^2\))

The marginal posterior in \(\mu\) is then a student’s t distribution

\[\mu|\mathcal{D}_n \sim \mathcal{T}(n+2, n\bar{x}/(n+1), (2+s^2)/(n+1)(n+2))\]

with \(n+2\) degrees of freedom, a location parameter proportional to \(\bar{x}\) and a scale parameter (almost) proportional to s.

2.2.2 Bayesian Estimates