# Maximum Likelihood Estimation

## Intro

This blog post is a supplementary material for my students from the

*Introduction to Mathematical Statistics (STAT 3445)* class. The main goal
is to give a brief, but clear, explanation of how maximum Likelihood
Estimation (MLE for short) works.

One example about the Normal distribution will be covered by this post.

One of the key concepts for the MLE is the Likelihood Function. The Likelihood
Function can be seen as the *joint pdf* conditioned on the data, i.e., keeping
the data fixed. When working with only one random variable, lets say \(X\), the
Likelihood function is the pdf itself. For example, consider a problem where
we are said that \(X\) follows a *Binomial* distribution with \(n = 10\) and \(p\)
unknown. We know that
\[
P(X = x) = {{10}\choose{x}} p^x(1 - p)^{n - x}.
\]
Suppose now, that besides knowing that \(X\) follows a \(Binomial(10, p)\), we also
know a realization of \(x = 4\). Now, what we are interested in infer is, what is
the value of \(p\) that maximizes \(P(X = 10)\). Can you notice how different it is
from finding a probability? We are still working with probability distributions,
but with a different goal.

### Example 1 - Normal Distribution

Suppose that \(X_1, \, \cdots \, X_n \overset{iid}{\sim} N(\mu, 1)\). How do we find the MLE for \(\mu\)? First of all, note that, since the random variables are identically distributed we have that \[ f_i(x_i) = f(x_i) = \frac{1}{\sqrt{2\pi}} \exp \left \{ - \frac{(x_i - \mu)^2}{2} \right \}. \]

Moreover, since the random variables are independent, their joint distribution is equal to the product of the marginals, i.e., \[\begin{align*} f(x_1, \cdots, x_n) & = \prod_{i = 1}^n f(x_i) \\ & = f(x_1) f(x_2) \cdots f(x_n) \\\\ & = \frac{1}{(2\pi)^{n/2}} \exp \left \{ - \frac{\sum_{i = 1}^n (x_i - \mu)^2}{2} \right \}. \end{align*}\]

Finally, as stated before, the Likelihood function is given by \[\begin{align*} L(\mu|X_1 = x_1, \cdots, X_n = x_n) & = f(x_1, \cdots, x_n) \\ & = \frac{1}{(2\pi)^{n/2}} \exp \left \{ - \frac{\sum_{i = 1}^n (x_i - \mu)^2}{2} \right \}. \end{align*}\] Note that, the data is fixed and the Likelihood is a function of \(\mu\). Maximizing this function with respect to \(\mu\) gives us the value for this parameter that this data is more likely to be generated from. The idea is that there is an infinite number of Normal probability distributions with \(\sigma^2 = 2\), each one associated to one value of \(\mu\). Now, for each value of \(\mu\) we have a probability for the joint pdf of the data. Finding the the Maximum Likelihood estimator od \(\mu\), gives us the value for this parameter that maximizes this probability.

Usually, maximizing the Likelihood function is quite complicated. An alternative
is to maximize the *log-Likelihood* function instead. The *log-Likelihood* function,
as the name suggests, is the *log* (of natural base) of the Likelihood function and
since the *log* is monotone non-decreasing function, the value that maximizes the
log-likelihood also maximizes the Likelihood function.
For this problem, the log-likelihood, denoted by \(\ell(\mu|\mathbf{X})\), where
\(\mathbf{X} = \{ X_1, \, \cdots \, X_n \}\), is defined as
\[\begin{align*}
\ell(\mu|\mathbf{X}) &= \log\left( L(\mu|\mathbf{X}) \right) \\
& = -\frac{n}{2} \log(2\pi) - \frac{\sum_{i = 1}^n (x_i^2 - 2\mu x_i + \mu^2)}{2} \\
& = -\frac{n}{2} \log(2\pi) - \frac{\sum_{i = 1}^n x_i^2}{2} + \mu \sum_{i = 1} x_i - \frac{n\mu^2}{2}.
\end{align*}\]
Now that we have found the \(\ell(\mu|\mathbf{X})\) in a simple form, we can maximize
it. To do so, we have to take the first derivative with respect to \(\mu\), make it
equal to 0, and, finally, isolate \(\mu\). Then, take the second derivative of the
function and verify if this function is less than zero when \(\mu = \hat{\mu}\).
The first derivative of the log-likelihood is given by:
\[\begin{align*}
\ell'(\mu|\mathbf{X}) &= \frac{\partial \ell(\mu|\mathbf{X})}{\partial \mu} \\
& = \sum_{i = 1} x_i - n\mu.
\end{align*}\]
Making it equal to zero and isolating \(\mu\), gives us the Maximum Likelihood
estimator of \(\mu\), which is
\[
\hat{\mu} = \bar{x}.
\]
Now, we only have to take the second derivative of the *log-Likelihood* function
to be sure that \(\hat{\mu}\) is a point of maximum.
\[\begin{align*}
\ell''(\mu|\mathbf{X}) &= \frac{\partial \ell'(\mu|\mathbf{X})}{\partial \mu} \\
& = -n.
\end{align*}\]
Note that \(-n < 0\) for all possible values of \(\mu\), thus the function is convex
and \(\hat{\mu} = \bar{x}\) is the Maximum Likelihood Estimator for \(\mu\).

## Recap

- Find the Likelihood function by finding the joint pdf (of pmf) and making it
function of the parameter (or parameters) that you want to estimate;

- Take the log of the Likelihood function;

- Take the first derivative of the
*log-Likelihood*function, make it equal to zero and isolate the parameter that you want to estimate;

- Find the second derivative of the
*log-likelihood*and analyze it to be sure that the value found for the parameter is in fact a point of maximum.