Maximum Likelihood Estimation


This blog post is a supplementary material for my students from the
Introduction to Mathematical Statistics (STAT 3445) class. The main goal is to give a brief, but clear, explanation of how maximum Likelihood Estimation (MLE for short) works.

One example about the Normal distribution will be covered by this post.

One of the key concepts for the MLE is the Likelihood Function. The Likelihood Function can be seen as the joint pdf conditioned on the data, i.e., keeping the data fixed. When working with only one random variable, lets say \(X\), the Likelihood function is the pdf itself. For example, consider a problem where we are said that \(X\) follows a Binomial distribution with \(n = 10\) and \(p\) unknown. We know that \[ P(X = x) = {{10}\choose{x}} p^x(1 - p)^{n - x}. \] Suppose now, that besides knowing that \(X\) follows a \(Binomial(10, p)\), we also know a realization of \(x = 4\). Now, what we are interested in infer is, what is the value of \(p\) that maximizes \(P(X = 10)\). Can you notice how different it is from finding a probability? We are still working with probability distributions, but with a different goal.

Example 1 - Normal Distribution

Suppose that \(X_1, \, \cdots \, X_n \overset{iid}{\sim} N(\mu, 1)\). How do we find the MLE for \(\mu\)? First of all, note that, since the random variables are identically distributed we have that \[ f_i(x_i) = f(x_i) = \frac{1}{\sqrt{2\pi}} \exp \left \{ - \frac{(x_i - \mu)^2}{2} \right \}. \]

Moreover, since the random variables are independent, their joint distribution is equal to the product of the marginals, i.e., \[\begin{align*} f(x_1, \cdots, x_n) & = \prod_{i = 1}^n f(x_i) \\ & = f(x_1) f(x_2) \cdots f(x_n) \\\\ & = \frac{1}{(2\pi)^{n/2}} \exp \left \{ - \frac{\sum_{i = 1}^n (x_i - \mu)^2}{2} \right \}. \end{align*}\]

Finally, as stated before, the Likelihood function is given by \[\begin{align*} L(\mu|X_1 = x_1, \cdots, X_n = x_n) & = f(x_1, \cdots, x_n) \\ & = \frac{1}{(2\pi)^{n/2}} \exp \left \{ - \frac{\sum_{i = 1}^n (x_i - \mu)^2}{2} \right \}. \end{align*}\] Note that, the data is fixed and the Likelihood is a function of \(\mu\). Maximizing this function with respect to \(\mu\) gives us the value for this parameter that this data is more likely to be generated from. The idea is that there is an infinite number of Normal probability distributions with \(\sigma^2 = 2\), each one associated to one value of \(\mu\). Now, for each value of \(\mu\) we have a probability for the joint pdf of the data. Finding the the Maximum Likelihood estimator od \(\mu\), gives us the value for this parameter that maximizes this probability.

Usually, maximizing the Likelihood function is quite complicated. An alternative is to maximize the log-Likelihood function instead. The log-Likelihood function, as the name suggests, is the log (of natural base) of the Likelihood function and since the log is monotone non-decreasing function, the value that maximizes the log-likelihood also maximizes the Likelihood function. For this problem, the log-likelihood, denoted by \(\ell(\mu|\mathbf{X})\), where \(\mathbf{X} = \{ X_1, \, \cdots \, X_n \}\), is defined as \[\begin{align*} \ell(\mu|\mathbf{X}) &= \log\left( L(\mu|\mathbf{X}) \right) \\ & = -\frac{n}{2} \log(2\pi) - \frac{\sum_{i = 1}^n (x_i^2 - 2\mu x_i + \mu^2)}{2} \\ & = -\frac{n}{2} \log(2\pi) - \frac{\sum_{i = 1}^n x_i^2}{2} + \mu \sum_{i = 1} x_i - \frac{n\mu^2}{2}. \end{align*}\] Now that we have found the \(\ell(\mu|\mathbf{X})\) in a simple form, we can maximize it. To do so, we have to take the first derivative with respect to \(\mu\), make it equal to 0, and, finally, isolate \(\mu\). Then, take the second derivative of the function and verify if this function is less than zero when \(\mu = \hat{\mu}\). The first derivative of the log-likelihood is given by: \[\begin{align*} \ell'(\mu|\mathbf{X}) &= \frac{\partial \ell(\mu|\mathbf{X})}{\partial \mu} \\ & = \sum_{i = 1} x_i - n\mu. \end{align*}\] Making it equal to zero and isolating \(\mu\), gives us the Maximum Likelihood estimator of \(\mu\), which is \[ \hat{\mu} = \bar{x}. \] Now, we only have to take the second derivative of the log-Likelihood function to be sure that \(\hat{\mu}\) is a point of maximum. \[\begin{align*} \ell''(\mu|\mathbf{X}) &= \frac{\partial \ell'(\mu|\mathbf{X})}{\partial \mu} \\ & = -n. \end{align*}\] Note that \(-n < 0\) for all possible values of \(\mu\), thus the function is convex and \(\hat{\mu} = \bar{x}\) is the Maximum Likelihood Estimator for \(\mu\).


  • Find the Likelihood function by finding the joint pdf (of pmf) and making it function of the parameter (or parameters) that you want to estimate;
  • Take the log of the Likelihood function;
  • Take the first derivative of the log-Likelihood function, make it equal to zero and isolate the parameter that you want to estimate;
  • Find the second derivative of the log-likelihood and analyze it to be sure that the value found for the parameter is in fact a point of maximum.
Lucas Godoy
PhD Candidate / TA /GA

I’m a PhD Candidate in Stats interested in R, Open Data, and the most diverse applications of statistics.

comments powered by Disqus