📊

FDSI-4 Numerical MLE

Intro

In some situations it is not possible to derive the MLE using an analytical approach, so numerical optimization methods are required.
When taking a numerical approach, it is still a good idea to work with the log likelihood, as this function is usually better behaved. (The product in the likelihood function will often lead to a likelihood with either very small or very large values.)

Methods and Examples

Python includes general-purpose optimization capability; here we will focus on minimized() , a part of scipy, which minimizes a function specified by the user.
from scipy.optimize import minimize
For example, suppose XX is binomial(n,p)binomial(n,p) and pp is unkown. We have seen that the sample proportion p^=X/n\hat{p}=X/n is unbiased and has standard error of p(1p)/n\sqrt{p(1-p)/n}. The decide the value of p maximizing the standard error (0p10\le p\le1)
import numpy as np def neg_se_of_phat(p, n=100): return (-1) * np.sqrt(p * (1-p) / n) n = 100 res = minimize(neg_se_of_phat, 0.2, method='Nelder-Mead', options={'disp': True}, args=(n))
  • The first argument is the function to be minimized.
  • The second argument is the starting values for the parameters over which the function will be minimized. Thie should be on the interior (not the boundary) of the space of possible values for the parameters. If there are more than one parameter, then this argument should be an array.
  • The method argument tells minimize which minimization technique to utilize.
  • The options argument can set various properties, including the maximum number of iterations, the tolerance, and, as in the example, the amount of information displayed.
res.x # array([0.5])

Numerical MLE for Gamma Dist

Suppose X1,X2,...,XnX_1,X_2,...,X_n are i.i.d. Gamma(α,β)Gamma(\alpha,\beta).
def gamma_mle(x): from scipy.stats import gamma def neg_log_likelihood(pars, x): return (-1) * sum(gamma.logpdf(x, pars[0], scale=1/pars[1])) beta_hat_mom = np.mean(x) / (np.mean(x**2) - np.mean(x)**2) alpha_hat_mom = beta_hat_mom * np.mean(x) mle_out = minimize(neg_log_likelihood, [alpha_hat_mom, beta_hat_mom], args=(x), method='Nelder-Mead') return mle_out

Regression with t-distributed Errors

Consider the following regression model:
Y=β0+β1x+σϵY=\beta_0+\beta_1 x+\sigma\epsilon
where ϵ\epsilon are assumed to have the t-distribution with ν\nu degrees of freedom. We will assume that ν\nu is set by the user (not estimated).
There are three unknown parameters in this model: β0,β1\beta_0,\beta_1 and σ\sigma. We want to estimate them via maximum likelihood.
We assume that we will observe nn pairs of (x,Y)(x,Y) values, and that the ϵ\epsilon values are independent. Derive the density for a single YY observation.
Since ϵi\epsilon_i are independent, the YiY_i are also independent. But the YiY_i are not i.i.d. since they have different XiX_i values, and hence different means.
FYi(y)=P(Yiy)=P(β0+β1xi+σϵiy)=P(ϵiy(β0+β1xi)σ)=FT(y(β0+β1xi)σ)\begin{aligned} F_{Y_i}(y) &=P(Y_i\le y)\\ &=P(\beta_0+\beta_1x_i+\sigma\epsilon_i\le y)\\ &=P(\epsilon_i\le\frac{y-(\beta_0+\beta_1 x_i)}{\sigma})\\ &=F_{T}(\frac{y-(\beta_0+\beta_1 x_i)}{\sigma}) \end{aligned}
where TtνT\sim t_{\nu}, and thus
fYi(y)=FYi(y)y=fT(y(β0+β1xi)σ)1σf_{Y_i}(y)=\frac{\partial F_{Y_i}(y)}{\partial y}=f_T(\frac{y-(\beta_0+\beta_1 x_i)}{\sigma})·\frac{1}{\sigma}
therefore
L((β0,β1,σ))=Πi=1nfYi(yi)=Πi=1nfT(y(β0+β1xi)σ)1σL((\beta_0,\beta_1,\sigma))=\Pi_{i=1}^nf_{Y_i}(y_i)=\Pi_{i=1}^nf_T(\frac{y-(\beta_0+\beta_1 x_i)}{\sigma})·\frac{1}{\sigma}
In preparation for using minimize(), here is the function that returns the negative log-likelihood as a function of the three parameters.
def neglogliketreg(pars, x, y, nu): from scipy.stats import t resids = y - (x*pars[1] + pars[0]) return (-1) * sum(t.logpdf(resids/pars[2], nu)) + len(x) * np.log(pars[2])
Following function will fit the model using the assumed t-distributed errors
def treg(x, y, nu): from scipy.stats import t def neglogliketreg(pars, x, y, nu): resids = y - (x*pars[1] + pars[0]) return (-1) * sum(t.logpdf(resids/pars[2], nu)) + len(x) * np.log(pars[2]) holdcov = np.cov(x, y) # MLE for beta_1 & beta_0, under assumption of epsilon ~ N(0, 1) beta_1_init = holdcov[0, 1] / holdcov[0, 0] beta_0_init = np.mean(y) - np.mean(x) * beta_1_init # MLE for sigma, under assumption of epsilon ~ N(0, 1) resids = y - (beta_0_init + beta_1_init * x) sigma_init = np.sqrt(sum(resids**2)/len(x)) output = minimize(neglogliketreg, \ [beta_0_init, beta_1_init, sigma_init], \ args=(x, y, nu), \ method = 'Nelder-Mead') return output

Fisher Information

In cases where the likelihood is maximimized numerically, it is typically not possible to derive the Fisher Information analytically. Instead, we can use an approximation based on the result
nI(θ)=Eθ[2θ2logL(θ)]nI(\theta)=E_{\theta}\Big[-\frac{\partial^2}{\partial \theta^2}\log L(\theta)\Big]
This result suggests that we can approximate nI(θ0)nI(\theta_0) by using the curvature at the peak of the observed likelihood function.
Example:
Assume X1,X2,...,XnX_1,X_2,...,X_n are i.i.d. Gamma(α,1\alpha,1), that is
fX(x;α)=1Γ(α)xα1ex,  x>0f_X(x;\alpha)=\frac{1}{\Gamma(\alpha)}x^{\alpha-1}e^{-x},\ \ x>0
def one_d_gamma_mle(x): from scipy.stats import gamma import numpy as np from scipy.optimize import minimize import numdifftools as nd def neg_log_likelihood(pars, x): return (-1) * sum(gamma.logpdf(x, pars[0], scale=1)) alpha_hat_mom = np.mean(x) mle_out = minimize(neg_log_likelihood,\ alpha_hat_mom, args=(x), \ method='Nelder-Mead') hessfunc = nd.Hessian(neg_log_likelihood) # hessfunc is a Function mle_var = 1 / hessfunc(mle_out.x, x) # mle_out is alpha hat return [mle_out.x, mle_var]
hessfunc(mle_out.x, x) returns the observed Fisher Information, i.e. the approximation of nI(θ0)nI(\theta_0).
Note that the Hessian is being evaluated on the negative log likelihood.
Multiparameter Case
In as case where the MLE is found numerically, we can use the observed Fisher Information in place of I1(θ^)/n\bold{I}^{-1}(\hat{\theta})/n.
In a multidimensional optimization using minimize(), the procedure minimizes the negative of the log likelihood. Thus, the Hessian is the matrix of second derivatives evaluated at the peak
2θ(logL(θ))θ=θ^\frac{\partial^2}{\partial \theta}(-\log L(\theta))\Big|_{\theta=\hat{\theta}}
we do not need to scale this by nn or multiply by -1.
To get the approximate covariance matrix, simply invert this Hessian.