FDSII-2 Power Transformations

Here are some transformation aiming to maximize the quality of the fit of a normal distribution to the sample data.

The motivations for such a transformation include

to make a skewed distribution more symmetric

to allow for the use of the normal model in a case where it otherwise may not be clear how to model the variable

to transform the response of a regression model so that the normal approximation is a better fit for the model error

Box-Cox Transformation

This is only useful in the case where the variable is strictly positive.

Define Box-Cos power transformation function:

This defines a wide range of possible transformations, including , etc.

Note that

the function is defined in such a format that the transformation is continuous in .

In Python, Box-Cox is implemented via scipy.stats.boxcox

Box-Cox for Regression

for model

question: what choice of yields the best fitting model, if we want to assume the errors are i.i.d. Normal?

Note that we cannot directly apply Box-Cox transformation on responses because Box-Cox try to make the transformed results i.i.d normal while responses in the dataset are never independent. Instead, we can try different and calculate the corresponding ’s likelihood assuming it’s normal. Then choose the maximizing the likelihood.

Implementation Code


def BoxCoxRegression(X,y,ax=None,alpha=0.05):
    # X is the design matrix. Include a column of ones if you require the intercept.
    # y is the vector of response values
    
    import numpy as np
    from scipy.stats import chi2
    
    # Calculate the log likelihood for a fixed value of lambda
    def loglike(lmbda, hatmatrix, y):
        
        from scipy.stats import boxcox

        # Transform using proposed lambda value
        yt = boxcox(y, lmbda=lmbda)
        
        # Residual sum of squares
        RSS = np.sum((np.matmul(hatmatrix,yt) - yt)**2.0)
        
        # Compare this with Equations (8) and (9) in Box and Cox (1964).
        # Note that they multiply last term by (lmbda-1), but I am
        # changing to just lmbda to make consistent with the R function.
        # Since it adds a constant, it is irrelevant.
        return -len(y)/2.0*np.log(RSS) + lmbda*np.sum(np.log(y))
        
        
    # Construct the hat matrix. This is the same for all lambdas    
    XtXinv = np.linalg.inv(np.matmul(np.transpose(X),X))
    hatmatrix = np.matmul(np.matmul(X,XtXinv),np.transpose(X))
    
    # Calculate the loglikelihoods for a range of lambda values
    lambdarange = np.linspace(-2,2,100)
    loglikes = np.array([loglike(lmbda,hatmatrix,y) for lmbda in lambdarange])
    
    # Find the best choice of lambda
    lambdaopt = lambdarange[np.argmax(loglikes)]
    
    # Find the confidence limits for lambda
    conflim = np.max(loglikes) - chi2.ppf(1-alpha,1)/2
    ininterval = lambdarange[loglikes>= conflim]
    confint = [np.min(ininterval),np.max(ininterval)]
    
    # Create a plot
    if not ax is None:
        ax.plot(lambdarange,loglikes,lw=1)
        ax.set(xlabel=r'$\lambda$',ylabel="log likelihood")
        ax.plot([lambdarange[0],lambdarange[-1]],[conflim,conflim],"--",color="red",lw=0.5)
        ax.plot([confint[0],confint[0]],[np.min(loglikes),conflim],"--",color="red",lw=0.5)
        ax.plot([confint[1],confint[1]],[np.min(loglikes),conflim],"--",color="red",lw=0.5)
        ax.plot([lambdaopt,lambdaopt],[np.min(loglikes),np.max(loglikes)],"--",color="red",lw=0.5)
        ax.text(confint[0],np.min(loglikes),round(confint[0],1),ha="center",va="top",size=8)
        ax.text(confint[1],np.min(loglikes),round(confint[1],1),ha="center",va="top",size=8)
        ax.text(lambdaopt,np.min(loglikes),round(lambdaopt,1),ha="center",va="top",size=8)
    
    return lambdaopt, confint, lambdarange, loglikes

The Yeo-Johnson Transformation

The Box-Cox Transformation suffers from the limitation that it can only be applied to strictly positive variables. A generalization that works for any variable is the Yeo-Johnson Transformation,

In Python, this is implemented in scipy.stats.yeojohnson