Consider the following: We wish to constrain the parameters of a linear model to have certain properties, for example, that the last coefficient is equal to 1, or that the sum of the coefficients are equal to one. How can we estimate the Least Squares Estimates using these constraints?
Thus is constrained least squares, and this article will explore the theoretical basis for choosing a selection of coefficients that will minimize the least squares given some linear constraints on the coefficients.
Constrained Least Squares -- Coefficient Calculation and Simulation
We want to maximize the likelihood equation of $\vec{\beta}$. We know that $\vec{y}$ in this context is normally distributed with mean $\mathbf{X}\vec{\beta}$ and variance $\sigma^2$. This means that the distribution of $y_i$ given a vector of values $\vec{x_i}$ is as follows:
$$f_{Y}\left( y_i \middle|\ \vec{x}_i \right) = {(2\pi\sigma^{2})}^{- 1/2}e^{\frac{- (y_i-\vec{x}_i \vec{\beta})}{2\sigma^{2}}^{2}}$$So the likelihood of $\vec{\beta}$ is therefore:
$$ L(\vec{\beta}, \sigma^{2} | \vec{y}, \mathbf{X}) = ( 2\pi\sigma^{2})^{- \frac{N}{2}}e^{- \frac{\sum_{i}^{N}{(y_{i}-\vec{x}_{i} \vec{\beta})}}{2\sigma^{2}}^{2}} $$Taking the natural logarithm gives us the log likelihood:
$$l\left( \vec{\beta},\sigma^{2} \middle| y,\mathbf{X} \right) = - \frac{1}{2\sigma^{2}}\sum_{i}^{N}{{(y_{i}-\vec{x}_{i}\vec{\beta})}^{2}-\frac{N}{2}\ln\left( \sigma^{2} \right)+ C}= - \frac{1}{2\sigma^{2}}\left( \vec{y} - \mathbf{\mathbf{X}}\vec{\beta}^{c} \right)^{\intercal}\left( \vec{y} - \mathbf{\mathbf{X}}\vec{\beta}^{c} \right) - \frac{N}{2}\ln\left( \sigma^{2} \right)+ C$$Here, I have truncated the result to only those involving $\vec{\beta}$ and $\sigma^{2}$, as we are only interested in maximizing with respect to $\vec{\beta}$. Note that I have transitioned $\sigma^{2}$ to a constant, as we are assuming the errors are iid with constant variance $\sigma^{2}$. Here, we obtain the same result as minimizing the least squares estimator, so the MLE and the least squares estimators are the same. If we were unconstrained, we would simply take the partial derivative here with respect to $\vec{\beta}$ to estimate $\vec{\beta}$ using maximum likelihood.
Since $\sigma^{2}$ is unconstrained, we can find the maximum likelihood estimator directly and use this to maximize with respect to $\vec{\beta}$. To do this, we differentiate the above equation with respect to $\sigma^2$. Doing so yields the normal estimate for $\sigma^{2}$:
$${\widehat{\sigma}}^{2} = - \frac{1}{N}\sum_{i}^{N}{(y_{i}-x_{i}\vec{\beta}^{c})}^{2}=\frac{1}{N}\left(\vec{y}-\mathbf{\mathbf{X}}\vec{\beta}^{c} \right)^{\intercal}\left(\vec{y}-\mathbf{\mathbf{X}}\vec{\beta}^{c} \right)$$Where $\vec{\beta}^{c}$ is our new constrained estimates. Plugging this back into the log likelihood equation yields:
$$ l(\vec{\beta}^{c} | \vec{y},\mathbf{\mathbf{X}}) = - \frac{N}{2}[{(\vec{y} - \mathbf{\mathbf{X}}\vec{\beta}^{c})}^{\intercal}( \vec{y} - \mathbf{\mathbf{X}}\vec{\beta}^{c} )]^{-1} ( \vec{y} - \mathbf{\mathbf{X}}\vec{\beta}^{c})^{\intercal} \vec{y} - \mathbf{\mathbf{X}}\vec{\beta}^{c}) - \frac{N}{2}\ln( \frac{1}{N}( \vec{y} - \mathbf{\mathbf{X}}\vec{\beta}^{c})^{\intercal}( \vec{y} - \mathbf{\mathbf{X}}\vec{\beta}^{c}))+ C $$$$l\left( \vec{\beta} \middle| \vec{y},\mathbf{X} \right) = - \frac{N}{2} - \frac{N}{2}\ln\left( \frac{1}{N}\left( \vec{y} -\mathbf{\mathbf{X}}\vec{\beta}^{c} \right)^{\intercal}\left( \vec{y} -\mathbf{\mathbf{X}}\vec{\beta}^{c} \right) \right)+ C$$However, we want to maximize the above equation with the constraint that: M$\vec{\beta}$ = $\vec{d}$, where d is a known vector and M is an r x p matrix of rank $r < p$. This is an equality constraint. We can use the Lagrange multipliers to solve this maximization problem with such a constraint.
We can see that, since the natural logarithm is a monotonically increasing function, and the division within the above logarithm is a constant, and that all other terms do not involve $\beta$, maximizing the above equation with respect to$\ \vec{\beta}^{c}$ is equivalent to minimizing:
$$f = \left(\vec{y}-\mathbf{\mathbf{X}}\vec{\beta}^{c} \right)^{\intercal}\left(\vec{y}-\mathbf{\mathbf{X}}\vec{\beta}^{c} \right) - \vec{\lambda}^{\intercal}\left( M\vec{\beta}^{c}\ –\ \vec{d} \right)$$Where $\vec{\lambda}$ is the Lagrange multiplier (and so $f$ is the Lagrange function). The Lagrange multiplier is especially useful in this case, as we are trying to minimize the least square error with some constraint on the parameters.
Multiplying the error by the transpose gives:
$$f =\vec{y}^{\intercal}\vec{y} -\vec{y}^{\intercal}\mathbf{X}\vec{\beta}^{c}-{{\vec{\beta}^{c}}}^{\intercal}\mathbf{X}^{\intercal}\vec{y} + {{\vec{\beta}^{c}}}^{\intercal}\mathbf{X}^{\intercal}\mathbf{X}\vec{\beta}^{c}-\vec{\lambda}^{\intercal}\left( M\vec{\beta}^{c}\ –\ \vec{d} \right) = =\vec{y}^{\intercal}\vec{y} -2{{\vec{\beta}^{c}}}^{\intercal}\mathbf{X}^{\intercal}\vec{y} + {{\vec{\beta}^{c}}}^{\intercal}\mathbf{X}^{\intercal}\mathbf{X}\vec{\beta}^{c}-\vec{\lambda}^{\intercal}\left( M\vec{\beta}^{c}\ –\ \vec{d} \right)$$Now we take derivatives with respect to $\vec{\beta}^{c}$ and $\vec{\lambda}$ to obtain
$$\frac{\partial{f}}{\partial{B}}= - 2\mathbf{X}^{\intercal}\vec{y}+2\mathbf{X}^{\intercal}\mathbf{X}\vec{\beta}^{c}-M^{\intercal}\vec{\lambda}$$$$\frac{\partial{f}}{\partial{\lambda}}=M\vec{\beta}^{c}\ –\ \vec{d}$$We can now set both of these equal to zero and solve this system of equations to find the appropriate MLE estimate of $\vec{\beta}^{c}$. However, there is a few simplifying steps which can make the result more useable. The first is to get the expression in terms of the original least squares estimator. If we multiply the first equation above by M(${\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}$, we obtain:
$$- \ 2M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}\mathbf{X}^{\intercal}\vec{y}+\ 2M{\widehat{\beta}}^{c}-\ M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}\vec{\lambda} = 0$$The first term on the right is the familiar $\vec{\beta}$ from unconstrained OLS (I will denote this $\vec{\beta}^{u}$). Substituting, we get
$$- \ 2M\vec{\beta}^{u}+\ 2M{\widehat{\beta}}^{c}-\ M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}\vec{\lambda} = 0$$$$- \ 2M\vec{\beta}^{u}+\ 2M{\widehat{\beta}}^{c}-\ M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}\vec{\lambda} = 0$$$$M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}\vec{\lambda} = \ 2M{\widehat{\beta}}^{c} - \ 2M{\widehat{\beta}}^{u}$$Solving for lambda:
$$\vec{\lambda} = {(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\left\lbrack 2M{\widehat{\beta}}^{c} - \ 2M{\widehat{\beta}}^{u} \right\rbrack= 2{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack$$Plugging this back into the original equation and solving for ${\widehat{\beta}}^{c}$,
$$0= - 2\mathbf{X}^{\intercal}\vec{y}+2\mathbf{X}^{\intercal}\mathbf{X}{\widehat{\beta}}^{c}-M^{\intercal}\vec{\lambda}$$$$0= - 2\mathbf{X}^{\intercal}\vec{y}+2\mathbf{X}^{\intercal}\mathbf{X}{\widehat{\beta}}^{c}-{2(M}^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack)$$ $$\mathbf{X}^{\intercal}\mathbf{X}{\widehat{\beta}}^{c}=\mathbf{X}^{\intercal}\vec{y}+M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack$$ $${\widehat{\beta}}^{c}= ({{\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}\mathbf{X}}^{\intercal}\vec{y}+{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack$$ $${\widehat{\beta}}^{c}=\vec{\beta}^{u}+{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack$$This is the desired result, the maximum likelihood estimator given the constraint M$\vec{\beta} =d$.
Now, let's look at an example.
$$\vec{\beta}_{p} = 0$$Is equivalent to the following constraint. $\vec{\alpha}^{\intercal}\vec{\beta} = 0$ where $\vec{\alpha}$ is a vector of zeros in all but the last row, which is a 1. So,
$$\vec{\alpha} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix}$$Plugging it into the derived result, we get:
$${\widehat{\beta}}^{c}={\widehat{\beta}}^{u}+{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}\vec{\alpha}^{\intercal}{(\vec{\alpha}^{\intercal}({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}\vec{\alpha})}^{- 1}\lbrack - \ \vec{\alpha}'{\widehat{\beta}}^{u}\rbrack$$$${\widehat{\beta}}^{c}={\widehat{\beta}}^{u}+{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}\vec{\alpha}^{\intercal}{(\vec{\alpha}^{\intercal}({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}\vec{\alpha})}^{- 1}\lbrack - \ {{\widehat{\beta}}^{u}}_{p}\rbrack$$Since $\vec{\alpha}^{- 1}= \vec{\alpha}$ in this case, we know that $\vec{\alpha}^{\intercal}{(\vec{\alpha}^{\intercal}({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}\vec{\alpha})}^{- 1} = \vec{\alpha}^{\intercal}\left( \vec{\alpha}^{\intercal}\left(\mathbf{\mathbf{X}}^{\intercal}\mathbf{X} \right)\vec{\alpha} \right) = {\mathbf{X}^{\intercal}\mathbf{X}}_{p,p}$ which is a scalar that is the bottom right element of the $\mathbf{\mathbf{X}}^{\intercal}\mathbf{X}$ matrix. So we can simplify this to:
$${\widehat{\beta}}^{c}={\widehat{\beta}}^{u}+{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}{\mathbf{X}^{\intercal}\mathbf{X}}_{p,p}\lbrack - \ {{\widehat{\beta}}^{u}}_{p}\rbrack$$Since ${\mathbf{X}^{\intercal}\mathbf{X}}_{p,p}\lbrack 1 - \ {\beta^{u}}_{p}\rbrack$ is a constant:
$${\widehat{\beta}}^{c}={\widehat{\beta}}^{u}-{\mathbf{X}^{\intercal}\mathbf{X}}_{p,p}(\left\lbrack \ {\beta^{u}}_{p} \right\rbrack){{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}$$So, if we compute the unconstrained ${\widehat{\beta}}^{u}$ and ${{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}$, we can solve directly for the constrained $\vec{\beta}$.
Now, for the second part of part a), this is equivalent to M being a vector of 1s, lets call it ϕ. Then $d=1$ and
$$ \vec{\phi} = \begin{bmatrix} 1 \\ 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix}$$Plugging it into the derived result, we get:
$${\widehat{\beta}}^{c}={\widehat{\beta}}^{u}+{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}\vec{\phi}^{\intercal}{(\vec{\phi}^{\intercal}({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}\vec{\phi})}^{- 1}\lbrack 1 - \vec{\phi}'{\widehat{\beta}}^{u}\rbrack$$The residual sum of squares is defined as:
$$\text{RSS} = \ {(\vec{y} -\widehat{y})}^{\intercal}(\vec{y} -\widehat{y})$$In this case, $$ \widehat{y} =\mathbf{\mathbf{X}}{\widehat{\beta}}^{c} = \vec{\beta}^{u}+{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}) =\mathbf{\mathbf{X}}\vec{\beta}^{u}+\mathbf{X}{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack $$
So it follows that $$ \left(\vec{y}-\widehat{y_{c}} \right) =\vec{y}- \mathbf{\mathbf{X}}\vec{\beta}^{u}+\mathbf{X}{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack =\vec{\varepsilon} +\mathbf{X}{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack $$
where $\vec{\varepsilon}$ is the error in the unconstrained case. So
$$\text{RSS}_c - \text{RSS} =\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack)$$Each of the errors are normally distributed with mean 0 and variance sigma squared. This implies that:
$$ (y - \check{y})\ I\frac{1}{\sigma}\sim\text{MVN}(0,I) \text{, i.i.d} $$Where the estimator is the estimator from the restrained equation. We can write the sum of squares of the above as:
$$\left(\vec{y}- \widehat{y} \right)^{\intercal}(I\frac{1}{\sigma^{2}})\left(\vec{y}- \widehat{y} \right) = \left(\vec{y}- \check{y} \right)^{\intercal}(I\frac{1}{\sigma^{2}})\left(\vec{y}- \check{y} \right) + \ \left( \check{y}-\widehat{y} \right)^{\intercal}\left( I\frac{1}{\sigma^{2}} \right)\left( \check{y}-\widehat{y} \right)+ 2(\check{y}-\widehat{y})'\left( I\frac{1}{\sigma^{2}} \right)(y -\check{y})$$Where$\ {\check{y}}_{i}\ $is the estimator of the unconstrained model. To show this, we remove the identical identity matrices and see that:
$$\left(\vec{y}- \widehat{y} \right)^{\intercal}\left(\vec{y}- \widehat{y} \right) = \left(\vec{y}- \check{y} + \check{y} - \widehat{y} \right)^{\intercal}\left(\vec{y}- \check{y} + \check{y} - \widehat{y} \right) = \left(\vec{y}- \check{y} \right)^{\intercal}\left(\vec{y}- \check{y} \right) + \ \left( \check{y} - \widehat{y} \right)^{\intercal}\left( \check{y} - \widehat{y} \right) + 2(y - \check{y})'\left( \check{y} - \widehat{y} \right)$$Now,
$$\left( \check{y} - \widehat{y} \right) = \left(\vec{y}- \widehat{y} - \left(\vec{y}- \check{y} \right) \right) = \left( \ \left(\vec{y}- \widehat{y} \right) - \vec{\varepsilon} \right)$$which we know from the above is $(\mathbf{X}{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\lbrack \vec{d} - \text{M}{\widehat{\beta}}^{u}\rbrack$)'
Rewriting, we have:
$$\left(\vec{y}- \check{y} \right)'\left( \left(\vec{y}- \widehat{y} \right) - \vec{\varepsilon} \right) = \vec{\varepsilon}'(\mathbf{X}{{(\mathbf{X}}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal}{(M({\mathbf{X}^{\intercal}\mathbf{X})}^{- 1}M^{\intercal})}^{- 1}\left\lbrack \vec{d} - \ M{\widehat{\beta}}^{u} \right\rbrack)$$Using the fact that $\vec{\varepsilon}^{\intercal}\mathbf{X} = 0$, we can remove this to obtain.
$${\left(\vec{y}- \widehat{y} \right)^{\intercal}\left(\vec{y}- \widehat{y} \right) = \left(\vec{y}- \check{y} \right)}^{\intercal}\left(\vec{y}- \check{y} \right) + \ \left( \check{y} - \widehat{y} \right)^{\intercal}\left( \check{y} - \widehat{y} \right)$$Thus the original standardized normal variable can be represented as a sum of $Q_1$, and $Q_2$ where:
$$Q_1 = \left(\vec{y}-\check{y} \right)^{\intercal}\left(\vec{y}-\check{y} \right)$$$$Q_2 = \left( \check{y}-\widehat{y} \right)^{\intercal}\left( \check{y}-\widehat{y} \right) $$$Q_1$ is the sum of squared errors for the unconstrained model, and therefore has rank $n - p$. $Q_2$ has rank $p - r$, as it is the differences between the two estimators. Thus rank($Q_1$) + rank ($Q_2$) = $n-r$ which is the rank of the errors of the constrained model, since $r<p$. Thus by Cochran's Theorem, the two are independently distributed.
I have also produced simulations to prove that this is the case. I create three normal random variables, two of which are related to the others linearly and randomly. I compute the beta unconstrained, and then use this to compute the beta constrained. I then produce the errors for the constrained beta, and check the correlations in 100 samples of 100 each. The mean correlation is very close to zero, which is the expected result. This is included below.
const <- vector()
unconst <- vector()
correl <- vector()
for (j in 1:100) {
for (i in 1:100) {
y <- rnorm(100)
x1 <- y + rnorm(100,0,.2)
x2 <- rnorm(100,0,.01) - .3*y
Xmat <- matrix(c(rep(1,length(x1)),x1,x2),ncol = 3)
data2 <- data.frame(y,x1,x2)
mod.sim.1 <- lm(y~x1+x2,data = data2)
constraint <- matrix(1,nrow = nrow(data2),ncol = 1)
d = 1
M = matrix(rep(1,3),ncol = 1)
mod.sim.1
BetaUnconstr <- summary(mod.sim.1)$coef[,1]
b <- d - t(M) %*% BetaUnconstr
a <- t(Xmat) %*% Xmat
c <- solve(t(M) %*% solve(a) %*% M)
const[i] <- b*c*b
unconst[i] <- sum(resid(mod.sim.1)^2)
}
correl[j] <- cor(const,unconst)
}
correl
mean(correl)