by Frank Luis Enrique
Consider a linear model y = Xß + e which satisfies the (so called) classical assumptions of independence and normality in e. In order to estimate ß Euler provides the function fit(X,y) which returns the least squares estimate b. For example, defining
>X:=ones(15,1)|intrandom(15,3,100); >y:=X.[2,0.5,0.35,0.15]'+normal(15,1)*3;
So,
>shortestformat; X,
1 66 21 90 1 29 53 32 1 45 30 29 1 89 28 71 1 22 45 31 1 92 20 47 1 10 60 44 1 73 47 33 1 53 51 17 1 27 87 54 1 50 61 66 1 97 20 94 1 8 99 2 1 36 53 43 1 17 19 29
>y,
60.8 39.3 35.5 63.5 29.3 61.3 30.6 61.3 53 52.9 58 74.4 37.9 49.7 20.5
>longformat; b:=fit(X,y)
-1.06363240097 0.531594491398 0.35618719465 0.172007517529
is the OLS estimate of ß. The function fit(.) is computationally optimal and its use is highly recommended. However, in the econometric practice, besides b one may also need to report the standard deviation s(b), the residual standard deviation, and other statistics such as t, F, R^2, adjusted R^2, etc. which are not provided by fit(.). Moreover, one may want to check common violations of the classical assumptions, such as multicollinearity, heteroskedasticity or autocorrelation, or even estimate ß by another criterion than LS. These necessities motivated the following scripts which I gathered in econometrics.e encouraged by Dr. René Grothmann. To use this scripts simply write:
>load econometrics.e
Following I describe each function. Please report to the author any bug. Contributions to this toolbox are welcomed.
The following estimators are available in econometrics.e . X is always a full column rank matrix (where the first column is a column of 1) and y is a response vector.
>{table,R2,s2,F}:=fitOLS(X,y); table,
-1.0636 3.8866 -0.27367 0.39245 0.53159 0.042929 12.383 0 0.35619 0.046846 7.6033 2.8558e-005 0.17201 0.044422 3.8721 0.0014339
simply returns the OLS estimated parameters b, s(b), and the corres- ponding t statistics with their p-values. It also returns the R^2 and the adjusted R^2, as well as the F statistic and its p-value.
>{b,sb,R2}:=fitLAD(X,y); b|sb,
0.13974 10.335 0.52986 0.11415 0.33522 0.12457 0.17219 0.11812
returns the Least Absolute Deviation estimates b, its asymptotic standard deviation and an analog to the R2. Vector b is estimated by linear programming using the Euler function simplex. b is asymptotically normaly distributed. I also provide the experimental estimator {b,m}:=iterLAD(X,y) which returns b by an iterative procedure; m being the iterations until convergence. This alternative is useful for big matrices X, but remember it is stil an experimental estimator with unknown statistical properties. fitRANK(X,y) is another robust estimator included in econometrics.e . This function returns the forcasted y after the rank-based transformation of X and y. The transformed data are fitted by OLS. In our example,
>{table,yest}:=fitRANK(X,y); > yest,
60.268 44.206 44.524 61.322 31.966 61.314 34.044 61.113 53.182 51.568 60.939 67.422 29.431 52.328 20.51
Although fitRANK(X,y) also returns b, recall that this is the estimator associated to the rank-based transformation. A third robust estimator is given by function fitTHEIL(X,y). This estimator computes the Theil-Sen estimator ß in the linear model y=Xß+e, provided x(ij) is different from x(i'j) for all columns except the first one. This calls fitSimpleTHEIL(X,y) which is the original Theil-Sen estimator for simple linear regression. Function fitTHEIL(X,y) is an adapted version for multiple regression, and should be considered experimental. Then, for the simple linear regression of X[:,1:2] on y
>{b,m}:=fitTHEIL(X[:,1:2],y); b,
30.441 0.42282
Note that the estimates may depart conspicuously from the true parameters, possibly due to the small size of the sample. Therefore, this estimator should be used with caution if n is not big enough. There are also restricted estimators available. That is, estimators where the parameteres are subject to a set of linear constraints. We write this specification as y=Xß+v s.t. Rß=r, E(v)=0 and var(v)=W and cov(v(i),v(i'))=0 In our example, let R=[0 1 1 1] and r=1. Then,
>R=[0, 1, 1, 1]; r=1; >{b,sb,s2}:=fitRLS(X,y,R,r); b|sb,
1.7157 0.88368 0.51462 0.037024 0.32538 0.021234 0.16001 0.042296
The reader may check that b(2)+b(3)+b(4)=1. fitRLS(X,y,R,r) assumes an homoskedastic error structure. If [] is introduced in place of R and r you get simply fit(X,y).
>{b,sb,s2}:= fitRGLS(X,y,R,r); b|sb,
1.8459 0.23361 0.50981 0.013058 0.322 0.00481 0.1682 0.012429
returns the feasible GLS estimate of ß but where v is heteroskedastic. The estimation is performed in two steps. In the first step the residuals are computed assuming homoskedasticity and in the second these residuals are used as estimators of the true errors.
>{b,sb,s2}:=fitWHITE(X,y); b|sb,
-1.0636 2.0699 0.53159 0.033107 0.35619 0.019989 0.17201 0.038002
returns the White's consistent heteroskedastic estimation of ß. This estimator is suitable for ill conditioned W, as W is not inverted. The constrined version of this estimator is
>{b,sb,s2}:=fitRWHITE(X,y,R,r); b|sb,
1.9481 0.72992 0.51147 0.030295 0.33381 0.013921 0.15471 0.036693
The restricted White-type estimator was recently proposed by Frank at the 2007 JSM. This is also a two steps estimator where the residual are first computed with fitRLS(.). If instead of exact linear constraints we have stochastic constraints, say r=Rß+u, then ß may be estimated with the function fitSRLS(X,y,R,r,V,w). The arguments of fitSRLS(.) are the same as in previous restricted estimators plus an error covariance matrix V and a weighting factor which is the quotient var(u)/var(e). If V is replaced by [], fitSRLS(.) uses the identity matrix instead, which is the same as assuming homoskedasticity. If w is unknown, simply use w=1 to weight equaly the dataset X and the linear constraints R. If the errors are correlated the Cochrane-Orcutt estimator is suitable. Therefore, the Econometrics toolbox includes fitCO(X,y,p), which returns the iterated estimation b following the Cochrane-Orcutt procedure when the error term is autorregresive of order p. The iteration will stop when the maximun difference is smaller than 0.001 or after 10 iterations, what ever happens first. Usually, b converges in the fourth or fifth iteration. Although our example considers independet errors, just for didactical purposes we compute below the estimator b using fitCO(X,y,p) where p=2. rho is the vector of auto- correlation coefficients.
>{b,sb,s2,rho}:=fitCO(X,y,2); b|sb,
-0.15654 4.0151 0.53877 0.034426 0.32738 0.04859 0.16459 0.04435
>rho,
0.61644 -0.35226
As expected, the autocorrelation coefficientes are near zero. The standard deviations sb and s2 are the ones obtained with the rhos and the residuals of the last iteration. Let's now redefine our dataset to test probit models
>X:=[-48.5,24.4,82.8,-24.6,-31.6,91.0,52.1,-87.7, ... -17.0,-51.5,-90.7,65.5,-44.0,-7.0,51.6,32.4, ... -61.8,34.0,27.9,-72.9,49.9]'; >X:=ones(21,1)|X; >y:=[0,0,1,0,0,1,1,0,0,0,0,1,1,0,1,1,0,1,1,0,1]'; >{table,lnL,m}:=fitPROBIT(X,y); >table
-0.064434 0.39924 0.029999 0.010286
Returns the estimated parameters and the asymptotic standard deviation of a probit model. The estimates are computed by maximum likelihood following the Newton-Raphson algorithm. "table" is just a two column matrix with the estimates and the s.d., lnL is the value loglikelihood when the stop criterium is reached and m the number of iterations performed. In the same fashion fitLOGIT(X,y) returns the logit counterparts.
>{table,lnL,m}:=fitLOGIT(X,y); >table
-0.23757 0.74992 0.05311 0.020617
The Econometrics toolbox includes several diagnostic tools for imperfect collinearity among the columns of X. Throughout this section we assume that X has a first column of 1. Following I present the available functions. corrcoef(X) returns the correlation coefficient matrix of matrix X. As a rule of thumb, if at least one R[i,j]>0.9 there is evidence of linear associa- tion between the columns of X. Remember to omit the columns of 1 when calling X. The correlation matrix of X[:,2:4] from our example is If
>X:=ones(15,1)|intrandom(15,3,100); >y:=X.[2,0.5,0.35,0.15]'+normal(15,1)*3;
Then
>corrcoef(X[:,2:4]),
1 0.050123 0.16434 0.050123 1 0.21334 0.16434 0.21334 1
Function partialcorr(X,y) returns the partial correlation coefficient between y and each X[:,j]. Recall that high R2 but low partial correlations is an evidence of multicollinearity, although this is a cualitative criterion. The first column of the output is j. A widespread criterion to detect multicollinearity is the Variance Inflation Factor or VIF. Function vif(X) computes the R2 of tha auxilliary regression xj = X(j)ß and the associated Variance Inflation Factor defined as 1/(1-R2) where R2 is the coefficient of determination of the auxiliary regression. The function returns the matrix j|R2|VIF. As a rule of thumb, multicollinearity is serious when VIF>10.
>vif(X)
2 0.027247 1.028 3 0.045746 1.0479 4 0.069182 1.0743
In our example all the VIF are lower tha 10 because the sample was drawn at random. FGtest(X) returns the Farrar Glauber F statistic and its p-value to test for linear dependency between xj and the X(j). Under the null ß=0 in the model xj = X(j)ß + v. This test relies heavily on an assumption of normality on X.
>FGtest(X)
2 0.15903 0.92128 3 0.26192 0.85207 4 0.38637 0.76713
A widespread measure of multicollinery is the condition number. Function cond(X) returns the square root of the ratio between the maximun and minimun eigenvalues of X'X scaled to unity but not centered. According to Judge et al. (1985) 5<=cond(X)<=10 reveals low to moderate multicolinearity, whereas 30<=cond(X)<=100 reveals moderate to high linear dependance.
>cond(X)
5.8687
I include in the toolbox the function rankcond(X) which returns the condition number of a rank-transformed matrix X, transformed in such a way that the diagonal elements of X'X equal to 1. This is an experimental measure whose tolerance values are roughly the same as the ones given for the ordinary condition number. Function rankcond(X) invokes the auxiliary function numerate(v) which returns the ranks of a row vector v. If ties appear, numerate(.) returns the delta that should be substracted to the rank-SC. Another test for collinearity is Bartletts's T test. Function bartlett(X) returns Bartletts's T statistic and its p-value to test for collinearity among the columns of X. Under the null hypothesis T is distributed chi-square with k*(k-1)/2 degrees of freedom. Recall that Ho assumes that X is a normal multivariate sample with correlation matrix equal to id(k). Omit the first column of 1 in X to perform this test.
>bartlett(X[:,2:4])
0.90285
The following are some statistical tests to check for heteroskedasticity in the residuals. GQtest(X,y,j) returns the p-value of the Golfeld-Quandt heteroskedasticity test where the the error variance is supposed to be a function of the j-th column. The number of deleted values are computed as c:=max(4,floor(n/10).
>GQtest(X,y,3),
0.00019688
white(X,y) returns the p-value of White's statistic nR^2 to test for heteroskedasticity. This is perhaps the most general test for heteroskedasticity. Under the null hypothesis nR^2 is distributed chi-squared with p degrees of freedom, where R2 is the determination coefficient of the auxiliary regression e^2 = ß1 + ß2 x1 + ...+ ß(k+1) x1^2 + ... + ß(2k+1) x1x2 + ß(2k+2) x1x3 + ... + v and p is the number of parameters in this regression excluding the intercept. ARCHtest(X,y,p) returns the p-value of Engle's test for AutoRegressive Conditional Heteroskedasticity (ARCH). Recall that some autocorrelation tests are sensible to conditional heteroskedasticity. Therefore I include in the toolbox this general ARCH test. The test statistic is nR^2 which is distributed chi-squared with p degrees of freedom. R^2 and p are the coefficient of determination and the number of parameters excluding the intercept in the auxiliary regression e(i)^2 = r0 + r1 e(i-1)^2 + r2 e(i-2)^2 + ... + v spearman(X,y) returns Spearman's correlation coefficient between the residuals (in absolute value) |e| and each column of X. This is a non parametric test. Under the null, the test statistic rho is distributed approximately t with n-2 degrees of freedom. The test requires that n>8.
Consider the following dataset
>X:=ones(7,1)|[-1:5]'; >y:=[10.4,-4.2,10.7,26.1,18,58.6,83.6]';
The true model underlying these data is y = -1 + 0.5*x1 + 3*x1^2 + e, but we shall fit a simple linear regression and test for autocorrelation. Several tests have been proposed for this purpose, some of which are included in the toolbox. gearytest(x) returns the p-value of the streak test statistic under the null "the signs of x appear at random". x is a row vector. In our context x is the vector of residuals. The test is non parametrical.
>b:=fit(X,y); >e:=y-X.b; >{pval,n1,n2}:=gearytest(e'); pval,
0.23975
DWtest(X,y) returns the Durbin-Watson statistic to test for first order autocorrelation in the residuals and the estimated rho. Before running this test it is suggested to check for ARCH.
>{rho,d}:=DWtest(X,y); d,
0.20457
The d statistic lies on the no rejection interval 1.36<d<2.64 with a type I probabilty of 0.05. wallistest(X,y,p) returns the Wallis statistic to test for fourth order autocorrelation in the residuals. This test is used to test for autocorrelation in quarterly time series. We skip this test, as it refers to a very specific autocorrelation structure. BGtest(X,y,p) returns the p-value of Breusch-Godfrey's statistic to test for autocorrelation of order p. The procedure reproduced by this script is as follows: 1) compute the OLS residuals e = y-Xb 2) Set the auxiliary model y = Xß + Er + v, where E is a matrix of lagged residuals, and compute the regression coefficients by OLS. Then compute the R2 coefficient. 3) Compute the test statistic (n-p)R2, which under the null hyphotesis is distributed chi-squared with p degrees of freedom. This distribution is asymptotic, so this test should be performed if n is sufficiently large. Then, the p-value associated to Breusch-Godfrey's statistic for first order autocorrelation is
>{pvalue,R2}:=BGtest(X,y,1); pvalue,
0.17749
leading to the rejection of Ho. However, this is an unreliable conclusion since n es too small to invoke the asymptotic theory. BPtest(X,y,p) returns the p-value of the Box-Pierce Q statistic. This statistic tests for autocorrelation on the first p OLS residuals. The distribution of Q is derived under the assumption that residuals are generated by an AR or ARMA process. Under the null hypothesis of no autocorrelation the Q statistic has asymptotically a chi-square distribution with degrees of freedom equal to p.
>{pvalue,Q}:=BPtest(X,y,1); pvalue,
0.91801
LBtest(X,y,p) returns the p-value of the Ljung-Box Q statistic. This statistic is a refined version of the Box-Pierce test. Under the null hypothesis of no autocorrelation the Q statistic has asymptotically a chi^2 distribution with degrees of freedom equal to p. This test is suitable for ARIMA residuals.
>{pvalue,Q}:=LBtest(X,y,1); pvalue,
0.89967
The bibliography shows that this test is better than the Box-Pierce test for all sample sizes. BWtest(X,y) returns the g statistic of Berenblutt-Webb test for the null rho=1. That is, the g statistic is distributed as Durbin-Watson's d.
>BWtest(X,y),
1.5902
Berenblutt-Webb's statistic lies in the positive autocorrelation interval, as 0<d<0,7. vNtest(X,y) returns the p-value of the von Neumann autocorrelation test. This is a large sample test to test for first orden autocorrelation.
>{pvalue,z}:=vNtest(X,y); pvalue,
0.26309