DSA SPSS Short Course Module 9.2 Multiple Regression

Data Science and Analytics

Please participate in the DSA Client Feedback Survey.

MODULE 9

If you are not familiar with Bivariate Regression, then I strongly recommend returning to the previous tutorial and reviewing it prior to reviewing this tutorial.

Multiple Linear Regression in SPSS.

Multiple regression simply refers to a regression model with multiple predictor variables. Multiple regression, like any regression analysis, can have a couple of different purposes. Regression can be used for prediction or determining variable importance, meaning how are two or more variables related in the context of a model. There are a vast number of types and ways to conduct regression. This tutorial will focus exclusively on ordinary least squares (OLS) linear regression. As with many of the tutorials on this web site, this page should not be considered a replacement for a good textbook, such as:

Pedhazur, E. J. (1997). Multiple regression in behavioral research: Explanation and prediction (3rd ed.). New York: Harcourt Brace.

For the duration of this tutorial, we will be using RegData001.sav

Standard Multiple Regression. Standard multiple regression is perhaps one of the most popular statistical analysis. It is extremely flexible and allows the researcher to investigate multiple variable relationships in a single analysis context. The general interpretation of multiple regression involves: (1) whether or not the regression model is meaningful, (2) which variables contribute meaningfully to the model. The first part is concerned with model summary statistics (given the assumptions are met), and the second part is concerned with evaluating the predictor variables (e.g. their coefficients).

Assumptions: Please notice the mention of assumptions above. Regression also likely has the distinction of being the most frequently abused statistical analysis, meaning it is often used incorrectly. There are many assumptions of multiple regression analysis. It is strongly urged that one consult a good textbook to review all the assumptions of regression, such as Pedhazur (1997). However, some of the more frequently violated assumptions will be reviewed here briefly. First, multiple regression works best under the condition of proper model specification; essentially, you should have all the important variables in the model and no un-important variables in the model. Literature reviews on the theory and variables of interest pay big dividends when conducting regression. Second, regression works best when there is a lack of multicollinearity. Multicollinearity is a big fancy word for: your predictor variables are too strongly related, which degrades regression's ability to discern which variables are important to the model. Third, regression is designed to work best with linear relationships. There are types of regression specifically designed to deal with non-linear relationships (e.g. exponential, cubic, quadratic, etc.); but standard multiple regression using ordinary least squares works best with linear relationships. Fourth, regression is designed to work with continuous or nearly continuous data. This one causes a great deal of confusion, because 'nearly continuous' is a subjective judgment. A 9-point Likert response scale item is NOT a continuous, or even nearly continuous, variable. Again, there are special types of regression to deal with different types of data, for example, ordinal regression for dealing with an ordinal outcome variable, logistic regression for dealing with a binary dichotomous outcome, multinomial logistic regression for dealing with a polytomous outcome variable, etc. Furthermore, if you have one or more categorical predictor variables, you cannot simply enter them into the model. Categorical predictors need to be coded using special strategies in order to be included into a regression model and produce meaningful interpretive output. The use of dummy coding, effects coding, orthogonal coding, or criterion coding is appropriate for entering a categorical predictor variable into a standard regression model. Again, a good textbook will review each of these strategies--as each one lends itself to particular purposes. Fifth, regression works best when outliers are not present. Outliers can be very influential to correlation and therefore, regression. Thorough initial data analysis should be used to review the data, identify outliers (both univariate and multivariate), and take appropriate action. A single, severe outlier can wreak havoc in a multiple regression analysis; as an esteemed colleague is fond of saying...know thy data!

To conduct a standard multiple regression using ordinary least squares (OLS), start by clicking on Analyze, Regression, Linear...

Next, highlight the y variable and use the top arrow button to move it to the Dependent: box. Then, highlight the x1 and x2 variables and use the second arrow to move them to the Independent(s): box.

Next, click on the Statistics... button. Select Confidence intervals, Covariance matrix, Descriptives, and Part and partial correlations. Then, click on the Continue button.

Next, click on Plots... Then, highlight ZRESID and use the top arrow button to move it to the Y: box. Then, highlight ZPRED and use the bottom arrow button to move it to the X: box. Then click on the Next button (marked with a red ellipse here). Then, select Histogram and Normal probability plot. Then, click the Continue button.

Next, click on the Save... button. Notice here you can have SPSS save a variety of values into the data file. By selecting these options, SPSS will fill in subsequent columns to the right of your data file with the values you select here. It is recommended one typically save some type of distance measure, here we used Mahalanobis distance; which can be used to checking for multivariate outliers. Then click the Continue button and then click the OK button.

The output should be very similar to that displayed below, with the exception of the new variable called MAH_1 which was created in the data set and includes the values of Mahalanobis distance for each case.

The output begins with the syntax generated by all of the pointing and clicking we did to run the analysis.

Then, we have descriptive statistics table which includes the mean, standard deviation, and number of observations for each variable selected for the model.

Then, we have a correlation matrix table, which includes the correlation, p-value, and number of observations for each pair of variables in the model. Note, if you have unequal number of observations for each pair, SPSS will remove cases from the regression analysis which do not have complete data on all variables selected for the model. This table should not be terribly useful, as a good research will have already taken a look at the correlations during initial data analysis (i.e. before running the regression). One thing to notice here is the lack of multicollinearity, the two predictors are not strongly related (r = -.039, p = .350). This is good, as it indicates adherence to one of the assumptions of regression.

Next, we have the Variables Entered/Removed table, which as the name implies, reports which variables were entered into the model.

Then, we have the Model Summary table. This table provides the Multiple Correlation (R = .784), the Multiple Correlation squared (R? = .614), the adjusted Multiple Correlation squared (adj.R? = .606), and the Standard Error of the Estimate. The multiple correlation refers to the combined correlation of each predictor with the outcome. The multiple correlation squared represents the amount of variance in the outcome which is accounted for by the predictors; here, 61.4% of the variance in y is accounted for by both x1 and x2. However, as mentioned in a previous tutorial, the multiple correlation squared is a bit optimistic, and therefore, the adjusted R? is more appropriate. More appropriate still for model comparison and model fit statistics; would be the use of the Akaike Information Criterion (AIC; Akaike, 1974) or Bayesian Information Criterion (BIC; Schwarz, 1978), neither of which is available in SPSS, but both can be computed very easily (see the references at the bottom of the page).

Next, we have the ANOVA summary table, which indicates that our model's R? is significantly different from zero, F(2, 97) = 77.286, p < .001.

Next we have the very informative Coefficients table. It is often preferred to read this table by column from left to right, recognizing that each row of information corresponds to an element of the regression model. The first two columns contain unstandardized (or raw score) coefficients and their standard errors. The Constant coefficient is simply the y-intercept term for a linear best fit fine representing our fitted model. The x1 and x2 unstandardized coefficients represent the weight applied to each score (for each variable) to produce new y scores along the best fit line. If predicting new scores is the goal for your regression analysis, then here is one of the places where you will be focusing your attention. The unstandardized coefficients are used to build the linear regression equation one might use to predict new scores of y using available scores of x1 and x2. The equation for the current example is below:

(1) y = .810(x1) + .912(x2) + 221.314 or y = 221.314 + .810(x1) + .912(x2)

Next, we have the Standardized Coefficients, which are typically reported in social science journals (rather than the unstandardized coefficients) as a way of interpreting variable importance because, they can be directly compared (they are in the same metric). They are sometimes referred to as slopes, but the standardized coefficients use the symbol Beta, which is the capital Greek letter β and can be interpreted as the correlation between a predictor and the outcome variable. There is no constant or y-intercept term when referring to standardized scores (sometimes called Z-scores) because, the y-intercept when graphing them is always zero. The standardization transformation results in a mean of 0 and a standard deviation of 1 for all variables so transformed. Next, we have the calculated t-score for each unstandardized coefficient (coefficient divided by standard error) and their associated p-value. Next, we have the confidence intervals for each unstandardized coefficient as specified in the point and click options. Then, we have the correlations for each predictor (as specified in the options). SPSS labels the semi-partial correlation as the Part correlation.

Next, we have the Coefficient Correlations table, which as the name implies displays the correlations and covariances among our predictors.

Next, we have the Residuals Statistics table which displays descriptive statistics for predicted values, adjusted predicted values, and residual values. Residuals are the differences between the actual values of our outcome y and the predicted values of our outcome y based on the model we have specified. The table also produces descriptive summary statistics for measures of multivariate distance and leverage; which allow us to get an idea of whether or not we have outliers or influential data points.

Finally, we have the Normal P-P Plot of Regression Standardized Residual values. We expect the values to be very close to (or on top of) the reference line, which would indicate very little deviation of the expected values from the observed values.

Next, we have a histogram of the standardized residual values, which we expect to be close to normally distributed around a mean of zero.

Now, we can return to the data view and evaluate our Mahalanobis distances (MAH_1) to investigate the presence of outliers. Click on Analyze, Descriptive Statistics, Explore...

Next, highlight the Mahalanobis Distance variable and use the top arrow button to move it to the Dependent List: box. Then click on the Statistics... button.

Next, select Descriptives, M-estimators, Outliers, and Percentiles; then click the Continue button. Then click on the Plots... button and select Stem-and-leaf, Histogram, and Normality plots with tests. Then click the Continue button, then click the OK button.

The output should be similar to what is displayed below.

These first few tables are fairly intuitively named. Case Processing Summary provides information on the number of cases used for the Explore function.

The Descriptives table provides the usual suspects in terms of descriptive statistics for the Mahalanobis distances. Remember, you should not be alarmed by the skewness and kurtosis because Mahalanobis distance with always be non-normally distributed. If there are values less than one, you have problem.

The M-Estimators are maximum likelihood estimates which can be used when outliers are present to overcome their undue influence on the least squares regression. (1) (2).

The Percentiles table simply reports the percentile ranks for the Mahalanobis distances.

The Extreme Values table is very helpful and reports the highest and lowest five cases for the variable specified; here Mahalanobis distance. This allows us to see just how extreme the most outlying cases are because, Mahalanobis distance is a multivariate measure of distance from the centroid (mean of all the variables).

The Tests of Normality table reports two tests of normality; meaning they test whether or not the distribution of the specified variable is significantly different from the standard normal curve. Here, it is not terribly useful because, we know Mahalanobis distance is not typically normally distributed (i.e. it is always positively skewed).

The next four graphical displays simply show the distribution of Mahalanobis distances. Of note at the bottom of the Stem & Leaf plot, where it shows that 3 values are extreme; which can be seen in the extreme values table and the Normal Q-Q Plots on the second row below.

Finally, we have the wonderful box plot which displays the distribution of Mahalanobis distances intuitively and identifies extreme values with either a circle (as is the case here) or an asterisk (which is the case when values are well beyond the whiskers of the box plot).

This concludes the standard multiple regression section. The next section focuses on multiple regression while investigating the influence of a covariate.

REFERENCES & RESOURCES

Achen, C. H. (1982). Interpreting and using regression. Series: Quantitative Applications in the Social Sciences, No. 29. Thousand Oaks, CA: Sage Publications. (1)

Akaike, H. (1974). A new look at the statistical model identification. I.E.E.E. Transactions on automatic control, AC 19, 716 – 723. (1) (2) (3)

Allison, P. D. (1999). Multiple regression. Thousand Oaks, CA: Pine Forge Press.

Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70(6), 426 - 443. (1)

Hardy, M. A. (1993). Regression with dummy variables. Series: Quantitative Applications in the Social Sciences, No. 93. Thousand Oaks, CA: Sage Publications. (1)

Harrell, F. E., Lee, K. L., & Mark, D. B. (1996). Multivariate prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15, 361 – 387. (1) (2) (3)

Kass, R. E., & Raftery, A., E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773 – 795. (1) (2) (3)

Pedhazur, E. J. (1997). Multiple regression in behavioral research: Explanation and prediction (3rd ed.). New York: Harcourt Brace.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461 – 464. (1) (2) (3)

Tabachnick, B. G., & Fidell, L. S. (2001). Using Multivariate Statistics. Fourth Edition. Boston: Allyn and Bacon.

Return to the SPSS Short Course

UNT home page

Contact Information
Jon Starkweather, PhD	Jonathan.Starkweather@unt.edu	940-565-4066
Richard Herrington, PhD	Richard.Herrington@unt.edu	940-565-2140

Please participate in the DSA Client Feedback Survey.

Last updated: 2018.11.27 by Jon Starkweather.