If you are not familiar with
Bivariate
Regression, then I strongly recommend returning to the
previous tutorial and reviewing it prior to reviewing this tutorial.
Multiple
Linear Regression
in SPSS.
Multiple regression simply refers to a regression
model with multiple predictor variables. Multiple regression, like any
regression analysis, can have a couple of different purposes.
Regression can be used for prediction or determining variable
importance, meaning how are two or more variables related in the
context of a model. There are a vast number of types and ways to
conduct regression. This tutorial will focus exclusively on ordinary
least squares (OLS) linear regression. As with many of the tutorials on
this web site, this page should not be considered a replacement for a
good textbook, such as:
Pedhazur, E. J. (1997). Multiple
regression in behavioral research: Explanation and prediction
(3rd ed.). New York: Harcourt Brace.
For the duration of this tutorial, we will be
using
RegData001.sav
Standard Multiple Regression. Standard
multiple regression is perhaps one of the most popular
statistical analysis. It is extremely flexible and allows the
researcher to investigate multiple variable relationships in a single
analysis context. The general interpretation of
multiple regression involves: (1) whether or not the regression model
is meaningful, (2) which variables contribute meaningfully to the
model. The first part is concerned with model summary statistics (given
the assumptions are met), and the second part is concerned with
evaluating the predictor variables (e.g. their coefficients).
Assumptions: Please notice the
mention of assumptions above. Regression also likely has the
distinction of being the most frequently abused statistical analysis,
meaning it is often used incorrectly. There are many assumptions of
multiple regression analysis. It is strongly urged that one consult a
good textbook to review all the assumptions of regression, such as
Pedhazur (1997). However, some of the more frequently violated
assumptions will be reviewed here briefly. First, multiple regression
works best under the condition of proper model specification;
essentially, you should have all the important variables in the model
and no un-important variables in the model. Literature reviews on the
theory and variables of interest pay big dividends when conducting
regression. Second, regression works best when there is a lack of
multicollinearity. Multicollinearity is a big fancy word for: your
predictor variables are too strongly related, which degrades
regression's ability to discern which variables are important to the
model. Third, regression is designed to work best with linear
relationships. There are types of regression specifically designed to
deal with non-linear relationships (e.g. exponential, cubic, quadratic,
etc.); but standard multiple regression using ordinary least squares
works best with linear relationships. Fourth, regression is designed to
work with continuous or nearly continuous data. This one causes a great
deal of confusion, because 'nearly continuous' is a subjective
judgment. A 9-point Likert response scale item is NOT a continuous, or
even nearly continuous, variable. Again, there are special types of
regression to deal with different types of data, for example, ordinal
regression for dealing with an ordinal outcome variable, logistic
regression for dealing with a binary dichotomous outcome, multinomial
logistic regression for dealing with a polytomous outcome variable,
etc. Furthermore, if you have one or more categorical predictor
variables, you cannot simply enter them into the model. Categorical
predictors need to be coded using special strategies in order to be
included into a regression model and produce meaningful interpretive
output. The use of dummy coding, effects coding, orthogonal coding, or
criterion coding is appropriate for entering a categorical predictor
variable into a standard regression model. Again, a good textbook will
review each of these strategies--as each one lends itself to particular
purposes. Fifth, regression works best when outliers are not present.
Outliers can be very influential to correlation and therefore,
regression. Thorough initial data analysis should be used to review the
data, identify outliers (both univariate and multivariate), and take
appropriate action. A single, severe outlier can wreak havoc in a
multiple regression analysis; as an esteemed colleague is fond of
saying...know thy data!
To conduct a standard multiple regression using
ordinary least squares (OLS), start by clicking on Analyze, Regression,
Linear...
Next,
highlight the y variable and use the top arrow button to move it to the
Dependent: box. Then, highlight the x1 and x2 variables and use the
second arrow to move them to the Independent(s): box.
Next, click on the Statistics... button. Select
Confidence intervals, Covariance matrix, Descriptives, and Part and
partial correlations. Then, click on the Continue button.
Next, click on Plots... Then, highlight ZRESID and
use the top arrow button to move it to the Y: box. Then, highlight
ZPRED and use the bottom arrow button to move it to the X: box. Then
click on the Next
button (marked with a red ellipse here). Then, select Histogram and
Normal probability plot. Then, click the Continue button.
Next, click on the Save... button. Notice here you
can have SPSS save a variety of values into the data file. By selecting
these options, SPSS will fill in subsequent columns to the right of
your data file with the values you select here. It is recommended one
typically save some type of distance measure, here we used Mahalanobis
distance; which can be used to checking for multivariate outliers. Then
click the Continue button and then click the OK button.
The output should be very similar to that
displayed below, with the exception of the new variable called MAH_1
which was created in the data set and includes the values of
Mahalanobis distance for each case.
The output begins with the syntax generated by
all of the pointing and clicking we did to run the analysis.
Then, we have descriptive statistics table which
includes the mean, standard deviation, and number of observations for
each variable selected for the model.
Then, we have a correlation matrix table, which
includes the correlation, p-value, and number of observations for each
pair of variables in the model. Note, if you have unequal number of
observations for each pair, SPSS will remove cases from the regression
analysis which do not have complete data on all variables selected for
the model. This table should not be terribly useful, as a good research
will have already taken a look at the correlations during initial data
analysis (i.e. before running the regression). One thing to notice here
is the lack of multicollinearity, the two predictors are not strongly
related (r = -.039, p = .350).
This is good, as it indicates adherence to one of the assumptions of
regression.
Next, we have the Variables Entered/Removed
table, which as the name implies, reports which variables were entered
into the model.
Then, we have the Model Summary table. This table
provides the Multiple Correlation (R = .784), the
Multiple Correlation squared (R? = .614), the
adjusted Multiple Correlation squared (adj.R? =
.606), and the Standard Error of the Estimate. The multiple correlation
refers to the combined correlation of each predictor with the outcome.
The multiple correlation squared represents the amount of variance in
the outcome which is accounted for by the predictors; here, 61.4% of
the variance in y is accounted for by both x1 and x2. However, as
mentioned in a previous tutorial, the multiple correlation squared is a
bit optimistic, and therefore, the adjusted R? is
more appropriate. More appropriate still for model comparison and model
fit statistics; would be the use of the Akaike Information Criterion
(AIC; Akaike, 1974) or Bayesian Information Criterion (BIC; Schwarz,
1978), neither of which is available in SPSS, but both can be computed
very easily (see the references at the bottom of the page).
Next,
we have the ANOVA summary table, which indicates that our model's R?
is significantly different from zero, F(2, 97) =
77.286, p < .001.
Next we have the very informative Coefficients
table. It is often preferred to read this table by column from left to
right, recognizing that each row of information corresponds to an
element of the regression model. The first two columns contain
unstandardized (or raw score) coefficients and their standard errors.
The Constant coefficient is simply the y-intercept term for a linear
best fit fine representing our fitted model. The x1 and x2
unstandardized coefficients represent the weight applied to each score
(for each variable) to produce new y scores along the best fit line. If
predicting new scores is the goal for your regression analysis, then
here is one of the places where you will be focusing your attention.
The unstandardized coefficients are used to build the linear regression
equation one might use to predict new scores of y using available
scores of x1 and x2. The equation for the current example is below:
(1)
y = .810(x1) + .912(x2) +
221.314
or
y = 221.314 + .810(x1) + .912(x2)
Next, we have the Standardized Coefficients, which
are typically reported in social science journals (rather than the
unstandardized coefficients) as a way of interpreting variable
importance because, they can be directly compared (they are in the same
metric). They are sometimes referred to as slopes, but the standardized
coefficients use the symbol
Beta,
which is the capital Greek letter β and can be interpreted as the
correlation between a predictor and the outcome variable. There is no
constant or y-intercept term when referring to standardized scores
(sometimes called Z-scores) because, the y-intercept when graphing them
is always zero. The standardization transformation results in a mean of
0 and a standard deviation of 1 for all variables so transformed. Next,
we have the calculated t-score for each unstandardized coefficient
(coefficient divided by standard error) and their associated p-value.
Next, we have the confidence intervals for each unstandardized
coefficient as specified in the point and click options. Then, we have
the correlations for each predictor (as specified in the options). SPSS
labels the semi-partial correlation as the Part correlation.
Next, we have the Coefficient Correlations
table, which as the name implies displays the correlations and
covariances among our predictors.
Next, we have the Residuals Statistics table which
displays descriptive statistics for predicted values, adjusted
predicted values, and residual values. Residuals are the differences
between the actual values of our outcome y and the predicted values of
our outcome y based on the model we have specified. The table also
produces descriptive summary statistics for measures of multivariate
distance and leverage; which allow us to get an idea of whether or not
we have outliers or influential data points.
Finally, we have the Normal P-P Plot of
Regression Standardized Residual values. We expect the values to be
very close to (or on top of) the reference line, which would indicate
very little deviation of the expected values from the observed values.
Next, we have a histogram of the standardized
residual values, which we expect to be close to normally distributed
around a mean of zero.
Now, we can return to the data view and evaluate
our Mahalanobis distances (MAH_1) to investigate the presence of
outliers. Click on Analyze, Descriptive Statistics, Explore...
Next, highlight the Mahalanobis Distance variable
and use the top arrow button to move it to the Dependent List: box.
Then click on the Statistics... button.
Next, select Descriptives, M-estimators, Outliers,
and Percentiles; then click the Continue button. Then click on the
Plots... button and select Stem-and-leaf, Histogram, and Normality
plots with tests. Then click the Continue button, then click the OK
button.
The output should be similar to what is displayed
below.
These first few tables are fairly intuitively
named. Case Processing Summary provides information on the number of
cases used for the Explore function.
The Descriptives table provides the usual suspects
in terms of descriptive statistics for the Mahalanobis distances.
Remember, you should not be alarmed by the skewness and kurtosis
because Mahalanobis distance with always be non-normally distributed.
If there are values less than one, you have problem.
The M-Estimators are maximum likelihood estimates
which can be used when outliers are present to overcome their undue
influence on the least squares regression. (1)
(2).
The Percentiles table simply reports the
percentile ranks for the Mahalanobis distances.
The Extreme Values table is very helpful and
reports the highest and lowest five cases for the variable specified;
here Mahalanobis distance. This allows us to see just how extreme the
most outlying cases are because, Mahalanobis distance is a multivariate
measure of distance from the centroid (mean of all the variables).
The Tests of Normality table reports two tests of
normality; meaning they test whether or not the distribution of the
specified variable is significantly different from the standard normal
curve. Here, it is not terribly useful because, we know Mahalanobis
distance is not typically normally distributed (i.e. it is always
positively skewed).
The next four graphical displays simply show the
distribution of Mahalanobis distances. Of note at the bottom of the
Stem & Leaf plot, where it shows that 3 values are extreme;
which can be seen in the extreme values table and the Normal Q-Q Plots
on the second row below.
Finally, we have the wonderful box plot which
displays the distribution of Mahalanobis distances intuitively and
identifies extreme values with either a circle (as is the case here) or
an asterisk (which is the case when values are well beyond the whiskers
of the box plot).
This concludes the standard multiple regression
section. The next
section focuses on multiple regression while investigating
the influence of a covariate.
REFERENCES &
RESOURCES
Achen, C. H. (1982). Interpreting and
using regression. Series: Quantitative Applications in the
Social Sciences, No. 29. Thousand Oaks, CA: Sage Publications. (1)
Akaike, H. (1974). A new look at the statistical
model identification.
I.E.E.E. Transactions on automatic control, AC 19, 716 –
723. (1)
(2)
(3)
Allison, P. D. (1999). Multiple regression.
Thousand Oaks, CA: Pine Forge Press.
Cohen, J. (1968). Multiple regression as a general
data-analytic system.
Psychological Bulletin, 70(6), 426 - 443. (1)
Hardy, M. A. (1993). Regression with
dummy variables. Series: Quantitative Applications in the
Social Sciences, No. 93. Thousand Oaks, CA: Sage Publications. (1)
Harrell, F. E., Lee, K. L., & Mark, D. B.
(1996). Multivariate prognostic models: Issues in developing models,
evaluating assumptions and adequacy, and measuring and reducing errors.
Statistics in Medicine, 15, 361 – 387.
(1)
(2)
(3)
Kass, R. E., & Raftery, A., E. (1995).
Bayes factors. Journal of the American Statistical
Association, 90, 773 – 795. (1)
(2)
(3)
Pedhazur, E. J. (1997). Multiple
regression in behavioral research: Explanation and prediction
(3rd ed.). New York: Harcourt Brace.
Schwarz, G. (1978). Estimating the dimension of a
model. Annals of Statistics, 6, 461 – 464. (1)
(2)
(3)
Tabachnick, B. G., & Fidell, L.
S. (2001). Using Multivariate Statistics.
Fourth Edition. Boston: Allyn and Bacon.
|