DSA SPSS Short Course Module 9 Regression

Data Science and Analytics

Please participate in the DSA Client Feedback Survey.

MODULE 9

Bivariate Linear Regression in SPSS.

Regression analysis can have a couple of different purposes. Generally regression is used as a means to predict values or scores on the outcome variable using one or more predictor variables. However, regression is also often used as a means of determining variable importance, meaning how are two or more variables related in the context of a model. There are a vast number of types and ways to conduct regression. This tutorial will focus exclusively on ordinary least squares (OLS) linear regression. As with many of the tutorials on this web site, this page should not be considered a replacement for a good textbook, such as:

Pedhazur, E. J. (1997). Multiple regression in behavioral research: Explanation and prediction (3rd ed.). New York: Harcourt Brace.

For the duration of this tutorial, we will be using RegData001.sav

Regression Assumptions: Regression is perhaps the most popular form of statistical analysis. Unfortunately, regression also likely has the distinction of being the most frequently abused statistical analysis, meaning it is often used incorrectly. There are many assumptions of regression analysis. It is strongly urged that one consult a good textbook (such as Pedhazur, 1997) to review all the assumptions of regression. However, some of the more frequently violated assumptions will be reviewed here briefly. First, regression works best under the condition of proper model specification; essentially, you should have all the important variables in the model and no un-important variables in the model. Literature reviews on the theory and variables of interest pay big dividends when conducting regression. Second, regression works best when there is a lack of multicollinearity. Multicollinearity is a big fancy word for: your predictor variables are too strongly related, which degrades regression's ability to discern which variables are important to the model. Third, regression is designed to work best with linear relationships. There are types of regression specifically designed to deal with non-linear relationships (e.g. exponential, cubic, quadratic, etc.); but standard multiple regression using ordinary least squares works best with linear relationships. Fourth, regression is designed to work with continuous or nearly continuous data. This one causes a great deal of confusion, because 'nearly continuous' is a subjective judgment. A 9-point Likert response scale item is NOT a continuous, or even nearly continuous, variable. Again, there are special types of regression to deal with different types of data, for example, ordinal regression for dealing with an ordinal outcome variable, logistic regression for dealing with a binary dichotomous outcome, multinomial logistic regression for dealing with a polytomous outcome variable, etc. Furthermore, if you have one or more categorical predictor variables, you cannot simply enter them into the model. Categorical predictors need to be coded using special strategies in order to be included into a regression model and produce meaningful interpretive output. The use of dummy coding, effects coding, orthogonal coding, or criterion coding is appropriate for entering a categorical predictor variable into a standard regression model. Again, a good textbook will review each of these strategies--as each one lends itself to a particular purpose. Fifth, regression works best when outliers are not present. Outliers can be very influential to correlation and therefore, regression. Thorough initial data analysis should be used to review the data, identify outliers (both univariate and multivariate), and take appropriate action. A single, severe outlier can wreak havoc in a multiple regression analysis; as an esteemed colleague is fond of saying...know thy data!

Bivariate Regression. The simplest form of regression is bivariate regression, in which one variable is the outcome and one is the predictor. Very little information can be extracted from this type of analysis. The most meaningful statistic is likely to be the correlation coefficient squared (R²), which refers to the amount of variance in one variable accounted for by the other.

Start by clicking on Analyze, Regression, Linear...

Next, highlight the y variable and use the top arrow button to move it to the Dependent: box. Then, highlight the x1 variable and use the second arrow button to move it to the Independent(s): box.

Next, click on the Statistics... button and select Confidence intervals and Covariance matrix (Estimates & Model fit should be selected by default). Then, click the Continue button.

Next, click on the Plots... button. Then, highlight *ZPRED and use the second arrow button to move it to the X: box. Then highlight *ZRESID and use the top arrow button to move it to the Y: box. Then, select Histogram and Normal probability plot. Next, click the Next button (marked with a red ellipse in the figure to the right). Finally, click the Continue button.

Next, click the OK button to complete the regression analysis and produce output similar to that displayed below.

Interpretation of the Model Summary Table shows us that the multiple correlation coefficient (R; i.e. the correlation between the predictor and the outcome variables) is .547, which when squared gives us .300 which can be thought of as the amount of variance in the outcome variable that is accounted for by the predictor variable. Generally expressed as: 30% of the variance in y was accounted for by x1. However, R² tends to be slightly optimistic and therefore, a more appropriate metric is adjusted multiple correlation coefficient squared (adj. R² = .293). Next, we see the ubiquitous ANOVA table which simply tests whether or not our model is significantly better than just using the mean of x1 to predict new values of y. Here, our model is significantly better. Another way of thinking about this ANOVA table concerns whether or not the R² is significantly different from zero. Next, we see our Coefficients table which gives the unstandardized and standardized coefficients (for building a regression equation) as well as a t test for each. So, if we wanted to predict new raw scores on the outcome variable (y), we would use the following equation:

(1) y = .779*x1 + 366.485

where the .779 is the unstandardized coefficient for the predictor (often called the b-weight) and the 366.485 is the y-intercept term (often called a). The t test for the constant or y-intercept has virtually no meaning. The t test for the predictor coefficient is testing whether or not the coefficient is significantly different than zero. The standardized coefficient (often called Beta and given the symbol β) represent the correlation between the predictor and the outcome. As you can see, in the case of only one predictor, it is the same as the multiple correlation (R). If we were interested in predicting new standardized scores of the outcome (y) then we would use the following regression equation:

(2) Z_y = .547*Z_x1

where the .547 represent the coefficient for the standardized predictor. There is obviously not an intercept term when dealing with standardized scores because, the intercept is always zero on both the x and y axis--commonly called the centroid.

The next table gives us the correlation and covariance matrix for our coefficient(s). Then, we have the residual descriptive statistics table which displays descriptive summary statistics for the residuals, also called errors of prediction (y - yhat). This table is followed by a histogram of the residuals, which we expect to be normally distributed and finally a diagnostic plot showing the expected versus observed probability values.

As shown in a previous tutorial, we can get an informative scatterplot to represent our bivariate regression by clicking no Graphs, Legacy dialogs, Scatter/Dot...

Next, click the Define button (the default Simple Scatter is appropriate). Then highlight the y variable and use the top arrow button to move it to the Y Axis: box. Then, highlight the x1 variable and use the second arrow button to move it to the X Axis: box.

Next, click the OK button to create the simple scatterplot.

Next, right click on the scatterplot in the output and select Edit Content, In Separate Window to bring up the chart editor.

Using the Chart editor, right click on the actual data points in the scatter plot (in the chart editor), at which point they should turn a yellow color. Then, select Add Fit Line at Total.

Next, simply left click somewhere in the white space of the output (outside the chart editor). You should now see something similar to what is displayed below. Note, the y-intercept does not seem to match with what is in the table above because, the scale of the x-axis begins at approximately 80 rather than zero.

This concludes the bivariate regression section. The next section focuses on Multiple Linear Regression.

Return to the SPSS Short Course

UNT home page

Contact Information
Jon Starkweather, PhD	Jonathan.Starkweather@unt.edu	940-565-4066
Richard Herrington, PhD	Richard.Herrington@unt.edu	940-565-2140

Please participate in the DSA Client Feedback Survey.

Last updated: 2018.11.27 by Jon Starkweather.