DSA SPSS Short Course Module 3

Data Science and Analytics

Please participate in the DSA Client Feedback Survey.

MODULE 3

Initial Data Analysis (IDA) continued from previous Module. What are descriptive statistics? Descriptive statistics allow us to describe a set of scores or multiple sets of scores. There are typically four categories of descriptive statistics; central tendency, dispersion, distribution, and relation.

Central Tendency: There are three general measures of central tendency. (1). Mean. The mean is the most frequently used to describe the center of a distribution of scores. It is the arithmetic average of a series of scores. Mean is very sensitive to outliers and for this reason, it is often preferable to use the trimmed mean which trims some percentage of extreme scores (e.g. 20%). (2). Median. The median is the point that delineates two halves of a series of scores. (3). Mode. The mode is the most frequently occurring score in a series.

Dispersion: There are 5 general measures of dispersion. (1). Variance. Variance is the sum of the squared deviations from the mean divided by the degrees of freedom. In lay terms, variance is the average deviation of the scores around the mean. (2). Standard Deviation. Standard deviation is the square-root of the variance. It is a standardized measure of dispersion (most frequently used) which allows us to compare distributions of different variables. Notice that sums of squares is crucial to both. (3). Z-scores (also called Standard Scores). Z-scores represent a transformation applied to each score which allows us to compare scores from different distributions. (4). Range. The range is simply the highest score minus the lowest score and gives an idea of the spread of scores or distance. (5). Minimum & Maximum. Simply the minimum and maximum scores. All measures of dispersion provide an idea of distance or spread.

Distribution: There are two measures of distribution, both offer a description of the shape of a distribution of scores. Skewness refers to the amount of non-symmetry a distribution of scores contains. Negative skew is when the tail points to the smaller values and most scores are located at the higher values. Positive skew is when the tail points to the larger values and most scores are located at the smaller values. Zero skew indicates symmetry. Kurtosis is used to measure the amount of tail magnitude, commonly referred to as peakedness or flatness of a distribution. Kurtosis is also referred to as a measure of normality. It is based on the size of a distribution's tails. A distribution with a large, positive kurtosis has thin tails and the distribution looks peaked. This is known as leptokurtic. A distribution with a large, negative kurtosis has large tails or thick tails and the distribution looks flat. This is known as platykurtic (like a plateau).

Relation. There are two measures of relation; both refer to the amount of shared variance two variables have. Measures of relation are unique in that they are descriptive, but can also be used inferentially when assessing magnitude. Covariance is an unstandardized measure of relation. Correlation is a standardized measure of relation; meaning it can be used to compare the relationships between multiple variables.

Getting descriptive statistics in SPSS.

Open the data file named "Cars.sav" which is available here.

Method 1:

With the Cars data file open in the Data window, go to Analyze, Descriptive Statistics, and then Descriptives...

Now you should have a smaller window open, highlight/select "Time to Accelerate from 0 to 60 (sec) [accel]" and use the arrow to put it into the variables box.

Next, click on "Options..." and select the descriptive statistics you want (typically mean, standard deviation, variance, range, standard error (S.E.) of the mean, minimum and maximum, as well as kurtosis and skewness). Then click "Continue".

You may also need to get the Z-scores for a variable and therefore, you can click the box in the lower left corner "Save standardized values as variables". This function will create a new variable in your data sheet (in the right-most column of data view) which will contain the values of each Z-score corresponding to each individual score for that variable (accel).

Next, click "OK". The output should contain a single, very long [to the right] table with all the descriptive statistics specified (except the Z-scores which are in the data file).

Method 2:

With the Cars data file open in the Data window, go to Analyze, Descriptive Statistics, and then Frequencies...

Now you should have a smaller window open, highlight/select ""Time to Accelerate from 0 to 60 (sec) [accel]" and use the arrow to put it into the variables box.

Next, click on "Statistics..." and select all the statistics specified earlier, as well as quartiles; then click "Continue".

Next, click on "Charts..." and select Histograms and Show normal curve on histogram. Then click "Continue" and then click "OK".

You should now see some output similar to that below. You'll notice the output table containing all the descriptive statistics is smaller and easier to read than the one provided by the Descriptive Statistics function above.

There are four benefits to using the Frequencies function for gathering descriptive statistics. First, you can get more descriptive statistics (quartiles), second; you can get a graphical display of the variable (histogram for continuous variables and bar graph for categorical variables). Third, you get a frequencies table; and fourth, the descriptive statistics table is smaller and easier to read with frequencies function. However, you can only get the standardized scores (Z-scores) by doing the Descriptives function.

Method 3: The Explore Function for getting descriptive statistics by group

With the Explore Example data file open in the Data window, go to Analyze, Descriptive Statistics, and then Explore...

Next, pick your dependent variable, in this example we'll use the variable "total score on blame scale [bt]". Highlight and move it to the Dependent List: box. Then, pick your independent variable, in this example we'll use the grouping variable "GENDER [sex]". Highlight it and move it to the Factor List: box. Then click on the Statistics... button.

Now we can specify what we want to get. Check Descriptives, M-estimators, Outliers, and Percentiles. Then click the Continue button. Next, click on the Plots button and select Histogram and Normality plots with tests. Then click the Continue button. Then click the OK button.

You should see some output similar to that displayed below.

You'll notice you get the Case Processing Summary which simply reports the number of participants/cases, percentages, and number of missing for each group of your independent or grouping variable. Then you get the descriptive statistics for each group, percentiles, then the table of extreme values. This last one; extreme values, is very handy for helping to detect and/or evaluate outliers. Likewise, the Tests of Normality also are helpful for evaluating assumptions of some common inferential (parametric) analyses. Finally, you're given the plots for each group; histogram, stem-and-leaf, and box plot. The box plot is also very handy for evaluating the normality and outliers within the groups. Notice within the box plot, extreme values are marked with the case number and the star symbol, while less extreme (but likely influential) points are marked with the case number and the circle symbol.

Obviously, SPSS is capable of more complex graphing. If one is so inclined, one could simply go to Graphs in the tool bar and practice making different types of graphs with the current data. Like most functions of SPSS, it is often easy enough to point and click ones way through a short trial-and-error session to get what one wants. Recall, the strength of SPSS and what it takes pride in, is its user-friendliness. SPSS is extremely easy to use and figuring out how to get what one wants out of it often takes less time than if one used a tutorial (such as this) to learn.

Method 4: Correlation

With the Explore Example data file open in the Data window, go to Analyze, Correlate, Bivariate...

Now you can move 'total score on blame scale' and 'total score on reasons for assigned prison time' to the Variables: box. Notice, you can get any or all three types of correlation and 2 or 1 tailed significance with or without flagging. Next, click on the Options... button and specify Means and standard deviations as well as Cross-product deviations and covariances. Then click the Continue button, then click the OK button.

You should see output similar to that provided below. Notice, as is the case with most analysis in SPSS; we specified and received the descriptive statistics for the variables we analyzed (mean, standard deviation, number of observations).

So, we see the correlation between these two variables is -.050 with a p-value of .159. We could also say that only about 0.25% of the variance in one variable is accounted for by the other variable. Correlation squared give the percentage of variance in one variable which is accounted for by the other variable; a form of an effect size measure (-.050 * -.050 = .0025 = .25%). Clearly, there is a very weak (and not statistically significant) relationship between these two variables. The covariance is -3.431 and there were 793 cases used to compute the correlation/covariance. Notice only cases with complete data were used.

Return to the SPSS Short Course

UNT home page

Contact Information
Jon Starkweather, PhD	Jonathan.Starkweather@unt.edu	940-565-4066
Richard Herrington, PhD	Richard.Herrington@unt.edu	940-565-2140

Please participate in the DSA Client Feedback Survey.

Last updated: 2018.11.12 by Jon Starkweather.