Correspondence
Analysis
Correspondence analysis is appropriate when
attempting to determine the proximal relationships among two or more
categorical variables. Using correspondence analysis with categorical
variables is analogous to using correlation analysis and principal
components analysis for continuous or nearly continuous variables. They
provide the research with insight as to the relationships among
variables and the dimensions or eigenvectors underlying them. A key
part of correspondence analysis is the multi-dimensional map produced
as part of the output. The correspondence map allows researchers to
visualize the relationships among categories spatially on dimensional
axes; in other words, which categories are close to other categories on
empirically derived dimensions.
Unlike correlation, correspondence analysis is nonparametric and does
not offer a statistical significance test because it is not based on a
distribution (or distributional assumption). Comparison of different
models (e.g. different variables entered/removed) should be done with
categorical or logistic regression. Again, correspondence analysis
requires categorical variables only. Correspondence analysis accepts
nominal variables, ordinal variables, and/or discretized interval -
ratio variables (e.g. quartiles), although creating discrete categories
from a continuous variable is generally discouraged.
For the duration of this tutorial we will be using
the
IntroPsych_Fall2009.sav
file; which is fictitious and contains 1500 participants' responses on
the following variables: code (sequential numbers which identify each
participant); sex; age; family_income (four income brackets); HS_GPA
(high school grade point average brackets); IQ (intelligence as
measured by the Wechsler-Adult Intelligence Scale version IV);
class_standing (freshman, sophomore, junior, senior); drinks_week
(number of alcoholic drinks consumed in a typical week); confidence
(self rating of how much confidence the student has in their ability to
achieve desired grades in college courses [possible range: 0-20]);
hardworker (self rating of how much effort the student puts toward
their college classes [possible range: 0-20]); number_grade (numeric
course grade for the Introduction to Psychology course); final_grade
(course letter grade for the Introduction to Psychology course).
1.) The first example will
explore a 2 way relationship between the 4 categories of family_income
and the 4 categories of class_standing. We would expect weak
relationships between family income and the members of each class; for
example, family income should have no relation with a student being a
freshman, sophomore, junior or senior.
Begin by clicking on Analyze, Data Reduction,
Correspondence Analysis...
Next, highlight / select the family_income
variable and use the top arrow button to move it into the Row: box.
Then, click the top Define Range... button and type a 1 for the minimum
value and type a 4 for the maximum value. Then click the Update button;
then click the Continue button.
Next, highlight / select the class_standing
variable and use the bottom arrow button to move it to the Column: box.
Then, click the Define Range... button. Next, type a 1 in the minimum
value: box and type a 4 in the Maximum value: box, then click the
Update button. Then, click the Continue button.
Next, click on the Statistics... button. By
default the following should be selected: Correspondence table,
Overview of row points, and Overview of column points. Also select, Row
profiles, Column profiles as well as Confidence Statistics for Row
points and Column points. Then, click the Continue button.
Next, click on the Plots... button and select: Row
points, Column points, Transformed row categories, and Transformed
column categories. By default, the Biplot should be selected already.
Next, click the Continue button, then click the OK button.
The output should be similar to what is displayed
below.
The Correspondence Table displays the frequency
for each category of each variable; it is essentially a
cross-tabulation frequency table.
The Row Profiles table displays the proportions of
each column value across each row. For instance, there are 23 Freshman
out of all 207 students whose family income is 00000 - 25000; 23 is
11.1% of 207. The Mass values across the bottom refer to the column's
proportion of the total sample size. For instance, 213 freshmen
represent 14.2% of the 1500 student total sample.
The Column Profiles table displays the proportions
of each row value down each column. For instance, 23 students' family
income is 00000 - 25000 out of all 213 students who are freshmen; 23 is
10.8% of 213. The Mass values down the right-most column represent each
row's proportion of the total sample size. For instance, 207 students
whose family income is 00000 - 25000 represent 13.8% of the 1500
student total sample.
The Summary table displays a variety of useful
information. First, we see that 3 dimensions were derived, but only two
are interpretable (i.e. only two dimensions account for a supposedly
meaningful proportion of the total inertia value). The Singular Value
column displays the canonical correlation between the two variables for
each dimension. The Inertia column displays the inertia value for each
dimension and the total inertia value. The total inertia value
represents the amount of variance accounted for in the original
correspondence table by the total model. Each dimension's inertia
value, thus refers to the amount of that total variance
which is accounted for by each dimension. So for instance, we could say
that dimension 1 accounts for 0.8% of the 0.9% of the total variance
our model explains in the original correspondence table. Stated another
way; our model accounts for only 0.9% of the variance in the original
correspondence table and of that (small) percentage, dimension 1
explains 0.8%. The chi-square test is testing the hypothesis that the
total inertia value is / is not different than zero. Here, our sig. or
p-value is greater than 0.05 (a common cutoff value); which indicates
our total inertia value is not significantly
different than zero. Keep in mind, this chi-square is not a model fit
statistic; it does not lend itself to comparing models with different
variables as chi-square is often used. It is only testing the inertia
value against zero. The Proportion of Inertia columns represent the
proportion of total inertia for each dimension; for example, dimension
1 (.008) accounts for 86.6% of total inertia (.009). The Standard
Deviation column refers to the standard deviation of the Singular
Value(s) and the correlation column refers to the correlation between
dimensions.
The Overview Row Points table displays values
which allow the research to evaluate how each row contributes to the
dimensions and how each dimension contributes to the rows. The Mass (as
mentioned above), is simply the proportion of each row to the total
(1500). The Score in Dimension displays each row's score on dimension 1
and dimension 2. The scores are derived based on the proportions (mass)
for each cell, column, and row when compared to total sample; the
scores are representative of dimensional distance and are used in the
graphs below. The Inertia column shows the amount of variance each row
accounts for of the total inertia value. The contribution Of Point to
Inertia of Dimension columns show the role each row plays in each
dimension; these are analogous to factor or component loadings. The
contribution Of Dimension to Inertia of Point columns show the role
each dimension plays in each row -- these are not the inverse or
opposite of the previous two columns because each dimension is composed
of multiple points. The Total column represents the sum of each
dimensions role in the row.
The Overview Column Points table displays values
which allow the research to evaluate how each column contributes to the
dimensions and how each dimension contributes to the columns. The Mass
(as mentioned above), is simply the proportion of each column to the
total (1500). The Score in Dimension displays each column's score on
dimension 1 and dimension 2. The scores are derived based on the
proportions (mass) for each cell, column, and row when compared to
total sample; the scores are representative of dimensional distance and
are used in the graphs below. The Inertia column shows the amount of
variance each column accounts for of the total inertia value. The
contribution Of Point to Inertia of Dimension columns show the role
each column plays in each dimension; these are analogous to factor or
component loadings. The contribution Of Dimension to Inertia of Point
columns show the role each dimension plays in each column -- these are
not the inverse or opposite of the previous two columns because each
dimension is composed of multiple points. The Total column represents
the sum of each dimensions role in the column.
The confidence points tables display the standard
deviation of each point's dimension score, as well as the correlation
between each point's dimension scores. Recall, the scores themselves
are displayed in previous tables (above).
The first two graphs show the score for each
category of Family Income on dimension 1 and dimension 2.
The next two graphs show show the score for each
category of Class Standing on dimension 1 and dimension 2.
The next two graphs show the scores for each
category on both dimensions (at once) for Family Income and Class
Standing.
Finally, the correspondence map shows each
category score on both dimensions (at once) for both family income and
class standing (at once). Now we can see the usefulness of scores as
measures of distance on the two interpreted dimensions of our model.
The scores allow us to compare categories across variables in (this
case) two dimensional space. Remember, correlation is a standardized
measure of relationship between two (typically) continuous variables.
Correspondence is a standardized measure of relationship (in
space/distance) between categories of multiple variables (in this case
two). It is important to note that the dimensions are empirically
derived axes or eigenvectors and not simply the variables entered into
the analysis. So, we could say that Juniors appear to have family
incomes between 50 and 75 thousand dollars. BUT,
given our not significantly different from zero
total inertia value of 0.009, we really can not have confidence in this
data's ability to offer conclusions about the general population. The
model is not good at all with only 00.9% of the variance in the
original correspondence table accounted for by the total model (all
three dimensions; only two of which were interpreted).
As with most of the tutorials / pages within this
site, this page should not be considered an exhaustive review of the
topic covered and it should not be considered a substitute for a good
textbook.
|