The following covers a few of the SAS procedures
for conducting component and factor analysis. Use the Import Wizard to
import the
Example
Data 4 file using the SPSS File (*.sav) source
option and member name example4. There should be 750 cases or
observations with no missing values and 16 variables. Make sure the entire
data set was successfully imported to SAS with
the following syntax:
PROC MEANS DATA=example4;
RUN;
Before we begin with the analysis syntax; let's
take a moment to address and hopefully clarify one of the most
confusing and misarticulated issues in statistical teaching and
practice literature. An ambitious goal to be sure.
First, Principal Components Analysis (PCA)
is a variable reduction technique which maximizes the amount of
variance accounted for in the observed variables by a smaller group of
variables called COMPONENTS. As an example, consider the following
situation. Let's say, we have 500 questions on a survey we designed to
measure persistence. We want to reduce the number of questions so that
it does not take someone 3 hours to complete the survey. It would be
appropriate to use PCA to reduce the number of questions by identifying
and removing redundant questions. For instance, if question 122 and
question 356 are virtually identical (i.e. they ask the exact same
thing but in different ways), then one of them is not necessary. The
PCA process allows us to reduce the number of questions or variables
down to their PRINCIPAL COMPONENTS.
PCA is commonly, but very confusingly, called
exploratory factor analysis (EFA). The use of the word factor
in EFA is inappropriate and confusing because we are really interested
in COMPONENTS, not factors. This issue is made more confusing by some
software packages (e.g. PASW / SPSS) which list or use PCA under the
heading factor analysis.
Second, Factor Analysis (FA) is
typically used to confirm the latent factor structure for a group of
measured variables. Latent factors are unobserved variables which
typically can not be directly measured; but, they are assumed to cause
the scores we observe on the measured or indicator variables. FA is a
model based technique. It is concerned with modeling the relationships
between measured variables, latent factors, and error.
As stated in O'Rourke, Hatcher, and Stepanski
(2005): "Both (PCA & FA) are methods that can be used to
identify groups of observed variables that tend to hang together
empirically. Both procedures can also be performed with the SAS FACTOR
procedure and they generally tend to provide similar results.
Nonetheless, there are some important conceptual differences between
principal component analysis and factor analysis that should be
understood at the outset. Perhaps the most important deals with the
assumption of an underlying causal structure. Factor analysis assumes
that the covariation in the observed variables is due to the presence
of one or more latent variables (factors) that exert causal influence
on these observed variables" (p. 436).
Final thoughts. Both PCA and FA can be used as
exploratory analysis. But; PCA is predominantly used in an exploratory
fashion and almost never used in a confirmatory fashion. FA can be used
in an exploratory fashion, but most of the time it is used in a
confirmatory fashion because it is concerned with modeling factor
structure. The choice of which is used should be driven by the goals of
the analyst. If you are interested in reducing the observed variables
down to their principal components while maximizing the variance
accounted for in the variables by the components, then you should be
using PCA. If you are concerned with modeling the latent factors (and
their relationships) which cause the scores on your observed variables,
then you should be using FA.
### REFERENCE ###
O'Rourke, N., Hatcher, L., & Stepanski, E.J. (2005). A
step-by-step approach to using SAS for univariate and multivariate
statistics, Second Edition. Cary, NC: SAS Institute Inc.
##################
IX.
Principal Components Analysis
So, here we go with the syntax. The generic syntax
for Principal Components Analysis with options is displayed below.
PROC FACTOR DATA=datasetname
SIMPLE
METHOD=PRIN
PRIORS=ONE
NFACT=
MINEIGEN=1
SCREE
ROTATE=
FLAG=.32
OUT=newdata;
VAR variable1 variable2 variable3...variableN;
RUN;
PROC FACTOR, as stated earlier, can be used for
either principal components analysis or factor analysis (you see why
this can be confusing). The data step should be familiar by now. The
SIMPLE statement provides simple descriptive statistics for each of the
variables in the analysis (i.e. number of cases/observations, means,
standard deviations). The METHOD=PRIN specifies the extraction method
as principal components. The PRIORS=ONE specifies prior communality
estimates. When conducting principal components analysis, you should
always use ONE. The NFACT optional statement allows you to specify the
number of retained components (again, the use of fact or factor makes
this confusing). The MINEIGEN=1 specifies the minimum acceptable (or
critical) eigen value we want a component to display in order for it to
be retained. The SCREE simply specifies that we want a scree plot to be
displayed with the output. ROTATE= specifies a rotation strategy. When
components are correlated, we would choose an oblique rotation strategy
(e.g. PROMAX) and when components are not correlated, we would choose
an orthogonal rotation strategy (e.g. VARIMAX). The FLAG=.32 specifies
that we want the output to flag (with an *) all loadings greater than
the number we specify. Here, 0.32 is specified because when squared, it
represents 10% of the variance in the variable accounted for by the
component. The OUT option specifies a name for a new data set which
will include the original variables and the retained component scores
for each observation. The OUT option can only be used when the input
data is raw data (as opposed to a correlation or covariance matrix) and
the number of components (NFACT) has been specified. The OUT option can
be useful for determining whether or not the components are correlated
(e.g. running a PROC CORR on the newly created data which includes the
component scores). The VAR statement is used to specify all the
variables being subjected to the component analysis. It is
important to notice the presence of semi colons before and
after the VAR statement.
If the OUT option is used in the principal
component analysis, then you will likely want to explore the
relationships between the components (named factor1 factor2...factorN
by default) and the variables. In which case, the syntax below provides
the generic format for doing so. Again, the use of the term factor when
referring to components makes this stuff confusing.
PROC CORR DATA=newdata;
VAR factor1 factor2...factorN;
WITH variable1 variable2...variableN factor1 factor2...factorN;
RUN;
(1)
Now we can move on to a practical example. The current example uses
Example Data 4 (example4) which contains 15 items or variables and 750
cases or observations. For an initial components analysis, we specify
no number of components to be retained, no rotation strategy, and we
are not interested in creating a new data file.
PROC FACTOR DATA=example4
SIMPLE
METHOD=PRIN
PRIORS=ONE
MINEIGEN=1
SCREE
FLAG=.32;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15;
RUN;
We notice from the output, we have two items (y14
& y15) which do not load on the first component (always the
strongest component without rotation) but create their own retained
component (also with eigen value greater than 1). We know a component
should have, as a minimum, 3 items/variables; but let's reserve
deletion of items until we can discover whether or not our components
are related.
(2) Next, we re-run the PCA
specifying NFACT = 5, which really means we are specifying 5 components
to be retained. We also specify the creation of a new data set
(ex4comp2) which will contain all the variables used in the PCA
and component scores for each observation. Also note, we
removed the SIMPLE option because the descriptive statistics were given
with the previous PCA.
PROC FACTOR DATA=example4
METHOD=PRIN
PRIORS=ONE
NFACT=5
MINEIGEN=1
SCREE
FLAG=.32
OUT=ex4comp2;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15;
RUN;
The creation of the new data set allows us to
determine if our components are correlated.
PROC CORR DATA=ex4comp2;
VAR factor1 factor2 factor3 factor4 factor5;
WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 factor1 factor2
factor4 factor5;
RUN
We see in this output that our components are not
correlated, which indicates we should use an orthogonal rotation.
(3) Now we can re-run the PCA
with a VARIMAX rotation applied.
PROC FACTOR DATA=example4
METHOD=PRIN
PRIORS=ONE
NFACT=5
MINEIGEN=1
SCREE
ROTATE=VARIMAX
FLAG=.32
OUT=ex4comp3;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15;
RUN;
Here we see that the varimax rotation cleaned up
the interpretation by eliminating the global first component (see the
Rotated Factor Pattern table). And, because we created a new data file,
we can verify the complete lack of correlations between the components
using the syntax below.
PROC CORR DATA=ex4comp3;
VAR factor1 factor2 factor3 factor4 factor5;
WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 factor1 factor2
factor4 factor5;
RUN;
(4) Finally, we can eliminate
the two items which (1) by themselves create a component (components
should have more than 2 items or variables) and (2) do not load (at
all) on the un-rotated or initial component 1.
PROC FACTOR DATA=example4
METHOD=PRIN
PRIORS=ONE
NFACT=4
MINEIGEN=1
SCREE
ROTATE=VARIMAX
FLAG=.32
OUT=ex4comp4;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13;
RUN;
PROC
CORR DATA=ex4comp4;
VAR factor1 factor2 factor3 factor4;
WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 factor1 factor2 factor4;
RUN;
To help clarify the purpose of PCA, consider
reviewing the output for PCA (3) with particular
attention to the first page of that output (the page above the scree
plot). You will find there a table with the title "Eigenvalues of the
Correlation Matrix: Total = 15, Average = 1". The fourth column in that
table is called "Cumulative" and refers to the cumulative variance
accounted for by the components. Now focus on the fifth value from the
top in that fourth column. That value of .5503 tells us 55.03% of the
variance in the items (specifically the items' variance - covariance
matrix) is accounted for by all 5 components. As a comparison, and to
highlight the purpose of PCA; look at the same table only for PCA (4),
which has the title "Eigenvalues of the Correlation Matrix: Total = 13,
Average = 1". Pay particular attention to the fourth value in the
fourth (cumulative) column. This value of .5517 tells us 55.17% of the
variance in the items (specifically the items' variance - covariance
matrix) is accounted for by all 4 components. So, we have reduced the
number of items from 15 to 13, reduced the number of components, and
yet have improved the amount of variance accounted for in the items by
our principal components.
X. Factor Analysis
The generic syntax for Factor Analysis (FA) with
options is displayed below; however, the only real changes are the
extraction method and priors. Before we used principal for PCA while
here with FA, we will be using ML which refers to maximum likelihood
extraction. Some suggest using ULS which refers to un-weighted least
squares extraction. The other change is the use of SMC or squared
multiple correlations in the priors statement.
PROC FACTOR DATA=datasetname
SIMPLE
METHOD=ML or ULS
PRIORS=SMC
NFACT=
MINEIGEN=1
SCREE
ROTATE=
FLAG=.32
OUT=newdata;
VAR variable1 variable2 variable3...variableN;
RUN;
Continuing with the same data as was used above,
we will submit our 15 initial items to the Maximum likelihood FA with
VARIMAX rotation, and SMC priors. We leave out the SIMPLE option
because we have already seen the descriptive statistics for each item
above. Will will leave out the OUT statement because we do not need to
use the factor scores for assessing the relationship between the
factors (we know from above they are not related). However, it is often
useful to save the factor scores for use in another analysis (SEM). We
will leave out the MINEIGEN criteria so that we insure we get all 5
factors retained (often it is the case that only one common factor is
retain because only one factor displays an eigen value greater than 1).
PROC FACTOR DATA=example4
METHOD=ML
PRIORS=SMC
NFACT= 5
SCREE
ROTATE=VARIMAX
FLAG=.32;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15;
RUN;
Look on the sixth page of the output, you will see
a table titled "Rotated Factor Pattern" in the middle of that page.
This table displays the rotated factor loadings for each item /
variable on each factor retained. Notice that Factor 5 has no items
loading greater than 0.32 (* indicates loadings greater than 0.32).
Also notice that items y14 and y15 do not load on any factor greater
than 0.32. In fact, the greatest loading for y14 is with Factor 5 which
is only 0.20; which when squared (.04) represents only 4% of the
variance accounted for in that item by factor 5. Furthermore, Factor 5
is only supported by two items (y14 & y15) which themselves are
not very good (indicated by the communalities). For instance, if we
look at the seventh page of the output, we find the majority of a table
titled "Final Communality Estimates and Variable Weights" which
displays the communalities for each item / variable. Communalities
represent the sum of the squared loadings for an item. They are
interpreted as the amount of variance in an item which is explained by
all the retained factors after rotation. So, we can see that both y14
and y15 display very low communalities which indicates their variance
is not explained by the combined factors. To be more specific, y14
displays a communality of 0.042; which when interpreted means: only
4.2% of the variance of item y14 is explained by all five factors
combined. The bottom line interpretation here is that Factor 5 and
items y14 and y15 can be removed.
PROC FACTOR DATA=example4
METHOD=ML
PRIORS=SMC
NFACT= 4
SCREE
ROTATE=VARIMAX
FLAG=.32;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13;
RUN;
Reviewing the last two pages of the most recent
output, we see the "Rotated Factor Pattern" table and the "Final
Communality Estimates and Variable Weights" table (which starts on the
bottom of one page and continues on the last page of the output). In
the Rotated Factor Pattern table we see clear factor structure
displayed; meaning, each item loads predominantly on one factor. For
instance, the first four items load virtually exclusively on Factor 1.
Furthermore, if we look at the communalities we see that all the items
displayed a communality of 0.32 or greater, with one exception. The
exception is y4, which is a little lower than we would like and given
that Factor 1 has three other items which load significantly on it, we
may choose to remove item y4 from further analysis or measurement in
the future.
Finally; as an additional example; we can take a
look at the same analysis but with an oblique (PROMAX) rotation
strategy.
PROC FACTOR DATA=example4
METHOD=ML
PRIORS=SMC
NFACT=4
SCREE
ROTATE=PROMAX
FLAG=.40
OUT=ex4comp5;
VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13;
RUN;
PROC
CORR DATA=ex4comp5;
VAR factor1 factor2 factor3 factor4;
WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 factor1 factor2 factor4;
RUN;
When interpreting the output of a run with oblique
rotation, remember that the oblique process is a two stage process.
During the first stage, an orthogonal rotation solution is produced.
The current example provides output (on pages 6 & 7) which is
identical to the previous VARIMAX rotated 4 factor and 13 item solution
from above. During the second stage, the factors are allowed to
correlate and the PROMAX rotation is then applied. Interpretation of
the oblique (PROMAX) solution begins on page 8 of the current output.
The top of page 10 begins with the table named "Inter-Factor
Correlations" but, directly below that table one can find the "Rotated
Factor Pattern (Standardized Regression Coefficients)" table. Here is
where the rotated loadings for the PROMAX rotation are displayed. At
the bottom of page 12 and continuing on to page 13 one will find the
communality estimates associated with the PROMAX solution.
XI. Internal Consistency Analysis
(Cronbach's Alpha Coefficient)
Often when one is conducting principal components
analysis or factor analysis, one will want to conduct an internal
consistency analysis. Traditionally, reliability analysis was used
synonymously with internal consistency and/or Cronbach's Alpha or
Coefficient Alpha. However, Cronbach's Alpha is not a statistical
measure of reliability; it is a measure of internal consistency.
Reliability generally refers to whether or not a measurement device
provides consistent data across multiple administrations. Reliability
can be assessed by correlating multiple administrations of the
measurement device given to the same population at different times --
this is known as test-retest reliability. Internal consistency can be
thought of as the relationship between each item and each other item; and
internal consistency can be thought of as the relationship of each item
to the collection of items or total score. Internal consistency is
assessed using (1) the item to total score correlation and (2)
Cronbach's alpha coefficient. In SAS, the item to total score
correlation and Cronbach's alpha coefficient are provided using the
PROC CORR procedure, but we would have to specify the ALPHA option.
The generic form of the PROC CORR procedure for
producing Cronbach's alpha follows:
PROC CORR DATA=dataname ALPHA NOMISS;
VAR variable1 variable2 variable3...variableN;
RUN;
The PROC CORR for obtaining Cronbach's alpha uses
the ALPHA option and can use the NOMISS option to perform deletion of
missing values.
As a practical example, consider the following
syntax which will provide Cronbach's alpha coefficient for the first
four items of our data set (i.e. Factor 1):
PROC CORR DATA=example4 ALPHA;
VAR y1 y2 y3 y4;
RUN;
The output of the ALPHA procedure contains 4
tables. The first table simply reports the descriptive statistics for
each item/variable entered in the ALPHA procedure. The second table
reports the raw and standardized versions of Cronbach's alpha
coefficient. The third table (often critically important) reports the
item-to-total score correlations and the alpha if item deleted for each
item in both raw and standardized forms. Keep in mind, we would expect
a good item to display a high correlation with the total score and a
low alpha if item deleted (i.e. if alpha drops when an item is deleted,
then clearly that item was important).
|