DSA SAS Short Course: Module 7

Data Science and Analytics

UIT | Help Desk | Training | About Us | Publications | DSA Home

Please participate in the DSA Client Feedback Survey.

Return to the SAS Short Course
MODULE 7
The following covers a few of the SAS procedures for conducting component and factor analysis. Use the Import Wizard to import the Example Data 4 file using the SPSS File (.sav) source option and member name example4. There should be 750 cases or observations with no missing values and 16 variables. Make sure the entire* data set was successfully imported to SAS with the following syntax: PROC MEANS DATA=example4; RUN; Before we begin with the analysis syntax; let's take a moment to address and hopefully clarify one of the most confusing and misarticulated issues in statistical teaching and practice literature. An ambitious goal to be sure. First, Principal Components Analysis (PCA) is a variable reduction technique which maximizes the amount of variance accounted for in the observed variables by a smaller group of variables called COMPONENTS. As an example, consider the following situation. Let's say, we have 500 questions on a survey we designed to measure persistence. We want to reduce the number of questions so that it does not take someone 3 hours to complete the survey. It would be appropriate to use PCA to reduce the number of questions by identifying and removing redundant questions. For instance, if question 122 and question 356 are virtually identical (i.e. they ask the exact same thing but in different ways), then one of them is not necessary. The PCA process allows us to reduce the number of questions or variables down to their PRINCIPAL COMPONENTS. PCA is commonly, but very confusingly, called exploratory factor analysis (EFA). The use of the word factor in EFA is inappropriate and confusing because we are really interested in COMPONENTS, not factors. This issue is made more confusing by some software packages (e.g. PASW / SPSS) which list or use PCA under the heading factor analysis. Second, Factor Analysis (FA) is typically used to confirm the latent factor structure for a group of measured variables. Latent factors are unobserved variables which typically can not be directly measured; but, they are assumed to cause the scores we observe on the measured or indicator variables. FA is a model based technique. It is concerned with modeling the relationships between measured variables, latent factors, and error. As stated in O'Rourke, Hatcher, and Stepanski (2005): "Both (PCA & FA) are methods that can be used to identify groups of observed variables that tend to hang together empirically. Both procedures can also be performed with the SAS FACTOR procedure and they generally tend to provide similar results. Nonetheless, there are some important conceptual differences between principal component analysis and factor analysis that should be understood at the outset. Perhaps the most important deals with the assumption of an underlying causal structure. Factor analysis assumes that the covariation in the observed variables is due to the presence of one or more latent variables (factors) that exert causal influence on these observed variables" (p. 436). Final thoughts. Both PCA and FA can be used as exploratory analysis. But; PCA is predominantly used in an exploratory fashion and almost never used in a confirmatory fashion. FA can be used in an exploratory fashion, but most of the time it is used in a confirmatory fashion because it is concerned with modeling factor structure. The choice of which is used should be driven by the goals of the analyst. If you are interested in reducing the observed variables down to their principal components while maximizing the variance accounted for in the variables by the components, then you should be using PCA. If you are concerned with modeling the latent factors (and their relationships) which cause the scores on your observed variables, then you should be using FA. ### REFERENCE ### O'Rourke, N., Hatcher, L., & Stepanski, E.J. (2005). A step-by-step approach to using SAS for univariate and multivariate statistics, Second Edition. Cary, NC: SAS Institute Inc. ################## IX. Principal Components Analysis So, here we go with the syntax. The generic syntax for Principal Components Analysis with options is displayed below. PROC FACTOR DATA=datasetname SIMPLE METHOD=PRIN PRIORS=ONE NFACT= MINEIGEN=1 SCREE ROTATE= FLAG=.32 OUT=newdata; VAR variable1 variable2 variable3...variableN; RUN; PROC FACTOR, as stated earlier, can be used for either principal components analysis or factor analysis (you see why this can be confusing). The data step should be familiar by now. The SIMPLE statement provides simple descriptive statistics for each of the variables in the analysis (i.e. number of cases/observations, means, standard deviations). The METHOD=PRIN specifies the extraction method as principal components. The PRIORS=ONE specifies prior communality estimates. When conducting principal components analysis, you should always use ONE. The NFACT optional statement allows you to specify the number of retained components (again, the use of fact or factor makes this confusing). The MINEIGEN=1 specifies the minimum acceptable (or critical) eigen value we want a component to display in order for it to be retained. The SCREE simply specifies that we want a scree plot to be displayed with the output. ROTATE= specifies a rotation strategy. When components are correlated, we would choose an oblique rotation strategy (e.g. PROMAX) and when components are not correlated, we would choose an orthogonal rotation strategy (e.g. VARIMAX). The FLAG=.32 specifies that we want the output to flag (with an ) all loadings greater than the number we specify. Here, 0.32 is specified because when squared, it represents 10% of the variance in the variable accounted for by the component. The OUT option specifies a name for a new data set which will include the original variables and the retained component scores for each observation. The OUT option can only be used when the input data is raw data (as opposed to a correlation or covariance matrix) and the number of components (NFACT) has been specified. The OUT option can be useful for determining whether or not the components are correlated (e.g. running a PROC CORR on the newly created data which includes the component scores). The VAR statement is used to specify all the variables being subjected to the component analysis. It is important* to notice the presence of semi colons before and after the VAR statement. If the OUT option is used in the principal component analysis, then you will likely want to explore the relationships between the components (named factor1 factor2...factorN by default) and the variables. In which case, the syntax below provides the generic format for doing so. Again, the use of the term factor when referring to components makes this stuff confusing. PROC CORR DATA=newdata; VAR factor1 factor2...factorN; WITH variable1 variable2...variableN factor1 factor2...factorN; RUN; (1) Now we can move on to a practical example. The current example uses Example Data 4 (example4) which contains 15 items or variables and 750 cases or observations. For an initial components analysis, we specify no number of components to be retained, no rotation strategy, and we are not interested in creating a new data file. PROC FACTOR DATA=example4 SIMPLE METHOD=PRIN PRIORS=ONE MINEIGEN=1 SCREE FLAG=.32; VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15; RUN; We notice from the output, we have two items (y14 & y15) which do not load on the first component (always the strongest component without rotation) but create their own retained component (also with eigen value greater than 1). We know a component should have, as a minimum, 3 items/variables; but let's reserve deletion of items until we can discover whether or not our components are related. (2) Next, we re-run the PCA specifying NFACT = 5, which really means we are specifying 5 components to be retained. We also specify the creation of a new data set (ex4comp2) which will contain all the variables used in the PCA and component scores for each observation. Also note, we removed the SIMPLE option because the descriptive statistics were given with the previous PCA. PROC FACTOR DATA=example4 METHOD=PRIN PRIORS=ONE NFACT=5 MINEIGEN=1 SCREE FLAG=.32 OUT=ex4comp2; VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15; RUN; The creation of the new data set allows us to determine if our components are correlated. PROC CORR DATA=ex4comp2; VAR factor1 factor2 factor3 factor4 factor5; WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 factor1 factor2 factor4 factor5; RUN We see in this output that our components are not correlated, which indicates we should use an orthogonal rotation. (3) Now we can re-run the PCA with a VARIMAX rotation applied. PROC FACTOR DATA=example4 METHOD=PRIN PRIORS=ONE NFACT=5 MINEIGEN=1 SCREE ROTATE=VARIMAX FLAG=.32 OUT=ex4comp3; VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15; RUN; Here we see that the varimax rotation cleaned up the interpretation by eliminating the global first component (see the Rotated Factor Pattern table). And, because we created a new data file, we can verify the complete lack of correlations between the components using the syntax below. PROC CORR DATA=ex4comp3; VAR factor1 factor2 factor3 factor4 factor5; WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 factor1 factor2 factor4 factor5; RUN; (4) Finally, we can eliminate the two items which (1) by themselves create a component (components should have more than 2 items or variables) and (2) do not load (at all) on the un-rotated or initial component 1. PROC FACTOR DATA=example4 METHOD=PRIN PRIORS=ONE NFACT=4 MINEIGEN=1 SCREE ROTATE=VARIMAX FLAG=.32 OUT=ex4comp4; VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13; RUN; PROC CORR DATA=ex4comp4; VAR factor1 factor2 factor3 factor4; WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 factor1 factor2 factor4; RUN; To help clarify the purpose of PCA, consider reviewing the output for PCA (3) with particular attention to the first page of that output (the page above the scree plot). You will find there a table with the title "Eigenvalues of the Correlation Matrix: Total = 15, Average = 1". The fourth column in that table is called "Cumulative" and refers to the cumulative variance accounted for by the components. Now focus on the fifth value from the top in that fourth column. That value of .5503 tells us 55.03% of the variance in the items (specifically the items' variance - covariance matrix) is accounted for by all 5 components. As a comparison, and to highlight the purpose of PCA; look at the same table only for PCA (4), which has the title "Eigenvalues of the Correlation Matrix: Total = 13, Average = 1". Pay particular attention to the fourth value in the fourth (cumulative) column. This value of .5517 tells us 55.17% of the variance in the items (specifically the items' variance - covariance matrix) is accounted for by all 4 components. So, we have reduced the number of items from 15 to 13, reduced the number of components, and yet have improved the amount of variance accounted for in the items by our principal components. X. Factor Analysis The generic syntax for Factor Analysis (FA) with options is displayed below; however, the only real changes are the extraction method and priors. Before we used principal for PCA while here with FA, we will be using ML which refers to maximum likelihood extraction. Some suggest using ULS which refers to un-weighted least squares extraction. The other change is the use of SMC or squared multiple correlations in the priors statement. PROC FACTOR DATA=datasetname SIMPLE METHOD=ML or ULS PRIORS=SMC NFACT= MINEIGEN=1 SCREE ROTATE= FLAG=.32 OUT=newdata; VAR variable1 variable2 variable3...variableN; RUN; Continuing with the same data as was used above, we will submit our 15 initial items to the Maximum likelihood FA with VARIMAX rotation, and SMC priors. We leave out the SIMPLE option because we have already seen the descriptive statistics for each item above. Will will leave out the OUT statement because we do not need to use the factor scores for assessing the relationship between the factors (we know from above they are not related). However, it is often useful to save the factor scores for use in another analysis (SEM). We will leave out the MINEIGEN criteria so that we insure we get all 5 factors retained (often it is the case that only one common factor is retain because only one factor displays an eigen value greater than 1). PROC FACTOR DATA=example4 METHOD=ML PRIORS=SMC NFACT= 5 SCREE ROTATE=VARIMAX FLAG=.32; VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15; RUN; Look on the sixth page of the output, you will see a table titled "Rotated Factor Pattern" in the middle of that page. This table displays the rotated factor loadings for each item / variable on each factor retained. Notice that Factor 5 has no items loading greater than 0.32 (* indicates loadings greater than 0.32). Also notice that items y14 and y15 do not load on any factor greater than 0.32. In fact, the greatest loading for y14 is with Factor 5 which is only 0.20; which when squared (.04) represents only 4% of the variance accounted for in that item by factor 5. Furthermore, Factor 5 is only supported by two items (y14 & y15) which themselves are not very good (indicated by the communalities). For instance, if we look at the seventh page of the output, we find the majority of a table titled "Final Communality Estimates and Variable Weights" which displays the communalities for each item / variable. Communalities represent the sum of the squared loadings for an item. They are interpreted as the amount of variance in an item which is explained by all the retained factors after rotation. So, we can see that both y14 and y15 display very low communalities which indicates their variance is not explained by the combined factors. To be more specific, y14 displays a communality of 0.042; which when interpreted means: only 4.2% of the variance of item y14 is explained by all five factors combined. The bottom line interpretation here is that Factor 5 and items y14 and y15 can be removed. PROC FACTOR DATA=example4 METHOD=ML PRIORS=SMC NFACT= 4 SCREE ROTATE=VARIMAX FLAG=.32; VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13; RUN; Reviewing the last two pages of the most recent output, we see the "Rotated Factor Pattern" table and the "Final Communality Estimates and Variable Weights" table (which starts on the bottom of one page and continues on the last page of the output). In the Rotated Factor Pattern table we see clear factor structure displayed; meaning, each item loads predominantly on one factor. For instance, the first four items load virtually exclusively on Factor 1. Furthermore, if we look at the communalities we see that all the items displayed a communality of 0.32 or greater, with one exception. The exception is y4, which is a little lower than we would like and given that Factor 1 has three other items which load significantly on it, we may choose to remove item y4 from further analysis or measurement in the future. Finally; as an additional example; we can take a look at the same analysis but with an oblique (PROMAX) rotation strategy. PROC FACTOR DATA=example4 METHOD=ML PRIORS=SMC NFACT=4 SCREE ROTATE=PROMAX FLAG=.40 OUT=ex4comp5; VAR y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13; RUN; PROC CORR DATA=ex4comp5; VAR factor1 factor2 factor3 factor4; WITH y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 factor1 factor2 factor4; RUN; When interpreting the output of a run with oblique rotation, remember that the oblique process is a two stage process. During the first stage, an orthogonal rotation solution is produced. The current example provides output (on pages 6 & 7) which is identical to the previous VARIMAX rotated 4 factor and 13 item solution from above. During the second stage, the factors are allowed to correlate and the PROMAX rotation is then applied. Interpretation of the oblique (PROMAX) solution begins on page 8 of the current output. The top of page 10 begins with the table named "Inter-Factor Correlations" but, directly below that table one can find the "Rotated Factor Pattern (Standardized Regression Coefficients)" table. Here is where the rotated loadings for the PROMAX rotation are displayed. At the bottom of page 12 and continuing on to page 13 one will find the communality estimates associated with the PROMAX solution. XI. Internal Consistency Analysis (Cronbach's Alpha Coefficient) Often when one is conducting principal components analysis or factor analysis, one will want to conduct an internal consistency analysis. Traditionally, reliability analysis was used synonymously with internal consistency and/or Cronbach's Alpha or Coefficient Alpha. However, Cronbach's Alpha is not a statistical measure of reliability; it is a measure of internal consistency. Reliability generally refers to whether or not a measurement device provides consistent data across multiple administrations. Reliability can be assessed by correlating multiple administrations of the measurement device given to the same population at different times -- this is known as test-retest reliability. Internal consistency can be thought of as the relationship between each item and each other item; and internal consistency can be thought of as the relationship of each item to the collection of items or total score. Internal consistency is assessed using (1) the item to total score correlation and (2) Cronbach's alpha coefficient. In SAS, the item to total score correlation and Cronbach's alpha coefficient are provided using the PROC CORR procedure, but we would have to specify the ALPHA option. The generic form of the PROC CORR procedure for producing Cronbach's alpha follows: PROC CORR DATA=dataname ALPHA NOMISS; VAR variable1 variable2 variable3...variableN; RUN; The PROC CORR for obtaining Cronbach's alpha uses the ALPHA option and can use the NOMISS option to perform deletion of missing values. As a practical example, consider the following syntax which will provide Cronbach's alpha coefficient for the first four items of our data set (i.e. Factor 1): PROC CORR DATA=example4 ALPHA; VAR y1 y2 y3 y4; RUN; The output of the ALPHA procedure contains 4 tables. The first table simply reports the descriptive statistics for each item/variable entered in the ALPHA procedure. The second table reports the raw and standardized versions of Cronbach's alpha coefficient. The third table (often critically important) reports the item-to-total score correlations and the alpha if item deleted for each item in both raw and standardized forms. Keep in mind, we would expect a good item to display a high correlation with the total score and a low alpha if item deleted (i.e. if alpha drops when an item is deleted, then clearly that item was important).

Return to the SAS Short Course

Contact Information
Jon Starkweather, PhD	Jonathan.Starkweather@unt.edu	940-565-4066
Richard Herrington, PhD	Richard.Herrington@unt.edu	940-565-2140

Please participate in the DSA Client Feedback Survey.

Last updated: 2018.11.15 by Jon Starkweather.

UIT | Help Desk | Training | About Us | Publications | DSA Home