DSA SAS Short Course: Module 8.1

Data Science and Analytics

Please participate in the DSA Client Feedback Survey.

Return to the SAS Short Course
MODULE 8
XII. Path Analysis with Manifest Variables First, let's take a moment to discuss and describe our fictional* model. Our model consists of seven directly measured variables or manifest variables. They include; Education, Responsibility, Ambition, Wealth, Suggestibility, (Ethical) Flexibility, and Political Success. Our model reflects hypothesized causal relationships among characteristics of American politicians. Our model hypothesizes three key causal variables (Wealth, Suggestibility, & [Ethical] Flexibility) for political success. We further expect politicians who exhibit high levels of education, responsibility, and ambition to also exhibit greater wealth. Again; this is a fictional example and is not meant to be taken seriously as a research finding supported by empirical evidence. It is merely used here for instructional example purposes. If you are unfamiliar with standard path and structural equation models; there are a few things you should take note of in our path diagram that tend to be seen in published materials displaying path models and structural equation models. First, the use of squares or rectangles to denote observed or measured variables (often referred to as manifest variables). Second, the use of straight, single headed arrows to denote hypothesized causal relationships (often referred to as a paths). And third, the use of curved, double-headed arrows to refer to bi-directional relationships (often referred to as correlations or covariances). Specific hypotheses should be used to clarify what the researcher expects to find (e.g. a very strong positive relationship between Wealth & Education). One of the key issues with Path Analysis and SEM is the issue of overidentification. A model is said to be overidentified if it contains more unique inputs (sometimes called informations) than the number of parameters being estimated. In our example, we have seven measured variables. We can apply the following formula to calculate the number of unique inputs: (1) number of unique inputs = (p ( p + 1 ) ) / 2 where p = the number of manifest or measured variables. Given this formula and our 7 manifest variables; we calculate 28 unique inputs or informations which is greater than the number of parameters we are estimating. Looking at the diagram, we see 10 covariances (C?), 6 paths (P?), 5 variable variances, and 2 error variances (VAR?). Adding these up, we get 23 parameters to be estimated. Remember too that path analysis and SEM require large sample sizes. Several general rules have been put forth as lowest reasonable sample size estimates; at least 200 cases at a minimum, at least 5 cases per manifest or measured variable, at least 400 cases, at least 25 cases per measured variable...etc. The bottom line is this; path analysis and SEM are powerful when done with adequate large samples -- the larger the better. The procedure for conducting path analysis and/or SEM in SAS is PROC CALIS; however, PROC CALIS needs to have the data fed to it. There are three ways to 'feed' PROC CALIS the data, (1) a correlation matrix with the number of observations and standard deviations for each variable, (2) a covariance matrix, and (3) use of the raw data as input. Here we will use the correlation matrix with number of observations and standard deviations. You can import the raw data to SAS using the Import Wizard to import the Example Data 5c file using the SPSS File (.sav) source option and the member name ex5c. Once imported, you can get the descriptive statistics and correlations which you will need to run the path analysis. PROC CORR DATA=ex5c; RUN; Using the number of observations (n = 750), the standard deviations, and the correlation matrix, you can proceed to the path analysis. The syntax for estimating or fitting our Path Model is displayed below. Note that the top half of the syntax simply enters the data for the path analysis. The bottom half (PROC CALIS) is used to fit the path model. DATA path1(TYPE=CORR); INPUT _TYPE_ $ _NAME_ $ V1-V7; LABEL V1 = 'education' V2 = 'responsibility' V3 = 'ambition' V4 = 'wealth' V5 = 'suggestibility' V6 = 'moral flexibility' V7 = 'political success'; CARDS; N . 750 750 750 750 750 750 750 STD . 0.9709 1.0218 0.9873 0.9999 0.9666 1.0072 1.0001 CORR V1 1.0000 . . . . . . CORR V2 .3546 1.0000 . . . . . CORR V3 .3377 .3198 1.0000 . . . . CORR V4 .5912 .6581 .5319 1.0000 . . . CORR V5 .0203 .0131 .0422 .0138 1.0000 . . CORR V6 .0225 -.0034 .0591 .0349 .5249 1.0000 . CORR V7 -.0047 .0016 .0046 -.0236 .7047 .7185 1.0000 ; PROC CALIS COVARIANCE CORR RESIDUAL MODIFICATION ; LINEQS V7 = PV7V4 V4 + PV7V5 V5 + PV7V6 V6 + E1, V4 = PV4V1 V1 + PV4V2 V2 + PV4V3 V3 + E2; STD E1 = VARE1, E2 = VARE2, V1 = VARV1, V2 = VARV2, V3 = VARV3, V5 = VARV5, V6 = VARV6; COV V1 V2 = CV1V2, V1 V3 = CV1V3, V1 V5 = CV1V5, V1 V6 = CV1V6, V2 V3 = CV2V3, V2 V5 = CV2V5, V2 V6 = CV2V6, V3 V5 = CV3V5, V3 V6 = CV3V6, V5 V6 = CV5V6; VAR V1 V2 V3 V4 V5 V6 V7; RUN; The PROC CALIS statement is followed by options. First, COVARIANCE tells SAS we want to use the covariance matrix to perform the analysis. Even though we are using the correlation matrix as our data input, SAS calculates the covariance matrix for the PROC CALIS. The CORR option specifies that we want the output to include the correlation matrix or covariance matrix on which the analysis is run. The RESIDUAL option allows us to see the absolute and standardized residuals in the output. The MODIFICATION option tells SAS to print the modification indices (e.g. Lagrange Multiplier Test). The next part of the syntax, LINEQS, provides SAS with the specific linear equations which specify the paths we want estimated. The first of which can be read as: variables 7 is causally effected by the path between variable 7 and variable 4, the path between variable 7 and variable 5, the path between variable 7 and variable 6, and the error variance associated with variable 7. Next, we see the STD lines which specify which variances we want estimated (listed as VAR here and in the diagram above). Last, the COV statements specify all the covariances which need to be estimated. Then, the VAR line simply lists the variables to be used in the analysis. Please note; the first page of output was produced by the PROC CORR directly after importing the data (above). Therefore, the references to page numbers of output associated with the PROC CALIS will begin on the second page (p. 2) of the total output file (e.g. page 1 of the PROC CALIS output actually has the number 2 in the top right corner). The page number discrepancy is noted here because all PROC CALIS procedures tend to produce several pages of output. The first page of the PROC CALIS output consists of general information, including the number of endogenous variables (any variable with* a straight single-headed arrow pointing at it) and the number of exogenous variables (any variable without any straight single-headed arrows pointing to it). The second page of the PROC CALIS output consists of a listing of the parameters to be estimated; essentially a review of the specified model from the CALIS syntax. The third page shows the general components of the model (e.g. number of variables, number of informations, number of parameters, etc.); as well as the descriptive statistics and covariance matrix for the variables entered in the model. The fourth page provides the initial parameter estimates. The fifth page includes the iteration history. Often it is important to focus on the last line of the Optimization results (left side of the middle of the page) which states whether or not convergence criterion was satisfied. Also of importance is the beginning of the predicted covariance matrix, which is used for comparison to the matrix of association (original covariance matrix) to produce residual values. The sixth page continues the predicted covariance matrix. The seventh page displays fit indices. As you can see, a fairly comprehensive list is provided. Please note that although Chi-square is displayed it should not be used as an interpretation of goodness-of-fit due to the large sample sizes necessary for path analysis and SEM (which inflates the chi-square statistic to the point of meaninglessness). Some of the more commonly reported fit indices are the RMSEA (root mean square error of approximation), which when below .05 indicates good fit; the Schwarz's Bayesian Criterion (also called BIC; Bayesian Information Criteria), where the smaller the value (i.e. below zero) the better the fit; and the Bentler & Bonnett's Non-normed Index (NNFI) as well as the Bentler & Bonnett's normed fit index (NFI)--both of which should be greater than .90 and above to indicate good fit. Page 8 provides the Raw residual matrix and the ranking of the 9 largest Raw residuals. The 9th page shows the Standardized residual matrix and the 9 largest Standardized residuals; we expect values close to zero which indicates good fit. Any values greater than \|2.00\| indicates lack of fit and should be investigated. The 10th page displays a sideways histogram of the distribution of the Standardized residuals. Generally we expect to see a normal distribution of residuals with no values greater than \|2.00\|. The 11th page displays our path coefficients in Raw form, as well as t-values and standard errors for the t-values associated with each. Further down on the 11th page, we see estimated variance parameters and estimated covariances; each with t-values and standard errors for the t-values. Remember that t-values for coefficients are statistically significant (p < .05, two-tailed) if their absolute value is greater than 1.96; meaning they are significantly different from zero. It is also recommended that a review of the standard errors be performed, as extremely small standard errors (those very close to zero) may indicate a problem with fit associated with one variable being linearly dependent upon one or more other variables. The 12th page provides Standardized path coefficients and squared multiple correlations for endogenous variables (often considered the dependent variables in such a model). The 'Squared Multiple Correlations' R-square column gives us an idea of how well our model fits because, these values are interpreted as the percentage of variance in our endogenous variables accounted for by their respective exogenous variables. As an example; we could interpret V7 (Political Success) as having 66.66% of its variance accounted for by the combination of V4 (Wealth), V5 (Suggestibility), and V6 (Ethical Flexibility). The 13th page begins the listing of the modification indices, which continues to the end of the output. One should be careful when interpreting modification indices and should do so only after carefully interpreting all the previous output first. Modification indices generally take two forms; ones which recommend the exclusion of a parameter from the specified model and ones which recommend inclusion of a parameter to the model. Both types attempt to estimate the decrease in chi-square associated with the recommendation being implemented (i.e. increased goodness of fit). However, as mentioned above, chi-square is generally not an acceptable measure of goodness-of-fit and therefore modification indices should be treated with caution. The 14th page. The 15th page. The 16th page. Below you will find our completed path diagram with standardized path coefficients. Generally speaking, the output for any PROC CALIS will follow the same format seen here for path analysis; for example, the order of the output's presentation will be the same for the SEM example in the next tutorial. Please realize this tutorial is not meant to be an exhaustive review; it is merely an introduction. This tutorial is not meant to replace one or several good textbooks. And that concludes the tutorial on Path Analysis with manifest variables. The tutorial on the basics of Structural Equation Modeling (SEM) can be found here.

Return to the SAS Short Course