VI.
SAS Procedures
The following covers some of the most commonly
used SAS procedures with which you can run some basic statistical
analyses. Go to File, Import Data... to import the
Example
Data 1 file using the Import Wizard with SPSS File (*.sav)
source and member name example1 as was done previously.
Before we really begin; you should consider the
use of the OPTIONS statement when submitting any program (i.e. syntax).
The options statement can be tacked on to just about any program or
procedure. What the options statement does is allow you to control the
number of characters per line and lines per page of the output
generated by the program or procedure to which the options statement is
included. The generic form of the options statement follows:
OPTIONS LINESIZE=x PAGESIZE=y;
The x refers to the number of characters per line and the y refers to
the number of lines per page. The reason the options statement is
mentioned here is because, SAS can be quite costly in terms of the
amount of output generated when one considers printing it or copying
and pasting it into a word processing program. For instance, the sixth
edition of the Publication Manual of the American Psychological
Association (APA) generally recommends using Times New Roman 12 point
font on a page with 1 inch margins at top, bottom, left, and right.
This configuration in Microsoft Word results in a page that contains
approximately 78 characters per line and 46 lines per page. Therefore,
if you are accustom to using the APA Publication Manual guidelines for
formatting documents, you may want to use an options statement to
configure each SAS output so that it fits neatly on a pre-formatted
document page. An example of the use of the options statement is
provided in the syntax for the PROC PRINT example below -- noticeable
because, like all usable syntax on these web pages, it is shown in bold
Courier
New 10 point font on the web page.
1. PROC PRINT
PROC PRINT is frequently used to check the data
being read by SAS. It prints out the observations in a SAS
data set, using any or some of the variables. The complete syntax for
PROC PRINT is as follows:
PROC PRINT DATA= SAS-data-set
DOUBLE
NOOBS
UNIFORM
LABEL
SPLIT= 'split-character'
N
ROUND
HEADING= direction
ROWS= page-format
WIDTH= column-width;
VAR variable-list;
ID variable-list;
BY variable-list;
PAGEBY BY-variable;
SUMBY BY-variable;
SUM variable-list;
The most common use is to have the PROC PRINT following the data step
to verify the data:
For the current example with ExampleData1.sav
(using member name example1 in SAS); use the following syntax (with
optional OPTIONS statement included):
PROC PRINT DATA=example1;
OPTIONS LINESIZE=78 PAGESIZE=46;
RUN;
2. PROC CONTENTS
This procedure prints descriptions of the contents of one or more files
from a SAS library. Another common procedure to verify the
data set read into SAS library, especially for a sizeable data
set. It is crucial, for example, to check if all observations
and variables are read in correctly. PROC CONTENTS prints
descriptions of the contents of one or more files from a SAS data
library. It is useful for documenting permanent SAS data sets (library
members of DATA type).
Specific information pertaining to the physical characteristics of a
member depends on whether the file is a SAS data set or another type of
SAS file.
Syntax:
PROC CONTENTS <DATA=
<libref.>member>
<DIRECTORY>
<FMTLEN>
<MEMTYPE= (mtype-list)>
<NODS>
<NOPRINT>
<OUT= SAS-data-set>
<POSITION>
<SHORT>
<DETAILS|NODETAILS>;
For the current example:
PROC
CONTENTS DATA=example1;
RUN;
An often used command when first looking at data
is the data command in conjunction with the label command to assign
labels to variables. For the current example; we assign a new data step
consisting of our data, but with some variables having been assigned
labels.
DATA example1a;
SET example1;
LABEL Sex ="Gender"
recall1 ="Recall at time 1"
recall2 ="Recall at time 2";
RUN;
PROC CONTENTS DATA=example1a;
RUN;
3. PROC MEANS
PROC MEANS computes statistics for an entire SAS
data set or for groups of observations in the data set. If you use a BY
statement, PROC MEANS calculates descriptive statistics separately for
groups of observations. Each group is composed of observations having
the same values of the variables used in the BY statement. The groups
can be further subdivided by the use of the CLASS statement. PROC MEANS
can optionally create one or more SAS data sets containing the
statistics calculated.
The full syntax for PROC MEANS is as follows:
PROC MEANS <option-list> <statistic-keyword-list>;
VAR variable-list;
BY variable-list;
CLASS variable-list;
FREQ variable;
WEIGHT variable;
ID variable-list;
OUTPUT <OUT= SAS-data-set> <output-statistic-list>
<MINID|MAXID <(var-1<(id-list-1)>
<...var-n<(id-list-n)>>)>=name-list>;
We can get descriptive statistics for all of the
variables using proc means as shown
below.
PROC
MEANS DATA=example1;
RUN;
We can get descriptive statistics separately by
gender (i.e., broken down by SEX) as shown below.
PROC MEANS DATA=example1;
CLASS Sex;
RUN;
We can get descriptive statistics on the outcome
or dependent variable recall at time 1 (recall1) separately by gender
(i.e., broken down by SEX) as shown below.
PROC MEANS DATA=example1;
CLASS Sex;
VAR recall1;
RUN;
We can get descriptive statistics on recall1
separated by gender (i.e., broken down by SEX) and class standing
(cl_st) as shown below.
PROC
MEANS DATA=example1;
CLASS Sex cl_st;
VAR recall1;
RUN;
We can also subset the data do get very specific
descriptive statistics. For instance, if we review the output or know
the numeric codes for each value of our variables, we can request a
subset of the data (example1fj) be generated from the original data
(example1) which contains only persons who are sex = 1 and cl_st = 3
which corresponds to females whose class standing is Junior.
DATA example1fj;
SET example1;
IF sex='1'AND cl_st='3';
PROC MEANS DATA=example1fj;
VAR recall1;
RUN;
We can verify we have gotten what we wanted by
referring to the previous output showing descriptive statistics for
males and female across all four levels of class standing. In both the
current output and previous output we notice there were 27 females who
were Juniors.
4. PROC UNIVARIATE
This procedure is useful for basic descriptive
statistics of the variables. It provides detail on the
distribution of a variable. Features include:
- detail on the extreme values of a variable
- quartiles, such as the median
- several plots to picture the distribution
- frequency tables
- a test to determine that the data are normally
distributed.
If a BY statement is used, descriptive statistics
are calculated separately for groups of observations.
Syntax:
PROC UNIVARIATE DATA= SASdataset
NOPRINT
PLOT
FREQ
NORMAL
PCTLDEF= value
VARDEF= DF|WEIGHT|WGT|N|WDF
ROUND= roundoff unit...;
VAR variables;
BY variables;
FREQ variable;
WEIGHT variable;
ID variables;
OUTPUT OUT= SASdataset keyword= names...;
We can get detailed descriptive statistics for family
income using proc univariate
as shown below.
PROC UNIVARIATE DATA=example1;
VAR fam_income;
RUN;
We can also use PROC UNIVARIATE to get conditional
univariate summaries using the 'by' command; but first, we need to sort
the 'by variable'.
PROC SORT DATA=example1;
BY Sex;
RUN;
PROC
UNIVARIATE DATA=example1;
BY Sex;
VAR recall1;
RUN;
Another very handy function which can be performed
with PROC UNIVARIATE is identification of outliers. To accomplish this,
we insert two optional commands or statements into the basic proc
univariate syntax. These optional statements are NORMAL and PLOT.
PROC UNIVARIATE DATA=example1 NORMAL PLOT;
VAR recall1;
ID id;
RUN;
In the preceding syntax, we ran a PROC UNIVARIATE
program on recall at time 1 (recall1) and use values of the variable
participant identification (id) to IDENTIFY (ID) outlying values of
recall1. In the next syntax we perform the same basic procedures, but
separately for each gender (produces 7 pages of output).
PROC UNIVARIATE DATA=example1 NORMAL PLOT;
BY Sex;
VAR recall1;
ID id;
RUN;
5. PROC FREQ
The procedure produces one-way to n-way frequency and crosstabulation
tables. It shows the distribution of variable values and
crosstabulation tables with combined frequency distributions for two or
more variables. For one-way tables, PROC FREQ can compute chi-square
tests for equal or specified proportions. For two-way tables, PROC FREQ
computes tests and measures of association. For n-way tables, PROC FREQ
does stratified analysis, computing statistics within as well as across
strata.
Syntax:
PROC FREQ options;
OUTPUT <OUT=
SAS-data-set><output-statistic-list>;
TABLES requests / options;
WEIGHT variable;
EXACT statistic-keywords;
BY variable-list;
We can get a frequency distribution of age
using
proc freq as shown below.
PROC FREQ DATA=example1;
TABLES age;
RUN;
We can make a two way table showing the
frequencies for class standing by sex as shown below.
PROC FREQ DATA=example1;
TABLES cl_st * Sex;
RUN;
Labeling values is a two step
process. First, we must create the label formats with proc
format using a value statement.
Next, we attach the label format to the variable with a format
statement. This format statement can be
used in either proc or data
steps. An example of the proc format step
for creating the value formats on class standing (cl_st)
follows.
PROC FORMAT;
VALUE cl_stf 1="Fre"
2="Sop"
3="Jun"
4="Sen";
RUN;
Now that the format for class standing (cl_st)
have been created, they must be linked to the variable class
standing. This is accomplished by including a format
statement in either a proc or a data
step. In the program below the format
statement is used in a proc freq to change 'cl_st'.
PROC FREQ DATA=example1;
FORMAT cl_st cl_stf.;
TABLES cl_st;
RUN;
6. PROC TABULATE
PROC TABULATE constructs tables of descriptive
statistics using class variables, analysis variables, and keywords for
statistics. Tables can have one to three dimensions: column; row and
column; or page, row, and column.
The statistics that PROC TABULATE computes are many of the same
statistics computed by other descriptive procedures such as MEANS,
FREQ, and SUMMARY. In order for PROC TABULATE to execute, you need
either a CLASS or VAR statement, and a TABLE statement. There are no
default variables chosen for the procedure.
Syntax:
PROC TABULATE
<option-list>;
CLASS class-variable-list;
VAR analysis-variable-list;
FREQ variable;
WEIGHT variable;
FORMAT variable-list-1 format-1 <...variable-list-n
format-n>;
LABEL variable-1='label-1' <...variable-n='label-n'>;
BY <NOTSORTED> <DESCENDING> variable-1
<...<DESCENDING> VARIABLE-N>;
TABLE <<page_expression,> row_expression,>
column_expression
</ table-option-list>;
KEYLABEL keyword-1 ='description-1'
<...keyword-n='description-n'>;
We can create a basic table of individuals' recall
at time 2 (recall2) by gender (sex).
PROC TABULATE DATA=example1;
CLASS sex;
VAR recall2;
TABLE (recall2)*mean, sex;
RUN;
7. PROC GCHART & PROC
GPLOT
Making a simple graph in SAS.
We can make a simple vertical bar chart; with
recall at time 1. Because recall 1 is a continuous variable, SAS
automatically assigns five bins.
TITLE 'Simple Vertical Bar Chart ';
PROC GCHART DATA=example1;
VBAR recall1;
RUN;
You can control the number of bins for a
continuous variable with the
level= option on the vbar
statement. The program below creates a vertical bar chart
with seven bins for recall1.
TITLE 'Bar Chart - Control Number of Bins';
PROC GCHART;
VBAR recall1/LEVELS=9;
RUN;
On the other hand, cl_st
has only four categories and SAS's tendency to bin into five categories
and use midpoints would not do justice to the data. So when
you want to use the actual values of the variable to label each bar you
will want to use the discrete option on
the
vbar statement.
We can make a bar chart showing the frequencies of
family income
as shown below.
TITLE 'Bar Chart with Discrete Option';
PROC GCHART DATA=example1;
VBAR cl_st/DISCRETE;
RUN;
Simply changing 'VBAR' to 'HBAR' will produce the
same graph horizontally opposed to vertically.
TITLE
'Bar Chart with Discrete Option';
PROC GCHART DATA=example1;
HBAR cl_st/DISCRETE;
RUN;
We can create a variety of scatter plots using the
PROC PLOT function. It allows us to see the relationship between two
continuous variables. The program below creates a scatter
plot for recall2 * recall1.
This means that recall2 will be plotted
on the vertical axis, and
recall1 will be plotted on the horizontal axis.
TITLE 'Scatterplot - Two Variables';
PROC GPLOT DATA=example1;
PLOT recall2*recall1;
RUN;
You may want to examine the relationship between
two continuous variables and see which points fall into one or another
category of a third variable. The program below creates a
scatter plot for recall2*recall1 with
each gender (Sex) marked. You
specify
recall2*recall1=Sex on the plot
statement to have each level of sex identified on
the plot.
TITLE 'Scatterplot - Male/Female Marked';
PROC GPLOT DATA=example1;
PLOT recall2*recall1=Sex;
RUN;
The program below creates a scatter plot for recall2*recall1
with each level of Sex
marked. The proc gplot is
specified exactly the same as in the previous example. The
only difference is the inclusion of symbol
statements to control the look of the graph through the use of the
operands V=, I=,
and C=.
SYMBOL1 V=circle C=black I=none;
SYMBOL2 V=star C=red I=none;
TITLE 'Scatterplot - Different Symbols';
PROC GPLOT DATA=example1;
PLOT recall2*recall1=Sex;
RUN;
QUIT;
Symbol1 is used for
the lowest value of Sex and symbol2
is used for the next lowest value.
V= controls the
type of point to be plotted. We requested a circle
to be plotted for domestic cars, and a star
(asterisk) for males.
I= none causes SAS not to
plot a line joining the points.
C= controls the color of the
plot. We requested black for females, and red for
males. (Sometimes the C= option is needed
for any options to take effect.)
To plot a regression line along with the points we
use the I operand of the symbol statement. The program below
creates a scatter plot for
recall2*recall1 with such an OLS regression
line. The regression line is produced with the I=R
operand on the symbol statement.
SYMBOL1 V=circle C=blue I=r;
TITLE 'Scatterplot - With Regression Line ';
PROC GPLOT DATA=example1;
PLOT recall2*recall1;
RUN;
QUIT;
|