DSA SAS Short Course: Module 4

Data Science and Analytics

Please participate in the DSA Client Feedback Survey.

MODULE 4

VI. SAS Procedures

The following covers some of the most commonly used SAS procedures with which you can run some basic statistical analyses. Go to File, Import Data... to import the Example Data 1 file using the Import Wizard with SPSS File (*.sav) source and member name example1 as was done previously.

Before we really begin; you should consider the use of the OPTIONS statement when submitting any program (i.e. syntax). The options statement can be tacked on to just about any program or procedure. What the options statement does is allow you to control the number of characters per line and lines per page of the output generated by the program or procedure to which the options statement is included. The generic form of the options statement follows:

OPTIONS LINESIZE=x PAGESIZE=y;

The x refers to the number of characters per line and the y refers to the number of lines per page. The reason the options statement is mentioned here is because, SAS can be quite costly in terms of the amount of output generated when one considers printing it or copying and pasting it into a word processing program. For instance, the sixth edition of the Publication Manual of the American Psychological Association (APA) generally recommends using Times New Roman 12 point font on a page with 1 inch margins at top, bottom, left, and right. This configuration in Microsoft Word results in a page that contains approximately 78 characters per line and 46 lines per page. Therefore, if you are accustom to using the APA Publication Manual guidelines for formatting documents, you may want to use an options statement to configure each SAS output so that it fits neatly on a pre-formatted document page. An example of the use of the options statement is provided in the syntax for the PROC PRINT example below -- noticeable because, like all usable syntax on these web pages, it is shown in bold Courier New 10 point font on the web page.

1. PROC PRINT

PROC PRINT is frequently used to check the data being read by SAS. It prints out the observations in a SAS data set, using any or some of the variables. The complete syntax for PROC PRINT is as follows:

PROC PRINT DATA= SAS-data-set DOUBLE NOOBS UNIFORM LABEL SPLIT= 'split-character' N ROUND HEADING= direction ROWS= page-format WIDTH= column-width; VAR variable-list; ID variable-list; BY variable-list; PAGEBY BY-variable; SUMBY BY-variable; SUM variable-list;

The most common use is to have the PROC PRINT following the data step to verify the data:

For the current example with ExampleData1.sav (using member name example1 in SAS); use the following syntax (with optional OPTIONS statement included):

PROC PRINT DATA=example1;

OPTIONS LINESIZE=78 PAGESIZE=46;

RUN;

2. PROC CONTENTS

This procedure prints descriptions of the contents of one or more files from a SAS library. Another common procedure to verify the data set read into SAS library, especially for a sizeable data set. It is crucial, for example, to check if all observations and variables are read in correctly. PROC CONTENTS prints descriptions of the contents of one or more files from a SAS data library. It is useful for documenting permanent SAS data sets (library members of DATA type).
Specific information pertaining to the physical characteristics of a member depends on whether the file is a SAS data set or another type of SAS file.

Syntax:

PROC CONTENTS <DATA= <libref.>member> <DIRECTORY> <FMTLEN> <MEMTYPE= (mtype-list)> <NODS> <NOPRINT> <OUT= SAS-data-set> <POSITION> <SHORT> <DETAILS|NODETAILS>;

For the current example:

PROC CONTENTS DATA=example1;
RUN;

An often used command when first looking at data is the data command in conjunction with the label command to assign labels to variables. For the current example; we assign a new data step consisting of our data, but with some variables having been assigned labels.

DATA example1a;

 SET example1;

 LABEL Sex ="Gender"

 recall1 ="Recall at time 1"

 recall2 ="Recall at time 2";

RUN;

PROC CONTENTS DATA=example1a;

RUN;

3. PROC MEANS

PROC MEANS computes statistics for an entire SAS data set or for groups of observations in the data set. If you use a BY statement, PROC MEANS calculates descriptive statistics separately for groups of observations. Each group is composed of observations having the same values of the variables used in the BY statement. The groups can be further subdivided by the use of the CLASS statement. PROC MEANS can optionally create one or more SAS data sets containing the statistics calculated.

The full syntax for PROC MEANS is as follows:

PROC MEANS <option-list> <statistic-keyword-list>;

VAR variable-list;

BY variable-list;

CLASS variable-list;

FREQ variable;

WEIGHT variable;

ID variable-list;

OUTPUT <OUT= SAS-data-set> <output-statistic-list>

<MINID|MAXID <(var-1<(id-list-1)>

<...var-n<(id-list-n)>>)>=name-list>;

We can get descriptive statistics for all of the variables using proc means as shown below.

PROC MEANS DATA=example1;
RUN;

We can get descriptive statistics separately by gender (i.e., broken down by SEX) as shown below.

PROC MEANS DATA=example1;

 CLASS Sex;

RUN;

We can get descriptive statistics on the outcome or dependent variable recall at time 1 (recall1) separately by gender (i.e., broken down by SEX) as shown below.

PROC MEANS DATA=example1;

 CLASS Sex;

 VAR recall1;

RUN;

We can get descriptive statistics on recall1 separated by gender (i.e., broken down by SEX) and class standing (cl_st) as shown below.

PROC MEANS DATA=example1;
CLASS Sex cl_st;
VAR recall1;
RUN;

We can also subset the data do get very specific descriptive statistics. For instance, if we review the output or know the numeric codes for each value of our variables, we can request a subset of the data (example1fj) be generated from the original data (example1) which contains only persons who are sex = 1 and cl_st = 3 which corresponds to females whose class standing is Junior.

DATA example1fj;
SET example1;
IF sex='1'AND cl_st='3';
PROC MEANS DATA=example1fj;
VAR recall1;
RUN;

We can verify we have gotten what we wanted by referring to the previous output showing descriptive statistics for males and female across all four levels of class standing. In both the current output and previous output we notice there were 27 females who were Juniors.

4. PROC UNIVARIATE

This procedure is useful for basic descriptive statistics of the variables. It provides detail on the distribution of a variable. Features include:

detail on the extreme values of a variable
quartiles, such as the median
several plots to picture the distribution
frequency tables
a test to determine that the data are normally distributed.

If a BY statement is used, descriptive statistics are calculated separately for groups of observations.

Syntax:

PROC UNIVARIATE DATA= SASdataset NOPRINT PLOT FREQ NORMAL PCTLDEF= value VARDEF= DF|WEIGHT|WGT|N|WDF ROUND= roundoff unit...; VAR variables; BY variables; FREQ variable; WEIGHT variable; ID variables; OUTPUT OUT= SASdataset keyword= names...;

We can get detailed descriptive statistics for family income using proc univariate as shown below.

PROC UNIVARIATE DATA=example1;

 VAR fam_income;

RUN;

We can also use PROC UNIVARIATE to get conditional univariate summaries using the 'by' command; but first, we need to sort the 'by variable'.

PROC SORT DATA=example1;

 BY Sex;

RUN;

PROC UNIVARIATE DATA=example1;
BY Sex;
VAR recall1;
RUN;

Another very handy function which can be performed with PROC UNIVARIATE is identification of outliers. To accomplish this, we insert two optional commands or statements into the basic proc univariate syntax. These optional statements are NORMAL and PLOT.

PROC UNIVARIATE DATA=example1 NORMAL PLOT;
VAR recall1;
ID id;
RUN;

In the preceding syntax, we ran a PROC UNIVARIATE program on recall at time 1 (recall1) and use values of the variable participant identification (id) to IDENTIFY (ID) outlying values of recall1. In the next syntax we perform the same basic procedures, but separately for each gender (produces 7 pages of output).

PROC UNIVARIATE DATA=example1 NORMAL PLOT;
BY Sex;
VAR recall1;
ID id;
RUN;

5. PROC FREQ

The procedure produces one-way to n-way frequency and crosstabulation tables. It shows the distribution of variable values and crosstabulation tables with combined frequency distributions for two or more variables. For one-way tables, PROC FREQ can compute chi-square tests for equal or specified proportions. For two-way tables, PROC FREQ computes tests and measures of association. For n-way tables, PROC FREQ does stratified analysis, computing statistics within as well as across strata.

Syntax:

PROC FREQ options; OUTPUT <OUT= SAS-data-set><output-statistic-list>; TABLES requests / options; WEIGHT variable; EXACT statistic-keywords; BY variable-list;

We can get a frequency distribution of age using proc freq as shown below.

PROC FREQ DATA=example1;

 TABLES age;

RUN;

We can make a two way table showing the frequencies for class standing by sex as shown below.

PROC FREQ DATA=example1;

 TABLES cl_st * Sex;

RUN;

Labeling values is a two step process. First, we must create the label formats with proc format using a value statement. Next, we attach the label format to the variable with a format statement. This format statement can be used in either proc or data steps. An example of the proc format step for creating the value formats on class standing (cl_st) follows.

PROC FORMAT;

 VALUE cl_stf 1="Fre"

 2="Sop"

 3="Jun"

 4="Sen";

RUN;

Now that the format for class standing (cl_st) have been created, they must be linked to the variable class standing. This is accomplished by including a format statement in either a proc or a data step. In the program below the format statement is used in a proc freq to change 'cl_st'.

PROC FREQ DATA=example1;

 FORMAT cl_st cl_stf.;

 TABLES cl_st;

RUN;

6. PROC TABULATE

PROC TABULATE constructs tables of descriptive statistics using class variables, analysis variables, and keywords for statistics. Tables can have one to three dimensions: column; row and column; or page, row, and column.
The statistics that PROC TABULATE computes are many of the same statistics computed by other descriptive procedures such as MEANS, FREQ, and SUMMARY. In order for PROC TABULATE to execute, you need either a CLASS or VAR statement, and a TABLE statement. There are no default variables chosen for the procedure.

Syntax:

PROC TABULATE <option-list>; CLASS class-variable-list; VAR analysis-variable-list; FREQ variable; WEIGHT variable; FORMAT variable-list-1 format-1 <...variable-list-n format-n>; LABEL variable-1='label-1' <...variable-n='label-n'>; BY <NOTSORTED> <DESCENDING> variable-1 <...<DESCENDING> VARIABLE-N>; TABLE <<page_expression,> row_expression,> column_expression </ table-option-list>; KEYLABEL keyword-1 ='description-1' <...keyword-n='description-n'>;

We can create a basic table of individuals' recall at time 2 (recall2) by gender (sex).

PROC TABULATE DATA=example1;

 CLASS sex;

 VAR recall2;

 TABLE (recall2)*mean, sex;

RUN;

7. PROC GCHART & PROC GPLOT

Making a simple graph in SAS.

We can make a simple vertical bar chart; with recall at time 1. Because recall 1 is a continuous variable, SAS automatically assigns five bins.

TITLE 'Simple Vertical Bar Chart ';

PROC GCHART DATA=example1;

 VBAR recall1;

RUN;

You can control the number of bins for a continuous variable with the level= option on the vbar statement. The program below creates a vertical bar chart with seven bins for recall1.

TITLE 'Bar Chart - Control Number of Bins';

PROC GCHART;

 VBAR recall1/LEVELS=9;

RUN;

On the other hand, cl_st has only four categories and SAS's tendency to bin into five categories and use midpoints would not do justice to the data. So when you want to use the actual values of the variable to label each bar you will want to use the discrete option on the vbar statement.

We can make a bar chart showing the frequencies of family income as shown below.

TITLE 'Bar Chart with Discrete Option';

PROC GCHART DATA=example1;

 VBAR cl_st/DISCRETE;

RUN;

Simply changing 'VBAR' to 'HBAR' will produce the same graph horizontally opposed to vertically.

TITLE 'Bar Chart with Discrete Option';
PROC GCHART DATA=example1;
HBAR cl_st/DISCRETE;
RUN;

We can create a variety of scatter plots using the PROC PLOT function. It allows us to see the relationship between two continuous variables. The program below creates a scatter plot for recall2 * recall1. This means that recall2 will be plotted on the vertical axis, and recall1 will be plotted on the horizontal axis.

TITLE 'Scatterplot - Two Variables';

PROC GPLOT DATA=example1;

 PLOT recall2*recall1;

RUN;

You may want to examine the relationship between two continuous variables and see which points fall into one or another category of a third variable. The program below creates a scatter plot for recall2*recall1 with each gender (Sex) marked. You specify recall2*recall1=Sex on the plot statement to have each level of sex identified on the plot.

TITLE 'Scatterplot - Male/Female Marked';

PROC GPLOT DATA=example1;

 PLOT recall2*recall1=Sex;

RUN;

The program below creates a scatter plot for recall2*recall1 with each level of Sex marked. The proc gplot is specified exactly the same as in the previous example. The only difference is the inclusion of symbol statements to control the look of the graph through the use of the operands V=, I=, and C=.

SYMBOL1 V=circle C=black I=none;
SYMBOL2 V=star C=red I=none;
TITLE 'Scatterplot - Different Symbols';
PROC GPLOT DATA=example1;
PLOT recall2*recall1=Sex;
RUN;
QUIT;

Symbol1 is used for the lowest value of Sex and symbol2 is used for the next lowest value.

V= controls the type of point to be plotted. We requested a circle to be plotted for domestic cars, and a star (asterisk) for males.
I= none causes SAS not to plot a line joining the points.
C= controls the color of the plot. We requested black for females, and red for males. (Sometimes the C= option is needed for any options to take effect.)

To plot a regression line along with the points we use the I operand of the symbol statement. The program below creates a scatter plot for recall2*recall1 with such an OLS regression line. The regression line is produced with the I=R operand on the symbol statement.

SYMBOL1 V=circle C=blue I=r;
TITLE 'Scatterplot - With Regression Line ';
PROC GPLOT DATA=example1;
PLOT recall2*recall1;
RUN;
QUIT;

Appendix: SAS function keys

Recommended mapping of SAS functions keys (Windows, Mac, UNIX):

F1	help
F2	lib
F3	end
F4	recall
F5	pgm
F6	log
F7	output
F8	zoom off
F9	keys
F11	command bar
F12	next
Ctrl-F12	viewtable work._last_
Ctrl-F1	output;clear;log;clear;pgm;recall

Return to the SAS Short Course

UNT home page

Contact Information
Jon Starkweather, PhD	Jonathan.Starkweather@unt.edu	940-565-4066
Richard Herrington, PhD	Richard.Herrington@unt.edu	940-565-2140

Please participate in the DSA Client Feedback Survey.

Last updated: 2018.11.15 by Jon Starkweather.