Professional Documents
Culture Documents
- DRAFT -
I. ARRAY STATEMENT................................................................................................. 1
V. DATES ........................................................................................................................ 14
VII. REFERENCES........................................................................................................ 17
1
I. ARRAY STATEMENT
Purpose
Arrays are often used in conjunction with DO loops when performing actions for a
series of variables. The following example illustrates the same action being
performed on two separate diagnostic field variables. The study diagnosis of 820.0
can occur in either of these fields, and the statements are identical except for the
name of the diagnostic field. The intent of the following statements is to flag all
occurrences of the study diagnosis by creating a new variable - "HIPFRAC" - where ’1’
indicates the presence of the desired diagnosis.
Syntax
Arrays are set up using an ARRAY statement. It can appear anywhere in the DATA
step as long as it occurs prior to any reference to it. The variables that make up the
array are called elements. Individual elements are identified by subscripts: numbers
that identify an element’s position in the array.
Variable-n...variable-n lists the names of the variables (the variable list does not have
to begin at 1 - e.g., DX5-DX16)
3
Example
a) create a group or array name DIAG for the duration of the DATA step.
Note that DX01-DX16 are character variables and thus must be preceded by a "$".
You can refer to the entire array or just one of its elements when performing logical
comparisons or arithmetic calculations. All variables listed in the ARRAY statement
are assigned extra names with the form array-name{position}, where position is the
position of the variable in the list (1,2,3,...,16 in the example). The additional name is
called an array reference, and the position is often called the subscript.
In the above ARRAY statement, DX01 is assigned the array reference DIAG{1}; DX02
the array reference DIAG{2}; etc. From that point in the data step, you can refer to
the variable by either its original name or by its array reference; for example, the
names DX01 and DIAG{1} are equivalent.
Caution
II. DO LOOP
Purpose
Syntax
Index-variable is a name you choose (e.g., "I"). Its value changes with each iteration
of the loop, or each time the loop is processed. By default, the value of the index-
variable (I) is increased by 1 before each new iteration of the loop, consecutively
representing the values 1 to n (number of variables in array).
Three examples follow, 1) the basic DO loop, 2) DO UNTIL, and 3) DO WHILE. The
most efficient method is the last - DO WHILE - and is highly recommended over the
others.
5
Example 1
(1) Create an array called "diag" to represent the group of diagnostic fields dx01-dx16
for the duration of the data step.
(2) Create a new variable named "hipdiag" and set it equal to zero. It will remain at
0 if none of the 16 diagnostic fields for that record have a hip diagnosis present.
(3) Perform the actions in the loop sixteen times for each record in the data set.
When the value of "i" is 1, SAS reads the array reference as DIAG{1} and processes
the following statements on DIAG{1}, that is, DX01. In each iteration of the loop, the
subscript associated with DIAG is replaced with the index variable’s (i) current value.
(4) When a diagnosis within the specified range is encountered, the variable
"hipdiag" will be assigned a value of ’1’.
Example 1: Output
OBS DX01 DX02 DX03 DX04 DX05 DX06 DX07 DX08...DX16 HIPDIAG
1 650 V270 0
2 71783 0
3 4549 0
4 V664 82021 E8809 4538 3569 36250 7213 2859 1
5 V301 7746 0
6 8208 E888 4019 3310 1
7 4111 4140 4280 4011 42731 586 5990 V668 0
8 486 0
9 V72 0
10 650 V270 0
Observation 4 has hipdiag set to ’1’ because DX02 has a value of ’82021’, thus falling
within the ICD-9-CM range 820-820.99. Observation 6 similarly has hipdiag set to ’1’
because DX01 has a value of ’8208 ’.
Conditional Loops
The DO loop in the above example is not very efficient because, for each observation,
it will process all 16 diagnostic fields regardless of whether a hip diagnosis is found.
Even though the condition hipdiag=1 is found within dx02 in observation 4, for
example, the DO loop will continue to process the remaining diagnostic fields.
Using DO WHILE, the statement in the DO loop for observation 4 in the above
example will be executed twice (i=2) because the condition hipdiag=’1’ was met with
the second diagnostic field (dx02). Using DO UNTIL, however, the statement is
processed three times (i=3) in order to meet the condition hipdiag^=’0’ after setting
hipdiag to ’1’ with dx02.
7
(1) Perform the actions within the DO LOOP up to a maximum of 16 times. The loop
will continue to process the statements for each consecutive diagnostic field until it
encounters one that is missing (DIAG{i}=’ ’) OR until a field containing hip diagnoses
is encountered (hipdiag^=0). If either of these conditions are met, hipdiag will be
assigned a value of ’1’.
(2) When a diagnosis within the specified range is encountered for the record,
hipdiag will be assigned a value of ’1’ and the remaining diagnostic fields will not be
processed. For example, if hipdiag is set to "1" because dx01 falls within the specified
range, then the DO loop would only be processed once.
(1) Perform the actions in the DO LOOP up to a maximum of 16 times. The loop will
continue to process the statements for each consecutive diagnostic field only as long
as it is not missing (DIAG{i}^=’ ’) AND there are no hip diagnoses (hipdiag=0).
Caution
1. When coding many DO loops in a DATA step or when enclosing DO loops within
other DO loops, take care that each DO has a corresponding END. It is very easy to
lose an END in a long program!
2. If you want to flag the position of each ICD-9-CM code, use DO UNTIL to get
accurate positions (DO WHILE begins the position count at "2" rather than "1").
8
Purpose
Syntax
BY varname1 varname2...;
For the first record in a BY group, the value of the FIRST.varname1 is set to 1, with
all other records in the BY group set to 0. For the last record in a BY group, the value
of the LAST.varname1 is set to 1, with all other records set to 0. If the data are
sorted by more than one BY variable, the FIRST.varname for each variable is set to 1
at the first occurrence of a new value for the variable. FIRST. and LAST. variables
are not kept in the newly created data set since they are usually just used for
subsequent processing. Their values can be assigned, however, to other variables.
Following are examples of single and multiple by-group processing.
9
DATA dup;
SET hosp;
BY phin; (1)
firstfl=FIRST.phin; (2)
lastfl=LAST.phin; (3)
RUN;
(1) Set the data by PHIN (already previously sorted by this variable) in order to
create FIRST.PHIN and LAST.PHIN.
(2) Create a new variable called FIRSTFL and assign it a value of 1 for every
FIRST.PHIN=1 encountered.
(3) Create a new variable called LASTFL and assign it a value of 1 for every
LAST.PHIN=1 encountered.
Example 1: Output
1 562737 1 1
2 563850 1 1
3 563961 1 1
4 565858 1 1
5 566739 1 1
6 568729 1 0
7 568729 0 0
8 568729 0 1
9 569961 1 1
10 660861 1 1
In the above example, the person with PHIN 568729 has 3 records (observations 6-8).
For the first record (#6), FIRSTFL is set to 1, indicating that it is the first record for
that person, and all other records for that PHIN show FIRSTFL values set to 0. For
the third and last record, LASTFL is set to 1, indicating that it is the last record for
that person, and all other records show LASTFL values set to 0.
10
DATA dup;
SET hosp;
BY phin dateadm; (1)
firstph=FIRST.phin; (2)
lastph=LAST.phin; (3)
firstdt=FIRST.dateadm; (4)
lastdt=LAST.dateadm; (5)
RUN;
(1) Set the data by PHIN and DATEADM (already previously sorted by these
variables) in order to create FIRST.PHIN and LAST.PHIN and FIRST.DATEADM
and LAST.DATEADM.
(2) Create a new variable called FIRSTPH and assign it a value of 1 for every
FIRST.PHIN encountered.
(3) Create a new variable called LASTPH and assign it a value of 1 for every
LAST.PHIN encountered.
(4) Create a new variable called FIRSTDT and assign it a value of 1 for every
FIRST.DATEADM encountered.
(5) Create a new variable called LASTDT and assign it a value of 1 for every
LAST.DATEADM encountered.
11
Example 2 Output
1 599748 901003 1 1 1 1
2 690876 910109 1 1 1 1
3 695729 900607 1 0 1 1
4 695729 900829 0 0 1 1
5 695729 901004 0 1 1 1
6 794727 910304 1 1 1 1
7 794872 910326 1 1 1 1
8 795729 910112 1 0 1 0
9 795729 910112 0 1 0 1
10 795745 901211 1 1 1 1
When sorted and set by PHIN and DATEADM the FIRST.PHIN and LAST.PHIN
variables will have different values if there are multiple records per PHIN. If there is
more than one record with the same date of admission, the FIRST.DATEADM and
LAST.DATEADM variables will also have different values. If DATEADM is not the
same, however, then the first record for that date is the last record for that date and
both FIRST.DATEADM and LAST.DATEADM will be flagged as 1.
In the above example, the person with PHIN 795729 has 2 records. FIRSTPH is set
to 1 for the first record (#8) and 0 for the second, or last, record, with LASTPH set to 1
for the last record (#9) and 0 for the first record. DATEADM is identical as well on
both records, so FIRSTDT is set to 1 on the first record and 0 for the other, with
LASTDT set to 1 for the last record and 0 for the first record.
The person with PHIN 695729 has 3 records. For the first record (#3), FIRSTPH is
set to 1, representing the first record for that person, and all other records have
FIRSTPH set to 0. For the last record (#8), LASTPH is set to 1, representing the last
record for that person, and all other records show this variable set to 0. Unlike PHIN
795729, however, DATEADM is unique for each record, and FIRSTDT and LASTDT
are each set to 1 for each record.
Caution
Purpose
Syntax
The RETAIN statement can be used to specify initial values for variable(s) or
elements of an array. All elements or variables will be initialized to the specified
value.
Varlist: specifies the names of the variables, lists or arrays whose values you wish to
retain.
For example, the statement RETAIN pop 1; within a DATA step will assign a value
of 1 to each observation for the variable POP.
13
Example
The RETAIN statement is often used to carry over values of a specified variable from
one observation to the next. In the following example, a count is accumulated of the
number of admissions for each person (COUNT) and a flag is created to denote
whether the person has been panelled for a personal care home (PANELLED).
(1) Begin a DATA step, creating a SAS data set called "check2".
(2) Read data from the existing SAS data set "check", which has been sorted by
PHIN DATEADM DATESEP.
(3) Set the data set "CHECK2" by PHIN DATEADM DATESEP in order to create the
variable FIRST.PHIN. (Each PHIN has also been ordered by date - from earliest-
occurring dates of admission and separation to most recent admission and
separation.)
(4) Keep a value of 0 for the new variables, COUNT and PANELLED, for each
observation.
(5) Assign a value of 0 to both COUNT and PANELLED to the first record
encountered for each person.
(6) For each person, cumulate the value of COUNT, increasing it by 1 for each
subsequent admission.
(7) For each person, cumulate the value of PANELLED, flagging the first-occurring
record with "1", and increasing by 1 each subsequent record that satisfies the
definition of PANELLED.
14
Example Output
The last PHIN, 1009, shows four records (four hospital stays). For the first record,
COUNT is flagged as "1" and, for each subsequent record, it increases by 1. The last
record is the fourth record in this case, so COUNT=4 for that record.
The last PHIN also shows that PRIMPSVC ends with "99" for the third record
(observation #9), and it is the first time this code is encountered for that person so
PANELLED is set to "1". The code is encountered again with the next record, so
PANELLED increases by 1 to "2". PHIN 0005, on the other hand, has its first record
set to "1" for PANELLED, and there are no other subsequent records with this code,
so PANELLED remains at "1".
Keeping only the last record per person would provide a summary record of how many
admissions each person had and how many times each was panelled for a personal
care home.
15
V. DATES
Dates can be represented in a number of different ways within a data set; they can be
character or numeric; for example:
Character: 7/26/89
26JUL89
26JUL1989
July 26, 1989
Note that the only way to tell whether values such as 892607 in a SAS data set have
been input as numeric or character is to do a PROC CONTENTS on the data set.
892607 can refer to July 26, 1989 or it can be handled as the number 892,607.
A SAS date value is calculated as the number of days between January 1, 1960 and
any specified date (dates prior to January 1, 1960 are represented with negative
numbers). The date August 8, 1990, for example, would be converted to the number
11178 as a SAS date value. This number indicates that the day it represents is the
11,178th day after January 1, 1960.
Date Manipulation
Several examples follow of how to convert character and numeric dates to SAS dates
and back again, as well as of how to select data by date. In the examples, missing
values are represented by ’000000’ for character values, 0 for numeric values, and . for
SAS date values.
16
Example of manipulation
DATA check;
SET hosp;
* DATE CONVERSIONS *;
IF datesep>’000000’ THEN
sdtsep=INPUT(datesep,yymmdd6.); * Character to SAS date;
IF sdate>0 THEN
sdtsrg=INPUT(PUT(sdate,z6.),yymmdd6.); * Numeric to SAS date ;
IF sdtsep>. THEN
datesepx=PUT(sdtsep,yymmdd6.); * SAS date to character date ;
* DATE CALCULATIONS *;
IF deathdt>’000000’ THEN
sdthdt=INPUT(deathdt,yymmdd6.); * Character to SAS date
days2dth=(sdthdt-sdtsep); * Calculate days to death
using SAS
dates;
RUN;
When converting character dates that are 6-digits long and comprise 2-digit year, 2-
digit month, and 2-digit day to SAS dates, an example of an appropriate format is
"yymmdd6."; it indicates that each of the 3 fields is 2 digits wide in order of year,
month, and day, and the "6" indicates that the output should be 6 digits long, e.g.,
930401 (yymmdd8. would add slashes between the 2-digit components, i.e., 93/04/01).
A SAS date value, for example, 11059, can be referred to as ’12APR90’d in your
program. It is a SAS date constant, and "d" tells SAS to convert the calendar date to
a SAS date value.
In addition to using informats to store dates as SAS dates, SAS date formats are
available to transform SAS date values into recognizable calendar dates for the
17
output. Each format name ends with a period, and contains a width specification to
indicate how many columns to use when displaying the date value. Several examples
are presented below.
18
Example of display
In the above output, the variable DATEADM is displayed in 5 different ways - the
original character date value and the 4 formatted values (formatted using the SAS
date variables).
Warnings
* Missing date values can be represented in several ways, e.g., a blank space, a
period, 000000. Check how they are represented before performing data
manipulation.
* Check for illegal values which may cause SAS dates to return missing values. Also,
watch the century of your date values, e.g., 18xx and 20xx.
19
* Watch for dates stored as numeric values; make sure that they are converted to
SAS dates before using them in calculations because SAS will not provide warnings
that your formulas are inappropriate/inaccurate.
20
VI. ACKNOWLEDGEMENTS
All four modules were prepared by Murray Werbeniuk, with some additional
development and editing by Ruth Bond and Charles Burchill. Input was also
provided by Shelley Derksen, Pat Nicol, Andre Wajda, and Randy Walld.
VII. REFERENCES
SAS Institute Inc: SAS Language and Procedures: Usage, Version 6, First Edition,
Cary. NC: SAS Institute Inc., 1989. 638 pp. (arrays, do loops, by-group
processing, dates)
SAS Institute Inc: SAS Language: Reference, Version 6, First edition Cary, NC: SAS
Institute Inc., 1990. 1042 pp. (arrays)
SAS Institute Inc: SAS Programming Tips: A Guide to Efficient SAS Processing,
Cary, NC: SAS Institute Inc., 1990, 155 pp. (dates)
Spector, Paul E., SAS Programming For Researchers and Social Scientists. Sage
Publications Ltd. 166 pp. (do loops, retain)