DATA Step: Iterative DO and ARRAY Statements

Iterative DO Statements

In a previous blog post, the DO-END block was introduced as a way to group multiple actions for an IF-THEN statement. However, its usage is not limited to IF-THEN statement. In fact, the DO-END blocks can also be employed to perform iterative operations in a DATA step.

DO-TO Statements

The DO-TO statements specifies a variable that iterates over a range of numeric values. It has the following general form:

DO Variable = start TO end BY n;
<Some instructions you want to execute for each iteration>
END;

Where:

  • Variable: This is a numeric variable that you define, which can be any valid SAS variable name.
  • start and end: Numeric values or expressions that define the range of values for the variable. The DO loop iterates from the start to the end, incrementing n for each iteration.
  • n: This specifies the increment by which a variable increases with each iteration. If BY clause is omitted, n defaults to 1.

For example, let's consider the following DATA steps:

DATA MySequence;
DO i = 1 TO 5;
x = i * 10;
OUTPUT;
END;
RUN;

In the DATA step above, DO i = 1 TO 5; initializes a loop where i ranges from 1 to 5. Next, the following line calculates a variable x by multiplying the current value of i by 10. Then OUTPUT statement writes each iteration's result to the MySequence data set. After each iteration, the value of i increases by 1 until it reaches to 6, making sure we have 5 iterations from i = 1 to i = 5

You can also nest DO-TO iterations. For example:

DATA MyPrimes (DROP=divisor IsPrime);
DO num = 2 TO 100;
IsPrime = 1;

/* Check if num is divisible by any number other than 1 and itself */
DO divisor = 2 TO SQRT(num);
IF MOD(num, divisor) = 0 THEN DO;
IsPrime = 0;
LEAVE; /* Exit the inner loop early if not prime */
END;
END;

/* Output num if it is prime */
IF IsPrime = 1 THEN OUTPUT;
END;
RUN;

In this example, the outer DO loop iterates through num from 2 to 100, and initiates a variable IsPrime to 1. Subsequently, the inner DO loop iterates through potential divisors (divisor) from 2 to the square root of num. Here, for each divisor, if MOD(num, divisor) equals to 0, then change IsPrime to 0, and exit the inner loop using LEAVE statement. On the other hand, if IsPrime value remains as 1, then outputs the num to output data set MyPrimes. Lastly, the suboption (DROP=divisor IsPrime) drops the two temporary variables divisor and IsPrime.

In the DO-TO statement, specifying a BY clause allows you to set increment amount for each iteration. For example:

DATA MySample;
DO i = 2 TO N BY 10;
SET MyData.Boston NOBS=N POINT=i;
IF _ERROR_ THEN ABORT;
OUTPUT;
END;
STOP;
RUN;

This loop iterates a variable i from a starting value of 2 to a final value of N. Here, the loop increments i by 10 in each iteration (i.e., 2, 12, 22, ..., N). For each iteration, the SET statement brings all observations from MyData.Boston, assigns the number of observations to N, and retain observations whose automatic variable _N_ equals to i. Consequently, MySample will extract every 10th observations from MyData.Boston, sampling 51 out of 506 observations. This is the concept of systematic sampling, that will be discussed in a later post.

DO-WHILE and DO-UNTIL Statements

At the time of writing code, sometimes, you don't know exactly how many times you would like to iterate over. Instead, you only know the condition of when the iteration should stop. In such cases, you can use DO-WHILE statements. At the beginning of each iteration, it evaluates if the specified condition is true. If it is, then the DO-WHILE statements executes the block of statements. The iteration continues until the condition evaluates to false, at which point the loop terminates. For example:

DATA MyData;

x = 1;

/* Increment x while it is less than or equal to 10 */
DO WHILE(x <= 10);
y = x ** 2;
OUTPUT;
x + 1; /* Increment x by 1 */
END;
RUN;

In this DATA step, x = 1; initializes the variable x to 1. Then DO WHILE(x <= 10); begins a loop that continues while x is less than or equal to 10. For each iteration, SAS first checks if the given condition x <= 10 is true. If it is, then it performs the statements under the DO-WHILE-END block. Then, at the end of each iteration, x + 1; increases x by 1. This makes sure that the loop terminates at some point.

DO-UNTIL statements work similar to DO-WHILE blocks. However, unlike DO-WHILE statement, DO-UNTIL checks the condition is false, at the end of each iteration. When it is submitted for execution, DO UNTIL statement first executes statements under the block. Then at the end of the first iteration, it evaluates if the given condition is still false. If it is, then SAS moves to the next iteration. After executing the statements, if the condition evaluates to true, then SAS terminates the loop. For example:

DATA MyData;

x = 1;

/* Increment x while it is less than or equal to 10 */
DO UNTIL(x > 10);
y = x ** 2;
OUTPUT;
x + 1; /* Increment x by 1 */
END;
RUN;


ARRAY Statements

In computer programming, an array refers to an ordered collection of items with the same data type. Each item in an array is uniquely identified by an index, so that you can access and manipulate specific items based on their indices. 

The ARRAY statement in SAS DATA step groups multiple data set variables, as long as they are either all numeric or character. By grouping variables under a single name, you can perform repetitive tasks without having to write statements for each variable. The ARRAY statement follows this general form:

ARRAY Name (n) $ Variable-list;

Where:

  • Name: Name you give to the array.
  • n: Number of variables in the array.
  • $: Indicates if the array consists of all character variables.
  • Variable-list: list of all variables you want to include in the array

For example, let's consider the following data set:

The wine quality data set, created by Paulo Cortez and his team at the University of Minho, Portugal, has 13 variables and 6497 observations on wine quality metrics. It measures 11 physicochemical test scores, such as acidity, alcohol, pH, etc. on wines produced in the Vinho Verde region of Portugal, as well as 0 to 10 quality scores.

Now, for each observation, suppose that you want to check if any of the acidity measurement is missing, and if so, delete the observation.

DATA WineQualityCleaned;
SET WineQuality;
ARRAY WineAttributes (11) FixedAcidity VolatileAcidity CriticAcid ResidualSugar Chlorides FreeSulfur TotalSulfur Density pH Sulphates Alcohol;
DO i = 1 TO 11;
IF WineAttributes[i] = . THEN DELETE;
END;
RUN;

After executing the DATA step above, the WineQualityCleaned has 6463 observations: 34 observations with missing values were excluded in the new data set creation.

In the DATA step, an array WineAttributes is defined with 11 variables. Subsequently, there is an iterative DO statement. All statements in the DO and END statement are executed, in this example, eleven times, once for each variable in the array.

The variable i act as an index. Starting from the value of 1, each time through the DO loop, i is incremented by 1, and is referencing the i-th variable in the array. For example, WineAttributes[1] refers to FixedAcidity, WineAttributes[2] refers to VolatileAcidity, and so forth. This continues through all eleven variables in the array, leaving the value of 12 in i

Shortcuts for Variable Lists

When there are too many variables, instead of listing them all, you can use shortcuts for list of variable names. For example:

DATA WineQualityCleaned2;
SET WineQuality;
ARRAY WineAttributes (11) FixedAcidity -- Alcohol;
DO i = 1 TO 11;
IF WineAttributes[i] = . THEN DELETE;
END;
RUN;

In this example DATA step, FixedAcidity -- Alcohol refers to all variables from FixedAcidity to Alcohol. Thus, this DATA step will make the exactly the same output:

In addition to this, SAS also reserves some special name lists:

  • _NUMERIC_: All numeric variables in a dataset.
  • _CHARACTER_: All character variables in a dataset.
  • _ALL_: All variables in a dataset.

Using the special name lists, you can also shortcut variables like:

  • FixedAcidity _NUMERIC_ Alcohol: All numeric variables from FixedAcidity to Alcohol.
  • FixedAcidity _CHARACTER_ Alcohol: All character variables from FixedAcidity to Alcohol.
  • FixedAcidity _ALL_ Alcohol: All variables from FixedAcidity to Alcohol.

Note that the name abbreviations are not limited to the ARRAY statement. In fact, you can also employ this strategy anywhere in a DATA step or even in a PROC step to make your code more maintainable.

Post a Comment

0 Comments