DATA Step: Concept of PDV and RETAIN Statement

When creating a new data set, for each data line from the reference, SAS first generates an empty list known as PDV (Program Data Vector). This is a temporal space where SAS populates values by executing provided DATA step statements. After SAS finishes executing all statements in the current DATA step, the populated PDV is added to the new data set as an observation. This process will be repeated up until for the last row in the data source.

PDV Automatic Variables

During the process of populating PDV, SAS automatically generates temporary variables to store some information. These variables are known as automatic variables. Below is the list of these variables:

  • _N_: This variable is an integer that holds the current observation number within the DATA step. It starts at 1 and increments by 1 for each observation processed.
  • _ERROR_: This is a numeric variable that indicates whether an error occurred during the current observation processing. A value of 1 signifies an error has been encountered, while 0 indicates no error.
  • FIRST.BY and LAST.BY: These two automatic variables are created for each BY variable you specify in the BY statement. They are Boolean (1 or 0) and indicate the position within a BY group:
    • 1: The variable is set to 1 for the first observation in a BY group.
    • 0: The variable is set to 0 for subsequent observations within the same BY group.
    • 1 again: It changes back to 1 when a new BY group starts.

For example:

DATA One;
INPUT VarA;
DATALINES;
1
2
3
;

DATA Two;
INPUT VarB;
DATALINES;
11
12
13
14
;

DATA Three;
INPUT VarC;
DATALINES;
31
32
33
34
35
;

DATA Combine;
SET One;
IF _N_ = 2 THEN SET Two;
IF _N_ = 3 THEN SET Three;
RUN;

In this example, _N_ is initialized to 1 during the first iteration. Then, SAS processes the second statement SET One;, which reads the first observation from the data set One. Since _N_ = 1 for the current iteration, the remaining SET statements will not be executed.

Next, _N_ is incremented by 1, so now _N_ = 2. This time, after executing SET One;, SAS also executes SET Two, as _N_ = 2 evaluates to true. Here, it is important to note that _N_ values are based on the data source. Thus, for the current observation, VarA would be the second observation from the data set One. On the other hand, SET Two reads the first observation from the data set Two, concatenating it horizontally.

Then, _N_ is incremented again by 1, so not _N_ = 3. So, it reads the third observation from the data set One. This time, since _N_ = 3, SET Two is not executed; instead, SET Three is executed and reads the first observation from the data set Three. However, VarB of the third observation will be 11, not missing. This is because PDV for the current observation is created by copying that of the previous observation. So, in this example, SAS copies the second observation and then executes the three statements in the DATA step, starting from SET One;.

Generally, automatic variables exist temporarily only while reading objects and are not stored in the output data set. So, if you want to see the variables in your output, you should explicitly specify it as follows:

DATA Combine2;
SET One;
IF _N_ = 2 THEN SET Two;
IF _N_ = 3 THEN SET Three;
Nobs = _N_;
Error = _ERROR_;
RUN;

In tasks like report writing and data management, a common requirement is to identify the beginning and end of each group of observations. Two automatic variables, FIRST.BY and LAST.BY, store the start and end of observation groups defined by the BY variables. 

One useful applications of FIRST.BY and LAST.BY is checking whether the two consecutive observations for a variable are different. For example:

DATA One;
INPUT VarA VarB;
DATALINES;
A 1
B 2
D 3
C 5
C 4
C 6
D 7
;

DATA Two;
SET One;
BY VarA NOTSORTED;
First = FIRST.VarA;
Last = LAST.VarA;
RUN;

In this example, the data set Two maintains the same order as One. The NOTSORTED option allows you to use the BY statement without sorting it first. Thus, in this example, the automatic variable FIRST.VarA acts as a marker, checking if the VarA value in the current observation is different from its preceding one. Similarly, LAST.VarA also acts as a marker, checking if VarA in the next observation is different from the current one.

RETAIN Statement

At the moment it is created for the first variable, the PDV is an empty vector. Then, SAS reads data from the specified data source, and populates PDV by the instructions. In the process, the RETAIN statement provides a specific value for this initial state of variable and preserves it to the next iteration. The RETAIN statement can appear anywhere in the DATA step and has the following form:

RETAIN Variable-1 [Initial-value-1] ... ;

Where:

  • Variable-1: Name of the variable where you want to retain the initial value.
  • Initial-value-1 (Optional): The initial value of the variables specified in the statement. If omitted, it defaults to a missing value.

You can have as many variable-value pairs as needed. For example, let's consider the following data file:

This data file logs the active energy consumption in Watt-hours by a household on a per-minutes basis, calculated as Global_active_power * 1000 / 60 - Sub_metering_1 - Sub_metering_2 - Sub_metering_3. Now, let's consider the following DATA step:

DATA EnergyConsumption;
INFILE '/home/u63368964/source/household-power-consumption.txt' DLM=';' FIRSTOBS=2;
INPUT Date :ANYDTDTE10. Time TIME8. GlobalActivePower GlobalReactivePower Voltage GlobalIntensity SubMetering1 SubMetering2 SubMetering3; /* Calculate active energy consumption */ ActiveEnergyConsumption = GlobalActivePower * 1000 / 60 - SUM(SubMetering1, SubMetering2, SubMetering3); /* Keep record high of the active energy consumption */ RETAIN RecordHigh; RecordHigh = MAX(RecordHigh, ActiveEnergyConsumption); RUN;

In the DATA step, RETAIN RecordHigh; declares a variable with its default missing value. Subsequently, RecordHigh = MAX(RecordHigh, ActiveEnergyConsumption); updates the RecordHigh variable to store the maximum value encountered thus far, comparing the ActiveEnergyConsumption in the current PDV with the previously stored value in the variable. The updated RecordHigh value will be preserved until it encounters larger value in the iterations.

Calculating Running Totals

One of the most common places where you can leverage the RETAIN statement would be calculating running totals. A running total is the cumulative sum of a sequence of numbers. It gets updated each time a new number is added to the sequence by adding the value of the new number to the previous running total. For example, let's consider the following DATA step:

DATA EnergyConsumption2;
SET EnergyConsumption;
RETAIN RunningTotal -10;
RunningTotal = SUM(RunningTotal, ActiveEnergyConsumption); RUN;

This DATA step builds upon the previous one and calculates a running total for the ActiveEnergyConsumption. To be more specific, RETAIN RunningTotal -10; declares a variable named RunningTotal that's retained across observations within the DATA step. It is initialized to -10, representing the starting point for the running total.

Subsequently, RunningTotal = SUM(RunningTotal, ActiveEnergyConsumption); calculates and updates the RunningTotal for each observation. It uses the SUM function to combine the RunningTotal from the previous iteration with the current ActiveEnergyConsumption

An alternative way to achieve the same result is the sum statement; it doesn't require any keyword or equal sign, and retains values from the previous iteration of the DATA step. Essentially, this statement cumulatively adds the values of an expression to a variable. 

variable + expression;

For example:

DATA EnergyConsumption3;
SET EnergyConsumption;
RunningTotal + ActiveEnergyConsumption;
RUN;

Carry-over Calculations

For basic calculations that require a value from the previous observation, RETAIN can be a quick solution. Let's consider the following DATA step:

DATA EnergyConsumption4;
SET EnergyConsumption;
RETAIN PreviousConsumption;
CurrentDifference = ActiveEnergyConsumption - PreviousConsumption;
PreviousConsumption = ActiveEnergyConsumption;
RUN;

In this DATA step, the PreviousConsumption is initiated with a missing value for the first observation. Then the CurrentDifference is calculated subtracting this from the current ActiveEnergyConsumption. After that, PreviousConsumption is updated by the current ActiveEnergyConsumption, so that the two variables have the same value before moving onto the next observation.

Post a Comment

0 Comments