DATA Step: Crafting New Variables

Crafting new variables from raw data is often essential. In many cases, this practice can significantly improve the performance of statistical models and machine learning algorithms. Depending on the data type and analysis purpose, various techniques can be employed to derive new variables. When working with tabular data, commonly employed techniques include:

  • Transformations: Combining multiple variables through mathematical operations performed on existing variables (e.g., calculating body mass index using height and weight.)
  • Binning: Grouping continuous variables into categories (e.g., creating age ranges from a continuous age values.)
  • Encoding: Converting categorical variables into numerical representations (e.g., one-hot encoding for colors.)
  • Standardizing: Scaling variables to a common range, often between 0 and 1 or with a mean of 0 and standard deviation of 1. 

In a SAS DATA step, you can easily define a new variable as follows:

NewVariable = Expression;

On the left side of the equal sign is the variable name, which can be either a new one, to append one more variable to the data set, or existing one, to re-define it by the expression. On the right side of the equal sign can be a constant, an existing variable, or mathematical expressions combining them. Here are available mathematical expressions: 

OperationDescriptionExample
Numeric constantAssigns the value 10 to the variable x.x = 10;
Character constantAssigns the string "John" to the character variable name.name = "John";
New variableAssigns the sum of num1 and num2 to the new variable total.total = num1 + num2;
AdditionAdds the values of num1 and num2 and assigns the result to the variable totaltotal = num1 + num2;
SubtractionSubtracts the value of num2 from num1 and assigns the result to the variable difference.difference = num1 - num2;
MultiplicationMultiplies the values of num1 and num2 and assigns the result to the variable product.product = num1 * num2;
DivisionDivides the value of num1 by num2 and assigns the result to the variable quotient.quotient = num1 / num2;
ExponentiationRaises the value of base to the power of exponent and assigns the result to the variable result.result = base ** exponent;

SAS follows the standard mathematical rules of precedence: exponentiation takes precedence, followed by multiplication and division, lastly addition and subtraction. Parentheses can be used to override this order. 

Let's consider an example scenario. According to fitness experts, the one-repetition maximum (one-rep max or 1RM) in weight training refers to the maximum weight an individual can possibly lift for a single repetition. Based on the past workout logs, there are several different ways to calculate 1RM:

The following DATA steps perform different 1RM calculations using Reps and Weights

DATA WorkoutLogs;
INPUT Timestamp :DATETIME. Excercise $ Reps Sets Weights;
DATALINES;
27MAR2023:08:00:00 Squats 10 3 100
27MAR2023:08:15:00 Push-ups 15 3 0
27MAR2023:08:30:00 Bench-press 8 4 120
27MAR2023:08:45:00 Deadlifts 12 3 150
27MAR2023:09:00:00 Lunges 10 3 80
27MAR2023:09:15:00 Pull-ups 8 4 0
27MAR2023:09:30:00 Bicep-curls 12 3 40
27MAR2023:09:45:00 Planks . . .
27MAR2023:10:00:00 Shoulder-press 10 3 50
;
RUN;

DATA OneRepMax;
SET WorkoutLogs;
Epley = Weights * (1 + 0.0333 * Reps);
Brzycki = Weights / (1.0278 - 0.0278 * Reps);
Lombardi = Weights * Reps ** 0.1;
OConner = Weights * (1 + 0.025 * Reps);
Wathan = Weights * 100 / (48.8 + 53.8 * EXP(-0.075 * Reps));
RUN;

In addition to the arithmetic operations, SAS provides some built-in mathematical functions:

FunctionDescriptionExample
ROUND(x, int)Rounds a numeric value to a specific number of decimals.
ROUND(3.141592, 2) returns 3.14.
INT(x)Returns the integer part of x, by truncating any fractional digits.INT(3.75) returns 3
INT(-2.1) returns -2.
ABS(x)Returns the absolute value of x.ABS(-5) returns 5.
CEIL(x)Returns the smallest integer that is greater than or equal to x.
CEIL(4.3) returns 5
CEIL(-1.5) returns -1.
FLOOR(x)Returns the largest integer that is less than or equal to x.FLOOR(4.8) returns 4
FLOOR(-1.5) returns -2.
LOG(x)Returns the natural logarithm of x (base-e logarithm).
LOG(10) returns 2.302585.
EXP(x)Returns the base-e raised to the power of x.EXP(1) returns 2.71828.
SQRT(x)Returns the square root of x.SQRT(9) returns 3.
MAX(arg1arg2, ...)Returns the maximum value among the arguments.MAX(5, 10, 3) returns 10.
MIN(arg1arg2, ...)Returns the minimum value among the arguments.MIN(5, 10, 3) returns 3.
SUM(arg1arg2, ...)Calculates the sum of the arguments (arg1, arg2, ...)SUM(5, 10, 3) returns 18.
MEAN(arg1arg2, ...)Calculates the sum of the arguments (arg1, arg2, ...)MEAN(5, 10, 3) returns 6.

It is worth noting that the functions listed above perform element-wise calculations across variables within an observation, not across observations within a variable. For example: 

DATA OneRepMax;
SET WorkoutLogs;
Epley = Weights * (1 + 0.0333 * Reps);
Brzycki = Weights / (1.0278 - 0.0278 * Reps);
Lombardi = Weights * Reps ** 0.1;
OConner = Weights * (1 + 0.025 * Reps);
Wathan = Weights * 100 / (48.8 + 53.8 * EXP(-0.075 * Reps));
MaxRM = MAX(Epley, Brzycki, Lombardi, OConner, Wathan);
MinRM = MIN(Epley, Brzycki, Lombardi, OConner, Wathan);
RUN;

PROC PRINT DATA=OneRepMax;
RUN;

Let's consider another example. In the following raw data file, Month, Day, and Year variables are delimited by spaces.

When importing this file into a SAS data set, we want to craft a variable of SAS date values, combining these three pieces of information. You can achieve this through the MDY(monthdateyear) function as follows: 

DATA CarSales;
INFILE '/home/u63368964/source/car-sales.dat';
INPUT SalesMonth SalesDate SalesYear Model $ Color $ Quantity Price;
SasDate = MDY(SalesMonth, SalesDate, SalesYear);
RUN;

PROC PRINT DATA=CarSales;
RUN;

Using IF-THEN Statements to Craft New Variables

Instead of applying an expression to all observations, we often want to use it only for those that meet certain criteria. This is called conditional logic, and you do it with IF-THEN statement in SAS programming.

IF condition THEN action;

If condition evaluated to be true, then SAS executes action. Here are operators that can be used in the condition part:

Symbolic Mnemonic Description
= EQ Checks if two values are equal.
~= NE Checks if two values are not equal.
^= NE Checks if two values are not equal.
> GT Checks if the left operand is greater than the right operand.
>= GE Checks if the left operand is greater than or equal to the right operand.
< LT Checks if the left operand is less than the right operand.
<= LE Checks if the left operand is less than or equal to the right operand.
& AND All comparisons must be true.
| OR Only one comparison must be true.

For example, in the CarSales data set, let's say that we also want to mark SalesQuarter based on SalesMonth variable. In a DATA step, you can do this by applying YYQ(yearquarter) function in IF-THEN statements: 

DATA CarSales;
INFILE '/home/u63368964/source/car-sales.dat';
INPUT SalesMonth SalesDate SalesYear Model $ Color $ Quantity Price;
IF SalesMonth LE 3 THEN SalesQuarter = YYQ(SalesYear, 1);
IF SalesMonth GT 3 AND SalesMonth LE 6 THEN SalesQuarter = YYQ(SalesYear, 2);
IF SalesMonth GT 6 AND SalesMonth LE 9 THEN SalesQuarter = YYQ(SalesYear, 3);
IF SalesMonth GT 9 THEN SalesQuarter = YYQ(SalesYear, 4);
RUN;

PROC PRINT DATA=CarSales;
RUN;

Note that the created SAS date will be the first day of the quarter.

The IN operator checks if there is any matches in the list of values. For example, within an observation, if Model matches any of the list values ('Sedan', 'SUV', 'Hatchback'), then assigns 'Passenger' to CarType

DATA CarSales;
LENGTH Model $10. CarType $10.;
INFILE '/home/u63368964/source/car-sales.dat';
INPUT SalesMonth SalesDate SalesYear Model $ Color $ Quantity Price;
IF Model IN ('Sedan', 'SUV', 'Hatchback') THEN CarType = 'Passenger';
ELSE CarType = 'Commercial';
RUN;

PROC PRINT DATA=CarSales;
RUN;

Note that a single IF-THEN statement can only have one action. However, by incorporating DO-END block, you can execute more than one action. For example: 

DATA CarSales;
LENGTH Model $10. CarType $10.;
INFILE '/home/u63368964/source/car-sales.dat';
INPUT SalesMonth SalesDate SalesYear Model $ Color $ Quantity Price;
IF Model NOT IN ('Sedan', 'SUV', 'Hatchback') THEN DO;
CarType = 'Commercial';
SalesTax = Price * 0.06;
ELSE DO;
CarType = 'Passenger';
SalesTax = Price * 0.1;
RUN;

PROC PRINT DATA=CarSales;
RUN;

Subsetting IFs

Sometimes, it may be necessary to remove observations from a data set based on certain conditions. In such situations, you can use the DELETE statement along with the IF-THEN/ELSE statements. For example: 

DATA PassengerVehicleSales;
INFILE '/home/u63368964/source/car-sales.dat';
INPUT SalesMonth SalesDate SalesYear Model $ Color $ Quantity Price;
IF Model = 'Truck' THEN DELETE;
RUN;

PROC PRINT DATA=PassengerVehicleSales;
RUN;

Working with SAS Dates

In SAS, a variable can hold either character or numeric values. Any date values are represented as the number of days since January 1, 1960. However, this representation can be confusing, particularly when performing calculations involving time intervals like months or years. For example, the varying number of days in different months complicates tasks such as adding or subtracting a month from a date value. Similarly, determining the number of intervals between two dates is also challenging. To address these issues, you can employee the following functions:

  • INTNX('interval'fromn): This function adds n-intervals to the dates specified at from argument.
  • INTCK('interval'fromto): This function calculates how many intervals between from and to dates.

For example, let's consider the following DATA step:

DATA Employees;
INFILE '/home/u63368964/source/employees.csv' FIRSTOBS=2 DLM=',' TRUNCOVER;
/* Specifying date value representations */ FORMAT Dob DATE10. StartDate DATE10. EndDate DATE10. TenYearsAnniversary WORDDATE18.; INPUT EmployeeID $4. GivenName :$15. SurName :$15. Dob DDMMYY10. JobTitle :$25. StartDate :DATE10. EndDate ?:DATE10.; Age = INTCK('Year', Dob, TODAY()); IF EndDate = '.' THEN DO; YearsInOffice = INTCK('Year', StartDate, TODAY()); TenYearsAnniversary = INTNX('Year', StartDate, 10); NumDaysToAnniversary = INTCK('Day', StartDate, TenYearsAnniversary); END; ELSE TenureMonths = INTCK('Month', StartDate, EndDate); DROP EmployeeID JobTitle; RUN; PROC PRINT DATA=Employees; RUN;

In the DATA steps above, the INTCK() functions are employed to calculate number of years and days between the two SAS date variables. Here, TODAY() function retrieves the current date from the system. For example, Age = INTCK('Year', Dob, TODAY()); counts the number of years between the date values stored in Dob and system date. On the other hand, the INTNX() function is used to calculate the SAS date that is 10 years after the StartDate variable.

By default, the first year of a hundred-year span is set to be 1960. However, you can change this setting with the YEARCUTOFF= option. The YEARCUTOFF= option is helpful when the raw date values are stored with two-digits. For example, if a date value is provided by 07/04/76, it is ambiguous whether it means 1976, 2076, or possibly 1776, and so on. To avoid this issue:

OPTIONS YEARCUTOFF = 1950;

This statement ensures that the two-digit dates are between 1950 and 2049, and thereby 07/04/76 represents July 4th, 1976.

Manipulating Character Variables

In SAS, characters refer to any alphanumeric values enclosed by a pair of quotation marks. They represent categorical or factor variables in SAS data sets.

When processing these character variables, the standard arithmetic operators and functions typically cannot be applied directly. Even if an operator can be applied to the variables, the results could be tricky.

Concatenation Operator

One of the most common operator to manipulate character variables is the concatenation operator (||). This operator concatenates two character variables on its left and right-hand sides into a single variable. For example, let's consider the following DATA step:

DATA Employees;
INFILE '/home/u63368964/source/employees.csv' FIRSTOBS=2 DLM=',' TRUNCOVER;
INPUT EmployeeID $4. GivenName :$15. SurName :$15. Dob DDMMYY10. JobTitle :$25. StartDate :DATE10. TerminationDate ?:DATE10.; EmployeeName = TRIM(SurName) || ', ' || GivenName; KEEP SurName GivenName EmployeeName; RUN;

In the code lines above, concatenation operators combine SurName, ',', and GivenName. Here, to remove any trailing blank spaces in SurName, the TRIM function is applied.

Applying Comparison Operators on Character Variables

The comparison operators, such as > (GT), = (EQ), or ^= (NE), is not limited to numeric values; they can also be applied to compare two character variables. When comparing two character variables, the evaluation is based on either ASCII or EBCDIC collating sequence order. While IBM mainframes employ EBCDIC, most PCs, including SAS Studio environments, use ASCII, where blank spaces come first, followed by numbers, uppercase letters, lowercase letters, and non-Latin alphabets. For example:

DATA Employees;
INFILE '/home/u63368964/source/employees.csv' FIRSTOBS=2 DLM=',' TRUNCOVER;
INPUT EmployeeID $4. GivenName :$15. SurName :$15. Dob DDMMYY10. JobTitle :$25. StartDate :DATE10. TerminationDate ?:DATE10.; EmployeeName = TRIM(SurName) || ', ' || GivenName;
IF JobTitle >= 'Area Sales Manager' THEN DELETE; KEEP EmployeeName JobTitle; RUN;

This DATA step subsets raw data by comparing character strings. In the raw data file, there are 3,000 rows in total. However, after subsetting, the returning output contains only 66 observations.

Selected Functions for Character Variables

In addition to the two operators mentioned earlier, SAS also provides some built-in functions for character manipulations:

Function NameDescriptionExampleResult
ANYALNUM(arg, start)Returns position of first occurrence of any alphabetic character or numeral at or after optional start position.a='123 E St, #2 ';
x=ANYALNUM(a);
y=ANYALNUM(a,10);
x=1
y=12
ANYALPHA(argstart)Returns position of first occurrence of any alphabetic character at or after optional start position. a='123 E St, #2 ';
x=ANYALPHA(a);
y=ANYALPHA(a,10)
x=5
y=0
ANYDIGIT(argstart)Returns position of first occurrence of any numeral at or after optional start position.a='123 E St, #2 ';
x=ANYDIGIT(a);
y=ANYDIGIT(a,10)
x=1
y=12
ANYSPACE(argstart)Returns position of first occurrence of a white space character at or after optional start position. a='123 E St, #2 ';
x=ANYSPACE(a);
y=ANYSPACE(a,10)
x=4
y=10
CAT(arg-1arg-2, ... arg-n)Concatenates two or more character strings together leaving leading and trailing blanksa=' cat';
b='dog ';
x=CAT(a,b);

y=CAT(b,a);
x=' catdog '
y='dog  cat'
CATS(arg-1arg-2, ... arg-n)Concatenates two or more character strings together stripping leading and trailing blanksa=' cat';
b='dog ';
x=CATS(a,b);

y=CATS(b,a);
x='catdog'
y='dogcat'
CATX('separator-string'arg-1arg-2, ... arg-n)Concatenates two or more character strings together stripping leading and trailing blanks and inserting a separator string between arguments.a=' cat';
b='dog ';
x=CATX('&',a,b);

x='cat&dog'
COMPRESS(arg'char')Removes spaces or optional characters from argument. a=' cat & dog ';
x=COMPRESS(a);
y=COMPRESS(a,'&');
x='cat&dog'
y=' cat  dog '
INDEX(arg'string')Returns starting position for string of characters. a='123 E St, #2 ';
x=INDEX(a,'#')
x=11
LEFT(arg)Left aligns a SAS character expression. a=' cat';
x=LEFT(a);
x='cat '
LENGTH(arg)Returns the length of an argument not counting trailing blanks (missing values have a length of 1). a='my cat';
b=' my cat ';
x=LENGTH(a);
x=LENGTH(b);
x=6
y=7
PROPCASE(arg)Converts first character in word to uppercase and remaining characters to lowercase. a='MyCat';
b='TIGER';
x=PROPCASE(a);
y=PROPCASE(b);
x='Mycat'
y='Tiger'
SUBSTR(argpositionn)Extracts a substring from an argument starting at position for n characters or until end if no na='(916)734-6281';
x=SUBSTR(a,2,3);
x='916'
TRANSLATE(sourceto-1from-1, ... to-nfrom-n)Replaces from characters in source with to characters (one to one replacement only - you can't replace one character with two, for example). a='6/16/99';
x=TRANSLATE
(a,'-','/');
x=6-16/99
TRANWRD(sourcefromto)Replaces from character string in source with to character string. a='Main Street';
x=TRANWRD
(a,'Street','St.');
x='Main St.'
TRIM(arg)Removes trailing blanks from character expression. a='My ';
b='Cat';
x=TRIM(a)||b;
x='MyCat'
UPCASE(arg)Converts all letters in argument to uppercase.a='MyCat';
x=UPCASE(a)
x='MYCAT'

Post a Comment

0 Comments