DATA Step: Creating a New SAS Data Set

From the viewpoint of a statistician, data can be defined as a collection of column vectors containing observed values. Columns are referred to as variables, reflecting the fact that their values vary based on the observation (row) they capture. Collectively, these column vectors form a tabular array known as a data set.

Being a "Statistical Analysis System," SAS has a tabular data structure called SAS data set composed of columns and rows. Reflecting its primary purpose of statistical analysis, SAS calls the columns as variables and the rows as observations. Each variable in a SAS data set must have a unique name, length, and data type (numeric or character.) These attributes instruct SAS on how to store and handle data values in each variable.

The only kind of data structure that can be processed by SAS is the SAS data set. Thus, prior to any data processing, it is necessary to first import raw data files into a SAS data set through PROC IMPORT. Or create a new data set through a DATA step.

A SAS DATA step begins with the special keyword DATA, followed by two level name indication (SAS library and data set name, separated by a period) and some subsequent statements dictating where and how to reference raw data sources.

DATA SasLibrary.MyData;
    <Some instructions for creating MyData>
;
RUN; /* Executes the DATA step */

SAS data sets must adhere to the following naming rules:

Names must be 32 characters or fewer in length.
Start with an alphabet (A-Z, a-z) or an underscore (_).
Following the first character, names can contain alphabet, numbers, or underscore.
Data set names are not case sensitive.

Note that in the two-level data set naming, if you omit the library name, the output data set will be saved in the WORK library. SAS automatically deletes all data sets in the WORK library at the end of the current session. To prevent this, please make sure that you specify the library name in front of the period.

Manual Entry of Raw Data Values

When writing a test case for your SAS program, you probably include only a small number of observations. For the purpose, it is convenient to directly input the data values into your SAS program.

To indicate SAS that the raw data values will be provided inline, after adding the INPUT statement to specify the variable attributes, you should add the keyword DATALINES or CARDS. Any following lines comes after the keyword will be considered as the data entry. For example:

DATA MyData.WorkoutLogs;
    INPUT Timestamp :DATETIME. Excercise $ Reps Sets Weights;
    DATALINES;
27MAR2023:08:00:00 Squats 10 3 100
27MAR2023:08:15:00 Push-ups 15 3 0
27MAR2023:08:30:00 Bench-press 8 4 120
27MAR2023:08:45:00 Deadlifts 12 3 150
27MAR2023:09:00:00 Lunges 10 3 80
27MAR2023:09:15:00 Pull-ups 8 4 0
27MAR2023:09:30:00 Bicep-curls 12 3 40
27MAR2023:09:45:00 Planks . . .
27MAR2023:10:00:00 Shoulder-press 10 3 50
;
RUN; /* Executes the DATA step */

Note that the keyword DATALINES and CARDS are perfectly equivalent with no functional difference between them.

In the DATALINES block, for each observation, the data entries should be delimited by a space and any missing values should be marked by a period. SAS stores each delimited value to its destined variable. For example, the first data line, 27MAR2023:08:00:00 Squats 10 3 100, consists of five data values separated by spaces. Thus, the first value, 27MAR2023:08:00:00, will be stored as the first observation of the Timestamp. Similarly, the second value, Squats, will be stored as the first observation of the next variable, Exercise, and so forth. (The :DATETIME is called an informat, telling SAS how data entries are formatted.)

Importing External Data Files into SAS Data set

Most of the time, you will create a new data set by referencing an existing data file, which could be formatted as CSV, Microsoft Excel, or any other text file. To inform SAS that the raw data values will be sourced from an external file, you should add an INFILE statement with the file directory before the INPUT statement.

If you're working in SAS Studio, you should first upload the local file and then provide the file directory of the uploaded data file. For example, let's consider data values stored in your local machine as follows:

After uploading the file through the "Upload" button on the navigation pane, you can right-click on the file and select "Properties." Then SAS Studio opens a dialog with the exact location of the uploaded file.

Using the file location, you can import the uploaded file into a SAS data set in a DATA step:

DATA MyData.PowerConsumption;
    INFILE '/home/u63368964/source/household-power-consumption.txt' DLM=';' FIRSTOBS=2;
    INPUT Date :ANYDTDTE10. Time TIME8. 
          GlobalActivePower GlobalReactivePower Voltage GlobalIntensity
          SubMetering1 SubMetering2 SubMetering3;
RUN;

Controlling Inputs with the INFILE Options

When reading data from external sources through an INFILE statement, you can include some options to control data inputs. To add an INFILE option, simply append it after file directory, separated by a space. Below are commonly used INFILE options:

FIRSTOBS: Specifies the row number from which SAS begins to read data.
OBS: Specifies the row number at which SAS finishes reading data.
ENCODING: Specifies the character encoding of the file. Common encodings include 'latin1', 'utf-8', etc.
DLM: Short for "delimiter." It specifies the character that separates values in the file. SAS defaults to using a space as the delimiter, but you can specify others like comma (',') or tab ('\t'), depending on the source data format.
DSD: The "Delimiter Sensitive Data," or DSD, is an option to handle situations where data values in a delimited file contain the delimiter within the actual data. For example, consider a data file delimited by commas have a character string: "John Doe","123, Main St",25. Without DSD, SAS interprets this as four values. To prevent this, you can specify DSD=',' to ensure that SAS recognizes the entire quoted string as a single value.
MISSOVER: By default, SAS automatically moves to the next line to read next values, if SAS reached the end of the data line while there are remaining variables to fill in. The MISSOVER option tells SAS that if a data line runs out of value, assign missing values to the remaining variables.
TRUNCOVER: When the raw data file is fixed formatted with some trailing missing values, specifying TRUNCOVER option tells SAS to read data for the variable until it reaches the end of the data line or the last column specified in the format, whichever comes first.

For example, let's consider a text file delimited by commas as shown below. In the DATA step, we would like to read from the second to the seventh data line, excluding both the headers and comments. Also observe that the third and fifth rows exhibit trailing missing values, which have been truncated.

The following DATA step imports this data file into a SAS data set:

DATA MyData.HomeAddress;
    /* Specifying variable lengths */
    LENGTH ID $3 Name $50 Age 3 Sex $1 Address $100;
    INFILE '/home/u63368964/source/home-address.dat' ENCODING='utf-8' FIRSTOBS=2 OBS=7 DLM=',' TRUNCOVER;
    INPUT ID $ Name $ Age Sex $ Address;
RUN;

Creating a Data Set Bringing Observations from Another SAS Data set

Often, you want to create a SAS data set referencing from another data set. The SET statement brings observations from the specified data set and creates a new data set. For example:

DATA JapaneseAddress;
    SET MyData.HomeAddress;
    PostalCode = SUBSTR(Address, 1, 11);
    DROP ID Age Sex;
    RENAME Name = HouseholdName;
RUN;

This example DATA step creates the JapaneseAddress, referencing from Mydata.HomeAddress. After bringing all observations from MyData.HomeAddress, the DATA step creates a new variable by applying SUBSTR function to the Address variable. It also drops unnecessary variables using the DROP statement and renames Name by the RENAME statement.

DROP: Excludes variables from the output data set.
RENAME: Rename variables in the output data set.
KEEP: Retain specific variables in the output data set.

The INPUT Statement

The INPUT statement provides SAS overall guides for importing data into a new data set. It specifies the formatting of raw data values, which column values correspond to variables, and the name of the output variables. Here's the basic syntax of the statement:

INPUT Variable1 [$] [Column-position1 - Column-position2] [Informat.] ... ;

Variable1: Name of the first variable in the output data set. You can append variables to the statement as many times as you need.
Dollar Sign: If you want the Variable1 to be entered as a character variable, you should indicate it by adding a $.
Column-position1 - Column-position2 (Optional): These optional indices define the first and last column positions in the raw data.
Informat. (Optional): This specifies how the raw data is formatted (e.g., $240,000 or Mar 19, 2024).

List Input to Read Data

In a raw data file, when the values are separated by one or more spaces and there is no rule about their column positioning, we say the file is in a free format. To create a new SAS data set with a freely formatted data source, you can simply list the desired variable names in the INPUT statement. Variable names must be:

Starting with an alphabet (A-Z, a-z) or an underscore (_) only.
Not exceeding 32 characters.
Having numbers other than the 1st position.

For example, let's consider the following data file:

Now, let's import this file using the following INPUT statement. This kind of INPUT statement is referred to as list input:

DATA StudentGrades;
    INFILE '/home/u63368964/source/student-grades.dat';
    INPUT StudentID $ Name $ Math Science English;
RUN;

In the output data set, we see that the Math is entirely missing. This happens due to the existence of spaces between the first and last name in each data line. The list input is the most straightforward approach, but it comes with some limitations:

Strict Delimiter: All values must be strictly separated by at least one space. Or you must specify the delimiter by the DLM= option in the INFILE statement.
Missing Value Representation: Any missing values must be represented as a single period.
Character Variable Limitations: Character variable values must not contain any spaces, and their maximum length is 8 characters, unless specified otherwise using the LENGTH statement.

In this example DATA step, list input instructs SAS to interpret any space between the first and last name as a delimiter, resulting in two separate values being populated for each observation. Thus, the last names are mistakenly assigned to the Math, which is numeric. Since character values cannot be converted into numeric values, all Math entries are missing.

Reading Fixed Format Data

If all data values are neatly arranged in columns, you can read them by specifying column indicators. The column indicators specify the first and last column position for each variable entered in fixed format. For example:

DATA StudentGrades;
    INFILE '/home/u63368964/source/student-grades.dat';
    INPUT StudentID $ 1-4 Name $ 6-20 Math Science English;
RUN;

In this DATA step, the INPUT statement explicitly specifies raw data column ranges for each variable assignment. For example, any values in the 1-4 columns of the raw data will be assigned to the StudentID. Similarly, any values in the 6-20 columns will be assigned to the Name. For the remaining variables, we use the list input.

Compared to the list input, column indicators take several advantages:

Character variables can contain spaces and have values of up to 32,767 characters.
It's possible to skip values for unnecessary variables.
Since the positions of columns to read are specified, they can be read in any order, and the same column can be read repeatedly.

For example, let's consider the following DATA step:

DATA StudentGrades2;
    INFILE '/home/u63368964/source/student-grades.dat';
    INPUT Name $ 6-20 Math Science StudentID $ 1-4 MathScore 21-22;
RUN;

This DATA step creates Name, reading values in 6-20 column. Then it moves to the next values to populate Math, separated by a space. Similarly, it reads the next separated values to populate Science. These two variables are created by list input. After that, SAS reads values in 1-4 columns and populates StudentID. Lastly, it reads 21-22 column values again for another variable MathScore:

Pointer Controls in the INPUT Statement

By default, particularly when reading raw data without column indicators, the INPUT statement extracts data sequentially from the source, one value after another. However, this might not be ideal if your data isn't arranged in a straightforward way. This is where pointer controls come into play. They keep track of the current location for each data record and relocates as needed. Here are some common types of pointer controls:

Column Pointer Controls:

These allow you to move the pointer to a specific column position in the current data line.
@n moves the cursor to n-th column.
Sometimes you don't know the column number, but you do know that it always comes after a particular character string. In such situations, you can use the @'character string'.
+n moves the cursor to the n-th column after the current position.

Line Pointer Controls:

These controls are used to move the pointer to another line in the data record.
/ moves the cursor to the next line.
#n moves the cursor to the n-th line.

For example, data file in the figure shown below contains information about temperatures for the month of July for Alaska, Florida, and North Carolina. Each line contains the city, state, normal high, normal low, record high, and the record low temperature in degrees Fahrenheit.

Now, let's consider the following DATA step:

DATA JulyTemperatures;
    INFILE '/home/u63368964/source/temperatures.dat';
    INPUT @'State:' State $
          @1 City $
          @'State:' +3
          AvgHigh 2. AvgLow 2.
          RecordHigh RecordLow;
RUN;

In the DATA step, @'State:' relocates current column position wherever right after the specified string, 'State:', in the raw data file, and populates State. Then, @1 moves the current position to the first column and read values for City. Similarly, @'State:' +3 locates the cursor to 3 columns after 'State:' string.

Let's consider another example. This time, each observation spans three lines of the data file: the first line contains first and last name, second line contains age, and the third line contains height and weights.

Now, let's consider the following DATA step:

DATA HeightWeights;
    INFILE '/home/u63368964/source/height-weight.dat';
    INPUT FirstName $ LastName $
          / Age
          #3 Height Weight;
RUN;

In this DATA step, / moves cursor to the next line after reading two variables. Similarly, #3 relocates the cursor to the third line of the raw data file:

Reading Data based on Conditions

The at sign (@) located at the end of the INPUT statement without any column indicator holds the cursor and executes the next SAS statement. Combining this with an IF-THEN statement, you can have different INPUT statements based on the conditional evaluations. For example:

DATA MyData;
    INPUT Year 1-4 @;
        IF Year=1988 THEN INPUT Day 5-7 Amount 8-10;
        IF Year=1989 THEN INPUT Day 6-8 Amount 10-12;
    DATALINES;
19883.2149
19885.7614
1989 7.9 764
1989 6.8 875
;
RUN;

In this DATA step example, after reading a value for Year, @ holds the cursor and proceed to the subsequent SAS statements. If the condition in the IF clause is evaluated by true, then it takes an action specified in the THEN clause.

Character and Date Informats

Sometimes, raw data values adhere to specific formatting conventions. For example, "1,000,001" is intended to be interpreted as a numeric value. However, the inclusion of commas within the value can pose some challenges when importing the value into a SAS dataset.

In the INPUT statement, informats provide instructions to SAS on how to interpret raw data values. The informats has the following forms:

$informatw. /* Character */
informatw.d /* Numeric */
informatw. /* Date */

The $ signifies that it is a character informat, w represents the total width, and d indicates the number of decimal places (applicable only to numeric informats). For example, let's consider the following raw data file:

In the file, we have dates and sales amounts with specific formats. To read this raw data properly, the following DATA step specifies informats after variable names:

DATA Sales;
    INFILE '/home/u63368964/source/sales.dat';
    INPUT
        SalesID $4. 
        Product $ 6-16 
        SalesDate MMDDYY10.
        Amount COMMA8.2 
        @37 ProductWeight 4.2;
RUN;

Here, MMDDYY10. instructs SAS to read the 10 values, such as 01/05/2023, as the number of days between January 1, 1960 and January 5, 2024. This is referred to as a SAS date value. (As for why January 1, 1960? There's no official confirmation, but the common belief is that the SAS founders chose January 1, 1960, as an easy-to-remember reference point. It might have been around the approximate birth date of a popular computer system at that time, such as the IBM 370. Regardless of the reason, this fixed date simplifies the date calculations within SAS.)

The next eight values should be interpreted as a number. The informat COMMA8.2 tells SAS that the raw data values are formatted with commas and two decimal places. Note that the total width of 8 includes the two decimal places and commas. For example, 1,250.00 occupies a total of 8 column spaces, encompassing one comma, one period, and two decimals.

All things considered, the output data set would be as follows:

Here are some informats that are commonly used:

Informat	Raw Data	INPUT Statement	Results
Character
$CHARw. Reads character data - does not trim leading or trailing blanks.	John Doe Jane Doe	INPUT Name $CHAR10.;	John Doe Jane Doe
$UPCASEw. Converts character data to uppercase.	John Doe	INPUT Name $UPCASE10.;	JOHN DOE
$w. Reads character data - trims leading blanks.	John Doe Jane Doe	INPUT Name $10.;	John Doe Jane Doe
Date, Time, and Datetime
ANYDATEw. Reads dates in various date forms	1jan1961 01/01/61	INPUT MyDate ANYDATE10.;	366 366
DATEw. Reads dates in form: ddmmmyy or ddmmmyyyy	1jan1961 1 jan 61	INPUT MyDate DATE10.;	366 366
DATETIMEw. Reads dates in form: ddmmmyy hh:mm:ss.ss	1jan1960 10:30:15 1jan1961,10:30:15	INPUT MyDT DATETIME18.;	37815 31660215
DDMMYYw. Reads dates in from: ddmmyy or ddmmyyyy	01.01.61 02/01/61	INPUT MyDate DDMMYY8.;	366 367
MMDDYYw. Reads dates in form: mmddyy or mmddyyyy	01-01-61 01/01/61	INPUT MyDate 8.;	366 366
TIMEw. Reads time in form hh:mm:ss.ss (or hh:mm)	10:30 10:30:15	INPUT MyTime TIME8.;	37800 37815
Numeric
COMMAw.d Removes embedded commas and $, converts left parentheses to minus sign	$1,000,000.99 (1,234.99)	INPUT Income COMMA13.2;	1000000.99 -1234.99
COMMAXw.d Like COMMAw.d but switches role of comma and period	$1.000.000,99 (1.234,99)	INPUT Income COMMAX13.2;	1000000.99 -1234.99
PERCENTw. Converts percentages to numbers	5% (20%)	INPUT MyNum PERCENT5.;	0.05 -0.2
w.d Reads standard numeric data	1234 -12.3	INPUT MyNum 5.1;	123.4 -12.3

Informat Modifiers

Occasionally, you might be unsure about the exact width of the values. In such cases, you would provide a width that is reasonably long enough. However, this approach could result in output variables with some unwanted trailing values. For example, let's consider the following data file:

Suppose that you're not sure how long the width of the car model and color variable should be. However, executing the following DATA step would not work as intended:

DATA CarSales;
    INFILE '/home/u63368964/source/car-sales.dat';
    INPUT Month Day Year Model $15. Color $15. NumSales Price;
RUN;

The colon modifier directs SAS to read the values up to the specified width, but stops at the first space (or other delimiter) it encounters. This ensures that the trailing values are not included in the variable. To use a colon modifier, simply put a colon (:), right in front of the informat specification.

DATA CarSales;
    INFILE '/home/u63368964/source/car-sales.dat';
    INPUT Month Day Year Model :$15. Color :$15. NumSales Price;
RUN;

Let's consider another example. Suppose that you want to import this raw data file with four variables: Breeds, AvgHeight, AvgWeight, Temper. In the file, observations can potentially include a space.

Now, let's say that you're not sure about how long the Breed going to be. So, you cannot use column indicator. In such cases, to import this data file, you should use ampersand modifier &. It directs SAS to recognize a single space as part of a character variable, and two or more consecutive space characters are recognized as delimiters. For example.

DATA DogBreeds;
    INFILE '/home/u63368964/source/dog-breeds.dat';
    INPUT Breed &$50. AvgHeight AvgWeight Temper $;
RUN;

Lastly, tilde modifier ~ instructs SAS to recognize double quotation marks ("") as a part of variable. It is typically used along with a DSD option, which make sure that SAS ignores any delimiter inside the double quotation marks. For example:

In the data file shown above, all values are delimited by commas. At the same time, the first variable also contains a comma in the double quotation mark. To import this file:

DATA HeightWeights2;
    INFILE '/home/u63368964/source/height-weight-v2.txt' DLM=',' DSD;
    INPUT Name :$15. Age Sex $ Weight Height;
RUN;