To have greater control and flexibility over the program, you may consider learning the SAS Programming language. Some might argue that SAS products mostly function as point-and-click programs and there is no need to learn the language. However, while it is possible to perform basic data analysis solely through the menu-driven interface of SAS, learning its language enables users to understand what's going on behind the scenes of data processing, automate repetitive tasks, and customize the analysis program with a greater flexibility.
Basic Building Blocks of a SAS Program
In essence, a SAS program is a collection of multiple steps (procedures), each of which is either a DATA step or a PROC step. The DATA step creates a new SAS data set referencing raw data sources. On the other hand, the PROC step is responsible for data processing and analysis.
Regardless, every SAS procedure is a block of SAS statements. The SAS statements provide detailed instructions to SAS on how to handle data values. Here's a breakdown of the basic syntax of a SAS statement:
- SAS Keywords: Every SAS statement begins with a keyword that specifies the action you want SAS to undertake. For example, the DATA keyword initiates a new DATA step, PROC PRINT prints the data set most recently created, etc.
- Options/References: Following keywords, you may also include options or references to provide additional details for the action:
- Literals: Numbers or text strings values.
- Variables: Named column vectors in your SAS data set holding data values.
- Expressions: Combinations of variables, literals, and operators.
- Options: Additional specifications modifying the behavior of SAS statements.
- Semicolon: Every SAS statement must conclude with a semicolon. It acts as a signal to SAS that the instruction is complete. Omitting a semicolon is a very common mistake that even experienced SAS programmers often make. So, please double-check to ensure you haven't forgotten a semicolon in your SAS statement.
- SAS keywords and syntax itself are not case sensitive, meaning that they can be either in upper- or lowercase and there is no difference in terms of their functionality.
In essence, a SAS program is a collection of multiple steps (procedures), each of which is either a DATA step or a PROC step. The DATA step creates a new SAS data set referencing raw data sources. On the other hand, the PROC step is responsible for data processing and analysis.
Regardless, every SAS procedure is a block of SAS statements. The SAS statements provide detailed instructions to SAS on how to handle data values. Here's a breakdown of the basic syntax of a SAS statement:
- SAS Keywords: Every SAS statement begins with a keyword that specifies the action you want SAS to undertake. For example, the DATA keyword initiates a new DATA step, PROC PRINT prints the data set most recently created, etc.
- Options/References: Following keywords, you may also include options or references to provide additional details for the action:
- Literals: Numbers or text strings values.
- Variables: Named column vectors in your SAS data set holding data values.
- Expressions: Combinations of variables, literals, and operators.
- Options: Additional specifications modifying the behavior of SAS statements.
- Semicolon: Every SAS statement must conclude with a semicolon. It acts as a signal to SAS that the instruction is complete. Omitting a semicolon is a very common mistake that even experienced SAS programmers often make. So, please double-check to ensure you haven't forgotten a semicolon in your SAS statement.
- SAS keywords and syntax itself are not case sensitive, meaning that they can be either in upper- or lowercase and there is no difference in terms of their functionality.
There really aren't any widely accepted rules about how to format your SAS program. SAS statements can start in any column, continue on the next line (as long as you don't split words in two), be on the same line as other statements. However, neatly organizing your SAS code lines is always helpful, as it makes your program more maintainable.
Supported Data Types for a SAS Data Set
Supported Data Types for a SAS Data Set
Supported Data Types for a SAS Data Set
Adding Comments
Just like with any other programming languages, you can add some comments for the code reviewers and yourself in SAS. There are two main ways to include comments in your SAS program:
- Single-line comments:
- Start the comment with an asterisk followed by a space (* ).
- Any text until encountering a semicolon (;) is considered comment and ignored by SAS.
- Multi-line comments:
- Start the comment with /* followed by a space and end the comment with */.
- Everything between /* and */ is considered a comment, even if it spans multiple lines.
Just like with any other programming languages, you can add some comments for the code reviewers and yourself in SAS. There are two main ways to include comments in your SAS program:
- Single-line comments:
- Start the comment with an asterisk followed by a space (* ).
- Any text until encountering a semicolon (;) is considered comment and ignored by SAS.
- Multi-line comments:
- Start the comment with /* followed by a space and end the comment with */.
- Everything between /* and */ is considered a comment, even if it spans multiple lines.
Programming Tips
People who have no experience in any programming language often get frustrated when their programs don't work correctly on the first try. Don't try to tackle a long complicated program all at once. By starting small, building upon what works, and consistently checking your results, you can enhance your programming efficiency.
Even if you get errors, never get frustrated. Surprisingly, experienced SAS programmers could make simple mistakes; they forget to add a semicolon, misspell a word, or place statements in an incorrect order. These small mistakes can cause a whole list of errors. Sometimes, even when programs run without throwing errors, they may still be incorrect. It is always a good practice to test your code with small cases.
People who have no experience in any programming language often get frustrated when their programs don't work correctly on the first try. Don't try to tackle a long complicated program all at once. By starting small, building upon what works, and consistently checking your results, you can enhance your programming efficiency.
Even if you get errors, never get frustrated. Surprisingly, experienced SAS programmers could make simple mistakes; they forget to add a semicolon, misspell a word, or place statements in an incorrect order. These small mistakes can cause a whole list of errors. Sometimes, even when programs run without throwing errors, they may still be incorrect. It is always a good practice to test your code with small cases.
Iterative DO Statements
In a previous blog post, the DO-END block was introduced as a way to group multiple actions for an IF-THEN statement. However, its usage is not limited to IF-THEN statement. In fact, the DO-END blocks can also be employed to perform iterative operations in a DATA step.
DO-TO Statements
The DO-TO statements specifies a variable that iterates over a range of numeric values. It has the following general form:
DO Variable = start TO end BY n;<Some instructions you want to execute for each iteration>END;
Where:
- Variable: This is a numeric variable that you define, which can be any valid SAS variable name.
- start and end: Numeric values or expressions that define the range of values for the variable. The DO loop iterates from the start to the end, incrementing n for each iteration.
- n: This specifies the increment by which a variable increases with each iteration. If BY clause is omitted, n defaults to 1.
For example, let's consider the following DATA steps:
DATA MySequence;DO i = 1 TO 5;x = i * 10;OUTPUT;END;RUN;
In the DATA step above, DO i = 1 TO 5; initializes a loop where i ranges from 1 to 5. Next, the following line calculates a variable x by multiplying the current value of i by 10. Then OUTPUT statement writes each iteration's result to the MySequence data set. After each iteration, the value of i increases by 1 until it reaches to 6, making sure we have 5 iterations from i = 1 to i = 5.
You can also nest DO-TO iterations. For example:
DATA MyPrimes (DROP=divisor IsPrime);DO num = 2 TO 100;IsPrime = 1;/* Check if num is divisible by any number other than 1 and itself */DO divisor = 2 TO SQRT(num);IF MOD(num, divisor) = 0 THEN DO;IsPrime = 0;LEAVE; /* Exit the inner loop early if not prime */END;END;/* Output num if it is prime */IF IsPrime = 1 THEN OUTPUT;END;RUN;
In this example, the outer DO loop iterates through num from 2 to 100, and initiates a variable IsPrime to 1. Subsequently, the inner DO loop iterates through potential divisors (divisor) from 2 to the square root of num. Here, for each divisor, if MOD(num, divisor) equals to 0, then change IsPrime to 0, and exit the inner loop using LEAVE statement. On the other hand, if IsPrime value remains as 1, then outputs the num to output data set MyPrimes. Lastly, the suboption (DROP=divisor IsPrime) drops the two temporary variables divisor and IsPrime.
In the DO-TO statement, specifying a BY clause allows you to set increment amount for each iteration. For example:
DATA MySample;DO i = 2 TO N BY 10;SET MyData.Boston NOBS=N POINT=i;IF _ERROR_ THEN ABORT;OUTPUT;END;STOP;RUN;
This loop iterates a variable i from a starting value of 2 to a final value of N. Here, the loop increments i by 10 in each iteration (i.e., 2, 12, 22, ..., N). For each iteration, the SET statement brings all observations from MyData.Boston, assigns the number of observations to N, and retain observations whose automatic variable _N_ equals to i. Consequently, MySample will extract every 10th observations from MyData.Boston, sampling 51 out of 506 observations. This is the concept of systematic sampling, that will be discussed in a later post.
DO-WHILE and DO-UNTIL Statements
At the time of writing code, sometimes, you don't know exactly how many times you would like to iterate over. Instead, you only know the condition of when the iteration should stop. In such cases, you can use DO-WHILE statements. At the beginning of each iteration, it evaluates if the specified condition is true. If it is, then the DO-WHILE statements executes the block of statements. The iteration continues until the condition evaluates to false, at which point the loop terminates. For example:
DATA MyData;x = 1;/* Increment x while it is less than or equal to 10 */DO WHILE(x <= 10);y = x ** 2;OUTPUT;x + 1; /* Increment x by 1 */END;RUN;
In this DATA step, x = 1; initializes the variable x to 1. Then DO WHILE(x <= 10); begins a loop that continues while x is less than or equal to 10. For each iteration, SAS first checks if the given condition x <= 10 is true. If it is, then it performs the statements under the DO-WHILE-END block. Then, at the end of each iteration, x + 1; increases x by 1. This makes sure that the loop terminates at some point.
DO-UNTIL statements work similar to DO-WHILE blocks. However, unlike DO-WHILE statement, DO-UNTIL checks the condition is false, at the end of each iteration. When it is submitted for execution, DO UNTIL statement first executes statements under the block. Then at the end of the first iteration, it evaluates if the given condition is still false. If it is, then SAS moves to the next iteration. After executing the statements, if the condition evaluates to true, then SAS terminates the loop. For example:
DATA MyData;x = 1;/* Increment x while it is less than or equal to 10 */DO UNTIL(x > 10);y = x ** 2;OUTPUT;x + 1; /* Increment x by 1 */END;RUN;
ARRAY Statements
In computer programming, an array refers to an ordered collection of items with the same data type. Each item in an array is uniquely identified by an index, so that you can access and manipulate specific items based on their indices.
The ARRAY statement in SAS DATA step groups multiple data set variables, as long as they are either all numeric or character. By grouping variables under a single name, you can perform repetitive tasks without having to write statements for each variable. The ARRAY statement follows this general form:
ARRAY Name (n) $ Variable-list;
Where:
- Name: Name you give to the array.
- n: Number of variables in the array.
- $: Indicates if the array consists of all character variables.
- Variable-list: list of all variables you want to include in the array
For example, let's consider the following data set:
The wine quality data set, created by Paulo Cortez and his team at the University of Minho, Portugal, has 13 variables and 6497 observations on wine quality metrics. It measures 11 physicochemical test scores, such as acidity, alcohol, pH, etc. on wines produced in the Vinho Verde region of Portugal, as well as 0 to 10 quality scores.
Now, for each observation, suppose that you want to check if any of the acidity measurement is missing, and if so, delete the observation.
DATA WineQualityCleaned;SET WineQuality;ARRAY WineAttributes (11) FixedAcidity VolatileAcidity CriticAcid ResidualSugar Chlorides FreeSulfur TotalSulfur Density pH Sulphates Alcohol;DO i = 1 TO 11;IF WineAttributes[i] = . THEN DELETE;END;RUN;
After executing the DATA step above, the WineQualityCleaned has 6463 observations: 34 observations with missing values were excluded in the new data set creation.
In the DATA step, an array WineAttributes is defined with 11 variables. Subsequently, there is an iterative DO statement. All statements in the DO and END statement are executed, in this example, eleven times, once for each variable in the array.
The variable i act as an index. Starting from the value of 1, each time through the DO loop, i is incremented by 1, and is referencing the i-th variable in the array. For example, WineAttributes[1] refers to FixedAcidity, WineAttributes[2] refers to VolatileAcidity, and so forth. This continues through all eleven variables in the array, leaving the value of 12 in i.
Shortcuts for Variable Lists
When there are too many variables, instead of listing them all, you can use shortcuts for list of variable names. For example:
DATA WineQualityCleaned2;SET WineQuality;ARRAY WineAttributes (11) FixedAcidity -- Alcohol;DO i = 1 TO 11;IF WineAttributes[i] = . THEN DELETE;END;RUN;
In this example DATA step, FixedAcidity -- Alcohol refers to all variables from FixedAcidity to Alcohol. Thus, this DATA step will make the exactly the same output:
In addition to this, SAS also reserves some special name lists:
- _NUMERIC_: All numeric variables in a dataset.
- _CHARACTER_: All character variables in a dataset.
- _ALL_: All variables in a dataset.
Using the special name lists, you can also shortcut variables like:
- FixedAcidity _NUMERIC_ Alcohol: All numeric variables from FixedAcidity to Alcohol.
- FixedAcidity _CHARACTER_ Alcohol: All character variables from FixedAcidity to Alcohol.
- FixedAcidity _ALL_ Alcohol: All variables from FixedAcidity to Alcohol.
Note that the name abbreviations are not limited to the ARRAY statement. In fact, you can also employ this strategy anywhere in a DATA step or even in a PROC step to make your code more maintainable.
0 Comments