SAS (Statistical Analysis System) is a comprehensive software suite for advanced analytics, data management, and business intelligence. Originally started as a project to support agricultural research at North Carolina State University, SAS quickly gained widespread adoption beyond academia. It has a long history as a leading data solution for enterprises.
In recent years, however, SAS has faced increasing challenges in the rapidly evolving data analytics market. While it remains a preferred choice in some highly regulated industries, such as finance and pharmaceuticals, its prohibitive subscription costs, often exceeding six figures annually, and significant learning investment have made it less accessible to startups and individual learners. The rise of open-source tools like Python and R, driven by collaborative innovations in the data science community, has further highlighted the limitations of SAS as a proprietary software suite. These open-source platforms boast extensive libraries, rapid development cycles, and active communities fostering constant innovation, particularly in AI and machine learning. Researches and talent pools have become heavily concentrated around these open-source tools, and made SAS appear increasingly outdated in the fast-paced industry.
In response, SAS introduced SAS OnDemand for Academics (ODA). SAS ODA provides a free access to a web-based platform to write and execute SAS programs, called SAS Studio. This platform enables individual learners and small-scale learners perform statistical analysis, build predictive models, and explore data without prohibitive license fees. In the mean time, SAS ODA aims to ensure a steady pipeline of professionals skilled in SAS tools and encourage broader adoption across various industries.
This tutorial provides an introduction to SAS using SAS ODA. We'll begin with an overview of the SAS Studio user interface, followed by an introduction to core concepts and terminology of the SAS language. Let's get started!
Your First Look at SAS Studio
To begin, navigate to the SAS OnDemand for Academics website. If you haven't already, create a free account. Once logged in, click "Launch" button to start a new SAS Studio session. This will give you a web-based environment for writing and executing SAS programs.
Let's break down the SAS Studio interface. It's made up of three key parts: the top menu, the central work area, and the navigation pane on the left.
- Top Menu:
- Search and Open Files: Provides quick search and file opening interfaces for files that are stored in your SAS Studio environments. Click on magnifying glass icon to search specific files and the folder icon to open existing files, respectively.
- Perspective Switching: Switch between "SAS Programmer" and "Visual Programmer" perspectives to suit your workflow needs.
- Access to Documentation and Support: Click on the question mark icon to access the official documentation for SAS Studio and SAS products. For additional support, connect with other SAS Studio users on the official forum.
- Navigation Pane:
- Server Files and Folders: Access and manage your files stored in the SAS Studio environment.
- Tasks and Utilities: SAS provides a user-friendly interface with a wide range of common tasks, including statistical modeling, econometrics, data mining, and network analysis. Once your SAS dataset is ready, you can take advantage of point-and-click interface to perform complex analysis.
- Snippets: Provides pre-built code snippets for common data processing tasks. You can also create and save your own code snippets for later use here.
- Libraries: Organize and store your SAS datasets for easy access.
- File Shortcuts: Creates and manages shortcuts for frequently used files.
- Work Area: This is the main area where you can write, edit, run, and debug your SAS programs.
SAS Programmer Perspective vs. Visual Programmer Perspective
SAS Studio offers two different "perspectives" for different workflow needs. The SAS Programmer perspective is the default mode when you first open SAS Studio. On this perspective, you can directly write, edit, run, and debug SAS codes. For example:
By clicking the "Run" button on the upper left corner of the toolbar, you can execute your SAS code. It will execute only the selected portion if a specific part of the code is highlighted. Otherwise, the entire script will be executed.
You can save your SAS program by clicking the "Save" button. Program files created in the SAS programmer perspective are saved as .sas file.
On the other hand, the Visual Programmer perspective allows you to visually construct and execute data analysis workflows using a drag-and-drop interface. Working files are saved as Process Flow files with an extension .cpf extension.
In this perspective, you can select and move items, such as SAS datasets (*.sas7bdat files), SAS programs (.sas files) that you've created in the SAS Programmer perspective, or pre-defined tasks from the "Tasks and Utilities" section, onto the main work area. Each added item, referred to as "node", can then be connected to create a visual representation of the data analysis process. This visual workflow itself provides a clear overview of the entire data analysis project.
To edit an added SAS program, right-click the corresponding node and select "Open". It will open a new tab for editing SAS codes. To execute the workflow up to a specific point, right-click on that node and select the "Run". All connected nodes preceding that point will be executed sequentially.
Getting Started with SAS Language
SAS Studio provides menu-driven front ends for many tasks, as seen in the "Tasks and Utilities" section of the navigation panel. This section includes tasks for data visualization, statistical tests, and data preprocessing. At the same time, SAS supports custom programming through its own language.
Some people argue that SAS already provides menu-driven interface for most of its functionalities. They go further and say that learning the language has no point. However, these functionalities of front ends rely on the SAS language to generate code behind the scenes. Learning to write your own programs provides much more flexibility and control over your analysis.
A SAS program is essentially a sequence of SAS statements executed in order. Each statement provides instructions to SAS about how to perform a specified task. These instructions must be placed appropriately within the program. An everyday analogy to a SAS program is placing an order at a coffee shop. You enter your coffee shop, stand in line, and when you finally reach the counter, you say what you want:
I would like a medium latte.Please make it with oat milk.No sugar, and no extra foam.Also, add a blueberry muffin.
Notice that you first express the general request--ordering a latte--and then provide additional details. The sequence of these details may vary slightly, but they all support the initial request. You wouldn't, for example, walk up to the counter and abruptly say, "Add a blueberry muffin!" without any context. That would confuse the barista and disrupt the process. Similarly, your request must stay consistent; you wouldn't say, "Add whipped cream" when you just specified no sugar or extra form.
A SAS program works the same way: it's an ordered set of SAS statements, just like the structured set of instructions you use when ordering at a coffee shop. Asides from miscellaneous statements, such as LIBNAME to create a SAS library or RUN to execute preceding statements, there are two main "general requests" that you can place in SAS: the DATA step and the PROC step.
In SAS programming, the DATA step consists of a series of SAS statements used to create a new dataset. Just as you begin by specifying the kind of coffee you want (e.g., "I would like a medium latte"), the DATA step starts by defining the destination library and name for the output dataset (e.g., mydata.sample_data, where mydata is the library and sample_data is the dataset name). You then specify the raw data source and outline how to process the referenced data.
For example:
/* Create a new dataset named 'mydata' in the 'sasdata' library */DATA mydata.sample_data;INFILE DATALINES;INPUT name $ age height;DATALINES;John 30 72Jane 25 .David . 70Mary 35 68;/* Executes the current DATA step */RUN;
In the example above, the DATA statement defines the name and location of the new dataset, sample_data, within the mydata library. The INFILE DATALINES; specifies that the raw data is included directly in the program, using the DATALINES statement[1][2]. The INPUT statement defines the variables to be read: name (a character variable indicated by $), age (numeric), and height (numeric). The DATALINES section provides the raw data values in a structured format, where each line represents a data row, with column values separated by spaces and missing values are marked by periods. Finally, the RUN statement executes the DATA step, creating the sample_data dataset in the specified library.
On the other hand, PROC steps are designed for specific tasks, such as data manipulations[3], data visualizations, or data reporting. All these tasks, however, are done in a way that is pre-defined and fully verified by SAS. While you can add some customized instructions and options to an existing PROC (for example, using different model specifications within PROC REG for regression analysis), you do not create new PROC steps from scratch.
You can think of it as ordering a cup of coffee with some pre-designed recipes. Just as you wouldn't try to invent a new recipe for your beverage on the spot, you rely on the standardized and tested procedures for achieving specific outcomes. You can customize your coffee with some options like using oat milk or adding extra shots, but the available customizations for the coffee are generally confined within the existing recipe. This approach ensures consistent and reliable results for your analysis.
For example:
/* Calculate summary statistics for mydata.sample_data */PROC MEANS DATA=mydata.sample_data;LABEL height = "Height (inches)";RUN;
The MEANS procedure provided above calculates summary statistics for the mydata.sample_data dataset. The LABEL statement is used to assign a more descriptive label to the height variable. This is one of the available instructions that you can add to the procedure. However, when calculating summary statistics, the specific way of data processing--how to handle missing values in this example--is completely hided from the end users; you must remain within the established framework of SAS[4].
Every procedures start with the keyword PROC, followed by the name of the procedure, such as MEANS in the example above. This introductory line is called the PROC statement.
Within the PROC statement, you can "optionally" specify the dataset to which the procedure should be applied using the DATA= option. For example, we see that mydata.sample_data is explicitly specified for the MEANS procedure above. Omitting the DATA= option, however, is not considered as a good practice. If left out, SAS defaults to using the most recently created dataset, which may not always align with your intentions. So, it's best practice to always specify the DATA= option to ensure clarity of your code.
After the first line, the subsequent statements depend heavily on the specific procedure being used. Each procedure has its own set of instructions and options that tailor its behavior and output to meet your analytical needs. However, some statements are commonly used across many procedures, including those for data filtering, grouping, adding labels to columns, and specifying titles and footnotes. Here's a breakdown of each:
- TITLE and FOOTNOTE
TITLE and FOOTNOTE Statements
The TITLE and FOOTNOTE statements are used to add titles and footnotes, respectively, to your PROC result. Both TITLE and FOOTNOTE statements are global statements, meaning that they are technically not a part of any PROC or DATA step. However, considering that the statements apply to the procedure output, it generally makes sense to put them with the procedure.
The TITLE statement consists of the keyword TITLE followed by your desired title enclosed in quotation marks. Similarly, the FOOTNOTE statement follows the same syntax, with the keyword FOOTNOTE preceding your footnote text enclosed in quotation marks. Note that you can also use double quotation marks instead of single ones; there is no functional difference, and it is purely a matter of preference.
If you find that your title or footnote texts contain an apostrophe, you have two options: you can either enclose the text in double quotation marks, or you can put an escape character ' in front of the apostrophe. For example:
LABEL Statements
By default, SAS uses variable names to label your output. However, if you require more descriptive names for your variables, you can create them using the LABEL statements. Each label can be up to 256 characters long. For example:
Note that when a LABEL statement is used in a DATA step, the labels become part of the dataset. On the other hand, when used in a PROC step, the labels stay in effect only for the duration of that particular step.
BY Statement
The BY statement specifies the variable(s) by which variable you want to apply a procedure. It is thereby required for the PROC SORT, which sorts observations. For all other PROCs, the BY statement is optional.
The variables listed in the BY statement are referred to as BY variables. When used in a PROC, other than PROC SORT, the BY statement instructs SAS to perform separate analyses for each unique combination of the BY variable values. However, it is important to note that for this functionality to work, a SAS dataset must be pre-sorted by the BY variables, typically achieved through PROC SORT. Otherwise, SAS will throw an error. For example:
In the SAS LOG window, we can see that it throws an error as we applied a BY variable in the PROC MEANS, without pre-sorting the observations with the variable. If the observations were sorted by the BY variable, SAS will apply the MEANS procedure for each unique value of the variable. For example:
Subsetting in Procedures with the WHERE Statement
One optional statement for any PROC that reads a SAS dataset is the WHERE statement. It allows you to specify a subset of the data to be used in the analysis. While you can also achieve this through a DATA step with IF statements, the WHERE statement serves as a convenient shortcut. Unlike subsetting IFs, which create a new SAS dataset after filtering, the WHERE statement in a PROC directly filters observations and applies the procedure on the current dataset. Thus, it is typically more efficient to use the WHERE statement than to first use subsetting IFs and then apply the procedure.
Here are the most frequently used operators for conditional expressions:
Symbolic | Mnemonic | Example |
---|---|---|
= | EQ | WHERE Make = 'Acura'; |
^=, ~=, <> | NE | WHERE Make ^= 'Acura'; |
> | GT | WHERE MSRP > 40000; |
< | LT | WHERE MSRP < 40000; |
>= | GE | WHERE MSRP >= 40000; |
<= | LE | WHERE MSRP <= 40000; |
& | AND | WHERE Make = 'Acura' AND MSRP <= 40000; |
|, ! | OR | WHERE Make = 'Acura' OR Make = 'Audi'; |
IS NOT MISSING | WHERE MSRP IS NOT MISSING; | |
BETWEEN AND | WHERE MSRP BETWEEN 30000 AND 40000; | |
CONTAINS | WHERE Make CONTAINS 'ura'; | |
IN (LIST) | WHERE Make IN ('Acura', 'Audi', 'BMW'); |
Basic Syntax of SAS Language
Like any language, SAS has its own set of rules to follow when writing statements. Thankfully enough, the rules for writing SAS statements are simpler and fewer than those in English.
The first and foremost rule is:
Every SAS statement ends with a semicolon.
This sounds very simple. However, omitting a semicolon at the end of a statement is a very common mistake that even experienced SAS programmers often make. Keeping this simple rule in mind and habitually double-checking the ends of your SAS statements will give you a head start.
The second rule is:
SAS is not case sensitive.
This means that SAS keywords and other objects, including libraries, datasets, and table columns, can be written in either in uppercase or lowercase; there is no functional difference between the two. The only case-sensitive element in SAS is the stored data values. For better readability, however, I would recommend to use uppercase for SAS keywords and lowercase for the user-created objects, such as libraries and datasets, when writing SAS programs.
Lastly:
Statements can start in any column, regardless of the position of other statements.
SAS statements can start in any column, continue on the next line (as long as you don't split words in two), or appear on the same line as other statements. Every SAS statement starts with a keyword and ends with a semicolon. So, there really aren't any specific rules about how to layout your SAS statements. However, neatly organizing statements is always beneficial, as it improves the readability and maintainability of your program.
Adding Comments
To make your SAS program easier to understand, you can add comments. SAS ignores whichever texts you included in the comments, so you can put anything in them--such as your favorite coffee recipe. However, comments are meant to annotate the program, helping others (or yourself) understand what you've done and why.
There are two main ways to include comments in a SAS program script:
- Single-line comments:
- Start the comment with an asterisk followed by a space (* ).
- Any text until encountering a semicolon (;) is considered comment and ignored by SAS.
- Multi-line comments:
- Start the comment with /* followed by a space and end the comment with */.
- Everything between /* and */ is considered a comment, even if it spans multiple lines.
Note that some operating environments interpret a slash-asterisk (/*) in the first column as the end of a job or script. This is typically associated with older conventions, such as mainframe environments (e.g., IBM z/OS) or outdated versions of SAS running on UNIX or Linux. So, when you're working on such systems, always be careful for adding a comment block, while this is not a concern for SAS Studio users.
SAS Data Sets
From a statistician's perspective, data can be defined as a collection of column vectors, each containing observed values. These columns are referred to as variables, as their values vary across observations (rows). Collectively, these column vectors form a tabular structure known as a dataset.
As a "Statistical Analysis System", SAS also organizes data into SAS datasets, which are structured tables of data with variables (columns), observations (rows), and stored as .sas7bdat files. This is the only file format that SAS can directly handle. For external sources like CSV or Excel spreadsheets, they must be "imported"[1] into a SAS dataset before analysis.
In essence, what a DATA step does is creating a new SAS dataset by referencing its provided data sources. This involves:
- Reading data from various sources (e.g., existing SAS datasets, external files, or DATALINES).
- Transforming data through calculations, operations, and manipulations. These transformed data will fill out the new, empty SAS dataset object.
- Saving the new dataset as a .sas7bdat file under a specified SAS library.
For example, let's consider the following DATA step:
/* Create a new dataset named 'mydata' in the 'sasdata' library */INFILE DATALINES;INPUT name $ age height;DATALINES;John 30 72Jane 25 65David 40 70Mary 35 68;/* Executes the current DATA step */RUN;
This DATA step creates a new dataset named
Naming Rules in SAS Language
When creating a SAS dataset, adhere to the following naming rules:
- Names must be 32 characters or fewer in length.
- Start with an alphabet (A-Z, a-z) or an underscore (_).
- Following the first character, names can contain alphabet, numbers, or underscore.
- Data set names are not case sensitive.
To prevent data loss, always specify a library name for your datasets. Omitting a library name will result in the dataset being stored in the temporary WORK library, which is automatically deleted at the end of the SAS session. This practice also helps organize related data files under a common project name.
[1] When data source is directly provided by DATALINES, you can omit the INFILE statement. ↩
0 Comments