Creating a SAS Datasets from Raw Data

In


When starting a new analysis project, your data is typically not saved in the SAS dataset format (*.sas7bdat). It may come in the form of an Excel spreadsheet, an SPSS dataset, or a text file. However, SAS cannot directly process these file formats. So, you must first create a new SAS dataset by referencing these external data sources.


If you want to work with your own data in SAS OnDemand for Academics, you'll first need to upload it to your account. This tutorial shows how to upload and import SAS and non-SAS data files (such as Excel and CSV) into SAS ODA for analysis.

How to Get Your Own Data into SAS OnDemand for Academics

When starting a data analysis project in SAS OnDemand for Academics, your data files may or may not be in the SAS data file format (*.sas7bdat). Your data might be an Excel spreadsheet, an SPSS dataset, or a text or CSV file.

Regardless of what format your data is in, the process of getting your data into your SAS OnDemand for Academics account--that is, your data file must actually be uploaded to the cloud. Then, you must "import" the uploaded file into SAS Studio (using a process that is what you would do with "desktop" SAS). The first step is unique to those using SAS OnDemand for Academics (if you were using "desktop" SAS, you could simply import the file directly from your computer), but fortunately only needs to be done once. 

In this tutorial, we'll cover how to get data files from your computer into your SAS OnDemand for Academics account so that you can analyze them.

Uploading Data Files to your SAS OnDemand for Academics Account

If you are using SAS OnDemand for Academics and have data files you want to work with in the program, you will need to upload data to your account. This goes for ANY dataset you want to work with in SAS ODA, including "native" SAS datasets (*.sas7bdat) or other formats such as Excel and CSV.

To upload a file to SAS ODA:

  1. In the Navigation pane, click Server Files and Folders
  2. Left-click on the folder you want to upload the data file to (usually "Files (Home)" or a folder within that directory). Then click Upload. (Note that the Upload button will not be clickable until a folder location is selected, so, if it's greyed out, make you've selected a folder.)
  3. The Upload Files window will open. Click Choose Files, then choose the data file on your computer to upload
  4. SAS will show the name(s) and size(s) of the file(s) you selected. To complete the upload, click Upload.
  5. If success








Importing Raw Data Files into SAS Datasets

Once uploaded, you will find the raw data file under the "Server Files and Folders" section. Double-clicking it will open a new tab to import the file into a SAS dataset. Internally, SAS can only handle SAS datasets; you cannot directly perform any data processing on the uploaded files, and it is necessary to create a new SAS dataset based on the raw data file. We call this process as "importing data."


On the tab, navigate to the "Guessing rows" option and enter the number of rows. Based on your input, SAS determines the data types and lengths of each variable for the newly created SAS dataset. So, enter any number that is reasonably large enough but no more than the number of rows in your raw data. Then, click the "Save" and "Run." This will automatically create a file, containing PROC IMPORT code for the new SAS dataset.



Most of the time when you start a new project, your data will not be saved in a SAS dataset file format (*.sas7bdat). Your data might be in the form of a spreadsheet in Excel, an SPSS dataset, or a text file. The most common and new-user friendly method for reading a non-SAS dataset into SAS is by using the Import Wizard.







This tutorial shows how to import Excel files into SAS. 







If you want to work with your own data in SAS OnDemand for Academics, you'll first need to upload it to your account. This tutorial shows how to upload and import SAS and non-SAS data files (such as Excel and CSV) into SAS ODA for analysis. 













hjkl

How to Upload Local Files into a SAS Dataset?

SAS Studio is a cloud application. Prior to any data processing, you must first upload data files stored in your local machine. To upload one, go to the "Server Files and Folders" section on the navigation pane, and click on "Files". Next, select the destination folder under the "Files" and click "Upload" button.  

After uploading the selected file, the next step is to create a new SAS data set. SAS cannot directly process raw data files like CSV or Excel. These files must first imported into a SAS dataset. The easiest way is to use the point-and-click interface of SAS Studio. 

Right-click on the uploaded data file and select "Import Data".

By default, the imported SAS dataset will be saved in the temporal WORK library and named "Import". So, click "Change" and replace the library and dataset name. Next, fill in the row number at which data reading should start in the "Start reading data at row". In a SAS dataset, all columns must have appropriate lengths and data types. The "Guessing rows" field determines the number of rows read to determine these attributes. Select a number that is smaller than the entire number of rows, but is reasonably large enough. All things completed, click "Save" and "Run" buttons to start data imports. 

SAS Libraries

SAS libraries are essentially a storage location for SAS datasets, grouping related datasets under specific names and providing callable references. By default, all SAS datasets are temporarily stored in the WORK library and will be automatically deleted by the end of the current session. To avoid this, you should create a new library for your project and store datasets under the library.

One of the ways to create a new SAS library is the LIBNAME statement. Here's the basic syntax:

LIBNAME MyData '/path/to/your/library';

In the SAS ODA, your library paths will always begin with '/home/your-user-name/'. You can find the user name for the path at the bottom right corner on your browser.

Alternatively, you can create a library through SAS ODA's graphical interface. Navigate to the "Libraries" section on the left panel, then right click on the "My Libraries". You will find the "New Library" button.

In the "New Library" window:

  • Name: Specify the name for the new library. It should be descriptive, no longer than 8 characters, must start with a Roman alphabet, cannot contain any blanks or special characters other than underscore.
  • Path: The directory where the library is located, equivalent to the file path specified in a LIBNAME statement. You may click the Browse button to select the directory for the new library.
  • Re-create this Library at start-up: Optionally, you can set SAS to "remember" the new library every time you start a new SAS Studio. This ensures that the library exists in a new session.

kj




Importing External Data Files into SAS Data set

Most of the time, you will create a new data set by referencing an existing data file, which could be formatted as CSV, Microsoft Excel, or any other text file. To inform SAS that the raw data values will be sourced from an external file, you should add an INFILE statement with the file directory before the INPUT statement.

If you're working in SAS Studio, you should first upload the local file and then provide the file directory of the uploaded data file. For example, let's consider data values stored in your local machine as follows:

After uploading the file through the "Upload" button on the navigation pane, you can right-click on the file and select "Properties." Then SAS Studio opens a dialog with the exact location of the uploaded file. 

Using the file location, you can import the uploaded file into a SAS data set in a DATA step:

DATA MyData.PowerConsumption;
INFILE '/home/u63368964/source/household-power-consumption.txt' DLM=';' FIRSTOBS=2;
INPUT Date :ANYDTDTE10. Time TIME8.
GlobalActivePower GlobalReactivePower Voltage GlobalIntensity
SubMetering1 SubMetering2 SubMetering3;
RUN;

Controlling Inputs with the INFILE Options

When reading data from external sources through an INFILE statement, you can include some options to control data inputs. To add an INFILE option, simply append it after file directory, separated by a space. Below are commonly used INFILE options:

  • FIRSTOBS: Specifies the row number from which SAS begins to read data.
  • OBS: Specifies the row number at which SAS finishes reading data.
  • ENCODING: Specifies the character encoding of the file. Common encodings include 'latin1', 'utf-8', etc.
  • DLM: Short for "delimiter." It specifies the character that separates values in the file. SAS defaults to using a space as the delimiter, but you can specify others like comma (',') or tab ('\t'), depending on the source data format.
  • DSD: The "Delimiter Sensitive Data," or DSD, is an option to handle situations where data values in a delimited file contain the delimiter within the actual data. For example, consider a data file delimited by commas have a character string: "John Doe","123, Main St",25. Without DSD, SAS interprets this as four values. To prevent this, you can specify DSD=',' to ensure that SAS recognizes the entire quoted string as a single value.
  • MISSOVER: By default, SAS automatically moves to the next line to read next values, if SAS reached the end of the data line while there are remaining variables to fill in. The MISSOVER option tells SAS that if a data line runs out of value, assign missing values to the remaining variables.
  • TRUNCOVER: When the raw data file is fixed formatted with some trailing missing values, specifying TRUNCOVER option tells SAS to read data for the variable until it reaches the end of the data line or the last column specified in the format, whichever comes first. 

For example, let's consider a text file delimited by commas as shown below. In the DATA step, we would like to read from the second to the seventh data line, excluding both the headers and comments. Also observe that the third and fifth rows exhibit trailing missing values, which have been truncated.

The following DATA step imports this data file into a SAS data set:

DATA MyData.HomeAddress;
/* Specifying variable lengths */
LENGTH ID $3 Name $50 Age 3 Sex $1 Address $100;
INFILE '/home/u63368964/source/home-address.dat' ENCODING='utf-8' FIRSTOBS=2 OBS=7 DLM=',' TRUNCOVER;
INPUT ID $ Name $ Age Sex $ Address;
RUN;

Creating a Data Set Bringing Observations from Another SAS Data set

Often, you want to create a SAS data set referencing from another data set. The SET statement brings observations from the specified data set and creates a new data set. For example:

DATA JapaneseAddress;
SET MyData.HomeAddress;
PostalCode = SUBSTR(Address, 1, 11);
DROP ID Age Sex;
RENAME Name = HouseholdName;
RUN;

This example DATA step creates the JapaneseAddress, referencing from Mydata.HomeAddress. After bringing all observations from MyData.HomeAddress, the DATA step creates a new variable by applying SUBSTR function to the Address variable. It also drops unnecessary variables using the DROP statement and renames Name by the RENAME statement.

  • DROP: Excludes variables from the output data set.
  • RENAME: Rename variables in the output data set.
  • KEEP: Retain specific variables in the output data set.

The INPUT Statement

The INPUT statement provides SAS overall guides for importing data into a new data set. It specifies the formatting of raw data values, which column values correspond to variables, and the name of the output variables. Here's the basic syntax of the statement:

INPUT Variable1 [$] [Column-position1 - Column-position2] [Informat.] ... ;

  • Variable1: Name of the first variable in the output data set. You can append variables to the statement as many times as you need.
  • Dollar Sign: If you want the Variable1 to be entered as a character variable, you should indicate it by adding a $.
  • Column-position1 - Column-position2 (Optional): These optional indices define the first and last column positions in the raw data. 
  • Informat. (Optional): This specifies how the raw data is formatted (e.g., $240,000 or Mar 19, 2024). 

List Input to Read Data

In a raw data file, when the values are separated by one or more spaces and there is no rule about their column positioning, we say the file is in a free format. To create a new SAS data set with a freely formatted data source, you can simply list the desired variable names in the INPUT statement. Variable names must be:

  • Starting with an alphabet (A-Z, a-z) or an underscore (_) only.
  • Not exceeding 32 characters.
  • Having numbers other than the 1st position.

For example, let's consider the following data file:

Now, let's import this file using the following INPUT statement. This kind of INPUT statement is referred to as list input:

DATA StudentGrades;
INFILE '/home/u63368964/source/student-grades.dat';
INPUT StudentID $ Name $ Math Science English;
RUN;

In the output data set, we see that the Math is entirely missing. This happens due to the existence of spaces between the first and last name in each data line. The list input is the most straightforward approach, but it comes with some limitations:

  • Strict Delimiter: All values must be strictly separated by at least one space. Or you must specify the delimiter by the DLM= option in the INFILE statement.
  • Missing Value Representation: Any missing values must be represented as a single period.
  • Character Variable Limitations: Character variable values must not contain any spaces, and their maximum length is 8 characters, unless specified otherwise using the LENGTH statement. 

In this example DATA step, list input instructs SAS to interpret any space between the first and last name as a delimiter, resulting in two separate values being populated for each observation. Thus, the last names are mistakenly assigned to the Math, which is numeric. Since character values cannot be converted into numeric values, all Math entries are missing.

Reading Fixed Format Data

If all data values are neatly arranged in columns, you can read them by specifying column indicators. The column indicators specify the first and last column position for each variable entered in fixed format. For example:

DATA StudentGrades;
INFILE '/home/u63368964/source/student-grades.dat';
INPUT StudentID $ 1-4 Name $ 6-20 Math Science English;
RUN;

In this DATA step, the INPUT statement explicitly specifies raw data column ranges for each variable assignment. For example, any values in the 1-4 columns of the raw data will be assigned to the StudentID. Similarly, any values in the 6-20 columns will be assigned to the Name. For the remaining variables, we use the list input.

Compared to the list input, column indicators take several advantages:

  • Character variables can contain spaces and have values of up to 32,767 characters.
  • It's possible to skip values for unnecessary variables.
  • Since the positions of columns to read are specified, they can be read in any order, and the same column can be read repeatedly.

For example, let's consider the following DATA step:

DATA StudentGrades2;
INFILE '/home/u63368964/source/student-grades.dat';
INPUT Name $ 6-20 Math Science StudentID $ 1-4 MathScore 21-22;
RUN;

This DATA step creates Name, reading values in 6-20 column. Then it moves to the next values to populate Math, separated by a space. Similarly, it reads the next separated values to populate Science. These two variables are created by list input. After that, SAS reads values in 1-4 columns and populates StudentID. Lastly, it reads 21-22 column values again for another variable MathScore:

Pointer Controls in the INPUT Statement

By default, particularly when reading raw data without column indicators, the INPUT statement extracts data sequentially from the source, one value after another. However, this might not be ideal if your data isn't arranged in a straightforward way. This is where pointer controls come into play. They keep track of the current location for each data record and relocates as needed. Here are some common types of pointer controls:

  • Column Pointer Controls
    • These allow you to move the pointer to a specific column position in the current data line.
    • @n moves the cursor to n-th column.
    • Sometimes you don't know the column number, but you do know that it always comes after a particular character string. In such situations, you can use the @'character string'.
    • +n moves the cursor to the n-th column after the current position.
  • Line Pointer Controls:
    • These controls are used to move the pointer to another line in the data record.
    • / moves the cursor to the next line.
    • #n moves the cursor to the n-th line.

For example, data file in the figure shown below contains information about temperatures for the month of July for Alaska, Florida, and North Carolina. Each line contains the city, state, normal high, normal low, record high, and the record low temperature in degrees Fahrenheit. 

Now, let's consider the following DATA step:

DATA JulyTemperatures;
INFILE '/home/u63368964/source/temperatures.dat';
INPUT @'State:' State $
@1 City $
@'State:' +3
AvgHigh 2. AvgLow 2.
RecordHigh RecordLow;
RUN;

In the DATA step, @'State:' relocates current column position wherever right after the specified string, 'State:', in the raw data file, and populates State. Then, @1 moves the current position to the first column and read values for City. Similarly, @'State:' +3 locates the cursor to 3 columns after 'State:' string. 

Let's consider another example. This time, each observation spans three lines of the data file: the first line contains first and last name, second line contains age, and the third line contains height and weights.

Now, let's consider the following DATA step:

DATA HeightWeights;
INFILE '/home/u63368964/source/height-weight.dat';
INPUT FirstName $ LastName $
/ Age
#3 Height Weight;
RUN;

In this DATA step, / moves cursor to the next line after reading two variables. Similarly, #3 relocates the cursor to the third line of the raw data file:


Post a Comment

0 Comments