SAS Procedures for Data Manipulations

The TRANSPOSE procedure turns observations into variables or variables into observations. Typically, PROC TRANSPOSE is used to convert observations to variables. This can be achieved with the following statements:

PROC TRANSPOSE DATA=OldData OUT=NewData;
BY variable-list;
ID variable-list;
VAR variable-list;
RUN;

Where:

  • DATA: This option specifies the name of the data set you want to transpose. If omitted, the most recently created data set will be used.
  • OUT: This option defines the name of the newly transposed data set. If omitted, the transposed data will be named as WORK.OUTDATA.
  • BY: In PROC TRANSPOSE, BY variables are those that will be included in the transposed data set, but not themselves transposed. The transposed data set will have one observation for each BY level per variable transposed. Note that the BY variables must be sorted before transposing.
  • ID: This statement specifies the variable whose observations will be transposed into variables. If more than one variable is listed in the ID statement, then the values of ID variables will be concatenated to form the new variable names. The ID values must occur only once in the data set; or if a BY statement is present, then the values must be unique within BY groups
  • VAR: The VAR statement specifies the variable names which will be turned into the values in each observation. SAS creates a new variable, _NAME_, which has as values the names of the variables in the VAR statement. If there are more than one VAR variables, then _NAME_ will have more than one values.

For example, let's consider the following data set:

The data set shown in the figure above is a subset of SASHELP.BASEBALL. It has the names, teams, positions, and salaries of players from the Chicago White Sox and Detroit Tigers in 1987. 

Now, let's examine the following procedures:

PROC SORT DATA=ChicagoDetroit;
BY Position Name;
RUN;

PROC TRANSPOSE DATA=ChicagoDetroit OUT=TransposedData; BY Position Name; ID Team; VAR Salary; RUN;

PROC PRINT DATA=TransposedData;
TITLE "Transposed Data";
RUN;

After sorting observations by Position and Name, the transpose procedure is applied. In the procedure, Position and Name are employed as BY variable. These two variables will be included in the transposed data set as they currently are. That is, each observation in the TransposedData will have untransposed Position and Name as its variables. Meanwhile, ID Team; instructs SAS to make each team, Chicago and Detroit, new variables in the transposed data set. Thus, we expect TransposedData to include the four variables mentioned so far: Position, Name, Chicago, and Detroit.

The remaining question would be "How to fill out the Chicago and Detroit variable values?" This question is answered by the VAR statement. The VAR Salary; means that the variable name "Salary" will be turned into the observation for a new variable _NAME_, and each observed salaries will be used as a value for Chicago or Detroit. In this example, since Salary also has labels, there will be one more variable, _LABEL_, in the transposed data.

All things considered, the transposed data will be:

PROC SORT

As implied by its name, PROC SORT arranges observations in a SAS data set. If you intend to have a BY statement in a subsequent DATA or PROC step of your SAS program, the data set must be sorted by the BY variable through PROC SORT. Here's the basic syntax of the PROC SORT:

PROC SORT DATA=SasDataset;
BY variable-list;
RUN

The BY statement specifies the variables by which SAS should apply the procedure, and thus, in the case of PROC SORT, it tells SAS based on which variable the observations should be sorted. Unlike other procedures, the BY statement is required for PROC SORT. 

If there are more than one variable is specified in the BY statement, SAS sorts observations by the first variable, then by the second variable within the same value of the first variable, and so on. For example, let's consider the following SAS program:

DATA Employees;
INFILE '/home/u63368964/source/employees.csv' FIRSTOBS=2 DLM=',' TRUNCOVER;
FORMAT
Dob DATE10. StartDate DATE10. TerminationDate DATE10.;
INPUT
EmployeeID $4. GivenName :$15. SurName :$15. Dob DDMMYY10.
JobTitle :$25. StartDate :DATE10. TerminationDate ?:DATE10.;
Age = INTCK('Year', Dob, TODAY());
IF TerminationDate = '.' THEN MonthsInOffice = INTCK('Month', StartDate, TODAY());
ELSE MonthsInOffice = INTCK('Month', StartDate, TerminationDate);
RUN;
PROC SORT DATA=Employees;
BY JobTitle MonthsInOffice;
RUN;

PROC PRINT DATA=Employees;
TITLE1 'List of Employees';
TITLE2 'Sorted by Title and MonthsInOffice (ASC)';
RUN

In PROC SORT of the program, observe that there are two variables, JobTitle and MonthsInOffice, are specified. Thus, the data set will first be sorted by JobTitle, and then sorted by MonthsInOffice within the same value of the JobTitle.

By default, PROC SORT arranges observations in ascending order. To reverse the sorting order, you should put the keyword DESCENDING before the BY variable name. For example:

PROC SORT DATA=Employees;
BY JobTitle DESCENDING MonthsInOffice;
RUN;

PROC PRINT DATA=Employees;
TITLE1 'List of Employees';
TITLE2 'Sorted by Title (ASC) and MonthsInOffice (DESC)';
RUN

In the "RESLUTS" tab, we see that the observations are first sorted by JobTitle in ascending order, and then sorted by MonthsInOffice in descending order.

PROC SORT Options

By default, applying PROC SORT will modify the original arrangements of the observations. To avoid this, you can add the OUT= option in the PROC statement. For example:

DATA HeightWeight;
INFILE '/home/u63368964/source/height-weight-v2.txt' DSD DLM=',';
INPUT Name $ Age Sex $ Weight Height;
RUN;

PROC SORT DATA=HeightWeight OUT=SortedDataset;
BY Height;
RUN;

/* Print original data */
PROC PRINT DATA=HeightWeight;
TITLE 'Height and Weight Data (Original)';
RUN;

/* Print sorted data */
PROC PRINT DATA=SortedData; TITLE 'Height and Weight Data (Sorted)'; RUN;

In this example, the PROC SORT is applied to HeightWeight with the OUT=SortedData specification. Thus, the original data set, HeightWeight, remains unmodified and we can print out both data sets:

The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same values for the BY variable. For example:

PROC SORT DATA=MyData.Boston OUT=SortedBostonNoDupKey NODUPKEY;
BY ZN;
RUN;

/* Print the first 15 observations of the original data */
PROC PRINT DATA=MyData.Boston (OBS=15);
TITLE1 "Boston Housing Data";
TITLE2 "Unsorted";
RUN;

/* Print the first 15 observations of the modified data */
PROC PRINT DATA=SortedBostonNoDupKey (OBS=15); TITLE1 "Boston Housing Data";
TITLE2 "Sorted by ZN with NODUPKEY Option"; RUN;


If you specify the DUPOUT option along with the NODUPKEY, you can specify the output data set where the duplicated observations are stored. For example:

PROC SORT DATA=MyData.Boston OUT=SortedBostonNoDupKey NODUPKEY DUPOUT=DupObs;
BY ZN;
RUN;

/* Print the first 15 observations of the duplicated data */
PROC PRINT DATA=DupObs (OBS=15);
TITLE "Duplicated Data";
RUN;

When sorting observations by character variables, the default collating sequence varies depending on the operating system you're using. In the z/OS operating environment, it follows the EBCDIC sequence, where precedence is given to blanks followed by numerals, uppercase letters, and then lowercase letters. For all other operating environments, including the SAS Studio environments, the default collating sequence is ASCII, where precedence is given to blanks followed by lowercase letters, uppercase letters, and then numerals.

You can, however, explicitly specify which collating sequence you would like to use with the SORTSEQ= option.


When sorting data set by a character variable, regardless of ASCII or EBCDIC, SAS sorts observations, distinguishing upper- and lowercase letters. However, we often want to sort the data set case insensitively. In this case, the SORTSEQ=LINGUISTIC with the (STRENGTH=PRIMARY) suboption tells SAS to ignore the case. For example:

DATA DATA=MyData;
INPUT Name $;
CARDS;
eva
amanda
Zenobia
ANNA
RUN;

PROC SORT DATA=MyData SORTSEQ=LINGUISTIC (STRENGTH=PRIMARY);
BY Name;
RUN;

ROC PRINT DATA=MyData;
TITLE "Sorted by Name (Case insensitive)";
RUN;

Occasionally, numeral values are stored as character values. When applying PROC SORT to such values, it sorts data as if they are character strings. So, for example, the value "10" comes before "2". However, (NUMERIC_COLLATION=ON) suboption along with the SORTSEQ=LINGUISTIC option tells SAS to treat numerals as their numeric equivalent. This is particularly useful when the BY variable is mixed with numbers and characters. For example:

DATA DATA=MyData2;
INPUT Name $ 1-8 Competitions $ 9-20;
CARDS;
eva 1500m freestyle
amanda 200m breaststroke
Zenobia 100m backstroke
ANNA 50m freestyle
;
RUN;

/* Sorts data without suboption */
PROC SORT DATA=MyData2 OUT=SortByDefault;
BY Competitions;
RUN;

/* Sorts data with numeric collation option */
PROC SORT DATA=MyData2 OUT=SortWithSuboption SORTSEQ=LINGUISTIC (NUMERIC_COLLATION=ON);
BY Competitions;
RUN;

/* Print original data */
PROC PRINT DATA=MyData2;
TITLE "Unsorted Data";
RUN;

/* Print sorted data (default) */
PROC PRINT DATA=SortByDefault;
TITLE "Sorted by Competitions (Default)";
RUN;

/* Print sorted data (numeric collation) */
PROC PRINT DATA=SortWithSuboption;
TITLE "Sorted by Competitions (Numeric collation)";
RUN;

PROC RANK

In statistics, ranking refers to assigning position or order to data points in a variable. It is a way of transforming data by replacing the actual values with their relative standing when the data is sorted. The rank statistic involves in many statistical tests, including Friedman test, Kruskal-Wallis test, and Spearman's rank correlation coefficient.

In SAS, you can use PROC RANK to calculate the rank for one or more numeric variables. It has the following form:

PROC RANK DATA=BmwCars OUT=MsrpRanks;
BY variable;
VAR MSRP;
RANKS MsrpRanking;
RUN;

In the PROC RANK statement, you can add some more options other than DATA= and OUT=. Here's the list of available options:

  • DESCENDING: Reverse the direction of the ranks. 
  • FRACTION: Computes fractional ranks by dividing each rank by the number of non-missing observations. Note that this can be aliased as F.
  • TIES = HIGH | LOW | MEAN | DENSE: Specifies how to compute normal scores or ranks for tied data values.
    • HIGH: Assigns the largest of the corresponding ranks (or largest of the normal scores when NORMAL= option is also specified).
    • LOW: Assigns the smallest of the corresponding ranks (or smallest of the normal scores when NORMAL= option is also specified).
    • MEAN: Assigns the mean of the corresponding ranks (or mean of the normal scores when NORMAL= option is also specified).
    • DENSE: Computes scores and ranks by treating tied values as a single-order statistic. By default, ranks are consecutive integers that begin with the number one and end with the number of unique, non-missing values of the ranking variable. Tied values are assigned the same rank. Note that this specification can be aliased as CONDENSE.
  • GROUPS = number-of-groups: Assigns group values ranging from 0 to number-of-groups minus 1. Common specifications are GROPUS=100 for percentiles, GROUPS=10 for deciles, and GROUPS=4 for quartiles. The group values are calculated by: FLOOR(rank * k / (n+1)), where rank is the value's order rank, k is the value of GROUPS=, and n is the number of non-missing ranking values that are used in TIES=LOW | MEAN | HIGH (for TIES=DENSE, n is the number of unique non-missing ranking values).
  • NORMAL = BLOM | TUKEY | VW: Computes normal scores from the ranks. The resulting variables appear normally distributed. 
    • BLOM: \(y_i = \Phi_{-1}((r_i - 3/8)/(n+1/4))\). This is an approximation to the exact expected order statistics for the normal distribution (i.e., normal scores, Blom 1958).
    • TUKEY: \(y_i = \Phi_{-1}((r_i - 1/3)/(n+1/3))\). This is an approximation to the exact expected order statistics for the normal distribution (i.e., normal scores, Tukey 1962).
    • VW: \(y_i = \Phi_{-1}(r_i/(n+1))\). NORMAL=VW allows you to use the rank scores for a non-parametric location test.
    • Where \(\Phi_{-1}\) is the inverse cumulative normal (PROBIT) function, \(r_i\) is the rank of i-th observation, and n is the number of non-missing observations for the ranking variable for TIES = LOW | MEAN | HIGH (for TIES = DENSE, n is the number of unique non-missing values).
  • NPLUS1: Computes fractional ranks by dividing each rank by the denominator n+1, where n is the number of non-missing ranking values for TIES=LOW | MEAN | HIGH (n is the number of non-missing unique values for TIES=DENSE).
  • PERCENT: Divides each rank by the number of observations that have non-missing values of the variable and multiples the result by 100 to get a percentage. n is the number of observations that have non-missing ranking values for TIES = LOW | MEAN | HIGH (n is the number of non-missing unique ranking values for TIES = DENSE).
  • SAVAGE
    • Computes Savage scores from the ranks by Lehman, 1998. 
    • \(y_i = [\sum_{j=n-r_i + 1}(\frac{1}{j})]-1\). 

For example:

DATA BmwCars;
SET SASHELP.CARS;
IF Make NE 'BMW' THEN DELETE;
KEEP Make Model Type MSRP;
RUN;

PROC RANK DATA=BmwCars OUT=MsrpRanks;
VAR MSRP;
RANKS MsrpRanking;
RUN;

PROC PRINT DATA=MsrpRanks;
RUN

The BmwCars is a subset of SASHELP.CARS. It contains each car's model name, manufacturer, type, and MSRP. Then the PROC RANK creates the new ranking variable, MsrpRanking, assigning the lowest rank value to the lowest MSRP value. The VAR statement specifies which variables to rank. The RANKS statement identifies the variable that contains the rank. Omitting it will replace the values of the VAR variables.

Optionally, you can also have the BY statement. It will calculate a separate set of ranks for each BY group. For example:

DATA BmwCars;
SET SASHELP.CARS;
IF Make NE 'BMW' THEN DELETE;
KEEP Make Model Type MSRP;
RUN;

PROC SORT DATA=BmwCars;
BY Type;
RUN;

PROC RANK DATA=BmwCars OUT=MsrpRanks;
BY Type;
VAR MSRP;
RANKS MsrpRanking;
RUN;

PROC PRINT DATA=MsrpRanks;
RUN

Post a Comment

0 Comments