PROC Step: Summarizing Your Data into a Tabular Report

The term statistics has two different meanings. One is the field of statistics itself. Another is the numbers calculated from data. These numbers, refer to as descriptive statistics, summarize the observed data points and provide estimations on parameters of a population which we would like to infer.

In data analysis, one of the first things we do after collecting and cleaning data, is summarizing them into descriptive statistics and presenting them in a tabular report. Particularly, we are interested in at which point a variable is centered and at which degree it is scattered around the central point. These two properties of a variable are called the central tendency and dispersion, respectively.

Measures of Central Tendency

The central tendency of a variable refers to a single value that summarizes the entire data points for that variable. It serves as the most representative value for a variable. For example, let's consider a variable \(X\) with 10 values: \(1, 1, 2, 2, 2, 2, 4, 4, 6, 8\). How would you summarize these values into a single number?

One way that we can intuitively come up with would be taking average of the values: 

\(\frac{1+1+2+2+2+2+4+4+6+8}{10} = \frac{1 \times 2 + 2 \times 4 + 4 \times 2 + 6 \times 1 + 8 \times 1}{10}\)

This can be generalized. Let \(N\) be the number of data points and \(x_1, x_2, ... x_N\) be the observed data points. Then the average of the data would be: 

\(\frac{\sum_{i=1}^N x_i}{N}\) for \(i = 1,2,...,N\)

Now, let \(X_1, X_2, ... \) represent the list of possible values of the data points, which in this example are \(1, 2, 4, 6, 8\). Then for each possible value, we can count the number of occurrences. Let's denote these counts as \(w_1, w_2, ...\). For example, we have observed 10 values earlier: \(1, 1, 2, 2, 2, 2, 4, 4, 6, 8\). Based on these observations, we have five pairs of each possible value and its occurrences: \((X_1 = 1, w_1 = 2), (X_2 = 2, w_2 = 4), (X_3 = 4, w_3 = 2), (X_4 = 6, w_4 = 1), (X_5 = 8, w_5 = 1)\). 

Observe that \(\sum w_j = N\). This makes sense as \(w_j\) represents the number of occurrence for \(j\)-th possible data value. With these new notations, we can re-write the average formula as follows: 

\(X_1 \times \frac{w_1}{\sum w_j} + X_2 \times \frac{w_2}{\sum w_j} + ... = \frac{\sum_{j} X_j \times w_j}{\sum_{j} w_j}\)

This is what referred to as the mean of the variable \(X\). It essentially represents the average of the possible values of variable \(X\), weighted on the number of occurrences.

If you were to have an access to the entire population data (which is never possible in the practice), the mean of \(X\) would be noted as \(\mu\). On the other hand, the sample mean at hand is noted as \(\bar{X}\). 

Another way to summarize the data points is the mode, which is the value appeared most often. With the pairs of possible values and the number of occurrence, this can be easily obtained. For example, we have 10 observed values mentioned earlier: \(1, 1, 2, 2, 2, 2, 4, 4, 6, 8\), and the five pairs of the possible values and their occurrences: \((X_1 = 1, w_1 = 2), (X_2 = 2, w_2 = 4), (X_3 = 4, w_3 = 2), (X_4 = 6, w_4 = 1), (X_5 = 8, w_5 = 1)\). Mode is the value of \(X_j\) where \(w_j\) is the largest, which in this example 2.

The other way would be the median. It is the value to which 50% of data points are greater than or equals and 50% of data points are smaller than or equals. To find the median value, we should first arrange the data points from smallest to largest. If the number of data points is odd, the median is the middle data point in the list. If the number of data points is even, the median is the average of the two middle data points in the list. For example, the 10 data points mentioned earlier \(1, 1, 2, 2, 2, 2, 4, 4, 6, 8\) are sorted in ascending order. Since we have an even number of observations, the median is the average of the two middle values: \(\frac{2+2}{2} = 2\). Here, 50% of data values (\(2, 4, 4, 6, 8\)) are greater than or equals to 2. At the same time, 50% of data values (\(1, 1, 2, 2, 2\)) are smaller than or equals to 2. 

In data analysis, the choice of central tendency depends on the data. For example, imagine you are a delivery driver and you track your earnings for a week:

  • Monday: $50
  • Tuesday: $50
  • Wednesday: $60
  • Thursday: $50
  • Friday: $70
  • Saturday: $60
  • Sunday: $280 (Big weekend order)

If someone asked "On a typical day, how much money do you earn from delivery?" you are likely choose to answer by excluding days like Sunday from your considerations. This is because including atypical values like $280 could make your answer less representative. Instead, you would prefer to select a representative day for your answer, such as a day with median earnings.

  • Mean: \(\frac{$50 \times 3 + $60 \times 2 + $70 \times 1 + $280 \times 1}{7} = $88.57\)
  • Mode: $50
  • Median: $60

As we can see in this example, medians and modes are robust to outliers (extreme values)

Medians can also provide us with additional information that the means cannot. For example, let's say that you have 10 house sales data. If the median sales price is $1M, then you can notice that 5 houses were sold were below $1M and 5 houses were sold over $1M. However, knowing the fact that mean house sales price is $1M do not give you such information; technically it is possible that 9 houses were sold at $1 respectively and 1 house sales price $9,999,991.

In addition, by its nature, it doesn't make sense to calculate means for some variables. For example, let's say that a clothing store tracks what color T-shirts sell the most. In this case, calculating mean for categories doesn't make sense; you can't average colors of T-shirts. 

Nonetheless, in most cases, the most preferred measure of central tendency is the mean. This is because, unlike median or mode, which are associated with one or two values in their calculations, the mean takes into account every single data point, providing a more complete picture of how the data is distributed: each possible value contributes to the mean proportionally to its number of occurrences. In addition, the mean value does not require sorting, which can take a huge amount of computational resources when there is a large number of data points.

Measures of Dispersion

Next question is how to measure the degree of dispersions of your data. One simple way to measure this would be the range, which is the value subtracting the minimum from the maximum. For example, if your data points are \(1, 1, 2, 2, 2, 2, 4, 4, 6, 8\), then the range would be \(8\) (maximum) \(- 2\) (minimum) \(=6\). 

The range is a simple and quick way to get a basic idea of how spread out your data points are. However, it is important to remember that the range can be misleading, especially if your data has outliers (extreme values). A single outlier can significantly inflate the range and not accurately reflect the spread of the majority of the data points. For example, suppose that you have \(88\), instead of \(8\) in the example mentioned earlier. Even a single value change heavily influenced the range.

Thus, when measuring the spread of your data using range, it is recommended to provide some percentiles as well. The percentiles are the generalized version of median. For example, the 90th percentile is the value to which 10% of data is greater than or equals and 90% of data is smaller than or equals. Similarly, the 50th percentile (median) is the value to which 50% of data is greater than or equals and 50% of data is smaller than or equals. The five-number summaries of data refer to minimum (also known as 0 quartile), 25th percentile (1st quartile), median (2nd quartile), 75th percentile (3rd quartile), and maximum (4th quartile). These numbers are five most important points in your sample and offer a glimpse into the data's spread and potential skewness.

While the five-number summaries are useful, statisticians still want a single, easily digestible value to represent overall spread. To obtain such a number, we firstly subtract the mean from each data point. Based on the \(N\) number of observations, you will obtain \(N\) deviations from the mean value. However, considering how the mean value is calculated, simply summing up these deviations will always end up with 0. Instead, we calculate the sum of squared deviations

Using the sum of squared deviations, the population and sample variance is defined as follows:

  • Population Variance: \(\sigma^2 = \frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2\)
  • Sample Variance: \(S^2 = \frac{1}{N-1} \sum_{i=1}^N (X_i - \bar{X})^2\)

Since the variance is in squared units of your data, it is quite hard to interpret. You might then wonder, "Why didn't we just take the absolute values instead of squaring them?" The biggest reason is that the absolute values cannot be differentiated at the point of 0. The concept of differentiation becomes very important when working with statistical functions and models related to variance.

Instead, we square root the variance to address this issue:

  • Population Standard Deviation: \(\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2}\)
  • Sample Standard Deviation: \(S = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (X_i - \bar{X})^2}\) 

Justification for Bessel's Correction

Suppose that we are randomly selecting \(N\) data points from a population with mean \(\mu\) and variance \(\sigma^2\). The specific outcomes of these sample data points are unknown until we actually conduct the random sampling; they will be determined by the random sampling process and can range between the minimum and maximum values of the population. Thus, we can think of the \(N\) data points as a set of independently and identically distributed random variables. That is:

\(X_1, X_2, ..., X_N \stackrel{i.i.d}{\sim} F\)

Based on these random variables, we can obtain the sample variance as: 

\(S_N^2 = \frac{1}{N-1}\sum_{i=1}^N (X_i = \bar{X})^2\)

As per this definition, the sample variance represents the sum of squared deviations from the sample mean \(\bar{X}\), divided by \(N-1\). Notably, \(N-1\) is used instead of \(N\) to compute the average of the squared deviations. This adjustment is known as Bessel's correction.

The choice of \(N-1\) instead of \(N\) as the denominator aims to make \(S^2\) an unbiased estimator for \(\sigma^2\), i.e., \(E(S_n^2) = \sigma^2 \). The key to the proof of this relies on an handy identity:

\(\begin{aligned}\sum_{i=1}^N (X_i - c)^2 & = \sum_{i=1}^N ((X_i - \bar{X}) + (\bar{X} - c))^2\\ & = \sum_{i=1}^N (X_i - \bar{X})^2 + 2\sum_{i=1}^N (X_i - \bar{X})(\bar{X} - c) + \sum_{i=1}^N (\bar{X} - c)^2\\ & = \sum_{i=1}^N (X_i - \bar{X})^2 + n \ast (\bar{X} - c)^2 \end{aligned}\)

This identity holds true for any constant \(c\). Now, let us apply the identity, choosing \(c = \mu\):

\(\begin{aligned}\sum_{i=1}^N (X_i - \mu)^2 & = \sum_{i=1}^N (X_i - \bar{X})^2 + N \ast (\bar{X} - \mu)^2\\ \Rightarrow \sum_{i=1}^N (X_i - \bar{X})^2 & = \sum_{i=1}^N (X_i - \mu)^2 - N \ast (\bar{X} - \mu)^2\end{aligned}\)

Dividing both sides by \(N-1\):

\(\frac{\sum_{i=1}^N (X_i - \bar{X})^2}{N-1} = \frac{\sum_{i=1}^n (X_i - \mu)^2}{N-1} - \frac{N \ast (\bar{X} - \mu)^2}{N-1}\)

Taking the expectation of both sides:

\(\begin{aligned}E\begin{pmatrix}\frac{\sum_{i=1}^N (X_i - \bar{X})^2}{N-1}\end{pmatrix} & = E\begin{pmatrix}\frac{\sum_{i=1}^N (X_i - \mu)^2}{N-1}\end{pmatrix} - E\begin{pmatrix}\frac{N \ast (\bar{X} - \mu)^2}{N-1}\end{pmatrix}\\ & = \frac{E(\sum_{i=1}^N (X_i - \mu)^2)}{N-1} - \frac{N}{N-1} \ast E(\bar{X} - \mu)^2\\ & = \frac{E((X_1^2 - 2X_2\mu + \mu^2) + ... + (X_N^2 - 2X_N\mu + \mu^2))}{N-1} - \frac{N}{N-1} \ast \frac{\sigma^2}{N}\\ & = \frac{E(X_1^2) - 2\mu E(X_1) + \mu^2 + ... + E(X_N^2) - 2\mu E(X_N) + \mu^2}{N-1} + \frac{\sigma^2}{N-1}\\ & = \frac{N(\sigma^2 + \mu^2)}{N-1} - \frac{2N\mu^2}{N-1} + \frac{N\mu^2}{N-1} - \frac{\sigma^2}{N-1}\\ & = \frac{N\sigma^2 + N\mu^2 - 2N\mu^2 + N\mu^2 - \sigma^2}{N-1} = \sigma^2 \end{aligned}\)

\(\Rightarrow E(S^2) = E\begin{pmatrix}\frac{\sum (X_i - \bar{X})^2}{N-1}\end{pmatrix} = \sigma^2\)

We see that the expected value sample variance equals to the population variance. Thus, we can confirm that the Bessel's correction yields unbiased estimated for \(\sigma^2\).

Data Summaries Using SAS

In SAS, the three procedures that calculate the data summaries we've discussed so far and organize them into tabular reports are PROC FREQ, PROC MEANS, and PROC TABULATE. In the following sections, let's explore how to achieve this using these three PROCs. 

PROC FREQ

Creating a frequency table is a great starting place for summarizing your data. It essentially answers the question "How often each value occurs in the data set?" In SAS, you can draw a frequency table using PROC FREQ

For example, the SASHELP.CARS data set records attributes of each car model, including make, MSRP, and MPG, across 15 variables. Now, let's suppose that you want to draw a frequency table answering how many observations for each value of the DriveTrain variable. Displaying categorical variables in a frequency table is fairly straightforward, since we already have clearly defined categories. 

PROC FREQ DATA=SASHELP.CARS;
TABLES DriveTrain;
RUN;

When the frequency table counts for a single variable, it is referred to as a one-way frequency table. To produce a one-way table, all you have to do is just list the variable name in the TABLES statement of the PROC FREQ.


When you combine two or more variables, the counts are called two-way frequencies, three-way frequencies, and so on. Tables combining two or more variables are also called cross-tabulations or contingency tables. To create a contingency table, list the variables in the TABLES statement, separated by an asterisk. For example:

PROC FREQ DATA=SASHELP.CARS;
TABLES DriveTrain * Type;
RUN;


The contingency tables count the number of occurrences of each value across the multiple variables. So, for example, the number of observations where DriveTrain is 'Hybrid' and Type is 'All' is 0, the number of observations where DriveTrain is 'Front' and Type is 'SUV' is 22, and so on.

These tables show the joint distribution of the variables. In a TABLES statement, you can list as many variables as needed. However, joint distributions of three or more variables are too complex to interpret and are generally not recommended. 

Options, if any, should appear after a slash in the TABLES statement. Here's the list of options to control the output of PROC FREQ:

  • LIST: Prints contingency tables in a list format, not grid.
  • MISSPRINT: Includes missing values in frequencies, but not in percentages.
  • MISSING: Includes missing values in frequencies and in percentages.
  • NOCOL: Suppresses printing of column percentages in the output contingency table.
  • NOROW: Suppresses printing of row percentages in the output contingency table
  • NOCOL: Suppresses printing of percentages in the output contingency table.
  • NOPERCENT: Suppresses printing of percentages.
  • OUT = data-set: Create a data set with the frequency table.

For example: 

PROC FREQ DATA=SASHELP.CARS;
TITLE "Contingency table without percentages";
TABLES DriveTrain * Type / NOROW NOCOL NOPERCENT;
RUN;

Frequency Tables with Quantitative Variables

Frequency tables are not limited to qualitative variables. By nature, certain quantitative variables should be regarded as having discrete distributions. For example, the Cylinders variable in the SASHELP.CARS data set marks the number of cylinders for each car model. Since there is no way to have a fractional number of cylinders, it makes sense to consider this variable as having discrete distribution. 



Binning Continuous Variables with User-Defined Formats

In a frequency table, the number of classes between 4 and 20 is generally considered as a reasonable. So, if you create a frequency table with a continuous variable, which can take any number in a given range, there will be too many classes with a single occurrence. This makes it hard to get a glimpse of the overall distribution of the variable.

Instead, you can consider grouping (or binning) similar values into a single class. In SAS, you can do this with PROC FORMAT. For example: 

PROC FORMAT;
VALUE Fmt_Price
LOW - 30000 = '$'
30001 - 60000 = '$$'
60001 - 90000 = '$$$'
90001 - HIGH = '$$$$';
RUN;

PROC FREQ DATA=SASHELP.CARS;
TABLES MSRP;
FORMAT MSRP Fmt_Price.;
RUN;

In this example, the PROC FREQ creates a user-defined format named Fmt_Price. This format is applied to MSRP in the subsequent PROC FREQ. So, the MSRP values are grouped into four categories using the Fmt_Price. and PROC FREQ creates a one-way frequency table based on the groupings.






While there is no strict rule for determining the number of classes, in case you can't come up with an appropriate number, here is one standardized approach: 

\(2^k \ge N\)

In this equation, let \(k\) be the number of classes you want to create and \(N\) be the total number of data points you have. Then, find the smallest value of \(k\) that makes 2 raised to the power of \(k\) greater than or equal to \(N\). This is what we called the \(2^k\) rule.

PROC MEANS

In SAS, you can obtain the descriptive statistics through PROC MEANS. By default, the MEANS procedure provides the following summary statistics for each numeric variable:

  • N: Number of non-missing values.
  • Mean: Average of the non-missing values.
  • Std Dev: Sample standard deviation calculated from the non-missing values.
  • Minimum: Maximum value among all non-missing values.
  • Maximum: Minimum value among all non-missing values.

Notice that the MEANS procedure computes summary statistics, based on the non-missing values. It ignores all missing values in its calculation. For example:

DATA MyData;
INPUT X;
CARDS;
1
2
3
.
4
5
RUN;

In this example data set, we have six observations with one missing value. Now, let's run a basic MEANS procedure as follows:

PROC MEANS DATA=MyData;
RUN;

In the results, we see that all summary statistics were computed solely based on the non-missing values. For example, when PROC MEANS calculates the mean value, it sums up all non-missing values (which is 15 in this case) and then divides it by the number of non-missing values (5). Thus, the resulting value is 15/5=3; it does not consider the missing value as 0 and return 15/6 as the mean value.

It is also important to note that every statistic the MEANS procedure calculates is the sample statistics, not population. SAS treats the observations at hand as a sample. Particularly, when calculating standard deviations, the denominator it uses is \(N-1\), not \(N\). In this example, the sum of squared deviations to the mean is 10. Thus, the sample standard deviation PROC MEANS calculated is the square root of 10/(5-1), which is approximately 1.5811388.

PROC MEANS Options

After initiating the MEANS procedure with the PROC statement, you can add some options to customize which summary statistics to calculate and how they are presented.

  • MAXDEC = n: Specifies the number of decimal places to be displayed
  • MAX: Maximum value of the variables.
  • MIN: Minimum value of the variables.
  • MEAN: Mean of the variables. 
  • MEDIAN: Median.
  • MODE: The most frequent value in the variable.
  • N: Number of non-missing values
  • NMISS: Number of missing values
  • RANGE: The value subtracting minimum value from the maximum.
  • STDDEV: Sample standard deviation of the variables. 
  • SUM: Sum of the variables.

For example:

DATA MyData2; INPUT X Y Z; CARDS; 1 1 1 2 2 1 3 3 1 . 4 4 4 5 4 5 6 9 ; RUN; PROC MEANS DATA=MyData2 MAXDEC=3 MEDIAN MODE N NMISS RANGE SUM; RUN;

Additionally, you can include the following statements:

  • BY variable-list;
    • The BY statement performs separate analyses for each level of the variables in the list. 
    • The data must first be sorted by these variables through PROC SORT.
  • CLASS variable-list;
    • The CLASS statement also performs separate analyses for each level of the variables in the list, but its output is more compact than with the BY statement.
    • Data do not have to be sorted first.
  • VAR variable-list;
    • The VAR statement specifies which numeric variables to use in the analysis.
    • If omitted, all numeric variables are used in the PROC MEANS.

For example:

DATA MyData3; INPUT X $ Y Z; CARDS; a 1 1 b 2 1 a 3 1 b 4 4 a 5 4 b 6 9 ; RUN;

PROC MEANS DATA=MyData3;
TITLE "CLASS X";
CLASS X;
RUN;

PROC SORT DATA=MyData3;
BY X;
RUN;

PROC MEANS DATA=MyData3;
TITLE "BY X";
BY X;
RUN;

PROC MEANS DATA=MyData3;
TITLE "VAR Y";
VAR Y;
RUN;

You can also employ any user-defined formats created by a preceding PROC FORMAT. The UDF will be used like BY variable in the MEANS procedure. For example: 

PROC FORMAT;
VALUE Fmt_Price
LOW - 30000 = '$'
30001 - 60000 = '$$'
60001 - 90000 = '$$$'
90001 - HIGH = '$$$$';
RUN;

PROC MEANS DATA=SASHELP.CARS;
CLASS MSRP;
FORMAT MSRP Fmt_Price.;
RUN;

PROC TABULATE

PROC TABULATE is another SAS procedure that can create a tabular reports with summary statistics. In fact, while it is possible to create tabular reports with other procedures, the PROC TABULATE stands out as the most favored option as its reports are the most visually appealing and well-organized.

The general form of PROC TABULATE is: 

PROC TABULATE;
CLASS classification-variable-list;
TABLE page-dimension, row-dimension, column-dimension;
RUN;

Where:

  • CLASS: Specifies which categorical variables to be used for dividing observations into groups.
  • TABLE: Specifies how to organize the output tables and what numbers to compute.

For each TABLE statement in PROC TABULATE, you can specify up to three dimensions. These dimensions, separated by commas, determine which variables to be used for pages, rows, and columns in the report.

If only one dimension is specified, it defaults to the column dimension. For example: 

PROC TABULATE DATA=SASHELP.CARS;
TITLE "Col-dimension: DriveTrain";
CLASS DriveTrain;
TABLE DriveTrain;
RUN;

If two dimensions are specified, rows and columns are defined. For example: 

PROC TABULATE DATA=SASHELP.CARS;
TITLE1 "Row-dimension: Type";
TITLE2 "Col-dimension: DriveTrain";
CLASS Type DriveTrain;
TABLE Type, DriveTrain;
RUN;

If all three dimensions are specified, pages, rows, and columns are included. For example: 

PROC TABULATE DATA=SASHELP.CARS;
TITLE1 "Page-dimension: Origin";
TITLE2 "Row-dimension: Type";
TITLE3 "Col-dimension: DriveTrain";
CLASS Origin Type DriveTrain;
TABLE Origin, Type, DriveTrain;
RUN;

So, when you write a TABLE statement, start with the column dimension, and then add rows and pages one by one, as you check the table results in between. Once you have each dimension checked, insert the next dimensional variable in front of the current one.

Missing Values

By default, PROC TABULATE excludes observations with any of the CLASS variables are missing in its table creations. For example, let's consider a data set from the following DATA step: 

DATA MyData4;
/* X: row Y: col Z: page */
INPUT X $ Y $ Z $;
CARDS;
. 121 131
211 221 231
311 321 331

112 122 132
212 . 232
312 322 332

113 123 .
213 223 233
313 323 333
RUN;

In this data set, 3 out of 9 observations have some missing value in each variable. Then, in the subsequent PROC TABULATE, the three variables are listed in the CLASS statement: 

PROC TABULATE DATA=MyData4;
/* Page, row, col */
CLASS Z X Y;
TABLE Z, X, Y;
RUN;

Excluding observations with missing values, PROC TABULATE will produce 6 pages where Z value is 231, 331, 132, 332, 233, 333, respectively. Then within each page, there will be only one observation. For example, page number where Z equals to 231 has one observation: X is 211 and Y is 221.





However, if you want to keep those observations, then simply add the MISSING option to your PROC statement. For example: 

PROC TABULATE DATA=MyData4 MISSING;
/* Page, row, col */
CLASS Z X Y;
TABLE Z, X, Y;
RUN;

This procedure will produce in total 9 tables, without excluding any observations; any missing values will be represented as a blank.


Adding Statistics

By default, what PROC TABULATE generates are frequency tables. That is, for each combination of levels in the CLASS variables, the TABLE statement counts the number of observations. Here, if you specify some other statistics in the TABLE statement, the output data cells will contain the values according to your specifications. 

Just as we listed categorical variables in the CLASS statement to group the observations, we should add the VAR statement to list continuous variables, so that PROC TABULATE can calculate statistics on the variables. Here is the general form: 

PROC TABULATE DATA=SASHELP.CARS;
VAR analysis-variable-list;
CLASS classification-variable-list;
TABLE page-dimension, row-dimension, column-dimension;
RUN;

You can include both a CLASS statement and a VAR statement, or just one of them. However, all variables listed in a TABLE statement must also be present in either a CLASS or a VAR statement. For example: 

PROC TABULATE DATA=SASHELP.CARS;
TITLE "Mean MSRP for each DriveTrain and Type";
VAR MSRP;
CLASS DriveTrain Type;
TABLE DriveTrain, MEAN * MSRP;
RUN;

In this example, we have DriveTrain and Type as grouping variables. Then, in the TABLE statement, we have DriveTrain to define row dimension and MEAN * MSRP to define column dimension. Thus, the output will have the average MSRP values in its data cell.

Here are some statistics that you can request in the TABLE statement:

  • ALL: Adds a row, column, or page total
  • MAX: Highest value
  • MIN: Lowest value
  • MEAN: Mean
  • MEDIAN: Median
  • MODE: Mode
  • N: Number of non-missing values
  • NMISS: Number of missing values
  • PCTN: Percentage of observations for each group
  • PCTSUM: Percentage of total represented by each group
  • STDDEV: Sample standard deviation
  • SUM: Sum

Within a dimension, variables and keywords can be concatenated, crossed, or grouped. To concatenate variables or keywords, simply list them separated by a space; to cross variables or keywords, separate then with an asterisk (*); and to group them, enclose the variables or keywords in parentheses. Typically, we concatenate the keyword ALL for a variable. To request other statistics, on the other hand, cross that keyword with the variable name. For example: 

PROC TABULATE DATA=SASHELP.CARS;
TITLE "Mean MSRP by DriveTrain and Type";
VAR MSRP;
CLASS DriveTrain Type;
TABLE DriveTrain ALL, MEAN * MSRP * (Type ALL);
RUN;

In this example, the row dimension concatenates the classification variable DriveTrain with ALL to produce totals. The column dimension, on the other hand, crosses MEAN with the analysis variable MSRP and with the classification variable Type (which happens to be concatenated and grouped with ALL). Here are the results:


Formatting Data Cells

To specify a specific format for all data cells in the output tables, add FORMAT= option in the PROC statement. For example: 

PROC TABULATE DATA=SASHELP.CARS FORMAT=DOLLAR9.2;
TITLE "Mean MSRP by DriveTrain and Type";
VAR MSRP;
CLASS DriveTrain Type;
TABLE DriveTrian ALL, MEAN * MSRP * (Type ALL);
RUN;

For further details on standard and custom data formats, please see this.

Occasionally, you may need to specify more than one formats for a single output. In such cases, you should put the FORMAT= option in the TABLE statement, instead of the PROC statement. For example: 

PROC TABULATE DATA=SASHELP.CARS;
TITLE "Mean MSRP and MPG (City) by DriveTrain and Type";
VAR MSRP MPG_City;
CLASS DriveTrain Type;
TABLE DriveTrian ALL, MEAN * (MSRP*FORMAT=DOLLAR9.2 MPG_City*FORMAT=4.1) * (Type ALL);
RUN;

Post a Comment

0 Comments