PROC Step: Formatting and Printing Data

In a data analysis, printing out observations from a data set is useful in many situations. For example, when exploring a new data set, printing out the first few observations can provide some initial grasps into its structure and content. Moreover, during data cleaning, selectively printing observations of interest and reviewing them allows for the intuitive identification of any outliers or missing values. Printing out observations is also essential for documentation, enhancing reproducibility and reliability of your data reports.

In this guide, we will explore how to print observations with custom formats using PROC PRINT and PROC FORMAT with some practical examples. Let's get started!

PROC PRINT

PROC PRINT would be one of the most frequently used procedures in SAS programming. As its name implies, the print procedure basically prints out observations stored in a data set. The basic syntax of the procedure is:

PROC PRINT DATA=MyData;
TITLE 'Your Title'; /* Optional title text */
FOOTNOTE 'Your footnotes'; /* Optional footnote text */
LABEL variable1 = 'variable one' /* Optional labels for variables */
variable2 = 'variable two';
RUN;

For any procedures, if not specified otherwise, SAS uses the most recently created dataset. PROC PRINT is no exception. In practice, it is almost always recommended to explicitly specify the DATA= option for clarity in your program, as it is often hard to quickly determine which dataset was created last. 

In addition to DATA=, some useful options for PROC PRINT are:

  • NOOBS: By default, SAS prints the observation numbers along with the variables. If you don't want observation numbers, however, you can add the NOOBS option at PROC PRINT.
  • LABEL: This option allows you to use variable labels instead of variable names in the output. This option enhances readability of your output, particularly useful for documentation purposes.
  • (OBS=n): This suboption prints out only the first n observations from the beginning.

The following codelines show all of these options together:

PROC PRINT DATA=MyData.Boston (OBS=20) NOOBS LABEL; TITLE1 'Boston Housing Dataset'; TITLE2 'First 20 Obs'; FOOTNOTE 'http://lib.stat.cmu.edu/datasets/boston'; LABEL CRIM = 'Crime rate' ZN = '% residential area' INDUS = '% non-retail business area' CHAS = 'Riverside' NOX = 'Nitric oxides' RM = 'Rooms per dwelling' AGE = '% old units' DIS = 'Distance to business centers' RAD = 'Radial highway accessibility' TAX = 'Tax per $10,000' PTRATIO = 'Pupil-teacher ratio' LSTAT = '% lower status of the population'; RUN;


Optionally, you can also add the following statements to PROC PRINT:

  • BY variable-list;
    • In the context of PROC PRINT, the BY statement starts a new section in the output for each new value of the BY variables and prints the values of the BY variables at the top of each section. Note that the data must be presorted by the BY variables.
  • ID variable-list;
    • When you use the ID statement, the observation numbers are not printed. Instead, the variables in the ID variable list appear on the left-hand side of the page.
  • SUM variable-list;
    • The SUM statement prints sums for the variable in the list.
  • VAR variable-list;
    • The VAR statement specifies which variables to print and the order. Without a VAR statement, all variables in the SAS dataset are printed in the order that they occur in the dataset.
  • FORMAT variable format;
    • You can change the appearance of printed values using standard data formats. 
    • For numeric values, you can specify a format along with the width w and decimals d (formatw.d). Note that the period and d also counts for w. For example, 5.3 can display up to 9.999.  
    • For character values, you must put a dollar sign to indicate that it is character format ($formatw.). It takes only the width w.
    • Internally, the only two data types a SAS dataset can have are numeric and character. Any date values are stored as the number of days since Jan 1, 1960. Thus, to display it as actual date values, you must specify the format.

For example:

PROC SORT DATA=SASHELP.BASEBALL OUT=SortedByTeam;
BY Team;
RUN;

PROC PRINT DATA=SortedByTeam; TITLE "86's MLB Players"; BY Team; SUM nAtBat nHits nHome nRuns nRBI nBB nOuts nAssts nError;
VAR Name nAtBat nHits nHome nRuns nRBI nBB nOuts nAssts nError Salary;
FORMAT Salary DOLLAR13.2; RUN;

This procedure prints observations from the SASHELP.BASEBALL data set, pre-sorted by Team. For each Team, the print procedure prints all observed values for the variables listed in the VAR statement and calculates the sum for the variables listed in the SUM statement. The Salary values should be formatted as DOLLAR13.2.

Here are some selected standard data formats that are commonly employed:

Description Example Format Result
Character
Converts character values to upper case. w ranges 1-32767, defaults to 8.
my cat
$UPCASE6.
MY CAT
Writes standard character data - does not trim leading blanks (same as $CHARw.) w ranges 1-32767, defaults to 1. my cat  
 my snake
$8. '*'
my cat  *
 my snak*

Date, Time, and Datetime
Writes SAS date values in form ddmmmyy or ddmmmyyyyw ranges 1-11, defaults to 7. 8966
DATE7.
DATE9.
19JUL84
19JUL1984

Writes SAS datetime values in form ddmmmyy:hh:mm:ss.ssw ranges 7-40, defaults to 16. 12182
DATETIME13.
DATETIME18.1
01JAN60:03:23
01JAN60:03:23:02.0
Writes SAS datetime values in form ddmmmyy or ddmmmyyyy. w ranges 5-9, defaults to 7.
12182
DTDATE7.
DTDATE9.
01JAN60
01JAN1960
Writes SAS date values in form dd.mm.yy or dd.mm.yyyy. w ranges 2-10, defaults to 8.
8966
EURDFDD8.
EURDFDD10.
19.07.84
19.07.1984
Writes SAS date values in Julian date form yyddd or yyyyddd. w ranges 5-7, defaults to 5.
8966
JULIAN5.
JULIAN7.
84201
1984201
Writes SAS date values in form mm/dd/yy or mm/dd/yyyy. w ranges 2-10, defaults to 8.
8966
MMDDYY8.
MMDDYY6.
7/19/84
071984
Writes SAS time values in form hh:mm:ss.ss. w ranges 2-20, defaults to 8.
12182
TIME8.
TIME11.2
3:23:02
3:23:02.00
Writes SAS date values in form day-of-week, month-name dd, yy or yyyy. w ranges 3-37, defaults to 29.
8966
WEEKDATE5.
WEEKDATE9.
Thu, Jul 19, 84
Thursday, July 19, 1984
Writes SAS date values in form month-name dd, yyyy. w ranges 3-32, defaults to 18.
8966
WORDDATE12.
WORDDATE18.
Jul 19, 1984
July 19, 1984
SAS decides best format - default format for numeric data. w ranges 1-32
1200001
BEST6.
BEST8.
1.20E6
1200001
Writes numbers with commas. w ranges 2-32, defaults to 6, defaults to 12.
1200001
COMMA9.
COMMA12.2
1,200,001
1,200,001.00
Writes numbers with a leading $ and commas separating every three digits. w ranges 2-32, defaults to 6.
1200001
DOLLAR10.
DOLLAR13.2
$1,200,001
$1,200,001.00
Writes numbers in scientific notation. w ranges 7-32, defaults to 12.
1200001
E7.
1.2E+06
Writes numbers with a leading € and periods separating every three digits. w ranges 2-32, defaults to 6.
1200001
EUROX13.2
€1.200.001,00
Writes numeric data as percentages. w ranges 4-32, defaults to 6.
0.05
PERCENT9.2
5.00%
Writes standard numeric data. w ranges 1-32.
23.635
6.3
5.2
23.635
23.64

PROC FORMAT

Occasionally, standard data formats listed earlier are not enough for particular requirements, prompting the need for custom formats tailored to your specific needs. For example, let's consider a data set introduced below:

The data set contains 913 responses from a survey on perceptions and practices of using Wikipedia as a teaching resource conducted among faculty members from two different universities located in Barcelona, Spain. Excluding AGE and YEARSEXP, 51 variables are coded as follows:

  • GENDER: 0=Male; 1=Female
  • DOMAIN: 1=Arts & Humanities; 2=Science; 3=Health Sciences; 4=Engineering & Architecture; 5=Law & Politics
  • UNIVERSITY: 1=UOC (University Oberta de Catalunya), 2=UPF(Universitat Pompeu Fabra)
  • UOC_POSITION and OTHER_POSITION: 1=Professor, 2=Associate, 3=Assistant, 4=Lecturer, 5=Instructor, 6=Adjunct
  • OTHERSTATUS, PhD, and USERWIKI: 0=No; 1=Yes
  • All remaining survey items are 5-point Likert scales: 1=Strongly disagree/Never; 2=Disagree/Rarely; 3=Neither agree or disagree/Sometimes; 4=Agree/Often; 5=Strongly agree/Always

Printing this data set with a user-defined formats would be very convenient as it removes necessity of data code book for interpretation. In SAS, PROC FORMAT creates custom formats that will later be associated with variables in a FORMAT statement. The basic syntax of the PROC FORMAT would be as follows:

PROC FORMAT;
VALUE name range-1 = 'formatted-text-1'
range-2 = 'formatted-text-2'
range-n = 'formatted-text-n';
RUN;

Where name is the name of the format you are creating. Note that if the format is for character data, the name must start with a dollar sign ($name). Format names must be unique to each other, can be up to 32 characters long (including the $ for character data), must not start or end with a number, and cannot contain any special characters except underscores.

In the VALUE statement, each range represents the value of a variable that is assigned to the text given in quotation mark on the right side of the equal sign. These formatted texts can be up to 32,767 characters long, but some procedures print only the first 8 or 16 characters. 

PROC FORMAT; VALUE Fmt_AgeGroup LOW - 40 = "Under 40" 40 -< 65 = "40 to 65" 65 - High = "Over 65"; VALUE Fmt_BinaryAnswer 0 = "No" 1 = "Yes";
VALUE Fmt_University 1 = "UOC"
2 = "UFP";
VALUE Fmt_Position 1 = "Professor"
2 = "Associate"
3 = "Assistant"
4 = "Lecturer"
5 = "Instructor"
6 = "Adjunct"; VALUE Fmt_Gender 0 = "Male" 2 = "Female"; VALUE Fmt_Likert 1 = "Strongly disagree / Never" 2 = "Disagree / Rarely" 3 = "Neither agree or disagree / Sometimes" 4 = "Agree / Often" 5 = "Strongly agree / Always"; VALUE Fmt_Domain 1 = "Arts & Humanities" 2 = "Science" 3 = "Health Science" 4 = "Engineering & Architecture" 5 = "Law & Politics"
OTHER = "Others"; RUN; PROC PRINT DATA=MyData.Wiki4HE; TITLE "Wiki4HE"; FORMAT AGE Fmt_AgeGroup. GENDER Fmt_Gender. DOMAIN Fmt_Domain. PhD Fmt_BinaryAnswer. USERWIKI Fmt_BinaryAnswer. OTHERSTATUS Fmt_BinaryAnswer.
UNIVERSITY Fmt_University.
UOC_POSITION Fmt_Position. OTHER_POSITION Fmt_Position. PU1 -- Exp5 Fmt_Likert.; RUN;

In the SAS program above, PROC FORMAT defines several user-defined formats (UDFs) that assign labels to numeric codes. Each VALUE statement creates a UDF with a name and mappings between numeric values and corresponding character labels. The keywords LOW and HIGH are used in ranges to indicate the lowest and highest non-missing values, respectively. The OTHER keyword is used to assign a format to any values not listed in the VALUE statement.

Subsequently, in the PROC PRINT, the UDFs created from PROC FORMAT is employed; AGE employees Fmt_AgeGroup, Gender employees Fmt_Gender, and so forth. Note that PU1 -- Exp5 means all variables PU1 through Exp5.

Post a Comment

0 Comments