Exploring Data with One-Dimensional Graphs

In the field of Statistics, data visualization is the art of representing data in a graphical format. It goes hand-in-hand with statistical analysis itself. Graphical methods allow statisticians to effectively condense large amount of information in the data sets into a clear and informative picture, highlighting any outliers, central tendencies and distributions for each variable, and the relationships between variables. These findings guide statisticians toward the most appropriate statistical tests to run on the data. Then, after conducting their chosen statistical tests, statisticians visualize the results to communicate their findings.

Visualizing Categorical Variables

The main purpose of visualizing a categorical variable is to reveal its frequency distribution. This helps identify which category is the most frequent and which are less common. Two common methods for visualizing a single categorical variable are through a pie chart and bar chart.

Pie Chart

One way to visualize a categorical variable that we can come up with is the pie chart. In a pie chart, categories of data are represented by wedges in a circle and are proportional in size to the percent of individuals in each category. In SAS, you can draw pie charts like below:

PROC SGPIE DATA=SASHELP.CARS;
TITLE "Pie Chart: Origin Distribution of Cars";
PIE Origin / DATALABELDISPLAY=ALL DATALABELLOC=INSIDE;
RUN;

PROC SGPIE DATA=SASHELP.CARS; TITLE "Donut Chart: Origin Distribution of Cars"; DONUT Origin / DATALABELDISPLAY=(CATEGORY PERCENT) DATALABELLOC=CALLOUT; RUN;

PROC SGPIE is a procedure specifically designed to create pie charts for data visualization. In the code lines above, the PIE statement specifies the categorical variable of which PROC SGPIE visualizes relative frequencies. Alternatively, you can also use the DONUT statement, if you prefer your pie chart to have a hole in the center. After specifying a categorical variable of your interest in either a PIE or DONUT statement, you can include additional options following a slash (/). Here are some commonly employed options:

  • DATALABELDISPLAY = ALL | NONE | (content-options):
    • ALL: Display all available information
    • NONE: Do not display slice labels
    • (content-options): a space-separated list of one or more of the following options enclosed in parentheses:
      • CATEGORY: Displays the category value
      • PERCENT: Displays the response or category percentage
  • DATALABELLOC = INSIDE | OUTSIDE | CALLOUT:
    • INSIDE: Locates the slice labels inside the pie slices
    • OUTSIDE: Locates the slice labels outside of the pie circumference
    • CALLOUT: Locates the slice labels outside of the pie circumference and draws a line from the label to its slice

However, either a pie or donut chart is generally not the most effective way to visualize the frequency distributions. One of the reasons to avoid pie charts is that human eyes are not very good at comparing the angles. Particularly, when the relative frequencies of the categorical values are similar to each other, it is very hard to judge which category occurs more often than the others. Besides, pie charts become cluttered and hard to interpret if you have too many categories. Typically, when the number of categories to be displayed is more than 4, it is highly recommended to avoid drawing a pie chart.

Bar Chart

A better alternative to the pie charts is the bar chart. It visualizes frequency distribution using rectangular bars. Each bar represents a specific category, and the height of the bar is proportional to the value of frequency of that category. In SAS, you can create a bar chart using the VBAR statement with PROC SGPLOT:

PROC SGPLOT DATA=SASHELP.CARS;
TITLE "Number of Cars by Origins";
VBAR Origin;
RUN;

Following the VBAR statement, you can add some options to customize your bar chart. Here are some popular options:

  • BARWIDTH = n: Sets the width of bars. Available values range from 0.1 to 1 with a default of 0.8.
  • TRANSPARENCY = n: Specifies the degree of transparency for the bars. The value of n must be between 0 (default) and 1, with 1 being completely transparent and 0 being completely opaque.
  • DATALABEL = variable-name: Displays a label for each bar. If you specify a variable name, then the variable values will be used. Otherwise, SAS will determine appropriate values.
  • MISSING: Includes a bar for missing values.
  • GROUP = variable-name: Specifies a variable used to group the data.
  • GROUPDISPLAY = type: Specifies how to display grouped bars, either STACK (default option) or CLUSTER.
  • RESPONSE = variable-name: Specifies a numeric variable to be summarized.
  • STAT = statistic: Specifies a statistic, either FREQ, MEAN, or SUM. FREQ is the default if there is no response variable. SUM is the default when you specify a response variable.

For example:

PROC SGPLOT DATA=SASHELP.CARS;
TITLE "Number of Cars by Origins";
VBAR Origin / BARWIDTH=0.5 TRANSPARENCY=0.7;
RUN;

Among the previously mentioned options, I strongly argue against using GROUP and GROUPDISPLAY. While SAS still allows these two options for some non-statisticians, these two options make your bar chart less clear and even misleading! For example, let's take a look at the bar chart generated by the following PROC step:

PROC SGPLOT DATA=SASHELP.CARS;
TITLE "Bad Practice 1: Unclear Plot";
VBAR Origin / GROUP=DriveTrain;
RUN;

This bar chart attempts to describe the distributions of both drivetrains and origins of cars at the same time. However, while it is still possible to compare the number of cars by their origins, stacking the DriveTrain makes it difficult to see the relative frequencies within each origin group. Since the baseline of each grouping variable (DriveTrain) is located at a different level (e.g., a red bar representing the front-wheel drive Asian cars is located at a height of 34, while another red bar representing the same group of American cars is located at a height of 22), it is very hard to understand how the relative frequencies of subgroups are different by categories.

OriginDriveTrain
AllFrontRearTotal
Asia
34
99
25
158
Europe
36
37
50
123
USA
22
90
35
147
Total
92
226
110
428

Adding GROUPDISPLAY=CLUSTER option would not solve the issue. For example, let's take a look at the bar chart below:

PROC SGPLOT DATA=SASHELP.CARS;
TITLE "Bad Practice 2: Misleading Plot";
VBAR Origin / GROUP=DriveTrain GROUPDISPLAY=CLUSTER;
RUN;

This chart separates bars for each DriveTrain group, which makes it hard to compare the number of cars by each Origin category. Moreover, this chart is somewhat misleading. As I mentioned earlier, each rectangular bar in a bar chart represents a distinct category or group. Thus, the bars in a bar chart must be separated to each other. However, GROUPDISPLAY=CLUSTER; removes gaps between the bars from different DriveTrain group. Consequently, the output bar chart looks like three different histograms on a certain continuous variable, depending on different Origin categories.

One key principle of creating a chart is to convey only one piece of information per chart. Chart should be focused and clear in what they're trying to communicate. So, in this example, instead of stacking or clustering bars, it would be clearer to draw two separate charts: one is depicting relative frequency of DriveTrain within each Origin, and another showing the number of car models for each Origin.

Displaying Large Number of Categories

When there's a large number of categories to display, using a vertical bar chart is generally not advisable. For example, let's take a look at the following bar chart:

PROC SGPLOT DATA=SASHELP.CARS;
TITLE "Number of Cars by Manufacturer";
VBAR Make;
RUN;

Observe that the vertical bar chart becomes cluttered and hard to read due to too many categories. In such a cases, a horizontal bar chart would be an excellent alternative to a vertical bar chart:

PROC SGPLOT DATA=SASHELP.CARS;
TITLE "Number of Cars by Manufacturer";
HBAR Make;
RUN;

Displaying Relative Frequencies

We often want to display the relative frequencies, rather than the frequency of a categorical variable. Unfortunately, PROC SGPLOT alone does not directly support the relative frequencies or percentages. However, we can still achieve this by using the SGPLOT procedure along with the PROC FREQ like below:

PROC FREQ DATA=SASHELP.CARS NOPRINT;
TABLES DriveTrain / OUT = FreqOut;
RUN;

DATA RelativeFreq;
SET FreqOut;
LABEL Pct = 'Percent';
FORMAT Pct PERCENT.;
Pct = PERCENT / 100;
RUN;

PROC SGPLOT DATA=RelativeFreq;
VBAR DriveTrain / RESPONSE = Pct DATALABEL;
YAXIS GRID DISPLAY = (NOLABEL);
XAXIS DISPLAY = (NOLABEL);
RUN;

In this example, the PROC FREQ generates a new data set FreqOut. This data set has three variables: DriveTrain, COUNT, and PERCENT, where the DriveTrain is the categorical variable of our interest. Then the subsequent DATA step declares a new variable Pct, calculating the actual percentage values for each DriveTrain of FreqOut, and outputs another data set RelativeFreq. Note that I specified the format for Pct through FORMAT Pct PERCENT.; to display percentages. If you want to display relative frequencies rather than percentages, comment out this line.

Lastly, PROC SGPLOT draws a vertical bar chart on DriveTrain, using the newly created RelativeFreq. Here, YAXIS GRID DISPLAY = (NOLABEL); and XAXIS DISPLAY = (NOLABEL); replaces the default grid lines and labels on the Y and X-axis of the chart with the label specified by LABEL Pct = 'Percent';.

Visualizing Continuous Variables

Unlike categorical variables, continuous variables are those that can take on any value within a given range of real numbers. One commonly employed graph to visualize the distribution of a continuous variable is the histogram. To create a histogram with PROC SGPLOT, use a HISTOGRAM statement with this basic form:

PROC SGPLOT DATA=MyData;
HISTOGRAM variable-name / options;
RUN;

Possible options include: 

  • BINSTART = n: Specifies the midpoint for the first bin.
  • BINWIDTH = n: Specifies the bin width (in units of the horizontal axis, not the variable). This option is ignored if you specify the NBINS= option.
  • NBINS = n: Specifies the number of bins. Based on the specification, SAS determines the bin width.
  • SCALE = scaling-type: Specifies the scale for the vertical axis, either PERCENT (default), COUNT, or PROPORTION.
  • SHOWBINS: Places tick marks at the midpoints of the bins. 
  • TRANSPARENCY = n: The value of n must be between 0 (default) and 1, with 1 being completely transparent and 0 being completely opaque.

For example:

PROC SGPLOT DATA=SASHELP.CARS;
TITLE 'Distribution of Car Weights';
HISTOGRAM Weight / SCALE=COUNT;
RUN;

In the plot, unlike in the bar charts, observe that the bars are gathered together without any gaps. This is because what histogram represents is the distribution of a continuous variable, which can take any value in a given range

By default, SAS automatically determines appropriate number of bins. However, as listed earlier, you can specify it using the NBINS= option. While there is no strict rule for determining the number of bins, but general suggestion is to select the number \(k\) such that \(2^k \ge n\), where \(n\) is the number of data points.

In addition to histogram, you can also plot density curves for your data. Instead of binning observed data points and represent the frequencies as consecutive bars, density curves visualize the distribution of a continuous variable with functional lines (known as the probability density function \(f_X(x)\)), so that the probability that a data point happens to be in a given range is represented as \(P(a\le x \ge b)=\int_{a}^{b} f_X(x)dx\), where \(a\) is the lower bound and \(b\) is the upper bound of the range. Note that the density is not equals to proportion or relative frequency! Unlike relative frequencies, the probability density itself can be greater than 1, as long as the total area under the curve equals to 1. An easy example is the probability density function of a continuous uniform distribution, defined on \([0, 1/2]\).

You can draw a density curve with the DENSITY statement in a PROC SGPLOT as follows:

PROC SGPLOT DATA=MyData;
DENSITY variable-name / options;
RUN;

Commonly employed options are:

  • TYPE = distribution-type: Specifies the type of distribution curve, either NORMAL (the default) or KERNEL.
  • TRANSPARENCY = n: Specifies the degree of transparency for the density curve. The value of n must be between 0 (default) and 1, with 1 being completely transparent and 0 being completely opaque.

For example:

PROC SGPLOT DATA=SASHELP.CARS;
TITLE 'Distribution of Car Weights';
HISTOGRAM Weight / SCALE=COUNT;
DENSITY Weight;
DENSITY Weight / TYPE=KERNEL TRANSPARENCY=0.5;
RUN;

Visualizing Relationships between Variables

So far, we have seen how to visualize the distribution of individual variables, but data visualization extends beyond univariate analysis. Indeed, there are many graphs that can effectively capture the relationship between two or more variables. In this section, let's explore how to visualize the relationships between variables in different cases. 

Relationships between Two Variables

Categorical Variable vs. Continuous Variable

box plot, also called a box-and-whisker plot, is a visual tool commonly employed to understand how the distribution of a continuous variable (response) is different, depending on each level of a categorical variable (factor). 

Let's start by a univariate box plot. 

PROC SGPLOT DATA=SASHELP.CARS;
VBOX Weights;
RUN;

The VBOX statement of PROC SGPLOT draws a box plot for the specified continuous variable. In the plot, the height of the box represents the interquartile range of the variable. That is, the difference between the 75th percentile and 25th percentile. Thus, each end of the box indicates the 1st and 3rd quartile of the variable. Similarly, line inside the box represents the median.

By default, whiskers cannot be longer than 1.5 times the length of the box. Any points beyond the whiskers are often considered as potential outliers that needs to be verified and are marked with circles. Lastly, the diamond marker indicates the mean of the variable.

In a VBOX statement, the CATEGORY= option following a slash (/) specifies a categorical variable by which you want to see how the distribution of the continuous variable changes. For example: 

PROC SGPLOT DATA=SASHELP.CARS;
TITLE 'Distribution of Car Weights by Type';
VBOX Weights / CATEGORY=Type;
RUN;

In the plot, we can visually check that the weights of hybrid cars are, on average, smaller than those of the others. To further enhance your box plot, you can also include the following options: 

  • EXTREME: Specifies that the whiskers should extend to the minimum and maximum, so that the candidate outliers will not be identified.
  • GROUP = variable-name: Specifies a second categorical variable. One box plot will be created for each value of this variable within the categorical variable.
  • MISSING: Includes a box for missing values for the group of category variable.
  • TRANSPARENCY = n: Specifies the degree of transparency for the box plot. The value of n must be between 0 (the default) and 1, with 1 being completely transparent and 0 being completely opaque.

Similar to bar charts, when there are too many categories to be displayed, you may consider using HBOX, instead of VBOX: 

PROC SGPLOT DATA=SASHELP.CARS;
TITLE 'Car ';
HBOX MPG_City / CATEGORY=Make;
RUN;

Continuous Variable vs. Continuous Variable

When you are interested in the relationship between two continuous variables, scatter plots can provide an effective solution. You can create a scatter plot using the SCATTER statement in the PROC SGPLOT. For example: 

PROC SGPLOT DATA=SASHELP.CARS;
TITLE 'MPG City vs. Highway';
SCATTER X=MPG_City Y=MPG_Highway;
RUN;

In the plot result, we can observe that there is a positive linear relationship between MPG_City and MPG_Highway. Statisticians typically put explanatory variable on the x-axis (which is X= specification) and response variable on the y-axis (which is Y= specification). So, we often goes further and says that the response is expected to be increasing for each one unit increase in the explanatory variable. This kind of arguments, however, require some more confirmatory data analysis like regression analysis.

Just like other plots in PROC SGPLOT, you can add some options to customize plots to your needs. Possible options include:

  • DATALABEL = variable-name: Displays a label for each data point. If you specify a variable name, the values of that variable will be used as labels. If you do not specify a variable name, then the values of the Y variable will be used. Typical specification of this option is ID variable in the data set.
  • GROUP = variable-name: Specifies a variable to be used for grouping data.
  • NOMISSINGGROUP: Specifies that observations with missing values for the group variable should not be included.
  • TRANSPARENCY = n: Specified the degree of transparency for the markers. The value of n must be between 0 (the default) and 1, with 1 being completely transparent and 0 being completely opaque.

For example, SASHELP.BASEBALL has 24 attributes of MLB players. Let's suppose that we want to draw a scatter plot comparing the number of runs in 1986 (nRuns) and amount of salaries in 1987 (Salary). In this case, adding DATALABEL= option would be helpful: 

PROC SGPLOT DATA=SASHELP.BASEBALL;
TITLE 'Number of Runs vs. Salary';
SCATTER X=nRuns Y=Salary / DATALABEL=Name;
RUN;

Relationships between More Than Two Variables

SASHELP.IRIS contains a collection of 150 iris flowers from three different species: Iris Setosa, Iris Versicolor, and Iris Virginica. Each flower in the data set is described by sepal length, sepal width, petal length, and petal width, all measured in centimeters. Now, let's consider the following scatter plot: 

PROC SGPLOT DATA=SASHELP.IRIS;
TITLE 'Iris: Sepal Length vs. Sepal Width';
SCATTER X=SepalLength Y=SepalWidth / GROUP=Species;
RUN;

In the plot, we see that the three species can be distinguished by some combination of SepalLength and SepalWidth. That is, if the categorical variable, Species, is the response of your interest, you can imagine a function which takes sepal length and width as its arguments and returns different Iris species in your later modeling.

Creating Paneled Plots

PROC SGPANEL is closely related to the SGPLOT procedure. It produces nearly all the same types of plots; the only difference is that unlike PROC SGPLOT, the SGPANEL produces multi-celled plots. PROC SGPANEL produces a separate cell for each combination of values of the classification variables that you specify. Each of those cells uses the same variables on their X and Y axes.

The syntax for PROC SGPANEL is almost identical to PROC SGPLOT. So, it is easy to convert one to the SGPANEL procedure by simply making a couple changes to your code: replace the keyword SGPLOT with SGPANEL, and add a PANELBY statement like this:

PROC SGPANEL;
PANELBY variable-list / options;
<plot-statements>
RUN;

Note that the PANELBY statement must appear before any statements that create plots. Possible options include:

  • COLUMNS = n: Specifies the number of columns in the panel.
  • MISSING: Specifies that observations with missing values for the PANELBY variable should be included.
  • NOVARNAME: Removes the variable name from cell headings.
  • ROWS = n: Specifies the number of rows in the panel.
  • SPACING = n: Specifies the number of pixels between rows and columns in the panel. The default is 0.
  • UNISCALE = value: Specifies which axes will share the same range of values. Possible values are COLUMN, ROW, and ALL (the default).

For example:

PROC SGPANEL DATA=SASHELP.IRIS;
TITLE1 'Petal Length vs. Petal Width';
TITLE2 'By Species';
PANELBY Species / NOVARNAME COLUMNS=3 SPACING=5;
SCATTER X=PetalLength Y=PetalWidth;
RUN;

When creating a paneled plots, setting COLUMNS= or ROWS= to the number of categories in PANELBY variable makes it possible to compare each cell plot by Y or X variable, respectively. This practice often shows how the relationship between the two variable is different depending on the different levels of categorical variable. This is what referred to as interaction. For example:

PROC SGPANEL DATA=SASHELP.IRIS;
TITLE1 'MPG (City) vs. Drive-train';
TITLE2 'By Origin';
PANELBY Origin / NOVARNAME ROWS=3 UNISCALE=ROW SPACING=5;
HBOX MPG_City / CATEGORY = DriveTrain;
RUN;

Post a Comment

0 Comments