DATA Step: Combining Two Data Sets

In data analysis, when information about a subject is distributed across multiple data sets, it is often necessary to combine them into a single data set. For example, consider a scenario where one data set contains the purchase history of customers, while another data set includes the demographic information of each customer. In this scenario, purchase history alone wouldn't tell you much about why customers buy what they do. But by merging it with demographic data, you can see if age, location, or other factors influence purchase behavior. 

Similarly, in projects involving ongoing observations, you might have to append new observations for data management. For example, consider a project tracking daily weather data - each day's information should be appended to the previous data to create a growing set of records. 

In a SAS DATA step, there are three statements to combine two data sets: SETMERGE, and UPDATEIn this guide, we will explore different techniques to combine data sets with some examples. Let's get started!

The SET Statements

Vertical Concatenation of Two Data sets

In a DATA step, the SET statement brings all observations from another SAS data set. Taking advantage of this, you can vertically concatenate two data sets using the SET statement. For example:

DATA SetOne; INPUT VarA VarB; CARDS; 1 5 2 6 3 7 4 8 ; DATA SetTwo; INPUT VarC VarD; CARDS; 9 15 10 16 11 17 12 18 13 19 14 20 ;

DATA VerticalConcat;
SET SetOne SetTwo;
RUN;

When multiple data sets are specified in a single SET statement, SAS vertically concatenates them. To be more specific, the SET statement brings all observations from the first data set, which in this example is SetOne, followed by all observations from the second data set, SetTwo. Note that you can specify as many data sets as needed.

In this example, observe that the two data sets don't share any common variable. The SET statement do not match any observation and prioritize to contain all column values. So, in this example, the output will have four variables and the variables where it cannot find any matches will be missing.

What if there is at least one common variable between the two data sets? For example:

DATA SetOne; INPUT VarA VarB; CARDS; 1 5 2 6 3 7 4 8 ;
DATA SetThree; INPUT VarA VarB VarE; CARDS; 21 24 29 22 25 30 23 26 31 1 27 32 2 28 33 ;
DATA VerticalConcat2;
SET SetOne SetThree;
RUN;

In this example, SetOne and SetThree share two variables: VarA and VarB. Consequently, the resulting data set will include three variables: VarA, VarB, and VarE. Initially, the SET statement brings all observations from SetOne with missing values for VarE and includes them in the output data set. Subsequently, it appends all observations from SetThree to the output. Thus, the output data set will be a 9 by 3 table.


Horizontal Concatenation of Two Data Sets

What if we use a separate SET statement for each data set? Remember that SAS brings each row from its reference into PDV and executes DATA step statements one by one, populating PDV. The populated PDV is then appended as an observation to the new data set that the current DATA step is creating.

This also applies to a DATA step with multiple SET statements. That is, to make the first observation in the new data set, the first SET statement brings the first observation and the second SET statement brings the second observation. Combining these two observations, PDV is added to the new data set as a single observation. For example:

DATA SetOne; INPUT VarA VarB; CARDS; 1 5 2 6 3 7 4 8 ; DATA SetTwo; INPUT VarC VarD; CARDS; 9 15 10 16 11 17 12 18 13 19 14 20 ;

DATA HorizontalConcat;
SET SetOne;
SET SetTwo;
RUN;

Let's consider another example. When creating the first observation for HorizontalConcat2, SET SetOne; brings two column values from the first observation of SetOne. Then in the next line, SET SetThree; brings VarA, VarB, and VarE values from SetThree. Here, the previous values 1 and 5 are replaced by 21 and 24. Then VarE is horizontally appended to the PDV. This process is repeated until the end of the last row for SetOne. Consequently, HorizontalConcat2 will have three columns and four observations.

DATA SetOne; INPUT VarA VarB; CARDS; 1 5 2 6 3 7 4 8 ;
DATA SetThree; INPUT VarA VarB VarE; CARDS; 21 24 29 22 25 30 23 26 31 1 27 32 2 28 33 ;
RUN;
DATA HorizontalConcat2;
SET SetOne;
SET SetThree;
RUN;


Interleaving Observations

In SAS, the BY statement controls how data set observations are processed. Particularly, when it is used with a SET statement that vertically concatenates two data sets, the BY statement outputs observations sorted by the BY variable values. For example, let's consider the following SAS program:

DATA SetOne; INPUT VarA VarB; CARDS; 1 5 2 6 3 7 4 8 ;
DATA SetThree; INPUT VarA VarB VarE; CARDS; 21 24 29 22 25 30 23 26 31 1 27 32 2 28 33 ;
RUN;
PROC SORT DATA=SetThree;
BY SetThree;
RUN;

DATA Interleaving; SET SetOne SetThree; BY VarA; RUN;

In this example. PROC SORT first rearranged SetThree by VarA; both SetOne and SetThree are sorted by VarA. This makes sure that SAS can compare VarA values in the two observations and interleave them in the subsequent DATA step. 


MERGE Statements

One-to-one Match Merging

In a DATA step, the MERGE statement reads the n-th observations from the both data sets and concatenates them horizontally. If there is an existing value for the same variable, the first data set values are replaced by the second data set values. This iteration continues up until it reads all observations in the two data sets. For example:

DATA SetOne; INPUT VarA VarB; CARDS; 1 5 2 6 3 7 4 8 ;
DATA SetTwo; INPUT VarC VarD; CARDS; 9 15 10 16 11 17 12 18 13 19
14 20 ;
RUN;

DATA OneToOneMerge; MERGE SetOne SetTwo; RUN;

One-to-Many Match Merging

In a more common scenario, you would merge the two data sets matching common variable values. For example:

DATA SetOne; INPUT VarA VarB; CARDS; 1 5 2 6 3 7 4 8 ;
DATA SetThree; INPUT VarA VarB VarE; CARDS; 21 24 29 22 25 30 23 26 31 1 27 32 2 28 33 ;
RUN;

DATA MatchMerge; SET SetOne SetThree; BY VarA; RUN;

In the output data set, observe that the two data sets are merged by VarA. Particularly, when the BY variable values are matching, the first data set is replaced by the second data set. You can think of it as overlapping. On the other hand, when there is no matching BY variable values, the observations are added without any overlap.

UPDATE Statement

When combining two data sets using the MERGE statement, any common variables will take the values from the latter data setThis happens even when the common variables in the latter data set contains missing values. However, in a more common data management scenarios, we would like to keep the former data values when the associated latter data values are missing. 

In such scenarios, you might consider using the UPDATE statement. The UPDATE statement updates existing observations in the master (former) data set with corresponding observations from the transaction (latter) data set. In the process, if any values are missing in the transaction, unlike MERGE statement, the UPDATE statement will keep the values in the master. For example:

DATA MasterSet; INPUT VarA VarB VarC $; CARDS; 1 11 C1 2 12 C2 3 13 C3 4 14 C4 ; DATA TransactSet; INPUT VarA VarB VarE VarC $; CARDS; 5 15 21 C5
6 . 22 C6
7 17 23 C7
8 18 . C8
1 . 24 .
2 19 25 .
3 20 26 C3
;

PROC SORT DATA=TransactSet;
BY VarA;
RUN;

DATA UpdateMaster;
UPDATE MasterSet TransactSet;
BY VarA;
RUN;

Post a Comment

0 Comments