Random Sampling

The term "statistics" comes from the Latin word, "status," which means "political state" or "government." This connection reflects the field's historical roots in Renaissance Italy, where bureaucrats faced challenges in conducting full-scale censuses with limited resources. Back in the era, there were no computers and only a limited amount of administrative power was available.

To get round these limitations, they collected representative samples from the population of interest and draw inferences about the whole. Collecting data from a representative portion of the population was a feasible option to achieve the desired demographic information. In Statistics, the unknown (but fixed) values we aim to estimate for the whole population are called parameters. For example, let's say that you want to figure out the average annual income of a country for the last year. The true value calculated from the entire population (people who lived in the country last year) is the parameter of your interest. Due to time and resource constraints, suppose that you decided to collect samples of the population. Here, the corresponding value calculated from your collected sample is called a statistic

However, this approach raises another issue: how to collect samples that can represent its population? If your sample does not represent the entire population, it will introduce bias, which refers to systematic errors where the statistic deviates from the parameter. For example, to figure out the annual income of a country for the last year, conducting surveys on the people living today could not be a representative sample. Consider some individuals who passed away recently had an income last year, but your survey sample systematically excludes answers from them.

To avoid bias as much as possible, you need to adhere to the key principle of the sampling methods: samples should be selected at random from the population. That is, each sampling unit has the equal chance of being selected, so that the fact that one sampling unit is selected does not have any impact on the chance of another sampling unit being selected. Indeed, the sampling methods introduced in this blog post, simple random samplingstratified samplingcluster sampling, and systematic sampling, do all adhere to this principle. Let's see how they work and verify if they give every sampling unit the equal chance of being selected!

Simple Random Sampling

The simple random sampling would be one of the most misunderstood sampling methods. People often incorrectly say "It can be achieved by simply giving each member of a population an equal chance of being selected." That's not quite true! The correct definition of the simple random sampling is that selecting each possible sample (not member or sampling unit) with an equal chance of being selected! In fact, as mentioned earlier, selecting each sampling unit with the same chance is required for any sampling method to be considered as unbiased.

The simple random sampling goes one step further and provides each possible sample the same chance of being selected. Ensuring this guarantees the basic principle of the random sampling. For example, consider a population of 10 members: \(A, B, C, D, E, F, G, H, I, J\). From this population, you want to draw a sample of 2. In this scenario, there will be \(_{10}C_{2} = 45\). Using SAS, you can list every possible samples as follows:

%LET population_size = 10; /* Number of population members */ %LET sample_size = 2; /* Number of samples */ PROC FORMAT; VALUE SampleCode 1 = "A" 2 = "B" 3 = "C" 4 = "D" 5 = "E" 6 = "F" 7 = "G" 8 = "H" 9 = "I" 10 = "J"; RUN;
/* List of all possible samples */ DATA SampleList_&population_size.C&sample_size(DROP=i j); DO i = 1 TO (&population_size - 1); DO j = (i + 1) TO &population_size; Select1 = PUT(i, SampleCode.); Select2 = PUT(j, SampleCode.); Sample = CATX("", Select1, Select2); OUTPUT; END; END; RUN; PROC PRINT DATA=SampleList_&population_size.C&sample_size; TITLE "List of All Possible Samples";
RUN;

In the DATA step above, observe that the loop for i iterates from up to (&population_size - 1), not &population_size. Also notice that the loop for j starts from i+1, not i. These two adjustments reflect two key facts on the combinations: that the order of the sampling units in a sample doesn't matter and that we are sampling without replacement. That is, for example, a sample of \([A,B]\) should be considered equivalent to \([B,A]\) and you do not replace back your sampling unit once it is selected. 

To make every possible sample have an equal chance of being selected, we should not return any sampling unit once it is selected. In this example, we are sampling 2 members from a population of 10. We can think of it as having 2 available spots for 10 candidates. For example, let's say that you selected \(A\) in your first trial. Here, the selected candidate had a chance of \(2/10\). Subsequently, if it is not returned to the candidate pool for your second trial, the remaining 9 letters each have an equal chance of being selected, at \(1/9\). Let's say you sampled \(B\) for your second trial. Given that the two selections are independent to each other, the sample \([A,B]\) was selected with the possibility of \(2/10 \times 1/9 = 1/45\). Thus, we see that all possible 45 samples are equally likely to be selected, at \(1/45\).

On the other hand, the sampling with replacement is not appropriate for this case. In the scenario, if you include the selected candidate \(A\) to the selection pool for the second trial, the probability that a sample, \([A,B]\) for example, to be selected would be \(2/10 \times 1/10 = 2/100\), which is smaller than it should be. In most practical scenarios, whether you sample with or without replacement wouldn't pose a significant issue, particularly when the population size, conventionally denoted as \(N\), is large enough. However, it is good to know that simple random sampling technically should be conducted without replacement. 

Now, from the list of all possible samples, we randomly select one. Using SAS, you can achieve this as follows:

%LET seed = 36830;
%LET ntrial = 1;

DATA Sampling(DROP=i);
SelectedSample = INT(RANUNI(&seed) * n) + 1;
SET SampleList_&population_size.C&sample_size POINT = SelectedSample NOBS = n;
i + 1;
Trial = i;
IF i > &ntrial THEN STOP;
RUN;

PROC PRINT DATA=Sampling;
TITLE "Selected Samples";
RUN;

In the DATA step above, the RANUNI function takes a seed and generates a random number from a uniform distribution, \(Unif[0,1]\). Then, by applying some additional operations, it converts the generated random number into an integer ranging from 1 to the number of possible samples (n). This converted random number is temporarily stored at SelectedSample. Subsequently, the SET statement specifies the number of observations (45) to n and reads a row from the list of all possible samples, with the row number being SelectedSample

So, the newly created data set Sampling will have the selected sample from each trial as its observations. The DATA step loop will terminate once the specified number of trials is reached.

This time, because we initially constructed the list of all available samples and then select a sample from the list, the DATA step is conducting sampling with replacement. That being said, you can try multiple samplings from the list of possible samples as many times as you want. For example:

%LET seed = 36830;
%LET ntrial = 100;

DATA Sampling(DROP=i);
SelectedSample = INT(RANUNI(&seed) * n) + 1;
SET SampleList_&population_size.C&sample_size POINT = SelectedSample NOBS = n;
i + 1;
Trial = i;
IF i > &ntrial THEN STOP;
RUN;

PROC PRINT DATA=Sampling;
TITLE "Selected Samples";
RUN;

Observe that we selected the same sample \([C,J]\) for both 13th and 27th trial.

Now, let's confirm that if the simple random sampling indeed provides an equal chance of selection to each sampling unit. In this example, we are sampling two sampling units from a population of ten. Thus, to meet the condition, every letter should have an equal probability of being selected, which is \(2/10 = 1/5\). 

This can be easily verified by counting number of rows containing a specific letter from the list of all possible samples. For example:

%LET sampling_unit = A; DATA CountSamplingUnit; SET SampleList_&population_size.C&sample_size; ARRAY k[*] Select1-Select2; DO i=1 TO 2; IF k[i] = "&sampling_unit" THEN OUTPUT; END; DROP i; RUN;

If you specify different letters to the macro variable sampling_unit, you will observe that each letter appears equally in the possible samples, occurring 9 times. Since each possible sample is equally likely to be selected, each letter is equally likely to appear \(9/45\) times, which equals \(1/5\).

In summary, we can confirm that the simple random sampling is a type of random sampling, as it guarantees every member of the population has an equal chance of being selected. This equal possibility is achieved by making each possible sample equally likely. If we have the full list of the population, more conveniently, you can implement an SRS by sampling n different units from N possible members without replacement: even without pre-constructing the full list of possible samples, you will select a sample at the same possibility: \(\frac{n}{N} \times \frac{n-1}{N-1} \times ... \times \frac{1}{N-n+1}\).

In fact, most statistical packages and libraries implement SRS in this way: they first generate n different random numbers with no duplicates and then gather associated observations from N population set. In SAS, you can perform SRS using PROC SURVEYSELECT:

PROC SURVEYSELECT DATA=MyData.WineQuality METHOD=SRS N=50 SEED=1123 OUT=SampleSrs;
TITLE1 "Randomly selected 50 observations";
TITLE2 "Systematic Sampling"; RUN;
PROC PRINT DATA=SampleSrs; RUN;

Simple random sampling is fairly straightforward. However, it comes with a huge drawback: SRS requires the complete list of the population members, which is typically impractical in most statistical studies. If you already have access to the entire population, there is no point to sample only a subset of the population. Nonetheless, simple random sampling still useful, for it can be combined with other sampling methods. Or, in the field of machine learning, where probabilistic inference is not employed for model evaluation, SRS is a widely adopted option for splitting train and test set.

Stratified Sampling

If you have some categorical variables with known population distribution, it would be advantageous to take advantage of this information. Stratified sampling first partitions the entire sampling pool based on grouping variables. So that sampling units within each categorical group are homogenous with respect to the the grouping variable. This groups are called strata. Then within each group, implement SRS (or any other random sampling you like) with the known proportional distribution of the population. 

For example, let's say that you select 5 letters from the population of 20: \(A,B,C,D,E,F,G,H,I,J,K,L,m,n,o,p,q,r,s,t\). As we can see, this population can be divided by two homogeneous groups: capitals and small letters. Then from the capital letter group, you can implement SRS, proportional to the population distribution: \(5 \times 12/20 =3\). Similarly, from the small letter group, you should sample \(5 \times 8/20 = 2\) letters through SRS. As we discussed earlier, SRS provides equal chance of being selected to each sampling unit. So, within every group each sampling unit has an equal chance of being selected. For example, \(A\) has a selection probability of \(1/12\), \(m\) has a selection probability of \(1/8\), etc. Then, compose your sample using the selected units from each group.

In the process, for each capital letter, you assign a chance of being selected in your sample at (12/20 \times 1/12 = 1/20\), which is the product of the probability of being capital letter and the probability of being selected within the capital letter group. Likewise, for each small letter, you also assign a chance of being selected in your sample at \(8/20 \times 1/8 = 1/20\). Given the population size is 20, we can confirm that every member in the population is equally likely to be selected with a probability of \(1/20\).

In SAS, you can implement the stratified sampling using the SURVEYSELECT procedure. For example, in the MyData.WineQuality data set, the variable Type indicates if the wine is red or white. Another variable, Quality, records the 0 to 10 quality scores of the wines as measured by sommeliers. Now, suppose that this data set is the population of our interest and draw population distribution based on the two variables:

PROC SORT DATA=MyData.WineQuality OUT=SortByTypeQuality; BY Type Quality; RUN;

PROC FREQ DATA=SortByTypeQuality;
TITLE1 "Wine Quality Data set";
TITLE2 "Strata of Wines";
TABLES Type * Quality;
RUN;

The following PROC SURVEYSELECT statements select a probability sample of wines from the WineQuality data set according to the stratified sample design:

/* Within each wine group, SRS of 5 */
PROC SURVEYSELECT DATA=SortByTypeQuality METHOD=SRS SAMPRATE=0.1 SEED=1123 OUT=SampleStrata;
TITLE1 "Wine Quality Data";
TITLE2 "Stratified Sampling";
STRATA Type Quality;
RUN;

After sorting the WineQuality data set, PROC SURVEYSELECT is applied. In the procedure, the STRATA statement defines the variables by which the entire sampling pool is grouped (Type and Quality in this example.) Then, within each group, the SURVEYSELECT procedure implements SRS of 10%[1][2].

One of the pros of stratified sampling is that you can be more confident on the statistics calculated from the resulting sample. By stratifying the entire population by a known proportional group, you can reduce the variance of the sample statistics (i.e., sampling error): the variability that can be caused from the grouping variable is already reflected by sampling it with their known proportion. Another benefit of stratified sampling method is that you can have sample statistics for each stratum. Conversely, stratified sampling method could be impractical, if there is no grouping variable with the known proportion. 

Outside of the field of Statistics, the stratified sampling is often used option for machine learning models. Particularly, when dealing with data sets whose target class or categories are imbalanced, by dividing the data into subgroups (strata) based on the target variable (the class you're trying to predict) and then randomly sampling proportionally from each strata, stratified sampling ensures the test set reflects the true distribution of the classes in the population. This approach helps machine learning models perform better on the minority class, which can be crucial for tasks like fraud detection or medical diagnosis.

Cluster Sampling

Imagine that you are analyzing on the national average height of 6th grade students. Instead of directly sampling individual students from the pool of every single 6th grader, you could consider sampling some schools whose students can collectively well represent the students in the entire country. Then, from the selected schools, you can randomly sample students, or if preferred, collect data from the entire students from those schools for your sample.

Cluster sampling is a multi-step sampling method in which a subgroup (called a cluster) are first sampled from the population at random and then members inside the selected group are randomly sampled. Unlike stratified sampling, in this method, each group must be representative to the population, and thus, sampling units in each group should have a distribution that is similar to the population; the variable of interest must not be distinguished by the grouping variable itself.

Just as the other sampling methods, cluster sampling also provides every unit in the entire sampling pool an equal chance of selection. Suppose that you're sampling 2 letters from: \(A, B, C, D, E, f, g, h, i, j\). In the cluster sampling, you should first select from which cluster you will sample letters. In this case, chance of each group, capitals and small letters, being selected is equally \(1/2\). Next, within the selected group, you will select 2 letters with the chance at \(2/5\). Since the two level selections are independent, the probability that a single sampling unit is selected is equally at \(1/2 \times 2/5 = 2/10\). 

Let's consider a practical example. The MyData.Wiki4HE data set comprises survey responses from 913 faculty members from two different universities regarding their opinion on using Wikipedia as teaching materials. One of the survey question is the years of teaching experience of the respondents. Now, let's say that you are sampling from this data set to figure out the average years of experience of the faculty members. In this case, we can reasonably assume that the grouping variable, respondents' university, doesn't have any significant impact on the distribution of our variable of interest. 

Using SAS, you can sample a cluster using the CLUSTER statement (or equivalently, SAMPLINGUNIT with the same syntax) as follows:

PROC SURVEYSELECT DATA=MyData.Wiki4HE METHOD=SRS N=1 SEED=6589 OUT=SampleCluster; CLUSTER University; RUN;

/* See which cluster is selected and how many observations are in the cluster */
PROC FREQ DATA=SampleCluster;
TITLE "Selected Cluster";
TABLES University;
RUN;

In this context, the N= option specifies not the size of the output sample, but the number of clusters to sample. As a result, the SURVEYSELECT procedure selects all observations in the selected cluster. 

We see that all observations with University=2 (which refers to Pompeu Fabra University at Barcelona, Spain) were chosen. As shown in the figure below, the variable of interest, YearsExp, in the selected cluster has a distribution that is similar to its population distribution. You can then use all observations in the selected cluster or implement further samplings from it as needed for your analysis.

Cluster sampling is a preferred choice in market research and social science for its cost-effectiveness. It is generally cheater and less time-consuming than other methods, particularly for geographically dispersed populations. This is a major advantage for market research firms which track market shares and point-of-sales records on a tight time basis. 

On the other hand, the biggest drawback of cluster sampling method is that we typically cannot verify that the chosen clusters well represent the target population. For example, consider a scenario where a market research firm wants to know the market shares of smartphones. To save time and costs, they decided to select a shopping mall and took a survey on the sales volume of different phone brands. In this case, however, they have no way to prove that the chosen shopping mall truly reflects the entire market. Verifying this would necessitate comparisons between the selected shopping mall and the entire population, which is not feasible. If the chosen clusters are not reflective of the whole audience, the research finding will be highly biased. 

Systematic Sampling

Systematic sampling method involves choosing your sample based on a regular interval, rather than a fully randomized selection. Here's how it works:

  1. Calculate the Sampling Interval: Divide the population size \(N\) by the desired sample size \(n\) to determine the sampling interval \(k\): \(k = N/n\)
  2. Randomly Select the Starting Point: Choose a random number between 1 and sampling interval \(k\). This will be the index of the first element in your sample.
  3. Systematic Selection: Starting from the randomly chosen point, select every \(k\) element thereafter until you reach the desired sample size.

For example, imagine you want a sample of 50 households (\(n\)) from a list of 500 (\(N\)). The sampling interval (\(k\)) would be \(500/50 = 10\). You'd then randomly pick a number between 1 and 10 (using a uniform distribution), say 3. This means 3rd house on the list would be your first sample followed by every 10th house after that (13th, 23th, and so on).

In this example, you might wonder how it ensures that every sampling unit has an equal chance of being selected. Among the first 10 households, the probability that a single household to be selected is at \(1/10\). After selecting a household from the first 10 sampling units, the next 10 households have a predetermined selection. So, among the second 10 households, the probability of being selected becomes 1 for the ones that are already selected (13th) and 0 for the others. This pattern continues throughout all the 500 households. 

To sum up, sampling in the first \(k\) interval determines which sampling unit to be selected. Here, if we give the equal chance of being selected to the sampling units within the first \(k\) interval, then the every sampling unit in the whole population has a chance of being selected at \(1/k\). Then, by the definition of \(k = N/n\), each sampling unit is equally likely to be selected at \(n/N\).

In SAS, you can implement systematic sampling through the SURVEYSELECT procedure:

PROC SURVEYSELECT DATA=MyData.Boston METHOD=SYS N=50 SEED=1123 OUT=SampleSys;
TITLE1 "Boston Housing Data"; TITLE2 "Systematic Sampling"; RUN; PROC PRINT DATA=SampleSys; RUN;

In this example, MyData.Boston has 506 observations, while the desired sample size \(n\) is 50. Systematic sampling relies on a fixed sampling interval, but when the population size \(N\) is not evenly divisible by the sample size \(n\), you you need a slight adjustment. Typically, we apply the floor or ceiling function to the interval \(k\), so that we can make it interval. However, either the way, the systematic sampling cannot give each sampling unit an equal chance of being selected. This is because \(\lfloor N/n \rfloor\) will exclude some units from potential selection, especially towards the end of the population. Similarly, \(\lceil N/n \rceil) can create a situation where certain starting point repeatedly pick units from a specific portion of the data, giving those units a higher chance of selection.

More significantly, because of the fixed interval \(k\), systematic sampling can be susceptible to periodicity in the data. If there's an underlying pattern or cycle in the data that repeats every \(k\) units, the sample might disproportionately pick units from specific phase of that cycle. For example, suppose that you have monthly average temperatures for 10 years. Let's say that you systematically sample from this data to find the average temperature of the entire 10 years with an interval of 13. If you randomly select the 1st month (January) as your starting point, the sample will then include every 13th month after the point (January of the next year). This will pick up only winter temperatures, completely missing information from other seasons. Thus, the sample mean of the monthly temperatures could be heavily biased if there are seasonal variations on the temperatures. 


[1] Here, you can also manually specify different sampling rates per each group. To do this, you should list different sampling rates in parentheses, separating each value with a comma. E.g., SAMPRATE=(0.1, 0.2, ...). Alternatively, you can provide different number of desired samples per each group using SAMPSIZE=(\(n_1, n_2, ...\)). 

[2] Assigning the N= option with a STRATA specification results in sampling an equal number from each group. In this case, the N= specification must not exceed the smallest number of observations among the groups.

Post a Comment

0 Comments