4.1. Simple Random Sampling
Simple random sampling (SRS) is the most basic form of probability sampling. In this process, all possible samples of a given size have the same probability of being selected—i.e.,
is constant for every possible sample. As a result, all the animals in the population have an equal probability of being included in the sample—i.e., the inclusion probability
is the same for every animal [
21]. This sampling process has been applied to many veterinary studies, including recent investigations of lumpy skin disease [
22], bovine mastitis [
23], and foot-and-mouth disease [
24]. Despite its simplicity, in the right situation it can be a powerful sampling method and provide the theoretical basis for more complicated sampling methods. There are two forms of SRS—with and without replacement. In this article, we will limit the discussion to SRS without replacement (the sample contains no duplicated animals,) as this is by far the most common practice in veterinary research.
The statistics in which a researcher is usually interested are the properties of the population, e.g., the average milk production of the herd or the prevalence of a disease within the herd. We denote this finite population mean as and prevalence is just a special case of the finite population mean when the individual outcome value can only be 1 or 0. In the SRS setting, estimating the mean is straightforward. However, for other sampling processes this is not always the case; hence, it is easier to start to estimate the finite population total before moving on to the mean (which is a linear function of the total). To ensure a consistent methodology is used in this review, we will stick with the two-step process—estimating the total first and then the mean or prevalence.
Suppose we have a herd with
M animals, of which a sample of
m animals has been obtained using SRS. The Horvitz–Thompson (HT) estimator of the finite population total is [
25]:
In the SRS setting, the sampling weight
is a constant as the inclusion probability is the same for every animal, such that
(see
Appendix A for technical details), where
is the Bernoulli random variable for selection and
if the animal
k is selected; otherwise,
. The HT estimator is, by design, unbiased—i.e., its expected value is equal to the true value of the finite population characteristic [
7]:
where
is the expectation operator which takes all the possible values generated by the random variable and returns the weighted average value, so
.
The unbiased estimator for the mean
or the prevalence
is, therefore:
The proof is trivial. By observing Equation (2), we see that, in SRS, the sample mean or sample proportion is the unbiased estimator for the population mean or prevalence. This means that, for other sampling strategies, building up the sample mean from SRS will also result in an unbiased estimator if done correctly.
To derive the variance of the estimator for the mean or the prevalence, it is also easier to start with the variance of the total. The detailed derivation can be seen in
Appendix B; here, we only provide the formulas for the variances. First, the variance for the estimated population total is:
where
is the variance of the finite population. In the special case where we estimate prevalence, we can replace
with
with some algebra (see
Appendix B), resulting in
. Therefore, the variances for
and
are given as follows:
However, the finite population variance depends on an unknown quantity
or
, which we are attempting to estimate; in practice, we often replace
with
=
, which is the sample variance (or
with
). Therefore, the estimated variance for
and
is:
where
is usually referred to as the finite population correction factor [
26].
To illustrate this process, consider an investigator who wants to estimate the prevalence of digital dermatitis in lactating cows in a dairy herd. A random sample of 100 cows is obtained from a herd of 300 cows, of which 35 sampled cows are diagnosed as diseased. These 35 cows have records and the remining 65 sampled cows have records . The estimated prevalence is calculated using Equation (2), thus it is 0.35. The variance of this estimate is calculated using Equation (7). As the actual prevalence is unknown, we need to use the estimated prevalence to calculate the estimated variance: .
4.2. Stratified Random Sampling
In the stratified random sampling procedure (STRRS), the target finite population (e.g., the total number of animals within a herd) is partitioned into non-overlapping groups based on some pre-defined attributes and each of the groups is referred to as a stratum. These strata constitute the entire population; therefore, each animal belongs to a specific stratum. Within each stratum, SRS is commonly used to sample animals, and the sampling processes in the different strata are independent [
27]. There is no requirement to select all strata within a population. If only some strata are of interest (e.g., only those which include lactating cows), these can be selected and strata that are not of interest can be excluded. If this approach is used, it needs to be made clear that the target population is no longer the entire finite population, but rather the population represented by the selected strata.
The finite population mean or prevalence is then estimated by pooling the information from all the strata. Like SRS, STRRS is commonly used in veterinary research, for example stratification by area. This allows the researcher to investigate prevalences and associations across a country or a region—e.g., Heayns and Baugh [
28] investigated the opinions of veterinarians across the UK about serological testing to assess revaccination requirements in dogs. In this study, each county of Great Britain was considered as a stratum and 10% of the small animal veterinary practices within each stratum were randomly selected (if there were fewer than 10 practices in a county, one practice was randomly chosen to represent the county). Similarly, Atuman et al. [
29] investigated dog ecology, dog bites, and rabies vaccination rates in Bauchi, Nigeria, using STRRS. They stratified Bauchi into five areas, and within each area randomly selected 10% of the streets for direct street counts and the administration of a questionnaire. However, other sources of strata are also used—e.g., as part of a randomised clinical trial of footrot treatments in Kashmir, India, Kaler, et al. [
30] allocated sheep with acute footrot to one of three treatments using STRRS, with the strata being based on each sheep’s maximum footrot score. Stratification is useful to ensure that the sample includes individuals which could otherwise be missed by chance in SRS due to the limited number of individuals in their stratum. For example, at a certain period a pig farm in Hong Kong may keep few finisher pigs, but many piglets and sows are present on the farm. With SRS, it is likely that none of the finisher pigs is included in the sample, therefore one can argue that there is error in the representation of the population which could potentially dimmish the accuracy of the estimate. For this reason, it is also common to sample a fixed number of individuals in each stratum. Compared to SRS, however, extra information such as the variable used for stratification (membership) must be obtained for all sampling units.
If STRRS has been used, care is required when pooling the information from the strata in order to obtain an unbiased estimator for the finite population mean or prevalence. A “natural” estimator for the mean/prevalence might involve summing up all the observed values in the sample and dividing by the sample size (equivalent to the process of the SRS). However, this estimator is unbiased if the sample size in each of the strata is proportional to the actual size of the stratum—i.e., there has been proportional allocation (this is demonstrated in more detail in
Appendix C). The more general common approach to obtain an unbiased estimator for the finite population mean or prevalence follows the two principles we have mentioned: (1) following the actual sampling process and (2) starting with the finite population total. Consider a farm with
animals. A researcher has created
J strata based on the ages of the animals. For the
jth stratum, there are
animals, and clearly
. Suppose that
animals are sampled using SRS independently from each of the strata and that the value of the variable of interest is denoted as
for the
kth animal in the
jth stratum.
The unbiased estimator (using weight notation) for the finite population total:
where
is the sampling weight which is the reciprocal of the inclusion probability
For STRRS, this is the probability of the
kth animal in the
jth stratum being selected. However, writing the estimator in this form is not very intuitive, and it can be rewritten into a different formula in order to provide a more intuitive and meaningful picture for veterinary researchers. As SRS has been implemented within each of the strata, the inclusion probability
for the
kth animal in the
jth stratum is simply the sample size
divided by the stratum size
, which leads to
. Now, Equation (8) can be rewritten as:
This formula says that in order to estimate the finite population total, we need to first compute the mean/prevalence for each of the strata
using the estimator we have seen in SRS and then multiply it by the stratum size
to obtain the estimated total for each stratum. We then sum up all these estimated stratum totals to obtain the estimated finite population total. This is consistent with and follows the actual sampling process, as well as producing an unbiased estimator:
where
is the Bernoulli random variable for selection, representing whether the
kth animal in the
jth stratum is selected with an inclusion probability
, and
due to SRS. Once the estimated total is found, the estimated finite population mean or prevalence is just the total divided by the population size:
Since each stratum is independently sampled, building on the SRS, the variances for
and
using STRRS are also straightforward:
where both
and
are unknown quantities representing the population variance and prevalence in the
jth stratum. Similar to the SRS, the estimated variances are obtained by substituting estimated quantities into the unknowns, such as:
where
is the sample variance of the
jth stratum and the formula is given in the SRS section.
To illustrate this, consider an investigation of the seroprevalence of pseudorabies on a farm where STRRS is used. First, pigs are divided into groups based on the five production stages (strata): piglets, weaners, growers, finishers, and sows (breeding herds). The total numbers of pigs in each stratum are 30, 30, 40, 20, and 60, respectively. Within each stratum, a fixed number of pigs (10) are sampled using SRS and the numbers of infected pigs are 5, 6, 3, 2, and 7. The estimated prevalence can then be calculated using Equation (10): . The variance of this prevalence estimate can then be estimated using Equation (14). This is carried out stratum by stratum; for example, for the piglets, . This process is then repeated for all the strata, and the estimated variance is the sum of the quantities calculated for each stratum. In the example, the final estimated variance is 0.004.
4.3. Cluster Sampling
In this sampling method, the animals in a finite population (animals in a herd, region, or country) are aggregated into larger sampling units: clusters. A cluster is similar to a stratum; however, the sampling process is different. In a cluster sampling procedure, a set of (
n) clusters is sampled using SRS from a population with
N clusters. These clusters are usually referred to as primary sampling units, and the members within each cluster as secondary sampling units. Within the primary sampling units, all secondary sampling units may be measured or observed (one-stage cluster sampling) or the secondary sampling units may be sampled using SRS (two-stage cluster sampling). The selected individuals within the selected clusters then form a sample of the finite population [
26]. In contrast, in STRRS all strata of interest must be included, and SRS is usually used to sample individuals within each stratum. These different sampling strategies mean that the sources of variability in cluster sampling are different from those in STRRS. In STRRS, the variability of the estimated population mean/prevalence arises only from individual variability within a stratum. For cluster sampling, the variability of the estimated population mean/prevalence comes from one or more sources [
27]. In one-stage cluster sampling, where all individuals in a selected cluster are included, the variability of the estimated population characteristic or quantity is dependent on the variability between clusters. In two-stage cluster sampling, where only a sub-sample is collected from selected clusters, the variability of the estimated population characteristic comes from two sources: the within- and between-cluster variabilities [
31]. One advantage of cluster sampling is that it overcomes some of the logistics issues associated with SRS or STRRS and therefore generally requires less spending on administration and travel expenses. However, the estimates provided by cluster sampling are usually less precise than those provided by SRS, given the same sample size [
27].
Cluster sampling is possibly the most widely used approach in livestock research. Usually, a farm or a herd is regarded as a cluster and a number of farms/herds are selected. This was the approach adopted by Getahun, et al. [
32], who studied mastitis and antibiotic resistance patterns in dairy cows in central Ethiopia. This design treated a farm as a cluster and a number of farms were chosen using SRS; within each farm, all the dairy cows were sampled. A similar approach was later used to estimate the prevalence of bovine tuberculosis in southern Ethiopia [
33]. In this study, the target population was only cows above 6 months of age, and all cows above 6 months old were included on the selected dairy farms (clusters). We list here three examples of two-stage cluster sampling in veterinary research for interested readers [
34,
35,
36]. In the rest of this section, we will first provide insights into the estimation process for one-stage cluster sampling and do the same for a two-stage cluster sampling where STRRS instead of SRS is used at the second stage (essentially a complex sampling) with details.
4.3.1. One-Stage Cluster Sampling
In one-stage cluster sampling, all animals within a farm are sampled; therefore, the farm total
is directly measured, where
is the value of the variable of interest measured for the
kth animal on the
ith farm given the herd size of
. Common research tasks might be to estimate the farm-level and animal-level averages, such as the average milk production or average number of positive animals per farm and average milk production per cow or overall prevalence at the animal level. Suppose
n farms are sampled from
N farms in a region using SRS. As before, to estimate the population mean or prevalence it is always recommended to start by estimating the total. Since SRS is used for sampling clusters, the unbiased estimator for the finite population total (e.g., the number of all diseased dairy cows in a region) is straightforward and therefore given without proof:
The variance and estimated variance for this estimator can also be straightforwardly determined by applying the theory introduced in the SRS section:
and
, where
and
are the finite population variance and sample variance (at the farm level), such that
and
. The estimated farm-level average and its corresponding variance and estimated variance are straightforward:
The total number of animals in the region is
. Hence, the estimated average at the cow level is given by:
The variances and estimated variances for the cow-level average or overall prevalence are given as:
Note that at the farm level, we work on counts of positive animals instead of binary values even if we are estimating prevalence, therefore the variance formulas for and are indistinguishable.
4.3.2. Two-Stage Cluster Sampling
The main purpose of this section is to illustrate the estimation process for a complex survey—i.e., how to obtain the unbiased estimators and derive their corresponding variances. Suppose there are
M dairy cows in a region with
N dairy herds. The herd size for herd
i is
. The cows are separately managed based on a certain criterion; that is, within the
ith herd there are
J groups, and within each of the groups there are
cows. The groups can be treated as strata, as they are not overlapping and constitute the entire herd. A research team is interested in knowing the prevalence of a disease among cows in this region. Based on the demographic information, a two-stage cluster sampling is decided. First,
n herds will be selected using SRS. Within each of the sampled herds, STRRS will be used to sample cows from each of the strata in each of the herds. Before going to the estimation process, we shall define some notations (
Table 1).
The ultimate goal for this sample survey is to estimate
; however, as in the previous examples it is the best to start by estimating the total
. Additionally, the computation process needs to be consistent with the actual sampling steps. Thus, we start by estimating the total diseased animals in the
jth stratum in the
ith herd. Within each stratum, SRS is used, therefore the estimated total can be computed based on Equation (1). The second step is to estimate the total diseased animals in the
ith herd. Because we used STRRS, this can be achieved by adopting Equation (9). Finally, we can estimate the total number of diseased animals in the region by using Equation (15), as SRS is used to select herds. Hence, the unbiased estimated region total is computed in the following way:
To prove that the outcome of this process is unbiased, we simplify the notation, letting . We know that is unbiased (namely, , because we have used STRRS. Secondly, we specify a binary indicator variable if herd i is selected or if it is not. Let denote the probability that herd i is selected (inclusion probability of a herd); we then have , since SRS is used for the first stage of selection (i.e., the selection of herds). Given that sampling within any herd is independent of the sampling in any other herd and that is independent of , we have:
(partition theorem for expectations)
(the conditional expectation of a sum is the sum of the conditional expectations)
(expectation is a linear operator and is a constant)
(knowing a vector means the same as knowing every element of the vector; conditional on the selection status of every herd means knowing the selection status of any herd)
( and are independent)
(unbiased estimator for stratified random sampling for each herd)
(linear property of expectation).
Therefore, the unbiased estimator for the overall prevalence is simply:
In order to find the variance formula for
, it is easier to start with
. It is necessary to first identify the sources of variability. In this two-stage cluster sampling process, we have between- and within-herd variances. The variance partition formula can thus decompose the total variance into two parts:
, where
measures the variability between herds and
measures the variability within a herd. Since SRS is implemented at the herd level, according to Equation (3), we have:
where
This part of the variance is the same as that of one-stage cluster sampling, since the herd sampling procedures are exactly the same. The detailed derivation is essentially the same as the derivation of variance in SRS (see
Appendix B).
For the within-herd component of the variance,
, the formula inside the expectation operator,
according to the conditional variance formula. The detailed mathematical derivation is available in
Appendix D and
Appendix E provides the statistical theorems required in this paper. Here, we only give an essential intermediate result:
Since STRRS is implemented within each herd,
can be easily obtained from Equation (11) or Equation (12) depending on the nature of
. In our particular example, where
takes a binary value (either 1 or 0), we have
. Generally,
.
Therefore, the general formula for the variance of
is given as:
where
,
is the unknown mean of the
jth stratum in the
ith herd. When
takes a binary value, the special form is given by applying the method introduced in the SRS section (see Equation (5)):
Again, this variance depends on some unknown quantities which we have estimated. These estimates can then be used to replace these unknown quantities, as we have done previously. Thus, the estimated variance (general form) will be:
where
. Note the difference between
and
in the one-stage cluster sampling;
is the sample variance within the
jth stratum in the
ith herd, with
being the estimated sample mean. When
takes a binary value, the special form is given as:
Finally, the variance and estimated variance for
are found simply by multiplying the results of Equations (25) and (27) by a constant
. The same process can be applied to find the variance and estimated variance for
when
is not limited to binary values. A numerical illustration example in this design would be tedious to present manually; we have therefore provided the Python code for computation (see the
Supplementary Materials: Python code for the two-stage cluster sampling where stratification is implemented within the clusters).