1. Introduction
The group testing problem, introduced by Dorfman in [
1], is the problem of identifying the infection statuses of a set of individuals by performing fewer tests than individually testing everyone. The key idea of group testing is to mix test samples of the individuals and test the mixed sample. A negative test result implies that everyone within that group is negative, thereby identifying infection statuses of an entire group with a single test. A positive test result implies that there is at least one positive individual in that group, in which case Dorfman’s original algorithm goes into a second phase of testing everyone individually.
Since Dorfman’s seminal work, various families of algorithms have been studied, such as adaptive algorithms, where one designs test pools in the
st step by using information from the test results in the first
i steps, and non-adaptive algorithms, where every test pool is predetermined and run in parallel. In addition, various forms of infection spread models have been considered as well, such as the independent and identically distributed (i.i.d.) model where each person is infected independent of others with probability
p, and the combinatorial model where
k out of
n people are infected uniformly distributed on the sample space of
elements. Under these various system models and family of algorithms, the group testing problem has been widely studied. For instance, Ref. [
2] gives a detailed study of combinatorial group testing and zero-error group testing, Ref. [
3] relates the group testing problem to a channel coding problem, and Refs. [
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25] advance the group testing literature in various directions. The advantage of group testing is known to diminish when the disease is not rare [
26,
27,
28].
Early works mainly consider two infection models: combinatorial model where, prior to designing the algorithm, the exact number of infections is assumed to be known, and the probabilistic model where each individual is assumed to be infected with probability
p identically and independently. Although there is no general result for arbitrary infection probabilities and arbitrary correlations, Refs. [
29,
30,
31,
32,
33,
34] have considered advanced probabilistic models. Our goal in this paper is to consider a realistic graph-based infection spread model, and exploit the knowledge of the infection spread model to design efficient group testing algorithms. In this paper, we expand our prior conference paper in [
35], to present a comprehensive analysis.
To that end, first, we propose a novel infection spread model, where individuals are connected via a random connection graph, whose connection probabilities are known (For instance, location data obtained from cell phones can be used to estimate connection probabilities.). A realization of the random connection graph results in different connected components, i.e., clusters and partitions the set of all individuals. The infection starts with a patient zero who is uniformly randomly chosen among n individuals. Then, any individual who is connected to at least one infected individual is also infected. For this system model, we propose a novel family of algorithms which we coin two-step sampled group testing algorithms. The algorithm consists of a sampling step, where a set of individuals are chosen to be tested, and a zero-error non-adaptive test step, where selected individuals are tested according to a zero-error non-adaptive group test matrix. In order to select individuals to test in the first step, one of the possible cluster formations that can be formed in the random connection graph is selected. Then, according to the selected cluster formation, we select exactly one individual from every cluster. After identifying the infection statuses of the selected individuals with zero-error, we assign the same infection statuses to the other individuals in the same cluster with identified individuals. Note that the actual cluster formation is not known prior to the test design and, because of that, selected cluster formation can be different from the actual cluster formation. Thus, this process is not necessarily a zero-error group testing procedure.
Our main contributions consist of proposing a novel infection spread model with random connection graph, proposing a two-step sampled group testing algorithm which is based on novel
-separable zero-error non-adaptive test matrices, characterizing the optimal design of two-step sampled group testing algorithms, and presenting explicit results on analytically tractable exponentially split cluster formation trees. For the considered two-step sampled group testing algorithms, we identify the optimal sampling function selection, calculate the required number of tests and the expected number of false classifications in terms of the system parameters, and identify the trade-off between them. Our
-separable zero-error non-adaptive test matrix construction is based on taking advantage of the known probability distribution of cluster formations. In order to present an analytically tractable case study for our proposed two-step sampled group testing algorithm, we consider exponentially split cluster formation trees as a special case, in which we explicitly calculate the required number of tests and the expected number of false classifications. For zero-error construction, we prove that the required number of tests is less than
and is of
, when there are at most
n equal-sized clusters in the system, each having
individuals. For the sake of fairness, in our comparisons, we take
to be 1, ignoring further reductions of the number of tests due to
. We show that, even when we ignore the gain by cluster size
, our non-adaptive algorithm, in the zero-error setting, outperforms any zero-error non-adaptive group test and Hwang’s generalized binary splitting algorithm [
36], which is known to be the optimal zero-error adaptive group test [
28]. Since the number of infections scale as
in exponentially split cluster formation trees with
individuals, our results show that we can use group testing to reduce the required number of tests significantly in our system model even when the infection rate is high by using our two-step sampled group testing algorithm.
3. System Model
We consider a group of n individuals. The random infection vector represents the infection status of the individuals. Here, is a Bernoulli random variable with parameter . If individual i is infected, then , otherwise . Random variables need not be independent. A patient zero random variable Z is uniformly distributed over the set of individuals, i.e., with probability for . Patient zero is the first person to be infected. Thus far, the infection model is identical to the traditional combinatorial model with infected among n individuals.
Next, we define a random connection graph which is a random graph where vertices represent the individuals, and edges represent the connections between the individuals. Let denote the probability distribution of the random graph over the support set of all possible edge realizations. For the special class of random connection graphs where the edges are realized independently, we fully characterize the statistics of the random connection graph by the random connection matrix , which is a symmetric matrix, where the th entry is the probability that there is an edge between vertices i and j for , and for by definition.
A random connection graph
is an undirected random graph with vertex set
, with each vertex representing a unique individual, and a random edge set
which represents connections between individuals that satisfy the following: (1) If
, then there is an edge between vertices
i and
j; (2) For an arbitrary edge set
, probability of
is equal to
. In the case when all
are independent, where
denotes the indicator function of the event
A, the random connection matrix
fully characterizes the statistics of edge realizations. There is a path between vertices
i and
j if there exists a set of vertices
in
such that
, i.e., two vertices are connected if there exists a path between them. We summarize the system and algorithm parameters that we use throughout the paper in
Table 1.
In our system model, if there is a path in between two individuals, then their infection statuses are equal. In other words, the infection spreads from patient zero Z to everyone that is connected to patient zero. Thus, , if there exists a path between k and l in . Here, we note that a realization of the random graph consists of clusters of individuals, where a cluster is a subset of vertices in such that all elements in a cluster are connected with each other, and none of them is connected to any vertex that is not in the cluster. More rigorously, a subset of is a cluster, if and are connected for all , but and are not connected for any and all .
Note that the set of all clusters in a realization of the random graph is a partition of . In a random connection graph structure, formation of clusters in along with patient zero Z determine the status of the infection vector. Therefore, instead of focusing on the specific structure of the graph , we focus on the cluster formations in . For a given , we can calculate the probabilities of possible cluster formations in .
To solidify ideas, we give an example in
Figure 1. For a random connection graph where the edges are realized independently, we give probabilities of the existence of edges (zero probabilities are not shown) in
Figure 1a and three different realizations of a random connection graph
, where all three realizations result in different cluster formations in
Figure 1b–d. In
Figure 1, we consider a random connection graph
that has
vertices, which represent the individuals in our group testing model. Since in this example we assume that the edges are realized independently, every edge between vertices
i and
j exists with probability
, independently. As we defined, if there is a path between two vertices (i.e., they are in the same cluster), then we say that their infection statuses are the same. One way of interpreting this is that there is a patient zero
Z, which is uniformly randomly chosen among
n individuals, and patient zero spreads the infection to everyone in its cluster. Therefore, working on the cluster formation structures, rather than the random connection graph itself, is equally informative for the sake of designing group tests. For instance, in the realization that we give in
Figure 1b, if the edge between vertices 5 and 10 did not exist that would be a different realization for the random connection graph
; however, the cluster formations would still be the same. As all infections are determined by the cluster formations and the realization of patient zero, cluster formations are sufficient statistics. Before we rigorously argue this point, we first focus on constructing a basis for random cluster formations.
The random cluster formation variable F is distributed over as , for all , where is a subset of the set of all partitions of the set . In our model, we know the set (i.e., the set of cluster formations that can occur) and the probability distribution , since we know . Let us denote by f. For a cluster formation , individuals that are in the same cluster have the same infection status. Let , i.e., there are subsets in the partition of . Without loss of generality, for , we have , i.e., cluster formations in are ordered in increasing sizes. Let be the jth subset of the partition where and . Then, for fixed i and j, for all , for all and .
To clarify the definitions, we give a simple running example which we will refer to throughout this section. Consider a population with
individuals who are connected according to the random connection matrix
and assume that the edges are realized independently,
By definition, the main diagonal of the random connection matrix is zero, since we define edges between distinct vertices only. In this example,
consists of four possible cluster formations, and thus we have
. The random cluster formation variable
F can take those four possible cluster formations with the following probabilities:
This example network and the corresponding cluster formations are shown in
Figure 2. Here, cluster formation
occurs when the edge between vertices 1 and 2 and the edge between vertices 1 and 3 are realized;
occurs when only the edge between vertices 1 and 2 is realized; and
occurs when only the edge between vertices 1 and 3 is realized. Finally,
occurs when none of the edges in
is realized. In this example, we have
,
,
, and
. Note that
as assumed without loss of generality above. Each subset that forms the partition
are denoted by
, for instance,
consists of
and
.
Next, we argue formally that cluster formations are sufficient statistics, i.e., they represent an equal amount of information as the realization of the random graph as far as the infection statuses of the individuals is concerned. When
Z and
F are realized, the infection statuses of
n individuals are also realized, i.e.,
. Then,
where in (
3) we used the fact that
F is a function of
(not necessarily invertible). In addition, from
, we also have
, which together with () imply
. Thus,
F is sufficient statistics for
relative to
U. Therefore, from this point on, we focus on the random cluster formation variable
F in our analysis.
The graph model and the resulting cluster formations we described so far are general. For tractability, in this paper, we investigate a specific class of
which satisfies the following condition: For all
i,
can only be obtained by partitioning some elements of
. This assumption results in a tree-like structure for cluster formations. Thus, we call
sets that satisfy this condition cluster formation trees. Formally,
is a cluster formation tree if
can be obtained by partitioning the elements of
for all
. Note that
in (
2) is not a cluster formation tree. However, if the probability of the edge between vertices 1 and 3 were 0, then
would not contain
and
, and
would be a cluster formation tree in this case. Note that cluster formation trees may arise in real-life clustering scenarios, for instance, if individuals belong to a hierarchical structure. An example is: an individual may belong to a professor’s lab, then to a department, then to a building, and then to a campus.
Next, we define the family of algorithms that we consider, which we coin two-step sampled group testing algorithms. In the two-step sampled group testing algorithms, two steps do not involve consecutive testing phases: the proposed algorithm family in our paper consists of non-adaptive constructions and should not be confused with semi-adaptive algorithms with two testing phases such as two stage algorithms in [
32]. Two-step sampled group testing algorithms consist of two steps in both testing phase and decoding phase. The following definitions are necessary in order to characterize the family of algorithms that we consider in this paper.
In order to design a two-step sampled group testing algorithm, we first pick one of the cluster formations in
to be the sampling cluster formation. The selection of
is a design choice, for example, recalling the running example in (
1) and (
2), one can choose
to be the sampling cluster formation.
Next, we define the sampling function,
M, to be a function of
. The sampling function selects which individuals to be tested by selecting exactly one individual from every subset that forms the partition
. Let the infected set among the sampled individuals be denoted by
. The output of the sampling function
M is the individuals that are sampled and going to be tested. In the second step, a zero-error non-adaptive group test is performed on the sampled individuals. This results in the identification of the infection statuses of the selected
individuals with zero-error probability. For example, recalling the running example in (
1) and (
2), when the sampling cluster formation is chosen as
, we may design
M as
Note that, for each selection of , M selects exactly one individual from each . As long as it satisfies this property, M can be chosen freely while designing the group testing algorithm.
The test matrix
is a non-adaptive test matrix of size
, where
T is the required number of tests. Let
denote the infection status vector of the sampled individuals. Then, we have the following test result vector
yIn the classical group testing applications, while constructing zero-error non-adaptive test matrices, the aim is to obtain unique result vectors,
y, for every unique possible infected set and, for instance, in combinatorial setting, with
d infections,
d-separable matrix construction is proposed [
39]. In the classical
d-separable matrix construction, we have
for all subsets
and
of cardinality
d. As a more general approach, we do not restrict the possible infected sets to the subsets of
of the same size, but we consider the problem of designing test matrices that satisfy (
12) for every unique
and
in a given set of possible infected sets. This approach leads to a more general basis for designing zero-error non-adaptive group testing algorithms for various scenarios, when the set of possible infected sets can be restricted by the available side information.
By using the test result vector
y, in the first decoding step, the infection statuses of the sampled individuals are identified with zero-error probability. In the second stage of decoding, depending on
and the infection statuses of the sampled individuals, other non-tested individuals are estimated by assigning the same infection status to all of the individuals that share the same cluster in the cluster formation
. In the running example, with
M given in (
10), one must design a zero-error non-adaptive test matrix
, which identifies the infection statuses of individuals 1 and 3.
Let
be the estimated infection status vector. By definition, the infection estimates are the same within each cluster, i.e., for sampling cluster formation
,
, for all
, for all
. Since
M samples exactly one individual from every subset that forms the partition
, there is exactly one identified individual at the beginning of the second step of the decoding phase and by the aforementioned rule, all
n individuals have estimated infection statuses at the end of the process. For instance, in the running example, for the sampling cluster formation
, we have
as given in (
10) and
identifies
and
with zero-error. Then,
, since individuals 1 and 2 are in the same cluster in
.
Finally, we have two metrics to measure the performance of a group testing algorithm. The first one is the required number of tests T, which is the number of rows of in the two step sampled group testing algorithm family that we defined. Having a minimum number of required tests is one of the aims of the group testing procedure. The second metric is the expected number of false classifications. Due to the second step of decoding, the overall two step sampled group testing algorithm is not a zero-error algorithm (except for the choice of ) and the expected number of false classifications is a metric to measure the error performance of the algorithm. We use to denote the expected number of false classifications, where is the Hamming weight of a binary vector.
Designing a two-step sampled group testing algorithm consists of selecting , then designing the function M, and then designing the non-adaptive test matrix for the second step of the testing and the first step of the decoding phase for zero-error identification of the infection statuses of the sampled individuals. We consider cluster formation trees and uniform patient zero assumptions for our infection spread model, and we consider two step sampled group testing algorithms for the group test design.
In the following section, we present a motivating example to demonstrate our key ideas.
4. Motivating Example
Consider the following example. There are
individuals, and a cluster formation tree with
levels. Full characterization of
F is as follows:
First, we find the optimal sampling functions, M, for all possible selections of . First of all, note that M selects exactly one individual from each subset that forms , by definition. Therefore, the number of sampled individuals is constant for a fixed choice of . Thus, in the optimal sampling function design, the only parameter that we consider is the minimum number of expected false classifications . Note that a false classification occurs only when one of the sampled individuals has a different infection status than one of the individuals in its cluster in . For instance, assume that is chosen. Then, assume that the sampling function M selects individual 1 from the set . Recall that, after the second step of the two-step group testing algorithm, by using , the infection status of individual 1 is identified with zero-error and its status is used to estimate the statuses of individuals 2 and 3, since they are in the same cluster in . However, with positive probability, individuals 1 and 3 can have distinct infection statuses, in which case, a false classification occurs. Note that this scenario occurs only when is at a higher level than the realized F in the cluster formation tree , where we refer to as the top level of the cluster formation tree and as the bottom level.
While finding the optimal sampling function
M, one must consider the possible false classifications and minimize
, the expected number of false classifications. As shown in
Figure 3, the cluster
does not become partitioned, and for all three choices of
,
M can sample either one of the individuals 4 and 5. This selection does not change the expected number of false classifications since
in all possible realizations of
F. For all sampling cluster formation selections, we have the following analysis:
If: If
M samples individual 1 or 2 from the cluster
, a false classification occurs if
and the cluster
is infected, in that case, individual 3 is falsely classified as infected. Similar false classification occurs when
and the cluster
is infected. Similarly, in these cases, if individual 3 is infected, again, individual 3 is falsely classified as non-infected. Thus, for cluster
, when either individuals 1 or 2 is sampled, the expected number of false classifications is:
Similarly, when individual 3 is sampled from the cluster
, individuals 1 and 2 are falsely classified when
or
and either the cluster
or individual 3 is infected. Thus, in that case, the expected number of false classifications is:
Thus, (
14) and (
15) imply that, for cluster
, the optimal
M should select either individuals 1 or 2 for testing. As discussed above, for cluster
, the selection of sampled individual is indifferent and results in 0 expected false classification. Finally, for cluster
, a similar analysis implies that the optimal
M should select one of the individuals in
for testing.
If: Similar combinatorial arguments follow and we conclude that selection of sampled individuals from the clusters , and are indifferent in terms of the expected number of false classifications. Only a possible false classification can happen in cluster when and the infected cluster is either or . Similar to the case , if the sampled individual is either 6 or 7, then the expected number of false classifications is 0.6 in contrast to the 0.4 when the sampled individual is one of 8, 9 and 10. Thus, the optimal M should select one of the individuals 8, 9 and 10 as the sampled individual to minimize the expected number of false classifications.
If: It is not possible to make a false classification since, for all clusters in , all individuals that are in the same cluster have the same infection status with probability 1.
Therefore, for this example, the optimal sampling function selects either individuals 1 or 2 from the set ; selects either 4 or 5 from the set ; and selects either 8, 9 or 10 from the set if , and the same sampling is optimal with an addition of individual 3, if . Let us assume that M selects the individual with the smallest index when the selection is indifferent among a set of individuals. Thus, the optimal sampling function M for this example is: , or , depending on the selection of being , , or , respectively.
Now, for these possible sets of sampled individuals, we need to design zero-error non-adaptive test matrices.
If (i.e., ): The set of all possible infected sets is . By a counting argument, we need at least two tests, since each of three possible infected sets must result in a unique result vector y, and each one of these sets has one element. We can achieve this lower bound by using the following test matrix:
If (i.e.,
): In this case, the set of all possible infected sets is now
. In the classical zero-error construction for the combinatorial group testing model, one can construct
d-separable matrices, and the rationale behind the construction is to enable the decoding of the infected set, when the infected set can be any
d-sized subset of
. However, in our model, the set of all possible infected sets, i.e.,
, is not a set of all fixed sized subsets of
, but instead consists of varying sized subsets of
that are structured, depending on the given
. As illustrated in
Figure 3, a given cluster formation tree
can be represented by a tree structure with nodes (Throughout the paper, we use the word “node” only for the possible clusters in the cluster formation tree representations, not for the vertices in the connection graphs that represent the individuals.) representing possible infected sets, i.e., clusters at each level. Then, the aim of constructing a zero-error test matrix is to have unique test result vectors for each unique possible infected set, i.e., unique nodes in the cluster formation tree. In
Figure 4, we present the subtree of
, which ends at the level
, with assigned result vectors to each node. One must assign unique binary vectors to each node, except for the nodes that do not become partitioned while moving from level to level: those nodes represent the same cluster, and thus the same vector is assigned, as seen in
Figure 4. Moreover, while merging in upper level nodes, binary OR of vectors assigned to the descendant nodes must be assigned to their ancestor node. By combinatorial arguments, one can find the minimum vector length such that such vectors can be assigned to the nodes.
In this case, the required number of tests must be at least 3 and, by assigning result vectors as in
Figure 4, we can construct the following test matrix
:
Note that, for all elements of
, the corresponding result vector is unique and satisfies the tree structure criteria, as shown in
Figure 4.
If (i.e.,
): In this case, the set of all possible infected sets is
. We give a tree structure representation with assigned result vectors of length 3 that achieves the tree structure criteria discussed above, which is shown in
Figure 5 where each unique node is assigned a unique vector except for the nodes that do not become partitioned while moving from level to level. Note that every unique node in the tree representation corresponds to a unique element of
. The corresponding test matrix
is the following
matrix:
A more structured and detailed analysis of the selection of the optimal sampling function and the minimum number of required tests is given in the next section.
We finalize our analysis of this example by calculating the expected number of false classifications where denotes the conditional expected false classifications, given :
Note that the choice of is a design choice, and one can use time sharing (Time sharing can be implemented by assigning a probability distribution to over , instead of picking one cluster formation from to be deterministically.) between different choices of m, depending on the specifications of the desired group testing algorithm. For instance, if a minimum number of tests is desired, then one can pick , which results in two tests, which is the minimum possible, but with expected 0.58 false classifications, which is the maximum possible in this example. On the other hand, if a minimum expected false classifications is desired, then one can pick , results in 0 expected false classifications, which is the minimum possible, but with 3 tests, which is the maximum possible in this example. Generally, there is a trade-off between the number of tests and the number of false classifications, and we can formulate optimization problems for specific system requirements, such as finding a time sharing distribution for that minimizes the number of tests for a desired level of false classifications, or vice versa.
In the following section, we describe the details of our proposed group testing algorithm.
5. Proposed Algorithm and Analysis
In our
-
separable matrix construction, we aim to construct binary matrices that have
n columns, and for each possible infected subset of the selected individuals, there must be a corresponding distinct result vector. A binary matrix
is
-separable if
is satisfied for all distinct subsets
and
in the set of all possible infected subsets, where
denotes the
ith column of
. In
d-separable matrix construction [
39], this condition must hold for all subsets
and
of cardinality
d; here, it must hold for all possible feasible infected subsets as defined by
. From this point of view, our
-separable test matrix construction exploits the known structure of
and thus it results in an efficient zero-error non-adaptive test design for the second step of our proposed algorithm.
We adopt a combinatorial approach to the design of the non-adaptive test matrix
. Note that, for a given
M, we have
individuals to be identified with zero-error probability. The key point of our algorithm is the fact that the infected set of individuals among those selected individuals can only be some specific subsets of those
individuals. Without any information about the cluster formation, any one of the
subsets of the selected individuals can be the infected set. However, since we are given
, we know that the infected set among the selected individuals,
, can be one of the
subsets only if there exists at least one set
that contains
, and there is no element in the difference set
such that it is an element of all sets
containing
. This fact, especially in a cluster formation tree structure, significantly reduces the total number of possible infected subsets that need to be considered. Therefore, we can focus on such subsets and design the test matrix
by requiring that the logical OR operation of the columns that correspond to the possible
sets to be distinct, in order to decode the test results with zero-error. Let
denote the set of possible infected subsets of the selected individuals, i.e., the set of possible sets that
can be. Then, matrix
must satisfy (
18) for all distinct
and
that are elements of
. Note that the decoding process is a mapping from the result vectors to the infected sets and thus we require the distinct result vector property to guarantee zero-error decoding.
Designing the matrix that satisfies the aforementioned property is the key idea of our algorithm. Before going into the design of , we first derive the expected number of false classifications in a given two step sampled group testing algorithm. Recall that false classifications occur during the second step of the decoding phase. In particular, in the second step of the decoding phase, depending on the selection of the sampling cluster formation , the infection statuses of selected individuals M are assigned to the other individuals such that the infection status estimate is the same within each cluster. For fixed sampling cluster formation and the sampling function M, the number of expected false classifications can be calculated as in the following theorem.
Theorem 1. In a two step sampled group testing algorithm with the given sampling cluster formation and the sampling function M over a cluster formation tree structure defined by and , with uniform patient zero distribution over , the expected number of false classifications given isand the expected number of false classifications iswhere is the subset in the partition which contains the ith selected individual. Next, we obtain Theorem 2 to characterize the optimal choice of the sampling function
M. First, we define
functions as follows. For
and
,
where
is the subset in partition
that contains
k.
Theorem 2. For sampling cluster formation , the optimal choice of M that minimizes the expected number of false classifications iswhere is the ith selected individual. Moreover, the number of required tests is constant and is independent of the choice of M. We present the proofs of Theorems 1 and 2 in
Appendix A.
The optimal
M analysis focuses on choosing the sampling function that results in the minimum expected number of false classifications, among the set of functions that select exactly one individual from each cluster of a given
. For some scenarios, it is possible to choose a sampling function that selects multiple individuals from some clusters of a given
that achieves expected false classifications–required number of tests points that cannot be achieved by the optimal
M in (
A6). However, for the majority of the cases, the sampling functions of interest, i.e., the sampling functions that choose exactly one individual from each
, are globally optimal. First, the sampling functions that select multiple individuals from a cluster that never becomes partitioned further in the levels below
is sub-optimal: these sampling functions select multiple individuals to identify who are guaranteed to have the same infection status. For instance, in zero expected false classifications case, i.e., the bottom level,
is chosen as the sampling cluster formation, sampling more than one individual from each cluster is sub-optimal. Second, picking the sampling cluster formation
and choosing an
M such that multiple individuals are chosen from some clusters that further become partitioned in the levels below
, is equivalent to choosing a sampling cluster formation below
and using an
M that selects exactly one individual from each cluster of the new sampling cluster formation, except for the scenarios where there exists partitioning of multiple clusters in two consecutive cluster formations in a given
, and one can consider a sampling function that selects multiple individuals from some clusters of a given
that cannot be represented as a sampling function that selects exactly one individual from each cluster of another cluster formation
. For the sake of compactness, we focus on the family of sampling functions
M that selects exactly one individual from each cluster of the chosen
.
Thus far, we have presented a method to select individuals to be tested in a way to minimize the expected number of false classifications. Now, we move on to the design of , the zero-error non-adaptive test matrix which identifies the infection statuses of the selected individuals M with a minimum number of tests. Recall that, since , there are f possible choices of , and each choice results in a different test matrix .
Based on the combinatorial viewpoint stated in (
18), we propose a family of non-adaptive group testing algorithms which satisfy the separability condition for all of the subsets in
, which is determined by
. We call such matrices
-separable matrices and non-adaptive group tests that use
-separable matrices as their test matrix as
-separable non-adaptive group tests. In the rest of the section, we present our results on the required number of tests for
-separable non-adaptive group tests.
The key idea of designing an -separable matrix is determining the set for a given set of selected individuals M and the tree structure of so that we can find binary column vectors for each selected individual where all of the corresponding possible result vectors are distinct. Note that, for a given choice of , if we consider the corresponding subtree of which starts from the first level and ends at the level , the problem of finding an -separable non-adaptive test matrix is equivalent to finding a set of length T binary column vectors for each node at level that satisfy the following criteria:
For every node at the levels that are above the level , each node must be assigned a binary column vector that is equal to the OR of all vectors that are assigned to its descendant nodes. This is because each node in the tree corresponds to a possible set of infected individuals among the selected individuals where each merging of the nodes corresponds to the union of the possible infected sets which results in taking the OR of the assigned vectors of the merged nodes.
Each assigned binary vector must be unique for each unique node, i.e., for every node that represents a unique set
. For the nodes that do not split between two levels, the assigned vector remains the same. This is because each unique node (note that when a node does not split between levels, it still represents the same set of individuals) corresponds to a unique possible infected subset of the selected individuals and they must satisfy (
18).
In other words, for a cluster formation tree with assigned result vectors to each node, a sufficient condition for achievability of -separable matrices as follows:
Let u be a node with Hamming weight . Then, the number of all descendant nodes of u with constant Hamming weights i must be less than for all i. This must hold for all nodes u. Furthermore, the number of nodes with constant Hamming weight i must be less than for all i. In addition, Hamming weights of the nodes must strictly decrease while moving from ancestor nodes to descendant nodes.
This condition is indeed sufficient because it guarantees the existence of unique set of vectors that can be assigned to each node of the subtree of that satisfies the merging/OR structure determined by the subtree.
The problem of designing an -separable non-adaptive group test can be reduced to finding the minimum number T, for which we can find binary vectors with length T, such that all vectors that are assigned to the nodes satisfy the above condition. Here, the assigned vectors are the result vectors y when the corresponding node is the infected node.
We have the following definitions that we need in Theorem 3. For a given
, we define
as the number of unique ancestor nodes of the set
. We also define
as the number of unique sets
in
at and above the level
. Note that
is the total number of sets
in
at and above the level
, and thus we have
Theorem 3. For given and for , the number of required tests for an -separable non-adaptive group test, i.e., the number of rows of the test matrix , must satisfywith the addition of 1’s removed in (24) for the special case of . We present the proof of Theorem 3 in the
Appendix A. Note that Theorem 3 is a converse argument, without a statement about the achievability of the given lower bound. In fact, the given lower bound is not always achievable.
Complexity: The time complexity of the two-step sampled group testing algorithms consists of the complexity of finding the optimal M given and , the complexity of the construction of the -separable test matrix given M and , and the complexity of the decoding of the test results given the test matrix and the result vector y. In the following lemmas, we analyze the complexity of these processes.
Lemma 1. For a given cluster formation tree and a sampling cluster formation , the complexity of finding the optimal M as in Theorem 2 iswhere . Proof. In order to find the optimal
M,
needs to be calculated as in (
21) for each
. The complexity of each of these calculations is bounded above by the number of cluster formations below
multiplied by the number of clusters at level
f that do not include the individual
k and form the cluster
, i.e., the clusters
that satisfy
. Note that this upper bound varies for each
and the total complexity is the summation of these sizes multiplied by
, i.e., the number of cluster formations below
, for each
. As an upper bound, we consider the maximum of these sizes, i.e.,
, concluding the proof. □
In the next lemma, we analyze the complexity of the construction of the -separable test matrix given M and .
Lemma 2. For a given cluster formation tree and a sampling function M, the complexity of assigning the binary result vectors to the nodes in , and thus the construction of the -separable test matrix is .
Proof. When the cluster formation tree
and the sampling function
M are given, in order to assign unique binary result vectors to each node in
that represents a unique possible infected cluster, we need to consider the subtree of
that starts with the level
and ends at the level
, as in the example in
Figure 4. Then, we need to traverse from each bottom node in the subtree, to the top node, to detect every merging of each cluster. This results in finding the numbers
for
and
and unique binary test result vectors can be assigned to each unique node in
. The traversing on the subtree of
starting from the bottom level
to the top level for each bottom level node has the complexity
. This traversing does not immediately result in the explicit construction of unique binary result vectors to be assigned, but it gives an asymptotic lower bound for the complexity of the construction of the
-separable test matrices. □
Note that the Lemma 2 is an asymptotic lower bound for the complexity of the binary result vector assignment to the unique nodes in , and thus for the construction of the -separable test result matrix . This analysis is a baseline for the proposed model and proposing explicit -separable test matrix constructions with an exact number of required tests, and complexity is an open problem.
Lemma 3. For a given -separable test matrix , with corresponding cluster formation tree with assigned binary result vectors to each node and the result vector y, the decoding complexity is .
Proof. While constructing the -separable test matrix, we consider the assignment of the unique binary result vectors to the nodes in the given cluster formation tree . For a given test matrix and the result vector y, the decoding problem is a hash table lookup, with the complexity . □
Since, during the proposed process of assignment of unique binary result vectors to each unique node in , we specifically assign the test result vectors to every unique possible infected set, the decoding process is basically a hash table lookup, resulting in fast decoding with low complexity.
Key Steps of the Proposed Algorithm: The summary of the key steps of the two-step sampled group testing algorithm is given below:
We start with the assumption that exact connections between the individuals are not known, but the probability distribution of the possible edge realizations are known.
The given edge set probability distribution results in a random cluster formation variable, F. Each possible cluster formation is a partition of the set of all individuals.
Out of all possible cluster formations (which we call this set as ), one cluster formation is selected as the sampling cluster formation, which we call .
Exactly one individual is selected from each cluster in . These individuals are then tested and identified.
The selection is carried out according to the sampling function M. For the given choice of , M selects the individuals from the clusters that minimizes the expected number of false classifications, given in Theorem 2, and this results in the expected number of false classifications given in Theorem 1.
By using the given set of possible cluster formations, , an -separable test matrix is constructed to identify the individuals selected by M. This test matrix is guaranteed to identify the selected individuals since the construction is based on assigning a unique test result vector to every possible infected set among the selected individuals.
In Theorem 3, we present a converse argument by giving a lower bound for the required number of tests, in terms of the system parameters.
After obtaining the test results and identifying the selected individuals with zero-error, for each selected individual, their infection status is assigned to the others in their cluster, in . Note that there is exactly one individual selected and identified from every cluster in . This step introduces possible false classifications.
Selecting from lower levels from the possible cluster formations tree results in lower expected false classifications while increasing the number of required tests for identification. This results in a trade-off between the number of tests and expected false classifications. By using a randomized selection of , intermediate points can also be achieved for the expected false classifications and required number of tests.
In the next section, we introduce and focus on a family of cluster formation trees which we call exponentially split cluster formation trees. For this analytically tractable family of cluster formation trees, we achieve the lower bound in Theorem 3 order-wise, and we compare our result with the results in the literature.
6. Exponentially Split Cluster Formation Trees
In this section, we consider a family of cluster formation trees, explicitly characterize the selection of optimal sampling function, and the resulting expected number of false classifications and the number of required tests. We also compare our results with Hwang’s generalized binary splitting algorithm [
36] and zero-error non-adaptive group testing algorithms in order to show the gain of utilizing the cluster formation structure as achieved in this paper.
A cluster formation tree is an exponentially split cluster formation tree if it satisfies the following criteria:
An exponentially split cluster formation tree that consists of f levels has nodes at level , for each , i.e., .
At level , every node has individuals where is a constant positive integer, i.e., .
Every node has exactly two descendant nodes in one level below in the cluster formation tree, i.e., every node is partitioned into equal sized 2 nodes when moving one level down in the cluster formation tree.
Random cluster formation variable F is uniformly distributed over , i.e., .
We analyze the expected number of false classifications and the required number of tests for exponentially split cluster formation trees, by using the general results derived in
Section 5. In
Figure 6, we give a 4-level exponentially split cluster formation tree example. In that example, there is a
node at level
and the number of nodes gets doubled at each level, since each node is split into two nodes when moving one level down in the tree. In addition, the sizes of the nodes that are at the same level are the same, with the bottom level nodes having the size
.
Being a subset of cluster formation trees, exponentially split cluster formation trees correspond to random connection graphs where edges between individuals are not independently realized in non-trivial cases. For instance, in
Figure 7, we present four different possible realizations of edges of a 4-level exponentially split cluster formation tree system, given in
Figure 6, where there are
individuals in the bottom level clusters. Here, if the edges between individuals are realized independently, then there would be possible cluster formations that do not result in an exponentially split cluster formation tree structure. The edge realizations are correlated in the sense that, if there is at least one edge realized between two bottom level neighbor clusters, then there must be at least one edge realized between other bottom level neighbor cluster pairs as well. Similarly, if there is at least one bottom level cluster pair that are not immediate neighbors but get merged in some upper level
in
, then other bottom level cluster pairs that get merged in
must be connected as well. In
Figure 7, in
realization, the only edges that are present are the edges that form bottom level clusters. In
realization, there are at least one edge realized between each bottom level neighbor cluster pair, resulting in clusters of eight individuals. Similarly, there are more distant connections that are realized in
and
. From a practical point of view, the 4-level exponential split cluster formation tree example in
Figure 6 and
Figure 7 can be used to model real-life scenarios, such as the infection spread in an apartment complex with multiple buildings. In the bottom level, there are households that are guaranteed to be connected, and, in the
level, the households that are in close contact are connected, in the
level, there is a connection building-wise and, in
, the whole community is connected. Note that the connections given in
Figure 7 are realization examples that fall under four possible cluster formations and all edge realization scenarios are possible as long as the resulting cluster formation is one of the four given cluster formations. While designing the group testing algorithm, the given information is the probability distribution over the cluster formations, and in practice, one can expect a probability distribution where bottom level cluster formations, i.e., cluster formations towards
, have higher probabilities in a community where there are strict social isolation measures, and high immunity rates for a contagious infection, whereas higher probabilities of upper level cluster formations, i.e., cluster formations toward
, can be expected for communities with high contact rate and lower immunity.
Optimal sampling function and expected number of false classifications: Due to the symmetry of the system, for any choice
, each element of
has the same
value for all
. Therefore, the sampling function selects individuals from each set arbitrarily, i.e., the selection of a particular individual does not change the expected number of false classifications. Thus, we can pick any sampling function that selects one element from each
. By Theorem 1, the expected number of false classifications, for given
, is
This expected number of false classifications takes its maximum value when
,
and it takes its minimum value when
as
. Since the choice of
is a design parameter, one can use time sharing between the possible selections of
to achieve any desired value for the expected number of false classifications between
and
in (
32).
Required number of tests: We first recall that, if we choose the sampling cluster formation level , the required number of tests for selected individuals at that level for whom we design an -separable test matrix depends on the subtree that is composed of the first m levels of the cluster formation tree . Note that the first m levels of an exponentially split cluster formation tree is also an exponentially split cluster formation tree with m levels. In Theorem 4 below, we focus on the sampling cluster formation choice at the bottom level, and characterize the exact required number of tests to be between f and . This implies that the required number of tests at level is , and thus the required number of tests at level is .
Theorem 4. For an f level exponentially split cluster formation tree, at level f, there exists an -separable test matrix, , with not more than rows, i.e., an upper (achievable) bound for the number of required tests is for n individuals. Conversely, this is also the capacity order-wise, since the number of required tests must be greater than f.
Expected number of infections: In an exponentially split cluster formation tree structure with
f levels, the expected total number of infections is
since
and if
, then there are
infections. Thus, the expected number of infections is
.
Comparison: In order to compare our results for the exponentially split cluster formation trees with other results in the literature, for fairness, we focus on the zero-error case in our system model, which happens when is chosen. The resulting sampling function selects in a total of individuals, and the resulting number of required tests is between f and , i.e., , as proved in Theorem 4. Note that, by performing at most tests to individuals, we identify the infection statuses of individuals with zero false classifications, which implies that the number of tests scales with the number of nodes at the bottom level, instead of the number of individuals in the system. This results in a gain scaled with . However, in order to fairly compare our results with the results in the literature, we ignore this gain and compare the performance of the second step of our algorithm only, i.e., the identification of infection statuses of selected individuals only. To avoid confusion, let , i.e., each cluster at the bottom level is an individual and thus .
From (
33), the expected number of infections in this system is
. When the infections scale faster than
, as proved in [
26] (see also [
28]), non-adaptive tests with zero-error criterion cannot perform better than individual testing. Since our algorithm results in
tests, it outperforms all non-adaptive algorithms in the literature. Furthermore, we compare our results with Hwang’s generalized binary splitting algorithm [
36], even though it is an adaptive algorithm and also it assumes the prior knowledge of exact number of infections. Hwang’s algorithm results in a zero-error identification of
k infections among the population of
n individuals with
tests and attains the capacity of adaptive group testing [
28,
36,
40]. Since the number of infections takes
f values in the set
uniformly randomly, the resulting mean value of the required number of tests when Hwang’s generalized binary splitting algorithm is used is
Thus, the expected number of tests when Hwang’s generalized binary splitting algorithm is used scales as which is much faster than our result of . We note that Hwang’s generalized binary splitting algorithm assumes the prior knowledge of exact number of infections, and is an adaptive algorithm, and furthermore, we have ignored the gain of our algorithm in the first step (i.e., ). Despite these advantages given to it, our algorithm still outperforms Hwang’s generalized binary splitting algorithm for exponentially split cluster formation trees.