2.2. Algorithms
The input parameters for the sampling selection problem are the pedigree file and the integer K—the intended number of individuals for the sampling (In our software, two methods of calculation are bundled together, and as a result, the chosen method must also be provided as the input parameter). We consider the pedigree as a forest of separate trees, where each tree represents one single-gender lineage that propagates from one founder dam or sire. Depending on the sizes of the distinct trees in the whole pedigree, a set of K values for each separate tree is calculated. The K value for i-th tree is denoted with . If the size of the whole pedigree is denoted with N, then the size of the i-th tree is denoted with . In actual implementation and the proportional values are calculated, and separate lineages are processed sequentially. However, for the sake of the simplicity of the exposition, in the following we shall treat a pedigree as a single tree and use notation. Then the statement of the algorithmic problem is: In a pedigree tree of size N, find the subset of K individuals such that no other subset of K individuals has a larger sum of mutual pairwise distances.
We provide two algorithmic solutions, a sub-optimal greedy, and the optimal. Their principles of operation are explained as follows.
Greedy: The greedy algorithm builds the solution subset incrementally. In the first stage, the subset is initialized to the most distant pair of individuals. Finding the pair is a classical problem that can be solved in time proportional to N. The second stage has steps. In each step we find an individual not yet in the subset with the largest cumulative distance to all the individuals in the subset. At the end of the second stage, the subset has reached size K and we are done.
Steps of the second stage can be performed efficiently if we maintain, for each individual, the cumulative distance to all the individuals in the subset. As the number of steps is proportional to K steps, the total time complexity of the second stage, and the algorithm, is proportional to .
Optimal: The optimal algorithm expresses the total distance to be maximized as a sum of values assigned to edges. Each edge is assigned a value corresponding to the number of pairs of individuals from the subset such that the individuals are on different sides of the edge. Note that the value assigned to an edge is simply the product of two numbers: counts of individuals from the subset for each of the two subtrees that the edge connects. The sum of those values over all the edges is equal to the total pairwise distance between individuals in the subset.
We observe the following: if we choose an edge and decide how many individuals from the subset will be on each side of the edge, the problem breaks into two. As the value of the chosen edge is finalized, we can remove it and continue by finding a fixed-size subset of individuals from each of two subtrees generated by the edge removal. The two subtree problems can be solved independently. This is true because for an edge of a subtree it holds that all the individuals outside the subtree are on the same side of the edge. Hence, to compute the contribution of a subtree, we do not need to know the exact choice of individuals outside the subtree, but only their count.
This problem is solved using dynamic programming. A subproblem in the dynamic programming algorithm will be a subtree and an integer . The answer to a subproblem is the sum of values assigned to the edges in the subtree after optimal choice of k individuals. It is important to note that the choice of individuals from the rest of the tree does not matter. Only their count matters, which is equal to .
By removing edges in a specific order (rooted, depth-first traversal) we can see that the number of different subproblems is proportional to . Each of them can be solved in time proportional to K as it involves removing an edge and considering all choices for the number of individuals on each side of the removed edge. The total time complexity of the algorithm is then proportional to .
The selected subsets of individuals are associated with the score values, calculated as the sum of all pairwise distances among the selected individuals. For the measured scores on real and simulated pedigrees, the difference between optimal and greedy method is observed to be very small, less than 0.01% in all cases. Incidentally, the differences are larger on trees with long sequences of descendants with only one individual per generation—an unlikely case in the real-life populations.
It should be noted that we use the term “optimal” in the context of the algorithmic solution and that this does not necessarily imply that the actual set of the chosen individuals will indeed be optimal for the largest coverage of the variability. We merely expect it to be, since the mutual distance maximisation is the natural approach to the problem of variability coverage.
2.3. Usage
2.3.1. Defining the Reference Population
The selection for sampling must be restricted to the accessible individuals, that is, the alive ones, or those with the stored tissue samples. Such individuals form the reference population, and they have to be separately annotated in a pedigree. For this purpose, we use an additional column in the pedigree file labelled live, where a non-empty live field indicates that the individual belongs to the reference population.
Therefore, the first step in using our software should be to define the reference population through assignment of the live values to individuals. If the live column is not found in the subsequent processing, the entire pedigree is treated as the reference population. The results obtained in this way may not have biological relevance.
The simplest way to designate the reference population in a pedigree is to assign the live status to all individuals born after a given year. We provide the script setrefpop.py that performs this function. If the reference population is otherwise defined, the user should employ a customized function.
Besides defining the reference population, the live attribute is used for the selection of gender lineages, as explained below.
2.3.2. Selecting the Gender Lineage
The main motive for the construction of our software was to support the analysis of mitochondrial DNA variability. As a result, the default usage is adjusted for the dam lineages. However, the algorithms are the same, regardless of the gender, and our software can be easily applied to the analysis of sire lineages and Y chromosome variability. In order to do that two actions are required.
Firstly, the reference population must include only sires. The setrefpop.py script includes this option and can assign live value only to male individuals, which can be combined with the referent year setting.
Secondly, the user must set the options for the preprocessing script to include only male individuals. The details on preprocessing are given in the following section.
2.3.3. Preprocessing and the Main Processing
Our algorithms work with a standard edge list graph representation of a pedigree tree. We provide the script preprocess.py that transforms the input pedigree file into the appropriate input for the main processing program. Along with the list of edges, the transformed file contains the explicit list of individuals that belong to the reference population. In order to process the paternal lineages in a pedigree, the user must input the --parent father option, while the --parent mother option is the default mode. This is clearly indicated in the instructions included with the software.
The final processing is performed with main.py script that requires a preprocessed input file, a value for K, and the choice between the greedy or optimal method.