A Fast Method for the Selection of Samples in Populations with Available Genealogical Data
Round 1
Reviewer 1 Report
This is a concise and pithy study presenting a couple of faster (NK2, more exhaustive, and NK, greedy) algorithms for the sample selection problem (until now, N3 complexity). The usable codebase accompanies the study.
There are two issues that need to be addressed by the authors:
Just how "optimal" is the NK2 algorithm? How does it compare, performance-wise, with the established N3 algorithm? In general, there is no comparative (inter-algorithm) benchmarking in the manuscript.
"The two subtree problems can be solved independently as the choice of individuals from one subtree does not affect the values that will be assigned to edge in the other subtree" --- this is an intuitively appealing assumption, but is it really true? Could the authors explain? Also, the pedigree is considered as a forest of separate trees. Are they independent?
Provided these two issues are addressed (especially the comparative benchmarking), this promises to be a very useful study.
Author Response
Point 1: Just how "optimal" is the NK2 algorithm? How does it compare, performance-wise, with the established N3 algorithm? In general, there is no comparative (inter-algorithm) benchmarking in the manuscript.
Response 1: The new heuristic (greedy) algorithm works somewhat differently than the previous one implemented in Magellan software. It is interesting to note that, while much faster, it also produces slightly better results, i.e., larger scores. However, the differences in scores are very small, as well as the differences in scores compared to the optimal algorithm. We have added Table 2 that includes a few results to illustrate this fact, together with some additional text to that effect.
Point 2: "The two subtree problems can be solved independently as the choice of individuals from one subtree does not affect the values that will be assigned to edge in the other subtree" --- this is an intuitively appealing assumption, but is it really true? Could the authors explain?
Response 2: This follows from the way we express the total distance to be optimized: as a sum of contributions over all the edges. Contribution of an edge is the number of pairs of individuals from the subset whose path traverses the edge. It is equal to the product of counts of chosen individuals from both sides of the edge, i.e. K_e * (K - K_e), where K_e is the number of chosen individuals on one side of the edge. For each edge of a subtree it holds that all the individuals outside the subtree are on the same side of the edge. Hence, we're able to compute the contribution of subtree edges without knowing the exact choice of individuals outside the subtree, but only their count.
We agree that this should heve been better exposed in the paper, and we have included some additional sentences that we hope will clarify the concept a bit more.
Point 3: Also, the pedigree is considered as a forest of separate trees. Are they independent?
Response 3: A pedigree is indeed a complex structure where it is possible that the connections exist within any subset of individuals. However, when we deal only with a single gender lineage, where each individual has only one ancestor, the pedigree is reduced to a forest of independent trees. The trees are considered independent as the only connections between them are through the lineages of the other gender, which we do not take in account when dealing with mitochondria or Y chromosome.
Reviewer 2 Report
The authors present an interesting and potentially useful method for selection of samples in populations with genealogical data. The described method can be very useful if it works properly. Thus, this paper is worth publishing. However, some revisions are required.
Major points:
- Is Magellan really the only other software available for such analyses? If not, please, compare the new method to other tools.
- Please, provide links where the scripts you described can be seen and evaluated.
Minor points:
3. Line 94: What does the square mean (at the end of the sentence)?
4. Table 1. Please, use "min" for abbreviation of "minutes". In the SI system, which should be used, "m" stands for "meters", not "minutes".
Author Response
Point 1: Is Magellan really the only other software available for such analyses? If not, please, compare the new method to other tools.
Response 1: To the best of our knowledge, there exists no other software for this purpose. Magellan was confirmed as the first one by the reviewers at the time of publishing, and we are not aware of anything else being published in the meantime.
Point 2: Please, provide links where the scripts you described can be seen and evaluated.
Response2: The link to the github page has been provided in the abstract, and again in the introduction (line 43). We presumed that was the adequate position in the paper. However, it is possible that there exists another convention for link placement that we are not aware of. If that is the case, please let us know.
Point 3: Line 94: What does the square mean (at the end of the sentence)?
Response 3: We apologize for the informal use of that symbol. The intention was to separate descriptions of algorithms from the rest of the text. In the revised paper we have used vertical space instead.
Point 4: Table 1. Please, use "min" for abbreviation of "minutes". In the SI system, which should be used, "m" stands for "meters", not "minutes".
Response 4: We have corrected the text accordingly, thank you for the comment.
Round 2
Reviewer 1 Report
The revised manuscript is ready for publication. The two criticisms that I had have been addressed satisfactorily.
Reviewer 2 Report
The manuscript is acceptable now. Just a minor suggestion which can be included at the page proof stage - it would be useful for a reader if a link to the github page was provided a in the Methods section. It can be transferred from Introduction where only previously known information should be mentioned, so a reader does not expect description of newly prepared tools placed here. Keep the link in the Abstract, since Abstract is an independent part of the paper, published separately in various databases, thus, it is important to have the link there.