Beyond the Backbone: The Next Generation of Pathwalking Utilities for Model Building in CryoEM Density Maps
Round 1
Reviewer 1 Report
In this work, Hryc et al. have present an update of PathWalking, a software for model building from EM maps. The update includes an addition of probabilistic models, and a companion tool for modeling waters and ligands. A workflow for this version of PathWalking was shown. Finally, results of model building using the probability models, and ligand/water molecule modeling were quality-assessed, and were evaluated against published structures. The manuscript is well written. The workflow and how-to-use instruction for the new functionalities were well instructed, the new tools are computationally light, these factors are very practical. In general, I would recommend pulishing the manuscript, given the following points are addressed.
Major points:
- A comparison of models built using the probabilistic modelling function with models built with an old version of PathWalking shall be shown. This is important to convince the potential users that the new models are indeed worth using.
Minor points:
- The authors present that more than one potential pseudoatoms are produced. However, how the probabilistic models are introduced is not sufficiently elaborated. Authors seems only mention this: “Each point on the backbone trace (N) is assigned a probability based on the connection to the next element in the trace (N+1) across all traces.” Please expand this part with some more explanations.
- In the ligand identification part: “pw_ligands.py identified 2608 potential water; the published structure contained 4194 potential waters. Of the 2608 waters, 1814 identified waters were within 1Å of waters in the 280 published structure and only 356 of the waters were further than 5Å away from a water 281 in the published structure”. The authors shall discuss why ~1/3 water molecules were not identified.
- A table listing the computational time/load for representative protein mass would be a worthy addition. Also, a brief description on recommended computing environment would be nice.
Author Response
R1.Q1: A comparison of models built using the probabilistic modeling function with models built with an old version of PathWalking shall be shown. This is important to convince the potential users that the new models are indeed worth using.
R1.A1: In our experience with the tested maps, as well as some additional maps from active projects, the results for automated Pathwalking with and without probabilistic models are virtually identical. However, this does not necessarily mean that the resulting paths are guaranteed to be correct. In fact, even when using probabilistic models, we have noticed that, occasionally, a pseudoatom is placed in a bulky sidechain density which leads to small registration errors or “jumps” across density. In Apoferritin (Figure 1) and EMDB 22898 (FIgure 3B), we saw occasional bad pseudoatom placements in the helices due to bulky sidechain densities, resulting in improper geometry along the helical path. For the most part, this type of error is fixed during the refinement steps. However, what the probabilistic models provide is a quick visual representation of the path quality, making it easier to spot and track potential model issues. We have clarified this point in the text (Section 4.1, page 10).
R1.Q2: The authors present that more than one potential pseudoatoms are produced. However, how the probabilistic models are introduced is not sufficiently elaborated. Authors seems only mention this: “Each point on the backbone trace (N) is assigned a probability based on the connection to the next element in the trace (N+1) across all traces.” Please expand this part with some more explanations.
R1.A2: We have added additional text describing the pseudoatom generation in probabilistic models to the manuscript (section 2.2, page 5). Indeed, when we generate a set of decoys, multiple sets of pseudoatoms are generated at various density thresholds and noise levels, resulting in positional variance among the sets of pseudoatoms. However, the exact number of psuedoatoms generated remains constant and in the same coordinate system as the map of interest. In this regard, it is relatively simple to calculate the optimal alignment of all sets of pseudoatoms. From all sets of pseudoatoms, an average position for each point is calculated, from which an “average” path is calculated with Pathwalking. All paths, based on the various pseudoatom sets, are then compared to the average model path. A simple percentage is calculated comparing the connectivity of pseudoatom N in the average path to the connectivity of pseudoatom N in each of the decoy paths. As the paths from Pathwalking are agnostic to direction, both the N+1 and N-1 positions are examined.
R1.Q3: In the ligand identification part: “pw_ligands.py identified 2608 potential water; the published structure contained 4194 potential waters. Of the 2608 waters, 1814 identified waters were within 1Å of waters in the 280 published structure and only 356 of the waters were further than 5Å away from a water 281 in the published structure”. The authors shall discuss why ~1/3 water molecules were not identified.
R1.A3: The reviewer raises a good question – why don’t we identify all of the waters in the published structure? There are a number of possible ways to model water in the density map. Our ligand identification tool attempts to find un-modeled, non-protein density in the map. Part of this process is examining the consistency of the voxel values between the half maps computed in the reconstruction. We surmise that this step is responsible for eliminating a number of potential waters that were identified in the published structure. pw_ligands.py identified over 8000 initial sites in EMDB 7770, which were then pruned to just under 3000 sites in the final results. It is also worth noting that ligand identification in our software is based solely on observable density features and their spatial relationship to the model. No explicit chemistry is used in the identification of ligands in pw_ligand.py, and as such regions with good chemistry but poorly resolved density will not be modeled.
For the maps and models in the CryoEM Ligand Modeling Challenge, a variety of different approaches were taken to identify the ligands and waters, ranging from purely based on chemistry to purely based on density. Interestingly, none of the submitted models were able to reproduce all the waters in the published structure. We have added a small discussion to the manuscript to clarify this. (Section 4.2, page 11)
R1.Q4: A table listing the computational time/load for representative protein mass would be a worthy addition. Also, a brief description of the recommended computing environment would be nice.
R1.A4: We have added additional text to the manuscript regarding computing environment and timings (Section 3.5, pages 9-10). With Pathwalking, computing the actual path through a set of points corresponding to a Calpha atoms in the density map is negligible as modern TSP solvers can take a few hundred milliseconds to 2-3 seconds. In Google OR tools, the default TSP solver in the latest version of Pathwalking, there is an option to set a maximum time-limit for the search which we have set at 30 seconds. The two most time-consuming steps in the Pathwalking process are pseudoatom creation and real-space refinement. Depending on which pseudoatom method is specified, this process takes between 30 seconds and 5 minutes for medium to large (~100-1000aa) proteins. Real-space refinement is the most time-consuming step and is largely dependent on map size, but typically averages 10-15minutes when utilizing the default options. As for the recommended computational environment, Pathwalking is relatively “light” in terms of computing and requires only a single CPU on a modern (less than 10 years old) desktop or laptop computer.
Reviewer 2 Report
The manuscript by Hryc and Baker presents an improved method for automatic model building including the identification of ligands and water molecules. It describes new developments that are based on the author's previously published Pathwalker software. The presented method has been applied and evaluated in the 2021 CryoEM Ligand Challenge organized by the PDB/EMDB. The paper is clearly written, addresses an important problem, and presents timely new developments. I therefore recommend publication. However, there are some issues that I suggest to address before publication:
Step 5), sequence assignment, all-atom modeling and refinement is not really described in the main text. It just says in the Supplement that Pulchra and phenix.real-space-refine is used for this step. Depending on the resolution, this step can also be problematic. A few more sentences on the this step would useful, it would also be interesting to learn whether and how much manual intervention was needed. A reference for "Pulchra" is missing.
line 54: Mentioning "Nearly all current methods", some of them should be described in more detail and their papers should be included in the references. For example :
Terashi, Genki, and Daisuke Kihara. "De novo main-chain modeling for EM maps using MAINMAST." Nature communications 9.1 (2018): 1-11.
Pfab, Jonas, Nhut Minh Phan, and Dong Si. "DeepTracer for fast de novo cryo-EM protein structure modeling and special studies on CoV-related complexes." Proceedings of the National Academy of Sciences 118.2 (2021).
line 62: the authors might want to consider using the term Traveling Sales Person Problem.
line 87: For a clearer structure, the first paragraph of the methods sections may be declared a subsection, and titled : workflow overview or similar
Figure 1: The numbering of steps 6 and 7 is confusing, since those steps are not performed at the end of the workflow, but between steps 2 and 3, maybe 2a, 2b? Depicting connections of low reliability with red circles is a bit counter-intuitive, cylinders may be an alternative. color coding connections with a reliability of 75 percent white is not easy to see on a white background.
line 153: It is not clear how many pseudo atoms should be sampled in the non protein density.
206: In which situation are the statistics not computed on the pipeline website ?
line 217: see line 87, a title for the first paragraph would clarify the structure of the text.
figure 3: see figure 1, white is not a good color
line 314. I assume the RNA was ignored for the RMSD calculation? Maybe mention this explicitly, to avoid confusion.
Figure S1: green and yellow are hard to distinguish
Author Response
R2.Q1: Step 5), sequence assignment, all-atom modeling and refinement is not really described in the main text. It just says in the Supplement that Pulchra and phenix.real-space-refine is used for this step. Depending on the resolution, this step can also be problematic. A few more sentences on this step would be useful, it would also be interesting to learn whether and how much manual intervention was needed. A reference for "Pulchra" is missing.
R2.A1: C-alpha to mainchain assignment and real-space refinement utilize well-established tools in Phenix and are not unique to Pathwalking; Pathwalking simply calls these utilities using default options. For these two steps, the process is completely automated. The only manual intervention is deciding on the directionality of the path, as the Pathwalking results are agnostic to directionality. We have clarified this in the supplement and added the appropriate references.
R2.Q2: line 54: Mentioning "Nearly all current methods", some of them should be described in more detail and their papers should be included in the references. For example : Terashi, Genki, and Daisuke Kihara. "De novo main-chain modeling for EM maps using MAINMAST." Nature communications 9.1 (2018): 1-11. Pfab, Jonas, Nhut Minh Phan, and Dong Si. "DeepTracer for fast de novo cryo-EM protein structure modeling and special studies on CoV-related complexes." Proceedings of the National Academy of Sciences 118.2 (2021).
R2.A2: We have added a short discussion of some of the more common approaches to model building in the introduction (Section 1, page 3).
R2.Q3: line 62: the authors might want to consider using the term Traveling Salesperson Problem.
R2.A3: While we recognize the potential gender issue of the Traveling Salesman Problem, this optimization problem is a foundational problem in mathematics and dates back several decades. We respect the reviewer’s comments and have attempted to use TSP instead of the “Traveling Salesman Problem” where possible. Though it is still common parlance in the field, Google’s OR tools, the default TSP solver in Pathwalking, has adopted the term Traveling Salesperson and as such, we will refer to it in this manner (page 4).
R2.Q4: line 87: For a clearer structure, the first paragraph of the methods sections may be declared a subsection, and titled : workflow overview or similar
R2.A4: We have added the subsection header.
R2.Q5: Figure 1: The numbering of steps 6 and 7 is confusing, since those steps are not performed at the end of the workflow, but between steps 2 and 3, maybe 2a, 2b? Depicting connections of low reliability with red circles is a bit counter-intuitive, cylinders may be an alternative. color coding connections with a reliability of 75 percent white is not easy to see on a white background.
R2.A5: Thank you for the suggestion. We have renumbered the steps in Figure 1. In terms of the display, we have mapped the probability score to the B-factor column of the PDB file and use standard methods for coloring a model based on B-factor values.
R2.Q6: line 153: It is not clear how many pseudo atoms should be sampled in the non protein density.
R2.A6: Unlike path identification in Pathwalking, the user does not need to provide an exact number of pseudoatoms. Instead of using K-means (the default method for pseudoatom generation in Pathwalking), the ligand identification uses Mean Shift clustering with an automatic bandwidth estimator. There is a single tunable variable in pw_ligands.py that can be used to increase or decrease the number of pseudoatoms in a map, though we have found that the default values typically generate a few thousand points in near-atomic resolution maps, which appears to be sufficient for ligand identification. We have updated the text in section 2.4, page 6.
R2.Q7: 206: In which situation are the statistics not computed on the pipeline website ?
R2.A7: This was a poor choice of words. All of the models that were part of the CryoEM Ligand Model Challenge were uploaded and assessed in a (mostly) automated pipeline by a third party. For unknown reasons, some of the resulting statistics on the assessment website were not reported for the models. In the cases relevant to our submitted models, we performed the calculations as described by the assessors. We have updated the text in section 2.4, page 7.
R2.Q8: line 217: see line 87, a title for the first paragraph would clarify the structure of the text.
R2.A8: We have fixed this in the revised manuscript.
R2.Q9: figure 3: see figure 1, white is not a good color
R2.A9: We appreciate the reviewer’s concern but we are using a standard representation method that is consistent across the figures.
R2.Q10: line 314. I assume the RNA was ignored for the RMSD calculation? Maybe mention this explicitly, to avoid confusion.
R2.Q10: That is correct, RMSD does not consider the RNA. We have now specified this in the Table 2 legend.
R2.Q11: Figure S1: green and yellow are hard to distinguish
R2.A11: The colors have been fixed.
Reviewer 3 Report
The manuscript “Beyond the Backbone: The Next Generation of Pathwalking Utilities for Model Building in CryoEM density Maps " submitted by Corey F. Hryc and Mathew L. Backer reports an updated version of Pathwalking, a software for the novo modeling of protein structures on cryoEM density maps. In particular, the authors report the last technical developments, which include the addition of probabilistic models and the a semi-automated approach for identifying ligands directly in near-atomic resolution maps. Such improvements were further tested using the the 2021 cryoEM ligand Challenge data set. In my opinion, the manuscript is well-presented and the technical developments reported in this work support its publication in Biomolecules. Only minor revisions are needed.
Minor comment
- Discussion section. Line 41. The authors mention that “It should be noted that pw_ligands did locate several potential water sites in the IPR3R1 reconstructions, though at the reported resolutions, these sites were not distinguishable from noise in the reconstructions and thus, considered in the final models.”. It would be interesting if the authors can compare the performance of the Pathwalking approach in water/Ligand building with other known automated building tools. Please discuss if the the rates of false positives are comparable to other ligand building tools.
Author Response
R3.Q1: Discussion section. Line 41. The authors mention that “It should be noted that pw_ligands did locate several potential water sites in the IPR3R1 reconstructions, though at the reported resolutions, these sites were not distinguishable from noise in the reconstructions and thus, considered in the final models.”. It would be interesting if the authors can compare the performance of the Pathwalking approach in water/Ligand building with other known automated building tools. Please discuss if the rates of false positives are comparable to other ligand building tools.
R3.A1: We have added a small discussion regarding the performance of pw_lignads.py in the final section of the Discussion (Sections 4.1, 4.2, Pages 11,12). Interestingly, the localization and modeling of ligands, other than ions and waters, was remarkably similar amongst all of the different programs. However, the identification of waters and ions in the maps varied considerably in both numbers and location. This is likely due to how these features were identified – purely based on map features, purely based on chemistry or a combination of density map and model. It is difficult to say with any level of confidence which of these methods are the most accurate given the resolution of the test cases. At the current “standard” resolutions in cryoEM, it might be more accurate to say that these programs can identify “potential waters and ions”; ultimately the users of the software are the arbitrators of the final model assignments and must make the decision based on all available, structural, biochemical and visual evidence. As such, it is a bit difficult to calculate the rate of false positives or missed waters/ions.