CBCR: A Curriculum Based Strategy For Chromosome Reconstruction
Abstract
:1. Introduction
2. Materials and Methods
2.1. Division of Curricula
2.2. Conversion of Interaction Frequencies to Distances
2.3. Objective Function
2.4. Optimization
2.5. Early Stopping Criteria
2.6. Data
2.6.1. Simulated Hi-C Data
2.6.2. Real Hi-C Data
2.6.3. ChIA-PET and HiChIP Data
2.6.4. FISH Data
2.7. Evaluation
2.8. Hyperparameters
- : A convergence threshold for inter-curricula dSCC. If the absolute difference between two consecutive sub-curricula dSCC is less than this value, then all of the remaining data is merged together and utilized for one final training. The default value for this hyperparameter is as guided by the findings in [11].
- : A convergence threshold for intra-curriculum optimization. If the norm of the distance gradients at iteration t is less than times the loss at iteration t, then we stop optimizing over the current currriculum and move to the next curriculum. The default value for this hyperparameter is as guided by the findings in [11].
- : The maximum total number of iterations over all sub-curricula combined. This value bounds the number of iterations that CBCR may run during training. This value is set by the user and should be guided by the size of the input data. If the input data is large, then more iterations will be necessary to optimize sufficiently. Our analysis shows that a value of 4500 is sufficiently large for 100 kb resolution Hi-C data (see Section 3.3).
- : The maximum total number of iterations over the final training of CBCR. Recall that if is met, we lump all of the remaining data and use it for one final training. The hyperparameter dictates how many iterations we allow over this final, lump-sum curriculum. This value is separate from in order to ensure that if convergence is met, we do not excessively iterate over the remaining data. For example, if one sets to , but is met at iteration 1000, the remaining 9000 iterations may be needlessly excessive to meet convergence over the remaining data. Thus, by allowing to a lower, pre-set number, we ensure that the remaining data is still optimized with a reasonable bound of iterations. Like , this value is set by the user and should be guided by the size of the input data. Our analysis shows that a value of 500 is sufficiently large for 100 kb resolution Hi-C data (see Section 3.3).
- : The learning rate. As we utilize adaptive optimizers, the convergence of CBCR is less sensitive to the initial choice of . So long as we choose a value that is sufficiently large, the optimizers will reduce as time progresses to ensure convergence. The default value for this hyperparameter is as guided by the findings in [11].
- : The conversion factor that dictates the relationship between interaction frequency and wish distance given by Equation (2). This can either be a single value, or a range set by the user. For our experiments, we run CBCR for each in the set and choose the value that maximizes dSCC as the optimal conversion factor. The choice to do this was guided by the findings in [11,37].
- : The bias parameter for the first moment estimator in the Adam optimizer. The default value for this hyperparameter is as recommended in [22].
- : The bias parameter for the second moment estimator in the Adam optimizer. The default value for this hyperparameter is as recommended in [22].
- : The scaling parameter for the previously seen data in each curriculum. This hyperparameter dictates how much probabilistic mass we assign to the previously seen data in each curriculum. As is optimized during the training process, its initial value changes during training. The default value for this hyperparameter is .
- : The scaling parameter for the new, untrained data in each curriculum. This value is defined to be throughout training. Thus, the default initial value of is also .
- Number of Curricula: The total number of curricula. This value dictates how many curricula we divide our data into. This hyperparameter is the central hyperparameter in CBCR. We show that the reconstruction accuracy of CBCR is robust to the choice of this hyperparameter. We also show that setting the number of curricula to be one fifth of the total number of contacts yields an approximate minimization of the run-time of CBCR. See Section 3.1 for details of these findings. According to these findings, the default value for the number of curricula is where N is the number of contact sites corresponding to the input data.
3. Results
3.1. Finding the Optimal Number of Curricula
3.2. Time Performance
3.3. Reconstruction Accuracy: Real Data
4. Validation
4.1. Reproducibility Across Hi-C Experiments: Primary vs. Replicate
4.2. Reproducibility Across Restriction Enzymes: Mbol vs. DPnII
4.3. Validation on FISH Data
4.4. Valitation on ChIA-PET and HiChIP Data
4.5. Comparison of Outputs Between CBCR and 3DMax
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
3D | Three dimensional |
3C | Chromosome Conformation Capture |
dSCC | Distance Spearman Correlation Coefficient. |
dPCC | Distance Pearson Correlation Coefficient. |
dRMSE | Distance Root Mean Squared Error. |
CBCR | Curriculum Based Chromosome Reconstruction. |
GSDB | Genome Structure Database |
FISH | Fluorescence in Situ Hybridization |
ChIA-PET | Chromatin Interaction Analysis by Paired-End Tag Sequencing |
HiChIP | Hi-C Chromatin Immunoprecipitation |
GEO | Gene Expression Omnibus |
Appendix A
Appendix A.1
Appendix A.2
Appendix B
Appendix C
References
- Sati, S.; Cavalli, G. Chromosome conformation capture technologies and their impact in understanding genome function. Chromosoma 2017, 126, 33–44. [Google Scholar] [CrossRef] [Green Version]
- Cremer, T.; Cremer, C. Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat. Rev. Genet. 2001, 2, 292–301. [Google Scholar] [CrossRef] [PubMed]
- Dekker, J. Capturing Chromosome Conformation. Science (Am. Assoc. Adv. Sci.) 2002, 295, 1306–1311. [Google Scholar] [CrossRef] [Green Version]
- Simonis, M.; Klous, P.; Splinter, E.; Moshkin, Y.; Willemsen, R.; de Wit, E.; van Steensel, B.; de Laat, W. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat. Genet. 2006, 38, 1348–1354. [Google Scholar] [CrossRef]
- Dostie, J.; Richmond, T.A.; Arnaout, R.A.; Selzer, R.R.; Lee, W.L.; Honan, T.A.; Rubio, E.D.; Krumm, A.; Lamb, J.; Nusbaum, C.; et al. Chromosome Conformation Capture Carbon Copy (5C): A massively parallel solution for mapping interactions between genomic elements. Genome Res. 2006, 16, 1299–1309. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- van Berkum, N.L.; Lieberman-Aiden, E.; Williams, L.; Imakaev, M.; Gnirke, A.; Mirny, L.A.; Dekker, J.; Lander, E.S. Hi-C: A method to study the three-dimensional architecture of genomes. J. Vis. Exp. JoVE 2010. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- De Wit, E.; de Laat, W. A decade of 3C technologies: Insights into nuclear organization. Genes Dev. 2012, 26, 11–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lieberman-Aiden, E.; van Berkum, N.L.; Williams, L.; Imakaev, M.; Ragoczy, T.; Telling, A.; Amit, I.; Lajoie, B.R.; Sabo, P.J.; Dorschner, M.O.; et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science (Am. Assoc. Adv. Sci.) 2009, 326, 289–293. [Google Scholar] [CrossRef] [Green Version]
- Oluwadare, O.; Highsmith, M.; Cheng, J. An Overview of Methods for Reconstructing 3-D Chromosome and Genome Structures from Hi-C Data. Biol. Proced. Online 2019, 21, 7. [Google Scholar] [CrossRef] [PubMed]
- Lesne, A.; Riposo, J.; Roger, P.; Cournac, A.; Mozziconacci, J. 3D genome reconstruction from chromosomal contacts. Nat. Methods 2014, 11, 1141–1143. [Google Scholar] [CrossRef] [PubMed]
- Oluwadare, O.; Zhang, Y.; Cheng, J. A maximum likelihood algorithm for reconstructing 3D structures of human chromosomes from chromosomal contact data. BMC Genom. 2018, 19, 161. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Z.; Li, G.; Toh, K.C.; Sung, W.K. Inference of Spatial Organizations of Chromosomes Using Semi-definite Embedding Approach and Hi-C Data; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7821, pp. 317–332. [Google Scholar]
- Adhikari, B.; Trieu, T.; Cheng, J. Chromosome 3D: Reconstructing three-dimensional chromosomal structures from Hi-C interaction frequency data using distance geometry simulated annealing. BMC Genom. 2016, 17, 886. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum Learning; ACM: New York, NY, USA, 2009; pp. 41–48. [Google Scholar]
- Pombo, A.; Nicodemi, M. Physical mechanisms behind the large scale features of chromatin organization. Transcription 2014, 5, e28447. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Barbieri, M.; Chotalia, M.; Fraser, J.; Lavitas, L.M.; Dostie, J.; Pombo, A.; Nicodemi, M. Complexity of chromatin folding is captured by the strings and binders switch model. Proc. Natl. Acad. Sci. USA 2012, 109, 16173–16178. [Google Scholar] [CrossRef] [Green Version]
- Sexton, T.; Yaffe, E.; Kenigsberg, E.; Bantignies, F.; Leblanc, B.; Hoichman, M.; Parrinello, H.; Tanay, A.; Cavalli, G. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell 2012, 148, 458–472. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chiariello, A.M.; Annunziatella, C.; Bianco, S.; Esposito, A.; Nicodemi, M. Polymer physics of chromosome large-scale 3D organisation. Sci. Rep. 2016, 6, 1–8. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shi, G.; Thirumalai, D. Conformational heterogeneity in human interphase chromosome organization reconciles the FISH and Hi-C paradox. Nat. Commun. 2019, 10, 1–10. [Google Scholar] [CrossRef] [Green Version]
- Trieu, T.; Cheng, J. 3D genome structure modeling by Lorentzian objective function. Nucleic Acids Res. 2017, 45, 1049–1058. [Google Scholar] [CrossRef] [Green Version]
- Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Zou, C.; Zhang, Y.; Ouyang, Z. HSA: Integrating multi-track Hi-C data for genome-scale reconstruction of 3D chromatin structure. Genome Biol. 2016, 17, 40. [Google Scholar] [CrossRef] [Green Version]
- Duan, Z.; Andronescu, M.; Schutz, K.; McIlwain, S.; Kim, Y.J.; Lee, C.; Shendure, J.; Fields, S.; Blau, C.A.; Noble, W.S. A three-dimensional model of the yeast genome. Nature 2010, 465, 363–367. [Google Scholar] [CrossRef] [PubMed]
- Rao, S.S.P.; Huntley, M.H.; Durand, N.C.; Stamenova, E.K.; Bochkov, I.D.; Robinson, J.T.; Sanborn, A.; Machol, I.; Omer, A.D.; Lander, E.S.; et al. A three-dimensional map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014, 159, 1665–1680. [Google Scholar] [CrossRef] [Green Version]
- Oluwadare, O.; Highsmith, M.; Turner, D.; Lieberman Aiden, E.; Cheng, J. GSDB: A database of 3D chromosome and genome structures reconstructed from Hi-C data. BMC Mol. Cell Biol. 2020, 21, 60. [Google Scholar]
- Knight, P.A.; Ruiz, D. A fast algorithm for matrix balancing. IMA J. Numer. Anal. 2013, 33, 1029–1047. [Google Scholar] [CrossRef] [Green Version]
- Carey, M.F.; Peterson, C.L.; Smale, S.T. Chromatin immunoprecipitation (chip). Cold Spring Harb. Protoc. 2009, 2009, pdb-prot5279. [Google Scholar] [CrossRef] [Green Version]
- Li, G.; Fullwood, M.J.; Xu, H.; Mulawadi, F.H.; Velkov, S.; Vega, V.; Ariyaratne, P.N.; Mohamed, Y.B.; Ooi, H.S.; Tennakoon, C.; et al. ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biol. 2010, 11, 1–13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mumbach, M.R.; Rubin, A.J.; Flynn, R.A.; Dai, C.; Khavari, P.A.; Greenleaf, W.J.; Chang, H.Y. HiChIP: Efficient and sensitive analysis of protein-directed genome architecture. Nat. Methods 2016, 13, 919–922. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mumbach, M.R.; Satpathy, A.T.; Boyle, E.A.; Dai, C.; Gowen, B.G.; Cho, S.W.; Nguyen, M.L.; Rubin, A.J.; Granja, J.M.; Kazane, K.R.; et al. Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements. Nat. Genet. 2017, 49, 1602. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bhattacharyya, S.; Chandra, V.; Vijayanand, P.; Ay, F. Identification of significant chromatin contacts from HiChIP data by FitHiChIP. Nat. Commun. 2019, 10, 1–14. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pekowska, A.; Benoukraf, T.; Ferrier, P.; Spicuglia, S. A unique H3K4me2 profile marks tissue-specific gene regulation. Genome Res. 2010, 20, 1493–1502. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Barrett, T.; Suzek, T.O.; Troup, D.B.; Wilhite, S.E.; Ngau, W.C.; Ledoux, P.; Rudnev, D.; Lash, A.E.; Fujibuchi, W.; Edgar, R. NCBI GEO: Mining millions of expression profiles—Database and tools. Nucleic Acids Res. 2005, 33, D562–D566. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Langer-Safer, P.R.; Levine, M.; Ward, D.C. Immunological method for mapping genes on Drosophila polytene chromosomes. Proc. Natl. Acad. Sci. USA 1982, 79, 4381–4385. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Solovei, I.; Cavallo, A.; Schermelleh, L.; Jaunin, F.; Scasselati, C.; Cmarko, D.; Cremer, C.; Fakan, S.; Cremer, T. Spatial preservation of nuclear chromatin architecture during three-dimensional fluorescence in situ hybridization (3D-FISH). Exp. Cell Res. 2002, 276, 10–23. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rousseau, M.; Fraser, J.; Ferraiuolo, M.A.; Dostie, J.; Blanchette, M. Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling. BMC Bioinform. 2011, 12, 1–16. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Varoquaux, N.; Ay, F.; Noble, W.S.; Vert, J.P. A statistical approach for inferring the 3D structure of the genome. Bioinformatics 2014, 30, i26–i33. [Google Scholar] [CrossRef] [PubMed]
- Tang, Z.; Luo, O.J.; Li, X.; Zheng, M.; Zhu, J.J.; Szalaj, P.; Trzaskoma, P.; Magalska, A.; Wlodarczyk, J.; Ruszczycki, B.; et al. CTCF-mediated ‘uman 3D genome architecture reveals chromatin topology for transcription. Cell 2015, 163, 1611–1627. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chromosome | L1-L2 Distance | L2-L3 Distance |
---|---|---|
11 | 15.8 | 17.2 |
13 | 5.7 | 21.5 |
14 | 15.7 | 27.5 |
17 | 7 | 40.2 |
Chromosome | L1-L2 Distance | L2-L3 Distance |
---|---|---|
11 | 13.7 | 29.5 |
13 | 8 | 31 |
14 | 28.9 | 33.4 |
17 | 16.2 | 42.2 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hovenga, V.; Oluwadare, O. CBCR: A Curriculum Based Strategy For Chromosome Reconstruction. Int. J. Mol. Sci. 2021, 22, 4140. https://doi.org/10.3390/ijms22084140
Hovenga V, Oluwadare O. CBCR: A Curriculum Based Strategy For Chromosome Reconstruction. International Journal of Molecular Sciences. 2021; 22(8):4140. https://doi.org/10.3390/ijms22084140
Chicago/Turabian StyleHovenga, Van, and Oluwatosin Oluwadare. 2021. "CBCR: A Curriculum Based Strategy For Chromosome Reconstruction" International Journal of Molecular Sciences 22, no. 8: 4140. https://doi.org/10.3390/ijms22084140
APA StyleHovenga, V., & Oluwadare, O. (2021). CBCR: A Curriculum Based Strategy For Chromosome Reconstruction. International Journal of Molecular Sciences, 22(8), 4140. https://doi.org/10.3390/ijms22084140