1. Introduction
Soybean (
Glycine max L.) is an important crop worldwide that represents a major source of protein and vegetable oil for the human diet and animal feed. As of 2021, soybean was the largest source of protein meal in the world, at 243.6 t, and the second-largest source of vegetable oil (58.7 t) after palm oil (
http://soystats.com/, accessed on 12 February 2023). In Asian countries, soybean seeds are used to produce a number of food products, including soymilk, tofu, soybean paste, natto, and soy sauce. In the West, soybean is typically used for soybean meal and seed oil. Soybean seeds generally consist of 40% protein and 20% oil [
1], and these traits are negatively correlated with each other [
2,
3]. For this reason, it is very difficult to improve both traits simultaneously. In addition, because there is a negative correlation between seed yield and protein content [
4], high-protein varieties need to be developed with care. In addition to being easily influenced by environmental factors, the protein and oil content of soybean seeds is regulated by the polygenes and quantitatively inherited [
5,
6]. These polygenes can be divided into major genes, which are less influenced by the environment and that have a significant influence on these levels; and minor genes, which have a weaker influence.
Wild soybean (
G. soja Sieb. and Zucc.), the ancestor of cultivated soybean, has high genetic diversity and is thus valuable as a breeding material for soybean breeding programs [
7,
8]. Various studies have used wild soybean to improve biological stress resistance, abiotic stress tolerance, nutrition, and yields [
9]. The average protein content of wild soybean is reported to be higher than that of cultivated soybean, although this may be due to correlations with the yield or oil content [
9]. In the study by Chen and Nelson (2004), the protein content of wild and cultivated soybean lines was about 47% and 40%, respectively, while the oil content was 15% and 11%, respectively [
10].
After the publication of the soybean genome for the first time by Schmutz et al. (2010) [
11], Ha et al. (2012) [
12] advanced genomic research further with the integration of physical maps for
G. max and
G. soja. QTL mapping uses F
2, backcross (BC), and recombinant inbred line (RIL) populations derived from bi-parental crosses. In many soybean populations, the QTLs for proteins and oils have been mapped to genomic regions on chromosomes 15 and 20 [
13,
14,
15,
16,
17]. Major QTLs for protein and oil content were identified by Diers et al. (1992) using RFLP markers for the F2 population through the crossbreeding of
G. max (A81-356022) and
G. soja (PI 468916) [
18]. The QTL located on chromosome 15 has been fine-mapped at an interval of 535 kb between simple sequence repeat (SSR) markers (Kim et al. 2016) [
19], while the candidate gene for the QTL located on chromosome 20 has been cloned as
Glyma.20G85100 encoding the CCT domain [
20].
To date, 255 and 322 protein and oil content-related QTLs, respectively, have been identified using bi-parental populations (
https://www.soybase.org/, accessed on 12 February 2023). However, these QTLs may include multiple duplicate detections, so the Soybean Genetics Committee has emphasized the importance of experimentally confirming QTLs and adding “
cq” in front of the original QTL name to indicate that it has been confirmed [
20,
21]. In total, 16 QTLs each have been confirmed for protein and oil content (
https://www.soybase.org/), and these are distributed across 11 chromosomes, including chromosomes 15 and 20 [
18,
19,
20]. Of these, the only QTLs derived from wild soybean are
cqpro-003 and
cqoil-004 [
22]. Therefore, the purpose of this study was to discover new genes for protein and oil content from wild soybean using two progeny types derived from high-protein wild soybean lines.
3. Discussion
QTL analysis of the protein and oil content in soybean has been well-studied in previous research. The present study searched for QTLs related to high protein content using two F
2 and BC
1F
2 populations derived from a cross between cultivated soybean variety Daepung and wild soybean variety GWS-1887. The protein content of cultivated soybean is known to be about 40% [
1], whereas wild soybean GWS-1887 has protein levels close to 50% [
9,
10,
23]. This suggests that wild soybean GWAS-1887 may be useful for QTL analysis in terms of mapping the genetic regions associated with high protein content in soybean. However, the crossbreeding between
G. max and
G. soja may lead to linkage drag and consequent negative introgression such as reduced yields [
9]. Daepung exhibited a higher annual variation in its protein content (37.10% in 2019 and 40.05% in 2020) than did GWS-1887 (50.37% in 2019 and 49.28% in 2020) (
Table 1 and
Table 2). Several studies have reported that a lack of soil moisture reduces the protein content of soybean [
24,
25,
26]. The rainfall in the experimental area in August 2019 during the soybean development period was lower than normal, while the rainfall in August 2020 was above average (
http://www.kma.go.kr/, accessed on 12 February 2023). Therefore, it appears that the protein content of wild soybean has a higher environmental stability than cultivated soybean.
Previous QTL analysis of the protein and oil content in soybean seeds has been conducted with various populations, with the identified QTLs distributed across 20 chromosomes (
https://www.soybase.org/). Of these, major QTLs for protein and oil content are present on chromosomes 15 and 20 [
18], and several researchers have attempted to narrow down their precise location [
13,
19,
20,
21,
22,
27]. In the present study, the QTLs related to protein and oil content in the F
2:3 population were identified as Gm20_29512680 and Gm15_3621773, respectively, whereas in the BC
1F
2:3 population, marker Gm20_27578013 was identified for both protein and oil. Kim et al. (2016) reported the fine-mapping of a backcross line of Williams 82 and PI 407788A with 96 BARCSOYSSR markers and found that the QTL related to the protein and oil content on chromosome 15 was present in a 535 kb region from the physical position 3.59 Mbp to 4.12 Mbp [
19]. These results are consistent with the SNP marker Gm15_2621773 at the physical position 3.63 Mbp detected for oil content in the F
2:3 population in the present study. Recently, cqSeed protein-003 located on chromosome 20 was narrowed down through fine-mapping to a 77.8 kb region between genetic marker BARCSOYSSR_20_0670 and BARCSOYSSR_20_0674 (31.74 to 31.82 Mbp), and the
Glyma.20G85100 gene encoding the CCT domain was identified as a candidate gene involved in protein content [
20].
In our results, protein-related QTLs were mapped to Gm20_29512680 at 30.61 Mbp and Gm20_27578013 at 28.69 Mbp on chromosome 20 in the F
2:3 and BC
1F
2:3 populations, respectively. These results are consistent with the 24.55–32.91 Mbp range reported by Bolon et al. (2010) [
15] and the 28.7–31.1 Mbp range reported by Hwang et al. (2014) [
13], but not with the 32.71–33.08 Mbp range identified by Vaughn et al. (2014) [
14] and the 31.74–31.82 Mbp range found by Fliege et al. (2022) [
20]. The reason for these inconsistencies could the low LD with the surrounding markers [
20], and it is known that the wild soybean variety used as a parent in this study has a lower LD than cultivated soybean [
28]. For more accurate confirmation of the location, crossovers around the QTL detected in BC
1F
2:3 were identified but could not be found, and it was concluded that a crossover occurred at markers Gm20_33049242 and Gm20_32603292 in the two BC
1F
3:4 lines, which was advanced one generation by selecting high-protein lines. Therefore, it was predicted that the protein-related gene is present in the region downstream of Gm20_32603292.
Based on the results of QTL mapping, InDels were then searched for in the candidate genes located at around 30 Mbp between the Williams 82 reference genome and GWS-1887. Interestingly, no mutation was detected in the Glyma.20G85100 gene of the CCT motif family protein, which was recently cloned as a major protein-related gene [
20]. These results suggest that other major protein-related genes may be present in a similar genetic region. Wang et al. (2021) selected protein-related candidate gene
Glyma.20g088000 using DEG analysis via RNA-seq, and it was found that
Glyma.20g088000 had a significant difference in its sequence between the high-protein Nanxiadou 25 and low-protein Tongdou 11 varieties due to InDels [
16]. Interestingly,
Glyma.20g088000 (S-adenosyl-
l-methionine-dependent methyltransferase) was also selected as a candidate gene in the present study because small and large InDels occurred within several regions of the gene. In addition,
Glyma.20g086900 (aldehyde dehydrogenase-related) and
Glyma.20g088400 (oxidoreductase, 2-oxoglutarate-Fe(II) oxygenase family protein) genes were selected by Lee et al. (2019) as a result of a GWAS for the soybean seed protein content from maturation groups I to IV [
17]. These two genes were also identified as candidate genes in this study.
In the present study, it was confirmed that
Glyma.20g088000 and
Glyma.20g088400 had a large InDel in the 5′ first exon and a small InDel in the 3′ third exon, respectively (
Figure 5). Nonsense mutations that create stop codons and frameshifts in which amino acids are rearranged can disrupt the function of a gene [
29]. In one example, truncated polypeptides generated as a result of nonsense mutations resulted in the loss of anthocyanin pigments associated with the color of soybean flowers [
30]. In particular, the stop codon in
Glyma.20g088000 is expected to greatly simplify the structure of the protein (
Figure 6), thus it is likely to have a significant effect on the expression of its function. Although these candidate genes have potential functions in metabolism, the mechanisms of how they relate to seed composition require further study. In addition, the results collectively suggest that protein content may be regulated by the complex interaction of multiple genes located at around 30 Mbp on chromosome 20.
4. Materials and Methods
4.1. Plant Materials
In the present study, 180 F
3 and 90 BC
1F
4 populations derived from a cross between Daepung and GWS-1887 were analyzed. Daepung, which was used as the female, recurrent parent, is an elite Korean variety that is strongly resistant to disease and shattering and has high yields [
31], while GWS-1887, which was used as the male parent, was selected from the core collection of wild soybean accessions from the Rural Development Administration (RDA) because it has a protein content of 50% or higher. In the summer of 2018, F
1 seeds were obtained from artificial crossbreeding in an experimental field at Chonnam University (Gwangju, 36°17′ N, 126°39′ E, Republic of Korea). The F
1 seeds were planted in a greenhouse during the 2018–2019 winter season to obtain F
2 seeds, with the generation then advanced from F
2 to F
3 in the summer of 2019. At the same time, F
1 seeds were backcrossed in the summer of 2019 to obtain BC
1F
1 seeds. The produced BC
1F
1 seeds were planted in a greenhouse during the 2019–2020 winter period to obtain BC
1F
2 seeds. Finally, in the summer of 2020, the BC
1F
2 generation was advanced and BC
1F
3 seeds were obtained.
4.2. Analysis of Protein and Oil Content
All harvested seed samples were dried at 40 °C for 7 d and then pulverized using a coffee grinder to produce 3 g each for subsequent analysis. The crude protein content was measured using the Kjeldahl method. Reagents required for digestion, distillation, and titration were prepared, including 0.1N hydrochloric acid, a decomposition accelerator (containing 10 g of potassium sulfate and 1 g of copper sulfate), 40% sodium hydroxide solution, and 1% boric acid solution with 100 mL and 70 mL of Bromocresol green and methyl red solutions, respectively. The sample solution was prepared by mixing 0.7–1.0 g of the ground seed and 7–8 g of the decomposition accelerator with 10 mL of sulfuric acid in a decomposition bottle. Digestion was carried out by heating the sample solution at a slow ramping rate until no visible bubbles remained and the solution became transparent. The solution was then analyzed using a Kjeltec 1030 Autoanalyzer (FOSS Tecator AB, Hogans, Sweden) following the manufacturer’s instructions.
The crude oil content was measured using ether extraction. For this, an oil metering bottle was pre-dried at around 95–100 °C for 2 h followed by cooling in a desiccator for 30 min. Following this, 2–3 g of the sample wrapped in No. 2 filter paper was dried at the same temperature and for the same duration of time as the oil metering bottle. After drying, the sample was placed in a Soxtec 1043 instrument (FOSS Tecator AB, Hogans, Sweden), and subjected to a flow of ether at 80 °C for 8 h to extract the oil. The processed ether was then collected in an oil metering bottle and subsequently dried (95–100 °C for 3 h) followed by cooling in a desiccator (40 min) and weighing. The oil content was determined by subtracting the weight of the empty oil metering bottle from the weight of the bottle containing the extract.
4.3. DNA Extraction and SNP Genotyping
Fresh leaf tissue was collected at the beginning of growth for DNA extraction and ground using liquid nitrogen in a mortar. Genomic DNA was isolated from 20 mg of lyophilized leaf tissue using a DNeasy Plant Mini Kit (QIAGEN, Valencia, CA, USA) according to the manufacturer’s protocol. The quality and quantity of the extracted total DNA were verified using a Nano-MD UV-Vis spectrophotometer (Scinco, Seoul, Republic of Korea). The extracted DNA was stored in a freezer at −80 °C until further use. A total of 270 samples, consisting of 180 F2 and 90 BC1F2 plants and two replications of each parental plant (Daepung and GWS-1887) were genotyped using a SoySNP6K Illumina BeadChip (Illumina, San Diego, CA, USA) at TNT Research Co. (Anyang, Republic of Korea). The SNP alleles were called using Illumina’s GenomeStudio (Illumina, Inc., San Diego, CA, USA).
4.4. Genetic Linkage Analysis
A genetic linkage map was constructed using the Kosambi mapping function in Joinmap v4.1 (Kyazma, Wageningen, The Netherlands). For genetic analysis, MQM mapping was employed with MapQTL 6.0 (Kyazma, Wageningen, The Netherlands). In the F2 population, permutations were conducted to determine the genome-wide significance threshold for the LOD scores, with the number of permutations set at 1000. In the BC1F2 population, an LOD score of ≥3.0 was set as the threshold for determining the presence of a QTL. LOD graphs and the location maps for the QTLs were created with MapChart2.2.
4.5. Re-Sequencing
Re-sequencing analysis was commissioned by Insilicogen (Yongin, Republic of Korea) and performed using an Illumine Novaseq 6000 platform. A library was constructed from DNA fragments with 151 bp paired ends read using a DNA Sample Prep Kit (Illumina) following the manufacturer’s instructions. An analysis pipeline for detecting mutations in the sequencing data for the entire genome was employed with an NF-Core/SAREK workflow [
32]. The snpEff tool was used for genetic variation annotation and effect prediction, while the snpEff database was referenced to Glycine max var. Williams 82 [
11]. The whole genome sequencing data of GWS-1887 were deposited in NCBI under the BioProject PRJNA915129.