1. Introduction
As the saying goes, “Grains, buckwheat is king”.
Fagopyrum Mill is an important multigrain crop divided into two main cultivars: common buckwheat (
Fagopyrum esculentum Moench) and Tartary buckwheat [
1].
F. tataricum, also known as Tartary buckwheat, is one of the most important buckwheat species cultivated in China, and is suitable for cultivation in cool climates. Tartary buckwheat grains are notable for their nutritional functional substances, including protein, resistant starch, and flavonoids. The protein content of buckwheat is rich and of high quality, consisting of 19 amino acids, including 8 essential amino acids in appropriate proportions, exceeding the standards set by the Food and Agriculture Organization of the United Nations [
2].
The starch content of Tartary buckwheat is about 70%, with a high proportion of resistant starch, far exceeding that of rice and other grains. Resistant starch cannot be directly absorbed and broken down by the human small intestine, providing significant physiological and health benefits, such as maintaining intestinal activity and controlling blood glucose levels [
3,
4]. The flavonoid content in Tartary buckwheat exceeds 2%, with rutin, quercetin, kaempferol, and hypericin being the primary compounds, with rutin accounting for over 80% of the total flavonoids. These compounds have beneficial effects such as lowering blood sugar, blood pressure, and blood lipids, and possess anti-inflammatory and antibacterial properties [
5]. In recent years, the high nutritional value of Tartary buckwheat has driven rapid market demand in the functional food and medicine industries, among others, making it a plant star of the 21st century with substantial market potential and development prospects [
1,
3,
6].
Starch is the main component of Tartary buckwheat grains, accounting for 43.80% to 84.67% of their total weight, and is composed of amylose (11.06–49.24%) and amylopectin (8.97–61.85%) [
7]. The quality of food is significantly influenced by the amylose content to amylopectin ratio, which plays a crucial role in yield composition, nutrition, health benefits, and processing quality [
3]. Resistant starch, existing in five types (RS1–RS5), refers to starch and its degradation products that cannot be absorbed in the human small intestine. It is a functional component of dietary fiber in foods, helping to control post-meal blood glucose levels, with its yield being proportional to amylose content [
8,
9]. Therefore, improving the starch content and composition of Tartary buckwheat grains, particularly breeding varieties with high amylose, a high amylopectin ratio, and high resistant starch, is essential for breeding high-yield and high-quality buckwheat. Currently, total starch, amylose, and amylopectin contents are primarily determined by ultraviolet spectrophotometry [
10,
11,
12,
13], and resistant starch content is mainly determined by the chromogenic enzymolysis method [
14]. Some scientists have developed an asymmetric field-flow separation technique to determine the resistant starch content in potatoes (
Solanum tuberosum L.) [
15]. These methods are laborious, time-consuming, and costly, posing a bottleneck in the genetic research and breeding of high-yield and high-quality Tartary buckwheat.
Near-infrared spectroscopy (NIRS) detection technology offers rapid, accurate, cost-effective, and non-destructive analysis, saving time and cost in sample processing [
15]. Once the model is built, the rapid detection of substance content can be achieved through relatively simple instructions and data processing. NIRS has been widely used in food quality control [
16], food adulteration [
17], the real-time batch process monitoring of medicines [
18], and other fields [
15]. The near-infrared technique has also been studied in buckwheat [
19,
20,
21,
22,
23]. For example, Sato et al. used NIR reflectance spectroscopy to analyze moisture, fat, protein, and physiological activity in buckwheat flour for breeding selection, finding that NIR could successfully estimate these contents for simple and rapid breeding selection [
20]. Shruti et al. developed an NIRS prediction model for oil, protein, amino acids, and fatty acids in amaranth (
Amaranthus tricolor L.) and buckwheat, enabling the identification of trait-specific germplasm as potential gene sources and aiding crop improvement programs [
21]. Chai et al. found that combinations of NIRS and chemometrics indicated excellent predictive performance and applicability to analyze the adulteration of common buckwheat flour in Tartary buckwheat flour [
22]. NIR reflectance spectroscopy has been applied to determine the contents of rutin and D-chiro-inositol in Tartary buckwheat [
19].
Previous research highlights two primary near-infrared starch-absorption ranges: 1063–1639 nm and 1834–2175 nm [
23], equivalent to wavenumbers of 6101–9407 cm
−1 and 4598–5453 cm
−1. Scientists have developed NIR detection models to determine the total starch, amylose, and amylopectin contents of buckwheat, sorghum (
Sorghum bicolor (L.) Moench), and rice (
Oryza sativa L.) [
24,
25,
26]. NIR models have also been developed and used to predict resistant starch content and material screening in barley (
Hordeum vulgare L.) [
27], potato [
28], rice [
29], and sweet potato (
Ipomoea batatas (L.) Lam.) [
30]. However, many studies require spectrum collection after the core has been crushed [
20,
22], increasing the workload for modeling and making it unsuitable for rare materials resulting from the breeding process. Additionally, NIR spectroscopy for determining resistant starch content in Tartary buckwheat has not yet been reported.
Diverse and continuous modeling samples are required in the construction of near-infrared models. When Sato et al. constructed a predictive model for the moisture content of buckwheat flour, they used 96 buckwheat varieties from 25 countries to obtain samples with large differences in chemical values [
20]. To obtain better modeling results, pure turmeric was mixed with starch in different amounts from 1% to 50%, with the starch varying by 1%, and the created near-infrared model could accurately identify adulterations in turmeric (
Curcuma longa L.) [
17]. Previous studies by our research group have found that the starch content of the recombinant inbred line population has an approximately normal distribution, and the coefficient of variation is between 6% and 18% [
10], which is suitable for the construction of near-infrared models. Based on NIRS technology, rapid detection models for total starch, amylose, amylopectin, and resistant starch in Tartary buckwheat have been established, which is of great significance for the quality evaluation of Tartary buckwheat and the development of functional foods. In this study, data were collected from the recombinant inbred line population of Tartary buckwheat to obtain chemical values with a large coefficient of variation and finally obtain a more satisfactory prediction model.
3. Discussion
In plant breeding research, e.g., QTL mapping, general populations are large, and the determination of chemical values is particularly time-consuming and laborious. Scientists use RIL population modeling to predict chemical values, perform QTL mapping, and achieve good results [
31,
32,
33]. In this study, near-infrared models were successfully constructed to predict the amylose, amylopectin, total starch, and resistant starch levels of Tartary buckwheat. The Rc and Rp values of the best model were both above 0.93, meaning that this inbred line can be used to achieve the high-quality production, breeding, and QTL mapping of Tartary buckwheat. However, modeling RIL populations also has disadvantages. Zhang et al. compared the differences between RIL population modeling and natural population modeling and found that the model created by the natural population had a better predictive effect on exotic samples. They hypothesized that this could be related to the narrower range of chemical values in RIL populations [
32]. The construction of the model is influenced by various factors, with sample representativeness and diversity being among the most critical. A broader range of chemical values in the sample enhances the applicability of the created model. Through the extensive analysis of starch in Tartary buckwheat germplasm, significant variation in starch content among varieties was observed: total starch ranged from 604.9 mg/g to 779.8 mg/g, amylose from 115.9 mg/g to 283.0 mg/g, amylopectin from 377.1 mg/g to 591.5 mg/g, and resistant starch from 47.4 mg/g to 225.3 mg/g [
1,
9,
34,
35,
36,
37]. In contrast to earlier studies, the total starch content measured in this research using RIL populations ranged from 532.1 mg/g to 741.5 mg/g, amylose from 176.8 mg/g to 280.2 mg/g, amylopectin from 318.8 mg/g to 497.0 mg/g, and resistant starch from 45.1 mg/g to 105.2 mg/g. Therefore, the predicted results of samples with chemical values within the range of RILs will be more accurate. To extend the scope of the model, we need to add samples from outside the inbred lineage group to the modeling samples.
The prediction results of the near-infrared model for the sample were closely related to the chemical values of the sample used in the modeling. Resistant starch content in this study (45.1 mg/g to 105.2 mg/g) fell outside the range of prior measurements [
1], likely due to differences in detection methods influenced by cooking and other processing procedures. The study by Fu et al. found that the resistant starch content in ‘Sichuan Buckwheat No. 1’ Tartary buckwheat measured directly by the Goñi method was 4.74%. After improving the determination method, the resistant starch content could reach 13.38% [
34], which is relatively close to the starch content we measured. Zheng et al. used the Englyst method to determine the resistant starch content in black Tartary buckwheat. Using different enzyme dosages and ratios, they determined a resistant starch content between 20% and 33% [
1], which is significantly higher than the value we measured. Therefore, the resistant starch content predicted by the model established here is based on the chemical measurements conducted in this paper.
In the modeling process, sample set distribution influences model creation. Common partitioning methods include random partitioning, the KS algorithm, and the SPXY (Sample Set Partitioning based on joint x-y distance) algorithm. The KS algorithm, reliant on sample similarity, balances subsets post-partitioning, preventing over-bias towards specific sample types and enhancing model generalization across different subsets [
38]. SPXY, an extension of the KS algorithm, comprehensively considers sample concentration and spectral distance for sample screening. Wang et al. applied the SPXY and KS algorithms to partition sample sets and model soybean meal nutrients, finding KS to be superior for water and protein modeling [
39]. In this study, the KS algorithm effectively partitioned samples for modeling, reaffirming its utility in model construction.
Prior to model creation, spectrum preprocessing is essential to eliminate errors caused by noise and other factors during spectrum scanning. Each preprocessing method yields different effects, necessitating the exploration and comparison of various methods [
38]. Normalization primarily removes irrelevant variables’ influence on data, such as instrument sensitivity, sample size, and optical path length, to highlight the signal [
40]. The standard normal variate (SNV) eliminates effects of particle size, surface scattering, and light path variations on diffuse reflected light [
41]. Multiplicative scatter correction (MSC) addresses scattering effects resulting from non-uniform particle distribution and size [
42,
43]. Savitzky–Golay (SG) filtering aids in signal denoising, data smoothing, and feature extraction [
44]. Derivatives effectively eliminate baseline and background noise, enhancing resolution and sensitivity [
45]. The model construction effects in this study demonstrate superior prediction using “normalization + derivative” spectrum preprocessing. Scaling data post-normalization to a specific range mitigates dimensional differences among features, enhancing model training efficiency and stability. While this algorithmic processing may induce baseline drift [
45], derivative processing effectively mitigates baseline drift and superposition effects, significantly improving model prediction effectiveness. The SNV primarily normalizes grain size disparity, contributing effectively to whole-grain spectrum modeling.
To address challenges such as low absorption intensity, large spectral bandwidth, NIR spectra overlap, high information redundancy, and the strong collinearity of the whole spectrum [
46], variable selection methods are essential to extract useful wavelength variables. The CARS algorithm identifies key variables for modeling from thousands of wavenumbers, improving model predictability and reducing complexity [
47]. The CARS compression of characteristic bands for each soil type to less than 16% of the total wavenumber substantially reduces soil hyperspectral variable dimensions and computational complexity, thereby enhancing calibration model predictability [
48]. Previous research highlights two primary near-infrared starch-absorption ranges: 1063–1639 nm and 1834–2175 nm [
23], equivalent to wavenumber of 6101–9407 cm
−1 and 4598–5453 cm
−1. Zhang et al. observed significant differences in sorghum variety absorbance at 932 and 978 nm while assessing sorghum amylose and amylopectin using near-infrared techniques [
25]. Our study found that optimal model wavenumber points are concentrated in the 4000–7000 cm
−1 range, with additional points in the 10,000–12,000 cm
−1 range, consistent with prior findings, underscoring CARS’s suitability for enhancing near-infrared modeling and spectrum utilization.
Recent advancements have achieved notable success in developing models using near-infrared spectroscopy combined with chemometric methods for the rapid determination of plant starch content [
49]. However, studies indicate that the spectral modeling of crushed samples often outperforms whole-grain spectrum modeling [
20,
22]. While modeling crushed samples minimally impacts the assessment of grain content as a raw food material [
50], it is less suitable for NIR technology applications in breeding, particularly in relation to predicting substance content in valuable seed materials. Recent research has significantly enhanced spectrum processing effectiveness, improving spectrum utilization efficiency. Some researchers have begun utilizing complete grain scan spectra for modeling, extracting more spectral information from samples while simplifying modeling workflows [
51]. Addressing the need for the rapid and non-destructive determination of amylose and amylopectin in sorghum breeding, Peiris et al. successfully developed a near-infrared model to directly detect linear chain and amylopectin in grains, achieving a correlation coefficient of about 0.8 [
52]. In this study, spectral processing methods were further optimized, refining characteristic spectral modeling and successfully establishing a near-infrared model capable of predicting total starch, amylose, amylopectin, and resistant starch content in Tartary buckwheat grains. The test series of the best model exhibited correlation coefficients above 0.93, indicating improved spectrum utilization efficiency and feasibility for predicting nutrients in whole-grain crops. Promisingly, near-infrared hyperspectral imaging technology has enabled models to detect starch content in individual corn kernels [
53,
54]. In this study, the CARS algorithm directly extracted useful variables from full Tartary buckwheat grain spectra. This approach simplifies work processes, eliminates the need for hulling or crushing Tartary buckwheat grains during modeling, and enhances model practicality for breeding purposes, improving breeding efficiency and reducing costs. Future steps will explore using near-infrared or hyperspectral imaging of individual Tartary buckwheat grains to develop rapid, non-destructive predictive models for various nutrient components, facilitating early-stage progeny screening in breeding efforts.