1. Introduction
Rice is a tropical plant from of the genus of annual and perennial herbaceous plants of the grass family; a cereal species. Lots of nations treat it as a second bread. According to the time of cultivation and its valuable qualities, it is rightfully considered the most popular cereal in the world.
Due to its high agricultural importance, the problems and risks associated with rice cultivation are under the observation of scientists in a wide range of fields, from agronomic sciences to molecular biology and biotechnology. In the field of agronomy, different approaches to rice cultivation are being explored, and there is active research into the genetic differences between domesticated rice species and their wild relatives in order to develop loci that can be used to create a new generation of resistant rice varieties. Molecular biologists and biotechnologists are exploring the possibility of the genomic editing of rice to improve its nutritional properties and increase its environmental resilience [
1].
The rice genome is quite well studied, sequenced, and annotated. In 2005, the International Rice Genome Sequencing Project (IRGSP) group provided, for the first time, a reference rice genome sequence covering 95% of the
O. sativa Nipponbare genome [
2]. Later, the RAP-DB Rice Genome Annotation Project database (available at
https://rapdb.dna.affrc.go.jp/, accessed on 20 May 2022) [
3] and a database based on the structural and functional annotation of the rice genome based on pseudomolecules were created [
4]. In addition, other informative resources and tools for working on the rice genome were actively developed and appeared, with the exception of programs and algorithms for predicting regulatory elements of the rice genome, in particular promoter sequences, and databases for storing information about them.
A promoter is a sequence of DNA nucleotides in the vicinity of the transcription start site, necessary for the formation of the preinitiator complex. The length of promoters usually varies from 100 to 1000 bp; it can vary significantly from gene to gene. Eukaryotic promoters are usually divided into three regions: core/basic, distal, and proximal regions [
5]. The core or core promoter is the smallest promoter region that is capable of initiating transcription and is required for assembly of the pre-initiation complex (PIC); it typically includes a transcription start site (TSS) and is located at a distance of −60 to +40 bp from the TSS [
6]. Also, many core promoters contain the TATA box, which is located 25–45 bp upstream of the TSS and is a conserved DNA sequence (5′-TATAAA-3′) where common transcription factor proteins can bind. The TATA box is an important functional signal in eukaryotic promoters and, in some cases, can independently direct the precise initiation of transcription by RNA polymerase II even in the absence of other transcriptional elements [
7]. Many promoters of highly expressed genes contain a TATA box in their core region, however, there are also large groups of highly expressed genes in which the TATA box is absent in the promoters, for example: housekeeping and photosynthesis genes. This means that the presence of the TATA box in the core promoter region is not a mandatory criterion for promoter function [
8]. The proximal region of the promoter is located ~250–1000 bp upstream of the main promoter. The proximal promoter usually contains several transcription factors binding sites that are responsible for the specific regulation of transcription. The distal promoter regions can also contain transcription factor binding sites, but these regions mainly contain regulatory elements: enhancers, silencers, and insulators. The distal promoter, together with regulatory elements, is often required for the accurate reproduction of expression patterns, and distal cis-regulatory elements can be located in introns, which can make the computer study of these regions difficult.
Proper identification of promoter sequences plays an important role in understanding the dynamics, patterns, and regulation of gene expression. In addition, when genome editing or combining DNA fragments into new synthetic sequences, it is critical for predicting the potential formation of new promoter sequences. In recent years, a great abundance of bioinformatics tools have been created to predict promoter sequences in the genome or in a DNA region of interest. But the exact identification of promoter sequences remains a major challenge in computational biology, as even the best algorithms can generate false results that are indistinguishable from true ones.
To detect in the genome, and characterize and quantify the activity of promoters, various sequencing-based analysis methods have been developed, such as: CAGE (gene expression cap analysis), which allows for measuring the expression of eukaryotic capped RNAs and simultaneously displaying promoter regions [
9]; PRSeq (promoter RNA sequencing), which is a massive and quantitative method for analyzing the specificity and strength of the promoter, based on the creation of a template DNA pool that carries information about its own promoter in its transcribed region [
10]; PEAT (analysis of paired ends of transcription start sites), based on the mapping of transcription initiation patterns using paired-end sequencing [
11]; and RAMPAGE (RNA annotation and mapping of promoters for analysis of gene expression), which is a method that uses highly specific sequencing of 5′-complete complementary DNA to identify transcription start sites (TSS) throughout a genome [
12].
When predicting promoter sequences in silico using computational biology methods, developers mainly use the following strategies for predicting and identifying regulatory elements: the use of scoring functions, machine learning-based algorithms, and deep learning-based approaches.
In 2019, Meng Zhang and his team compared and evaluated computational promoter prediction tools developed from 2000 to 2019 [
13]. After analyzing 58 datasets and 19 predictors, the scientists concluded that estimator-based approaches are the easiest to implement, but usually have a lower performance than other approaches. Deep learning approaches tend to be more time-consuming and computationally intensive, but achieve very good results in predictive performance. Traditional methods based on machine learning are more balanced in terms of computational load and algorithm complexity. The researchers also concluded from their analysis that scoring tools tend to predict more false positives or have low sensitivity when applied to predefined promoter sequences. However, the prediction results from these methods can still be useful for some meaningful biological inferences.
In this work, we studied the sequences of the first chromosome of the
O. sativa rice genome from the Database of Potential Promoter Sequences (available online at
http://victoria.biengi.ac.ru/cgi-bin/dbPPS/index.cgi, accessed on 20 August 2022), predicted to be promoter elements using a mathematical sequence prediction method based on multiple alignments [
14]. Multiple alignments were created using the MAHDS method [
15], a mathematical method for calculating multiple alignments for highly divergent sequences. When searching for promoters using computerized methods, the problem of the number of false-positive results arises. The number of false positives is usually high and often devalues all the results obtained via computer methods because it is difficult to understand where the noise is and where there is really a signal. When using the MADHS method the number of false positives for the randomly mixed rice genome is less than 10
−8 per nucleotide [
14]. Generally, the number of false positives in promoter sequence searchers is at least 10
−4 per nucleotide. Therefore, all other theoretical databases are at least four orders of magnitude worse than this database, because the number of false positives is less than 10
−8 for this database. Thus, the potential promoters included in the database we used are practically unaffected by random processes. In this sense, the results of this database are the most promising for experimental verification.
2. Results
Firstly, all selected promoters under study were analyzed for the absence of an intersection with annotated promoters and transcripts within 1000 bp downstream of the annotated transcripts.
Next, TATA-motifs regions were identified in the selected promoters under study (
Table S1). Also, cis-regulatory elements were identified in the putative promoter region (
Table S1). The identified cis-regulatory elements are involved in abscisic acid responsiveness; low-temperature responsiveness; MeJA-responsiveness; anaerobic induction; gibberellin-responsiveness; phytochrome down-regulation expression; light responsiveness; auxin responsiveness; and meristem expression.
2.1. Investigated Predicted Promoters vs. Predicted Promoters from Another Database
As a result of comparing the studied promoters from the Database of Potential Promoter Sequences with the promoters from the PlantRegMap database, it was found that only nine of the studied promoters—14, 15, 29, 51, 80, 90, 92, 105, and 116—intersect with the promoters from PlantRegMap. This may indicate the correctness of the prediction of the function of these sequences, since they were predicted to be promoters by two different independent algorithms.
2.2. Potentially New Transcription Start Sites behind the Investigated Predicted Promoters
While analyzing the CAGE-seq data, we found CAP sites mapped to regions of the rice genome free of known annotated genes, as shown in
Figure 1 and highlighted in blue. This may indicate the possible existence of previously unannotated transcripts in the rice genome. Also in this figure, it can be observed that the transcription start sites of the annotated genes (mint color—“Annotated transcripts”) occur with the peaks (red color) obtained by analyzing the CAGE-seq data. This confirms the validity of our analysis of the cap-expression data. And it suggests the presence of an unannotated transcription in those regions of the genome where there are noticeable peaks in the cap analysis and no annotated sequence data.
Among such unannotated potential transcription start sites there are also those located close to some of the investigated predicted promoters (
Figure 2), among them:
- -
Three potential TSS located no further than 100 bp from the 3′-end of the predicted promoters (promoter numbers: 9, 24, 49);
- -
Nine potential TSS overlapping with the 3′-end of the promoter (promoter numbers: 27, 41, 50, 59, 59, 64, 84, 99, 101, 104);
- -
One potential TSS located no further than 300 bp from the 3′-end of the predicted promoter (promoter number: 77);
- -
One potential TSS that has a distinct peak located no further than 1000 bp from the 3′-end of the predicted promoter (promoter number 81) and lower peaks, the closest of which is no further than 250 bp from the 3′-end of the promoter.
Thus, these predicted promoters may indeed be the promoter region of the genome from which transcription starts, as there are potential unannotated transcription start sites behind them.
2.3. De Novo Assembled Non-Annotated Transcripts
We then searched for unannotated transcripts in the rice genome via de novo genome assembly. First, we selected transcripts located up to 1000 bp in the 3′ (downstream) direction from predicted promoter sequences. However, few transcripts were found in such a range, so we decided to extend the search range to 5000 bp. Theoretically, promoters can be distanced from genes by embedding mobile genome elements. Therefore, we thought it would be interesting to study the assembled transcripts more distant from the potential promoters as well.
As shown in
Table 1, 10 out of the 1825 assembled transcripts are within 5000 bp of the studied promoters. The length of the selected de novo transcripts ranges from 220 to 860 bp. All these transcripts have open reading frames—ORFs (the fragments of DNA sequences between start and stop codons). Some transcripts have several ORFs; only the maximum frame lengths are indicated in the table. Quantitative analysis of gene expression based on the normalization of these transcriptome samples reveals the presence of an expression (TPM—the number of transcripts per million mapped reads), which varies in different transcripts from 10 to ~1000.
The results of the homology search are presented in
Table 2. In tomato, homologous proteins were not found. In
Arabidopsis, two molecules of the photosystem were found behind the 37th promoter: 5MDX and, with a smaller degree of homology, 7OUI. Also with a small percentage of coverage and identity in
Arabidopsis, the 6KKS protein was found, which belongs to the transcription factors MYB family and plays an important role during plant growth and development. Identity details the percentage of base pairs or amino acids that match between the sequence of the assembled transcript and the reference sample. Query cover indicates the percentage of the length of the query sequence included in the alignment. This protein is involved in two processes: primary carbon dioxide fixation and fragmentation of the pentose substrate during photorespiration.
2.4. Investigated Predicted Promoters vs. Predicted Promoters from Another Database
We also analyzed chromatin accessibility in the region of the predicted promoter sequences by analyzing ATAC-seq data. As a result, we selected 16 predicted promoters (
Figure 3) that overlap with ATAC-seq peaks. This indicates that the predicted promoters are located in open regions of chromatin that are accessible for transcription factor seeding.
As a result, we can distinguish four predicted promoters, in which several factors characteristic of promoter sequences were identified simultaneously via in silico analysis. Promoters 9, 41, 64, and 81 are located in an open region of chromatin, which physically allows transcription factors to bind to them, and these predicted promoters also have transcription behind them that has not been annotated earlier.
3. Discussion
To predict new promoter sequences in the rice genome, we used the previously described mathematical sequence prediction method based on multiple alignments (MAHDS). But at the moment, the occurrence of false-positive results in sequence prediction remains an unsolved problem in all computational biology, including the method we use, in which the number of false positives for a randomly mixed rice genome is <10−8 per nucleotide. In this regard, we consider it necessary to conduct additional studies of the predicted sequences to confirm their functional role in the genome. And in our opinion, the obligatory completion of the verification of the functioning of predicted promoters should be the experimental verification of them in vivo or in vitro.
We saw that prediction tools based on different algorithms—MAHDS and an algorithm based on searching for DNA-binding domains of transcription factors—produced predominantly divergent predictions of promoter regions of the rice genome. As we can see from the results, only 9 of the studied 126 promoters predicted via the MAHDS method were also detected in the Plant Reg Map database based on another prediction algorithm. Can we be sure that these nine sequences are really promoters since each of them was predicted via different independent methods? Probably, yes. However, as we can see from our results, we did not detect unannotated TSSs or transcripts behind any of these nine predicted promoters. We also did not see accessible open chromatin in the region of these sequences.
It is well known that the accessible region of chromatin does not physically prevent the binding of transcription factors, which is important to consider when studying the properties of putative promoter sites. In our study, we paid attention to investigating the issue of the degree of chromatin openness in the neighborhood of predicted regulatory sequences. As a result, we obtained a number of predicted promoters located in the open region of chromatin (
Figure 3)—16 sequences. At this time, we do not have data that can prove the location of the other predicted promoters in the region available for transcription factor seeding. However, this cannot exclude the other predicted sequences from the range of potential promoters. It would seem that the predicted sequences located in the region of closed chromatin cannot be functional promoters. But there is evidence that certain transcription factors can bind to closed chromatin and recruit chromatin remodeling factors to open chromatin, providing Pol II binding and the initiation of transcription [
17,
18]. To better know the nucleosome-free regions of the rice genome, the results of other rice genome sequencing methods can also be studied in the future. For example, FAIRE-Seq (Formaldehyde-Assisted Isolation of Regulatory Elements), DNase-Seq (sequencing regions sensitive to cleavage by DNase I), and MNase-seq (micrococcal nuclease digestion with deep sequencing).
The presence of a promoter in the genome implies the presence of transcription behind it. One of the options for finding promoters or confirming the role of predicted promoters is to search for transcripts or transcription start sites. In this work we used both strategies. It is worth noting that the TSS search gave us more results, which showed the presence of 5′-end transcripts directly near the predicted promoters (
Figure 2). The single closest de novo assembled transcript is 82 bp away from the predicted promoter, while the others are more than 1000 bp away (
Table 1). But the absence of transcripts immediately adjacent to promoters cannot confirm the absence of the expression of that promoter, nor can it accurately indicate that the predicted sequences are not promoters. The genome of monocotyledons, particularly rice, is characterized by a large number of mobile elements that can be located between the promoter and the transcription start site (TSS), physically distancing the promoter [
19].
The theoretically predicted sequences may also be other regulatory elements of the genome, such as enhancers. Enhancers are small regions of DNA that, when transcription factors bind to them, can increase the level of transcription of a gene or group of genes. Enhancers do not need to be in close proximity to the genes they affect. Recently, it has been shown that enhancers can also perform promoter functions [
20]. For example, when a 72 bp tandem repeat from polyomavirus SV40 was inserted into a plasmid without a promoter, this element (an identified enhancer) initiated a low level of transcription, indicating that it can activate RNA Pol II through promoter-like activity [
21]. The fact that enhancers can play a promoter role in the genome can significantly complicate the prediction and accurate identification of genome regulatory elements. Consequently, it is necessary to conduct additional studies of the properties of predicted regulatory sequences to determine or confirm their functional role in the genome. For example, epigenetic marks characteristic of known confirmed promoters and enhancers can be studied. Studying the presence epigenetic marks on predicted regulatory sequences can be used to confirm the role of a regulatory sequence as an enhancer or promoter..
The lack of visible transcription behind the promoter region may also be due to weak transcription levels or the silencing of genes due to methylation of their promoter region. Methylation of these regions prevents RNA polymerases and/or various other transcription factors from binding to the promoter region of DNA due to steric hindrance, thereby reducing and even inhibiting gene transcription [
22]. The study of methylation status is also an important aspect in understanding the lack of transcription behind potential promoters; it can be studied by analyzing bisulfite genome sequencing data.
5. Conclusions
In this study, we identified from the 126 predicted promoter sequences of chromosome 1 of the rice genome those sequences that are most likely to be promoters for further experimental validation. Most sequences contain TATA-motifs and some cis-regulatory elements. We identified 14 predicted promoter sequences followed in the downstream direction by unannotated transcription start sites, 5 sequences followed in the downstream direction by potential unannotated transcripts, and 16 sequences that are located at a site in the genome in the open chromatin region for TF seeding. We identified four predicted promoter sequences (Sr. No.: 9, 41, 64, and 81) which we consider the most promising for further experimental analysis.
For in vivo/in vitro testing, constructs with reporter systems such as GUS or GFP behind potential promoter sequences can be used. Y2H/Y1H assays or Bimolecular fluorescence complementation (BiFC) also can be used. But, unfortunately, even such an analysis may not show the real result, as it depends on a number of factors. The tissue specificity of certain classes of spatiotemporal promoters is difficult to properly assess due to the apparent lack of differentially differentiated tissues. Also, the size of the isolated promoter can negatively affect the assay, as the cloning of the promoter may miss deleted essential cis-elements, or may lack the transcription factors necessary for interaction with cis-elements. Additionatly, in in vivo assays, promoter sequences can be randomly integrated into independent genomic events during transformation, which can lead to large expression variability depending on genomic location and may also cause silencing.
Despite the difficulties encountered, the study of genomes, and in particular regulatory sequences, is an important task of modern biology. The genomes of many organisms are already fairly well annotated, but there are still gaps and functionally unknown regions remaining. Knowledge of the exact reference sequence of the genome and its functional components can play an important role in genome editing technology. It is important to know all the areas in which editing can take place, as well as sensibly assessing the possible off-target effects of editing. Precise knowledge of the location of functional elements of the genome will allow us to notice and control random mutations in these areas, which can affect the expression of some genes and have a noticeable or imperceptible effect on the whole, living, edited organism.