1. Introduction
The conservation (within a wide phylogenetic range) of sequence structures within certain genome regions, like genes, exons, and especially coding DNA sequences (CDS) is out of the question. For DNA structures conserved between all known lifeforms (including viruses) within these regions, its information-storing capacity of the chemical structure of proteins, known as amino acid codons, has been known for many years [
1]. At first declared as useless junk DNA [
2,
3], the existence of conserved structures within
non-coding regions (defined as regions not coding for proteins, e.g., introns and intergenic regions) is also not a very new discovery [
4,
5]. Conserved structures found within introns can range from the sequences of individual introns of related genes [
4] to global intron sequence structures, found using powerful, alignment-free methods [
5,
6,
7]. Likewise, intergenic regions (defined as the regions between genes, excluding introns, 5′ untranslated regions (UTR), 3′UTR, and known structural elements like centromeres), have, in recent years, been found to harbor interesting and conserved sequence structures. The first hints at a functional connection between introns and intergenic regions in Animalia, namely a correlation between the size of the two regions, was found in [
8]. Unlike the analysis and results shown in this article, the focus in that study was not on finding conserved sequence structures within one genome region, but instead, whether conserved patterns between (two or more) such regions (e.g., between exons and introns) could be found.
When trying to search for conserved sequence structures between complete genome sequences or regions of a comparable order of magnitude in size, such as regions consisting of all genes, introns, or intergenic regions with a large number of organisms, problems to face are those of large datasets and limited computational power. Additionally, between two such regions, and especially between non-coding regions, one cannot expect to find long, linearly conserved DNA sequences, as known to be present within exons and CDS, which can be easily pairwise aligned, using established alignment algorithms. Accordingly, standard tools like the NCBI Basic Local Alignment Search Tool (BLAST) [
9], which deliver reliable results for the comparison of genes at the cost of relatively high consumption of computational resources, cannot be effectively used to face this task [
10]. Fortunately, more recently developed alignment-free, computational, and much less complex algorithms were developed and proven to produce just as reliable results if used patiently [
5,
11,
12]. We decided to use a quite simplistic but powerful method called
k-mer-analysis [
12] to pairwise-compare genome regions of a wide phylogenetic range for 39 Animalia organisms, with completely sequenced and assembled genome sequences. The
k-mer-analysis delivers a quantification of correlation (using Pearson correlation coefficient [
13]) between sequence structures found within pre-defined regions (e.g., introns, exons). If a significantly high correlation is found within a wide phylogenetic range of organisms, this can be interpreted as conservation of such structures within this phylogenetic range. While a standard
k-mer-analysis is sufficient to discover conservation of sequence structures, it does not deliver information about the conserved sequence patterns responsible for this conservation; therefore, we went one step further and additionally analyzed the results by performing a decomposition of the correlation coefficient (see
Section 2.2 for details) to quantify the contribution of different DNA words (e.g., X
n, (XY)
n with X,Y ∈{A,C,G,T}). These lists of DNA words were then compared between the genome regions and between the organisms.
3. Results
We analyzed different genome sequence regions (see
Section 2.2) within all 39 organisms mentioned in
Table 1 using the
k-mer analysis described in
Section 2.1. We used different word lengths
k (1 ≤
k ≤ 11), and the results are shown in
Figure 1. An example of the underlying data for each data point within
Figure 1 is shown as a heatmap image in
Figure 2 (for
k = 7,
k = 11 is given in the
Supplementary Materials).
Patterns associated with relatively low correlation values (below 0.4) are visible for
Nasonia vitripennis and
Ficedula albicollis in
Figure 2. Possible explanations range from special sequence features present within these two species, not conserved within close relatives (e.g., other birds in the case of
Ficedula albicollis) to quality issues of the two assembled genomes (e.g., within tandem repeats). Since, to our current knowledge, no related discoveries were published, we decided to keep them within our dataset in order to reduce selection bias.
Some of the high correlation values we found can be trivially explained. For small
k (
k < 8), the full genome sequences (genome) showed high correlations with intergenic regions since they make up most of the genome sequences. The same is true for correlations between genes and introns, where introns make up most of the gene sequences. It is also not very surprising to see that there is a lower correlation (<0.65) between introns–intergenic regions and exons, and even lower correlation values (< 0.3) between introns/intergenic and CDS (
Figure 1). These low correlations between non-coding regions and exons/CDS were expected, since it is known that exons, and especially the CDS, have a well-known conserved biological function (coding for proteins), and thus have associated conserved sequence structures (amino acid codons) [
1] whereas known functions and associated sequence structures of introns or intergenic (non-coding) regions are different [
23] (e.g., not limited to DNA triplets).
The high correlation values between introns and intergenic regions for small
k (
k < 8), visible in
Figure 1 and
Figure 2 (and in
Figure S1 also for
k = 11, though mainly in the same organism, not between several organisms) and the resulting correlation between the intergenic regions and genes cannot be trivially explained. The correlation between introns and intergenic regions visible in
Figure 1 must be the source of this gene–intergenic correlation since there is no significant correlation between exons and intergenic regions. The correlation values, shown as a heatmap in
Figure 2, are an illustration of the high correlation between introns and intergenic regions for
k = 7 (also visible as one respective data point in
Figure 1). The correlation values in
Figure 2 are relatively high (mostly > 0.6, and visible as mostly red colors in
Figure 2) for most pairs of introns and intergenic regions, and especially high if they are in the same organism, as can be seen by the dark red and non-trivial diagonal line. While this feature can also be seen to be a high mean correlation for the respective data point within
Figure 1,
Figure 2 illustrates that this mean value is not the result of a small number of very high correlations between the genome sequence regions of things such as closely related organisms, but the result of an overall high correlation between regions of all analyzed genome sequence regions. However, as a good illustration of that circumstance, we decided to rely on calculated uncertainties based on standard deviations (shown as error bars within
Figure 1) herein. These uncertainties can be interpreted as a summary of the range or spread of values within images like
Figure 2. Another interesting feature in
Figure 1 is the fact that the correlation values within the exon and CDS regions, as well as the correlation between the two, is significantly lower for small
k (
k < 8) when compared to intergenic–intron regions. Since exons and CDS are responsible for important biological functions (coding for proteins), one would expect a very high conservation of sequence structures, and thus relatively high correlation values in relation to non-coding parts like introns or intergenic regions. This behavior is only visible for higher word lengths (
k > 8) in
Figure 1, indicating that these conserved structures found in exons/CDS have a minimum size of about 9 bp on average. This observation is reasonable since proteins typically consist of more than three amino acids (corresponding to three codons, therefore a length of 9 bp of CDS).
The observation that introns and intergenic regions show significantly higher correlation values in
Figure 1 for smaller word lengths (
k < 8) than exons and/or CDS indicates that there are conserved sequence structures with typical length scales of less than 8 bp. To analyze both of these word length regimes, we focused our further analysis on two word lengths for
k = 7 in the regime of higher non-coding correlations, while still maintaining the optimal
k-mer word length range for sequences of the size analyzed [
16] and
k = 11, in the regime of
k > 8 where coding sequences showed higher correlation values. We chose the prime number
k = 11 over other possibilities like
k = 9 or
k = 10 to exclude resonance effects, which are potentially caused by smaller structures, such as amino acid codons (e.g., because 9 is a multiple of 3). This also further motivated the usage of
k = 7 over other values like
k = 6.
We analyzed the
k-mer words responsible for interesting correlations at
k = 7 and
k = 11 by applying a decomposition of the correlation values as described in
Section 2.3. The word sets we used consisted of one word of respective length
k within each set (e.g., for
k = 7:
S1 = {AAAAAAA},
S2 = {AAAAAAC},
S3 = {AAAAAAG},
S4 = {AAAAAAT},
S5 = {AAAAACA}, …). This decomposition obviously satisfies the requirements of a valid decomposition described in
Section 2.3. The
k-mer words which contributed most to high correlations (
Figure 1) are shown in
Table 3 (
k = 7) and
Table 4 (
k = 11).
The top 10 words shown in
Table 3 and
Table 4 show clear preferences of specific
k-mer word patterns for different pairs of regions compared. Even though only the top 10 words are shown, the tandem repeats observed also dominate in complete lists (not shown for reasons of clarity; for the top 100 see
Supplementary Materials,
Table S2 for
k = 7, and
Table S3 for
k = 11). The most obvious feature shared between all pairs of regions shown in
Table 3 and
Table 4 is the fact that tandem repeats (e.g., multiple repetitions of one small unit, such as ATATATA, is a tandem repetition of the repeat unit AT), with repeat unit lengths smaller than 4, make up a considerable fraction of the top 10 contributing words between all pairs of sequences (36–100%) (
Figure 3). If the contribution was to be evenly distributed over all
k-mer words, the correlation value should be about 4
−k; therefore one would expect contributions to be lower than 0.0006 for
k = 7 and even less for
k = 11. Since there are only 4
b possible tandem repeats for a given length b (in base pairs) of the repeat unit, one would expect a fraction of 4
b/4
k = 4
b−k (~0.4% for b = 3 and
k = 7, and even less for b < 3 or
k = 11) for tandem repeat
k-mer words within the top 10 contributing words. Especially when small deviations in the sequences between the listed words and tandem repeats (e.g., one mismatch) are considered, nearly 100% of the words shown are tandem repeat-based. Accordingly, tandem repeats seem to be overrepresented within the lists shown in
Table 3 and
Table 4, in which the overrepresented repeat unit length also differs depending on the pair of regions compared. Lists of top 10 (and also top 100) contributing words pairing CDS are dominated by unit lengths of 3 bp. This is partially true also for pairing exons or even CDS and exons, especially for lower
k (
k = 7). For other pairs, the lists contain mainly tandem repeats with smaller unit lengths of 1 bp and 2 bp (partially true also for exons, especially with higher
k,
k = 11). The lists shown in
Table 3 and
Table 4 also show that except for pairs containing CDS, the top contributing words are A/T words (
k-mer words consisting of A and T only), while C/G words are only rarely found. And even though polyA and polyT words give a large contribution to the correlation value, the contribution of all tandem repeats with
b = 2 together is bigger.
The data shown in
Table 3 and
Table 4 clearly suggest that the structures of most-contributing
k-mer words are different for the intron–intergenic correlation when compared with the exons–CDS, even though exons seem to be a hybrid under this perspective between CDS and intron–intergenic regions. This indicates a different nature of underlying sequence structures responsible for high correlations found by the applied
k-mer analysis. It is peculiar that the top-contributing words found for CDS, and to some extent also for exons, represent tandem repeat patterns of amino acid codons, namely units of 3 bp (e.g., (ATT)
n), while the words found for intron–intergenic regions show different patterns.
To further quantify these results, we went from analyzing structures of the top contributing words to analyzing the contributions of all words representing tandem repeats, with a given repeat unit length b. In the context of correlation contribution, this could be achieved by defining the following four different word sets:
S1 = {(X)
n for X ∈ {A,C,G,T}},
S2 = {(XY)
n for X, Y ∈ {A,C,G,T} and not in
S1},
S3 = {(XYZ)
n for X, Y, Z ∈ {A,C,G,T} or
S1 or
S2},
S4 = {not element of
S1,
S2,
S3}. One can easily check that this decomposition satisfies the requirements of a valid decomposition, since the sets are all mutual exclusives with
S1–
S3 containing tandem repeats with repeat unit lengths of 1–3, respectively, and
S4 contains all remaining
k-mer words (including all non-tandem repeat words). Using these sets, we performed a decomposition of the correlation coefficient, as described in
Section 2.2. The exemplary results, which show contributions of different repeat unit lengths
b, are shown in
Figure 3 and listed for all interesting pairs in
Table 5.
Table 5. While non-tandem repeat words contribute more than 80% (for
k = 7) to the correlation values of all pairs, tandem repeats with unit lengths of 1 bp and 2 bp contribute significantly to the correlations within and between introns and intergenic regions, while repeats with unit lengths of 3 bp contribute mostly to correlations within and between exons and CDS. Once again, exons also show some features of introns and intergenic regions, as for exons–exons, the amount of unit lengths of 1 bp is also quite high. The reason behind this observation could arguably be the known presence of poly-A tracts within UTRs, which are part of exons but not part of the CDS [
24].
Table 3 shows that the highest-contributing words between exons and introns are mainly Poly-A/T words, which supports this interpretation.
For
k = 11, the contributions of tandem repeats further increase to more than 50% for pairs of introns and intergenic regions. This supports the observation made in
Table 3 and
Table 4, as well as the idea that the repeat unit length for exons and CDS is associated with conserved sequence structures (amino acid codons), while the sequence structures conserved between introns and intergenic regions consist of smaller repeat units. It is remarkable that the remaining words contribute more to the correlation values of exons (> 72%) and CDS (> 89%), while values < 85% (down to < 50%) are observed for introns and intergenic regions.
The fact that more than 80% of contributions are based on non-tandem repeat words for k = 7 could lead to the misleading impression that tandem repeat words are proven to be irrelevant for the observed correlations and therefore for conserved sequence structures, but one should keep in mind that for k = 7 there are only 43 = 64 tandem repeat words with unit length b < 4, while there are 47 = 16,384 words overall, meaning that ~0.4% of all k-mer words are responsible for 8–18% (depending on the paired regions) of the respective correlation values.
Besides the fact that tandem repeat words are overrepresented within the top-contributing word lists, shown in
Table 3 and
Table 4, it is also clear that there is a tendency for A/T-rich words in the top-contributing words to be in intron intergenic regions, and an opposite tendency for top-contributing words between CDS and exons. To quantify this observation, we performed a correlation decomposition based on the G/C content of respective
k-mer words by defining sets like
SG/C<X% = {words with G/C content < X%} (
SA/T<X% can be defined analogously). The results, shown in
Table 6, supported the observations concerning the G/C (or A/T content) made in
Table 3 and
Table 4.
Words with low G/C content contribute most to the correlation values between intron and intergenic regions while words with a relatively balanced G/C to A/T ratio contribute most to the correlation values between exons and CDS for k = 7 and k = 11. This could illustrate constraints to the sequence structures of CDS and exons, resulting from the need to code for many different amino acids, in the sense that there would not be enough words consisting of mainly A/T or mainly G/C to encode for all amino acids with a reasonable level of redundancy. The fact that the G/C content in the words contributing most to the correlation values between introns and intergenic regions is very low, while words with high amounts of G/C contribute very little to the correlations between introns and intergenic regions, supports the idea that the underlying conserved sequence structures do not encode for biological functions in the same way as CDS does.
We further investigated the preference for low G/C or A/T rich words respectively, within introns and intergenic regions by analyzing tandem repeat
k-mer words allowing one mismatch (e.g., AAAAGAA is counted as AAAAAAA). That corresponding words contribute significantly to the correlation between introns and intergenic regions is illustrated by the abundances of these words within the top 10 contributing word lists shown in
Table 3 and
Table 4 (or top 100 in
Supplementary Materials,
Table S2 and
Table S3).
Table 7 shows the contribution of all tandem repeat words with unit lengths
b < 3 to the correlation between introns and intergenic regions for
k = 7 and
k = 11. The decomposition was made analogous to the decomposition for creating the top lists in
Table 3 by allowing one mismatch for tandem repeat identification.
Table 7 again shows that words contribute more to the correlation between introns and intergenic regions if they have lower G/C content for
k = 7 and
k = 11. The contribution of words consisting of A/T only reaches up to 10%, while words consisting of G/C only contribute down to 0.01% for (GC)
n and (CG)
n. The contributions of the complementary DNA words (AC)
n, (CA)
n and (TG)
n, (GT)
n show slightly higher contributions than other words with the same G/C content for
k = 7, and this discrepancy to the general trend further increases for
k = 11. The high contribution of non-polyA and non-polyT words taken together is also noteworthy.
To check possible indications on whether these words discovered were completely without function (a product of random mutations) or whether they lay in an unfunctional region, we analyzed possible preferences for specific nucleotides (A, C, G, T) within mismatches for all tandem repeat
k-mer words with unit lengths < 3 bp, allowing only one mismatch. The results within introns and intergenic regions are shown in
Table 8. Assuming that tandem repeats with mismatches were, to a significant extent, the results of point mutations in previously complete repeat words or sequence patterns mutating towards tandem repeats, these results can be interpreted as being preferences for mutations nearby or within tandem repeat words. Similar methods were developed, published, and used in recent publications [
7,
18]. While further investigation would be needed for a more definitive statement, the observation that these mismatches do not seem to be equivalently distributed for all four bases or all words could indicate that these sequences are functional.
While the general tendency towards mismatches consisting of A or T could be explained by a higher A/T content in general within analyzed organisms (data not shown), the fact that there are differences in mismatch preferences between different
k-mer words cannot be explained by global A/T or G/C content. The same is true for both introns and intergenic regions, as these preferences are conserved between them.
Table 8 shows that for
k = 7, the mismatches preferred are A/T over G/C. This does not occur randomly, as A is preferred when already containing A, and T if already containing T, and this tendency increases with the content of A/T (Poly-A and Poly-T are excluded since they cannot have mismatches of A and T respectively). If interpreted as point mutations, this indicates that there is a tendency of mutation towards higher A content in words containing A and towards higher T content in words containing T. An analogous tendency for C or G was not found (see
Table 8). This further supports the idea that sequence structures containing high content of A or T are preferred. The general tendency also exists for
k = 11, with the exception that (AC)
n, (CA)
n, (TG)
n, and (GT)
n do not show this tendency for
k = 11. They even show a slight tendency toward the complementary nucleotide (i.e., more T mismatches in words containing A and more A mismatches in words containing T). It is remarkable that these belong in the set of words that showed increased correlation contributions for
k = 7 and especially for
k = 11 (see
Table 7), apart from pure A/T words. If interpreted as mutations, this could indicate that repeats of (AC)
n, (CA)
n, (TG)
n, and (GT)
n do not mutate into homogeneous poly-A/T sequences. That these observations concerning (AC)
n, (CA)
n, (TG)
n, and (GT)
n, and other
k-mer words are conserved between introns and intergenic regions in Animalia supports the idea of a function shared between introns and intergenic regions.
4. Discussion and Conclusions
In summary, conserved sequence structures between different genomic regions were found, with positive correlations between sequence regions already known to be highly conserved (e.g., exons and CDS), as well as between regions where conservations could not be trivially explained as being in and between introns and intergenic regions. The general observation that short nucleotide units, such as dinucleotides and trinucleotides, are most relevant for sequence structures is consistent with the discoveries of former studies [
6].
All correlations within exons and CDS and between the two were the result of tandem repeats, with repeat unit lengths of 3 bp (triplet repeats). This observation is consistent with the well-known sequence patterns (amino acid codons) which are responsible for their biological function (coding for proteins) and accordingly responsible for their conservation [
1]. While the amino acid code alone cannot explain the relevance of repeats, some amino acid codon triplet repeats, consistent with the
k-mer words identified as being most relevant for exon/CDS correlations (as shown in
Table 3 and
Table 4), are known to code for important protein structures [
25]. The observation that for
k = 7 and
k = 11 non-repeat words still contribute more than 70% to the correlation values of exons and CDS (as shown in
Table 5A,B), as well as the relevance of words with balanced G/C to A/T content most contributing to exon and CDS correlations (as shown in
Table 6A,B), support the argument that the correlation of the two regions is a product of sequence conservation due to constraints based on the included amino acid code, since these observations can be explained by the need of many different DNA words as (redundant) codons for amino acids. The presence of smaller repeat units and the slightly higher relevance of A/T words for exon–exon correlations, when compared to correlations with CDS as well as other described similarities between exons on one side and introns and intergenic regions on the other, show the hybrid composition of exons. This can be explained by the presence of UTR which are part of exons but not of CDS. Accordingly, sequence structures in exons deviating from the structures found in CDS (as observed within this article) are expectable because UTRs do not code for amino acids and thus are not affected by amino acid codon-related constraints. In summary, the correlation found between exons and CDS, as well as the responsible sequence structures identified in this article, are consistent with known conserved sequence properties of these genome regions.
The correlation between non-coding regions (introns and intergenic regions) cannot be explained as easily. The difference between structures identified as being responsible for correlations of non-coding regions compared to coding regions (e.g., exons, CDS) is consistent with the fact that they do not share the same function (coding for proteins or functional RNA). The most impressive results are the high conservation of tandem repeats with a repeat unit length of
b ≤ 2 bp between introns and intergenic regions in each of the analyzed organisms, as well as the high conservation between these regions when compared across the species, mainly in Deuterostomia. The observation that introns and intergenic regions show high correlations and the same sequence patterns responsible for intron—intron and intergenic—intergenic correlations are also responsible for intron—intergenic correlations indicates phylogenetic conservation of these structures. As this is commonly seen as a sign for preserved function, the possibility of functional elements shared between introns and intergenic regions as an explanation for the observations made in this article, seems to be the most promising hypothesis. The identification of tandem repeats with b ≤ 2 bp favors this, as the set of words found is stable over large parts of evolution, for the big group of Deuterostomia at least for 500 million years ago, despite all base mutations, insertions, deletions, or rearrangements which have occurred during all these ages. And also, despite all these evolutionary alterations, correlation values of
k-mer distributions in intron and intergenic regions, strongly being under domination of these tandem repeats, are, in each organism, very high (above 0.8), higher than between species, indicating a highly preserved but not fixed homogeneity. This idea gets even higher support by the analysis of single-base mismatches in these DNA words, as these regions do not seem, for the exception of pure C/G words, to preserve every possible observable mutation, but favoring, depending on their own sequence, special transitions, mainly towards A/T. It may be necessary here to say that even though polyA and polyT have a huge influence in all analyses, the obsesrved patterns and properties are more strongly influenced by the sum of all other tandem repeats with
b ≤ 2 bp and may not be explained by just arguing on polyA and polyT. How these preservations are maintained or where they may have risen from remains debatable. One explanation for conserved structures between introns and intergenic regions could be a shared origin of, or a constant exchange between both regions. A corresponding observation (emergence of a gene from intergenic DNA in
Mus musculus) was made in [
26]. In this case, it could be beneficial to conserve sequence structures between intergenic regions as a source and introns as a target, to reduce the number of mutations needed for an active gene to emerge. While this hypothesis could explain the observed correlations between introns and intergenic regions, our observations do not support or refute it. The reason for this is that conserved sequence structures could well be a disadvantage, since a small number of mutational steps needed to create genes from intergenic regions may also lead to many potentially harmful active genes.
Nevertheless, for the large sequence amounts of introns and intergenic regions, this rarely-observed event is not so convincing. The opposite pathway, namely, intergenic regions as products of gene (mainly consisting of introns) degeneration, supported by the existence of pseudogenes still recognizable to have gene-like sequence structures but not coding for proteins [
27], is also possible and was, in fact, one of the first theories about the origin of non-coding DNA [
3,
28]. This mechanism can only explain conserved sequence structures, if intergenic regions already include functional DNA motifs that are then shared between introns and intergenic regions or do not have enough time to deviate from intron-like sequence structures through mutations. The last possibility seems unreasonable considering the time for mutations corresponding to the phylogenetic range of organisms analyzed (about 500–600 million years), especially since tandem repeats in general are known to have high mutational rates [
29].
A possible family of functional DNA motifs known to be shared between introns and intergenic regions are transposons. Since these functions were found to be associated with embryogenesis [
30], a special role of transposons for sequence-structure conservation within Animalia would not be surprising. Transposons make up a considerable amount of genome sequences of Animalia and accordingly intergenic regions. Retrotransposons especially are found in very large numbers [
31] in many Animalia species. Since retrotransposons are found as repetitions of similar sequence motifs within genome sequences, they could, in general, explain conserved sequence structures as observed. While a quantitative determination of the roles of transposons for the correlation content would require a formal analysis of sequence patterns of all known transposon families in Animalia (which is out of the scope of this article), the first qualitative results suggest that the tandem repeat structures we found to be responsible for high correlation values are not typical for sequence patterns of transposons [
31,
32]. Transposons are also known to be flanked by direct repeats, but these repetitions are usually not tandem repeats. An exception to this is Poly-A, found as (A/T)
7 and (A/T)
11 in this article. Poly-A (Poly-T on the opposing strand) tracks are known to flank many transposons as a result of reverse transcription [
24]. They could, in principle, explain the occurrence of A
n and T
n words within our analysis, but not the presence of other tandem repeats.
Another known functional sequence motif conserved between introns and intergenic regions, present in all analyzed organisms, are transcriptional factors (more precisely, their binding sites), especially binding motifs for silencers and enhancers [
33]. While the exact sequence structure of these transcriptional-factor binding sites is not completely understood [
23], and while a formal analysis would again be required to confirm these qualitative results, patterns typically annotated as such binding sites do not show a significant amount of tandem repeats [
34]. Like transposons, transcriptional-factor binding sites were found to be flanked by sequences consisting of either A or T, but since their typical size is 3 bp–5 bp [
34], they cannot be responsible for the results we found with word lengths > 6 bp.
Another discovery made is that transcriptional factors are associated with local DNA topography [
36]. This observation is of special interest for the interpretation of the observations made in this article, since dinucleotides (DNA words of length 2 bp) are often used to determine such structural and topological properties of DNA molecules [
36,
37] and the structures we observed to contribute most to intron intergenic correlations consist mainly of repeats of such dinucleotides. Since transcriptional factors influence distant genes through DNA loops [
33], it seems reasonable that local physical DNA properties that favor such loops are beneficial when associated with transcriptional factor binding sites. Since A/T-rich sequence structures—found responsible for introns/intergenic correlation in this article—were observed to induce loops within DNA molecules [
38], this is suggestive of the hypothesis that the function potentially associated with conserved sequence structures between introns and intergenic regions could be associated with local bending properties, and thus with 3D or more general conformation, with the topology of the DNA molecules. If true, the role of the different observations of (AC)
n, (CA)
n, (TG)
n, and (GT)
n in contrast to (AG)
n, (GA)
n, (TC)
n, and (CT)
n could be associated with specific structural dimer properties. The tandem repeat sequences identified in this article could also influence transcriptional-factor activity, and likewise, DNA topology through their mere presence, since the distance between transcriptional-factor and regulated genes would be increased or decreased based on the length of the tandem repeats located between them, and the same is true for the general length of the DNA molecules which could influence their general topological properties (sequence-dependent and sequence-independent).
If this is true, the fast mutational rates of tandem repeats and their tendency to change rapidly in length [
29], manuscript in preparation, could favor them over other sequence patterns to allow relatively fast adaptions of this regulatory function. In this case, the explanation of the preference of specific tandem repeats over others, especially the observed preference of A/T-rich repeats in these assumed interactions, may be an important and promising stepping-stone for further research in this field.