1. Introduction
Chicken is the most widespread poultry, widely used for food production and as model organisms for studying the mechanisms of embryonic development. Over the long history of human domestication, a large number of breeds of chicken have been developed. According to the Domestic Animal Diversity Information System of the Food and Agriculture Organisation (FAO), currently, there are more than 1500 chicken breeds (
https://www.fao.org/dad-is/browse-by-country-and-species/en/ accessed on 1 August 2024). In recent years, the genomes of chickens have been actively studied; currently, the genomes of 26 breeds have been sequenced [
https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9031 accessed on 1 August 2024]. Such a collection gives an opportunity to comprehensively analyze the genetic features associated with unique breed-specific features in morphology, physiology and other aspects.
The Russian White chicken breed was created in the USSR by crossing the White Leghorn with local Russian breeds. The RSW breed was derived from the Russian White, and during 1954–1970 was subjected to a high degree of inbreeding and selection for extraordinary resistance to the cold and some diseases, like viral leukemia and Marek’s disease [
1,
2]. Selection for cold resistance took place over 25 generations based on the principle of chicken survival in the first five days of life at low temperatures. As a result, RSW chickens have become cold-resistant and are able to survive from a day-old age at temperatures of 16 °C, and adult chickens can be kept in unheated poultry houses in the winter [
1,
2]. The creation of a breed by selection for the traits of interest should have been reflected in the genome, which is of great interest to researchers, since genome research can shed light on the molecular mechanisms of resistance to cold and disease [
1,
2,
3,
4,
5,
6].
To date, research on the genome of the Russian White and its derived population RSW has been limited to the individual genes and genome-wide SNP (single nucleotide polymorphism) chips. In this way, several genomic loci associated with unique RSW characteristics have been identified [
3,
4,
6,
7]. At the same time, the unusual tolerance of the RSW to cold and the remarkable phenotypic changes compared to other chicken breeds suggest strong changes in the genome. The de novo assembly and annotation of the RSW would provide a new, promising layer of knowledge about the link among the structure of the genes and the regulatory elements in the genome resulting in such drastic physiological changes. In this paper we sequenced, assembled, annotated, and analysed the genome of the RSW for the first time and revealed several strong unique features of the genome to be associated with the unique properties of this breed and the evolution of resistance to unfavourable environmental conditions in birds in general.
3. Discussion
Although the chicken genome is relatively small (1.1 Gb), its sequencing poses certain challenges. Thus, some genes remain invisible when the genomic DNA is sequenced, even though they are visible when the RNA transcripts are sequenced [
10]. The reason appears to be the large number of GC-rich regions and G4s in the DNA structure, which reduces the efficiency of the PCR amplification of these regions [
11,
12]. To overcome this problem, the depth of the sequencing was artificially increased by integrating sequencing data from different breeds and creating a so-called pangenome [
11], which is an addition to the reference genome sequence of 159 million base pairs in length. Hidden GC-rich regions of the Huxu chicken breed genome have also been successfully sequenced using Nanopore sequencing technology, which is less susceptible to this problem [
9]. Increasing the sequencing depth to 250× allowed us to recover most of the RSW genome, as evidenced by the high BUSCO completeness (96.8%), greater than in the reference assemblies GRCg6a, GRCg7b, and GGswu. The high quality of the GGRsw1 assembly allows it to be used as a reference genome in future studies. However, there are 2.5% of unplaced sequences in the genome assembly, and this could be refined in future studies using long-read sequencing methods, Hi-C, and optical mapping. Furthermore, we provide a gene annotation for the GGRsw1 assembly that contains most of the known chicken genes. The gene annotation can also be improved in the future by sequencing the transcriptome under various experimental conditions and tissues. GGRsw1 is the first genomic assembly of a unique RSW breed, which opens up great opportunities for future research. We sequenced only one chicken and were able to reconstruct its heterozygous sites, which opens the door to understanding the genetic diversity within the RSW breed, but more research is necessary in this area.
Phylogenetic analysis of the RSW genome has shown its relationship with European and Mediterranean chicken breeds, apparently separated early from the other breeds in the history of domestication. The relationship of the RSW to the White Leghorn breed is consistent with the historical data on its origin. We have established a relationship with the old French Houdan breed and the ancient Egyptian Fayoumi breed for the first time. The relationship of the Russian White with another French breed, the Faverolles Salmon, was obtained earlier based on multilocus DNA fingerprinting [
13].
A comparative analysis of the RSW genome with the genomes of the other chicken breeds revealed a large number of RSW-specific mutations. Such RSW-specific mutations were expected to reflect the unique characteristics of this breed, such as the cold resistance, the resistance to viral diseases, and the snow-white colour of the day-old chicks. Analysing the RSW-specific mutations, we found an excess frequency of mutations in genes from the pathways responsible for body temperature maintenance. The altered thermoregulation pathways include the nervous system, thermoreceptors, purine receptors, and the TGF-beta pathway. This can be explained by the artificial selection for cold resistance during RSW breeding. The control of body temperature is carried out by the nervous system. It is also known that the development of an organism under cold conditions triggers the mechanisms of adaptation to the cold through the modulation of neuronal differentiation [
14,
15]. The mechanism of thermosensitivity of the neurones is based on the action of thermosensitive TRP ion channels. When the temperature changes, the TRP ion channels can alter the entry of ions into the cell, which can lead to changes in the membrane potential [
16]. We discovered a large number of RSW-specific mutations in the genes associated with neuronal differentiation and growth. Furthermore, seven TRP ion channels are specifically mutated in the RSW breed. Four of them are high temperature receptors (
TRPV1,
TRPV2,
TRPM3,
TRPC4) [
17,
18,
19]. The remaining
TRPM1,6,7 channels have no known function in temperature sensitivity. Interestingly, we did not detect RSW-specific mutations in the coding parts of the cold-sensitive receptor genes
TRPM8 and
TRPA1. The purinergic receptors
P2X are cation channels activated in response to extracellular ATP, with important roles in various biological functions. We found an increased frequency of mutations in genes from the purinergic signaling pathway. The genes specifically mutated in the RSW include
P2RX4,
HCN1,
TLR7,
P2RX7, and
PKD2, many of which are involved in body temperature maintenance.
P2RX7 knockout mice have a lower core body temperature and an increased expression of
P2RX4 mRNA in the hypothalamus [
20].
HCN1 is a hyperpolarization-activated cyclic nucleotide-gated channel that functions in the heart and central nervous system, including the hypothalamus.
HCN1 has been shown to be involved in long-term cold adaptation in mice [
21].
HCN channels have also been linked to cold perception. That is,
HCN1 shows increased expression in oxaliplatin-induced cold hypersensitivity, and its specific inhibitor ivabradine abolishes cold hypersensitivity [
22]. Also,
PKD2 (also known as
TRPP1) was not implicated in body temperature regulation; it is a member of the transient receptor potential (TRP) family, many of which are well-known temperature sensors. We hypothesise that the genetic adaptations to cold in the RSW may be directed toward modulating the nervous system.
The transforming growth factor beta (TGF-beta) signaling pathway participates in many cellular processes, including thermoregulation and immune response [
23,
24]. We found that the genes involved in the TGF-beta pathway have an increased frequency of RSW-specific mutations (
Table 2 and
Table 3). TGF-beta is involved in the thymic selection of T cells and in the homeostasis of the naive T-cell pool. Also, TGF-beta controls the development of B cells, natural killer cells (NKs), macrophages, dendritic cells, and granulocytes [
24,
25]. In addition, many of the components of TGF-beta are involved in thermoregulation [
23]. We hypothesise that these mutations in the genes of the TGF-beta pathway may appear due to selection for resistance to viral diseases or cold conditions.
The RSW breed has been selected for resistance to viral leukemia and Marek’s disease [
1,
2], which could be reflected in its genome. In fact, we found an excess frequency of RSW-specific mutations in the genes associated with the immune system (
Table 4 and
Table 5). Interestingly, this excess of mutation frequency is observed near the genes (±10 kb) but not in the coding regions, indicating the possible involvement of transcriptional regulation. Marek’s disease (MD) is a highly contagious viral disease caused by
Gallid alphaherpesvirus 2 (GAHV-2) that results in the rapid development of T-cell lymphomas in chickens [
26]. Additionally, RSW chickens were selected for resistance to leukemia caused by the
Rous sarcoma virus (RSV) [
2]. The RSV is a retrovirus with a single-stranded RNA genome that can integrate into the genome of chickens through reverse transcriptase and integrase enzymes. Interestingly, we discovered the integration of the RSV into the RSW genome, as well as into the genomes of other breeds (Huxu, GRCg7b), which may affect the susceptibility of these breeds to this virus. Among the most mutated genes in the RSW breed are the genes related to immunity,
T-cell receptor variable 20-like,
mitochondrial antiviral signaling protein (
MAVS), and
INPP5D (SHIP1) [
27], which may be caused by selection for the resistance to viral diseases. In addition to the role of purinoceptors in thermoregulation, purinoceptors are also involved in immunity. That is, the
P2RX7 receptor is crucial for controlling how antigens are presented and how T cells are activated [
28]. Furthermore, many purinergic receptors, including
P2RX4 and
P2RX7, have distinct expression patterns in MD-resistant and MD-sensitive chicken breeds during MD infection [
29], indicating that the adaptation of purinergic signaling may be due to the selection for MD resistance in the RSW breed.
In addition to the RSW-specific SNP analysis, we identified long RSW-specific genomic regions, most of which are rich in tandem repeats and G4s. The high content of G4s in the RSW-specific regions may be explained by the fact that they were missed previously when they were sequenced with a low sequencing coverage. The tandem repeats are likely to have evolved through unequal crossing-over and strand slippage during DNA replication [
30]. The only such genomic region without tandem repeats and G4s contains a gene of the myosin heavy chain family, which has apparently been lost in most chicken breeds except the RSW. Another RSW-specific genomic region with a low concentration of tandem repeats and G4s apparently contains an additional 5′ exon of the
OPLAH gene, since the known chicken
OPLAH mRNA (XM_040696025 970 bp) is significantly shorter than the similar human mRNA (XM_047421688 4289 bp). We cannot determine whether this sequence has been lost by many breeds of chicken or whether it is missing from the genome assemblies due to sequencing difficulties, as it contains a significant number of G4s.
A search for RSW-specific SNPs in the regions previously identified by GWAS in the RSW yielded two good candidate causal SNPs for the snow-white trait in the
ZFHX4 and
RALYL genes, and for the egg weight trait in the
TLL1 gene. Furthermore, the
RALYL gene is located in the genomic region that presumably was recently under selection pressure [
3]. Interestingly, using a GWAS, a connection between the
RALYL gene and the colour of duck feathers was recently discovered [
31], supporting the idea of its role in aves colour. Therefore, the missense mutation of the
RALYL gene (chr2: 122881444: G > T) may play a key role in the pigmentation of day-old RSW chicks, but the molecular mechanisms require further study.
4. Materials and Methods
The tissue samples of the RSW chickens were provided by the Russian Research Institute of Farm Animal Genetics and Breeding. The genomic DNA was isolated from the liver tissue of one chicken using a DNeasy Blood & Tissue Kit (Qiagen, Hilden, Germany). 250 ng of total DNA was used for the paired-end DNA library preparation using the NEBNextUltra II DNA Library Prep Kit (New England Biolabs, Ipswich, MA, USA) and NEB Next Single Index set (New England Biolabs, Ipswich, MA, USA) according to manufacturer instructions. Whole-genome sequencing was carried out by the NovaSeq6000 sequencing platform (Illumina, San Diego, CA, USA) with a 2 × 150 bp read length. Sequencing was performed at the Skoltech Genomics and biovisualization Core Facility.
Read quality control was conducted using FASTQC v0.11.8 [
8]. Sequencing adapters were identified in 2.1% of the read pairs. The adapter sequences were:
Adapter1:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
Adapter2:
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT.
The adapters were removed before the subsequent mapping to the genome and were kept for the de novo assembly procedure.
As part of the preliminary analysis before the de novo genome assembly, we mapped the sequenced reads to the chicken genome using the pangenome [
11] as a reference. A total of 89.7% of the read pairs were uniquely mapped, and 6.43% were ambiguously mapped due to genomic repeats. The remaining sequences were classified into chimeric (0.87%), either due to structural variants or PCR template switching artefacts; unmapped, due to low base call quality (0.59%); and novel sequences, partially mapped to the pangenome (1.64%) or fully unmapped (0.76%). This preliminary analysis showed a high sequencing quality and the existence of a small number of novel sequences missing from the reference genome. Analysis of the unmapped reads showed the presence of a small amount of contamination. Among the unmapped reads, there is rat and
Neogale vison DNA, which we explain by rare errors in the read demultiplexing (reads related to the samples sequenced in the same run). In addition, the samples contained avian viruses,
Galliform chaphamaparvovirus,
Avian endogenous retrovirus EAV-HP, and
Rous Sarcoma virus, that appear to be located in the tissues under study or directly in the chicken genome. Furthermore, we identified bacterial 16S rRNA of unknown origin. The remaining unmapped reads can be successfully mapped to the genomes of other bird species, and apparently belong to the genomic regions of chickens missed during the sequencing of other chicken breeds (due to the high GC and/or G4s), or to the genomic regions specific to the RSW breed.
Mapping to the pangenome [
11] was performed with bowtie2 v2.2.3 [
32] in paired-end mode, allowing up to 1000 bp inserted between the read pairs. The mapping statistics were collected using an in-house program (
https://github.com/yevshin/bioutils accessed on 1 September 2024). Analysis of the unmapped reads was conducted by mapping them with Blast [
33] to the nt database.
The de novo contigs were assembled using the MASURCA v4.1.0 genome assembler [
34] with the parameters listed in [
Supplementary Table S7]. The contig to chromosome assignment and ordering was conducted using the reference-guided scaffolder RAGTAG v2.1.0 [
35,
36], with the “-r -g 1” parameters. To clear the non-bird sequences from the unmapped contigs, we aligned them to the nt database with blastn v2.9.0+ [
33].
Due to the relatively high coverage of the mitochondrial genome compared to the nuclear chromosomes, MASURCA was not able to recover the complete mitochondrial chromosomes when assembling from all input reads. The mitochondrial genome was assembled separately. To save computer time, 10 million subsets of raw reads were selected and these reads were aligned to the Huxu mitochondrial genome, allowing sequence boundary crossing with the minimap2 v2.26-r1175 aligner [
37] in “-ax sr” mode. The mapped reads were collected and used for the de novo assembly with MASURCA. Artificial overhangs, which appeared due to the circular nature of the mitochondrial genome, were manually removed.
To access the completeness of the assembly based on the analysis of orthologous genes, we used BUSCO v5.7.0 [
38] in the genomic mode (“-m genome”) with the augustus gene predictor (“-augustus”) and the “aves_odb10.2024-01-08” ortholog gene dataset.
The gene annotation of the GGRsw1 genome was performed by mapping the existing annotation of the GRCg7b assembly using the Liftoff v1.6.3 tool [
39].
To identify the heterozygous sites of the sequenced chickens, raw reads were mapped to the GGRsw1 assembly using bowtie2; the alignments with mapq ≥ 10 were selected and used for variant calling. The variant calling was conducted by the standard bcftools v1.8 pipeline [
40]. The variant calls with QUAL < 200 were filtered out and the remaining variants were converted into the BCF file format.
The pairwise alignment of the genomic sequences of GGRsw1 with the 33 chicken genomes of the other breeds was performed using winnowmap v2.03 [
41,
42].
For the phylogenetic analysis, SNPs were called based on the pairwise whole-genome alignments between GGRsw1 and each of the 33 chicken genomes. We restricted the analysis to the genomic regions present in the all of studied genomes. In total, we identified 11.7 million such SNPs. The phylogenetic tree was built using RaxML [
43] v8.2.12 under the GTRGAMMA substitution model. Bootstrap analysis with RaxML showed a high confidence of all branches (Internode Certainty > 80). The resulting tree was rooted at the Red Junglefowl and plotted using the APE v5.8 [
44] R package.
To identify the RSW-specific mutations, genomic variants distinguishing the breeds were called based on the whole-genome alignments. Due to the incompleteness of some genomic assemblies, we studied the regions represented in at least 30 of the 33 assemblies. If a particular allele was found only in the RSW genome and not in any other genome, it was classified as RSW-specific.
4.1. Test for Enrichment of RSW-Specific Mutations
By pairwise-comparing the RSW genome with many genomes of other chicken breeds, we find many mutations that distinguish the corresponding breeds. Some of these mutations are unique to the RSW; that is, they are not found in the other breeds, and are of particular interest. An increased frequency of unique mutations in a gene, group of genes (genes from the same GO term), or genomic region may indicate the genetic processes occurring in the RSW. The accumulation of mutations occurs unevenly throughout the genome, so when searching for genes (groups of genes or genomic regions) with an increased frequency of unique mutations, the appropriate statistical methods must be used. Consider the set S of genomic positions on the RSW genome that have a nucleotide difference compared to one of the other breeds. This set S may be a set of mutations of a particular gene, a group of genes with similar functions, or a particular region of a genome. Some of these positions are unique to the RSW in the sense that such a nucleotide is found only in the RSW breed. Let us denote the set of such unique mutations by U. Then, the proportion of unique mutations will be
p =|U|/|S|. Let us set S’ as a subset of S, with a corresponding set U’; for example, S’ can be a set of the mutations of one gene, all mutations of the genes belonging to a given GO category, or the mutations of a certain genomic region. As a null hypothesis, we use the statement that the proportion of unique mutations is the same in S and S’. In practice, S is sufficiently large, so |U’| has a binomial distribution,
where the success probability
p = |U|/|S|, the number of trials
n = |S’|, and the number of successes
k. We consider the excess of the proportion of unique mutations significant if the
p-value < 0.001, and additionally impose a filter on the effect size.
The applications of this method for various tasks include the following:
Searching for genes with an increased frequency of RSW-specific missense mutations. S is the set of all missense mutations of all genes, and S’ is the set of missense mutations of a given gene;
Searching for genes with an increased frequency of RSW-specific mutations near a gene (gene bounds ±10 kbp). S is the set of all mutations located near any gene, and S’ is the set of mutations near the given gene;
Searching for GO categories with an increased frequency of missense mutations. S is the set of all missense mutations of all genes from all GO categories, and S’ is the set of missense mutations of all genes of a given GO category;
Searching for GO categories with a high rate of RSW-specific mutations in a ±10 kbp window from the gene bounds. S is the set of all mutations located within the boundaries of genes ±10 kbp of all GO categories; S’ is the set of all mutations located within the boundaries of genes ±10 kbp of a given GO category.