2.1. LCRs in the 3D Structure of the Complex of HTT with HAP40
While human HTT (database record in UniProt: P42858) has already been known to be largely composed of alpha-helical tandem repeats since 1995 [
9], it was not until 2018 that most of its structure was solved by electron microscopy (database record in the Protein Data Bank, PDB: 6EZ8, resolution 4.00 A; [
5]). Incidentally, this was achieved in a complex with HAP40, signifying the flexibility of this protein and that its interactions may stabilize it thus making its study feasible. The interaction between HTT and HAP40 is conserved from human to (at least) fish [
10].
Currently, the only other structures of HTT (and of HAP40) available in the PDB are two more structures of the HTT:HAP40 complex solved by electron microscopy and without reviewed publications: PDB identifiers are 6RMH (9.60 A) investigating the effect of polyQ expansion [
11] and 6X9O (2.60 A) [
12] (
Figure 1). We use the latter because of its high resolution.
Regarding the 6X9O structure there are the following gaps in the structure of HTT that suggest disordered regions: 1–96, 407–665, 1165–1227, 1378–1422, and 2633–2662. Using these gaps, predictions of disorder (see later), and domain definitions from [
5] and [
1], we divided HTT in five domains (see caption of
Figure 1 for details).
HAP40 is tightly bound to the core of HTT, but there is a clear difference between the interface of HAP40 with the blue HTT domain (no insertions) and its interface with the cyan HTT domain (possibly containing interacting insertions). For HAP40, the gaps suggesting disordered regions are at positions: 1–82, 215–257, and 304–309. Two of these are situated close to disordered regions in the cyan HTT domain (
Figure 2).
Taken together with the presence of an A-rich region in the
C-terminal half of the HAP40 molecule (see alanines in HAP40 colored in yellow in
Figure 1 and
Figure 2), these observations suggest a mechanism of interaction between these two proteins that starts with the tightly fitting interaction of the
C-terminal of HTT with the
C-terminal half of HAP40, followed by a closing of HTT on the hydrophobic part of HAP40, with the cyan HTT domain grasping two disordered regions of HAP40 using its own disordered regions. This set of interacting flexible fragments in HAP40 and HTT can then be used to detect signaling events via PTMs controlling the opening and closing of HTT on HAP40.
Since structural information is lacking for many of the LCRs in HTT and HAP40, we will present approaches to study the function and conservation of LCRs in HTT and in HAP40 and other proteins that interact with HTT. We are interested to discover (i) commonalities in the LCRs of multiple proteins interacting with HTT that could indicate similar LCR-modulated modes of interacting with HTT; (ii) the co-presence of LCRs in HTT interactors that could indicate mechanisms of interaction requiring the coordination of multiple LCRs; and (iii) to find out co-evolution of LCRs in HTT and interactor partners that could hint at their direct physical interaction.
2.2. Prediction of LCRs in HTT-Interactors
We obtained a list of human proteins interacting with HTT from the HIPPIE database of human protein–protein interactions, scored according to the reliability of the experimental evidence for the interaction (HIPPIE v2.0; [
14]). The current version of the database reports 402 HTT-interactors (including HTT itself as a self-interactor) with links to the manuscripts that report the experimental evidence of the interactions.
The enrichment of functions in this set (see the Methods section for details) reports many terms related to protein interactions (e.g., the top one is “protein binding”, but see also “protein complex”, “identical protein binding”, “protein stabilization”, and “transcription factor binding”), which suggests that the function of HTT as a structural hub is also shared by its partners (terms with
p-value < 1 × 10
−5 in
Table 1; see
Supplementary File S1 for details). Other terms hint at the multiple subcellular locations and organelles that HTT might coordinate (e.g., cytosol, nucleoplasm, mitochondrion, membrane, extracellular exosome, autophagosome).
Although HTT is widely studied in a neural context, it is noticeable that “myelin sheath” is the only neural-specific function, and protein degradation terms (e.g., ubiquitin protein ligase binding) and functions related to cell–cell adhesion (e.g., cadherin binding involved in cell–cell adhesion) are also present. This is consistent with the role of HTT in clathrin-mediated endocytosis, vesicle transport, cell signaling, morphogenesis, and transcriptional regulation, with a localization in nuclei, cell bodies, dendrites, and nerve terminals, and ubiquitous expression with the highest levels found in the nervous system, accordingly with functions in neuronal transport processes, post-synaptic signaling, and neuron protection from apoptosis [
8].
Collecting predictions of LCRs in the set of proteins interacting with HTT and in their orthologs requires the making of two careful decisions: (i) the selection of a set of species as a reference set to find orthologs of HTT and interactors, noting that the orthologs might not interact in the corresponding species, and (ii) the selection of a set of LCRs and the protocols to predict them: here we decided on coiled coils (CCs), because they are involved in protein interactions, intrinsically disordered regions (IDRs), compositionally biased regions (CBRs), and homorepeats.
CCs are a structural motif with a seven amino acid repeat resulting in alpha-helices that use hydrophobic residues to form homo-dimers or hetero-dimers with other coiled coils [
15]. Although CCs adopt structure when they multimerize, they are disordered in monomeric state, which evidences their intrinsically disordered character [
16]. IDRs are defined by their lack of fixed structure and can be predicted in sequence by analyzing the propensity of consecutive amino acids to result in interactions stabilizing protein structure [
17]. CBRs are regions that have an amino acid composition that differs from average, and are usually characterized as regions rich in a particularly abundant amino acid [
18]. Homorepeats (or polyX) are consecutive stretches of the same amino acid [
19].
These LCRs overlap in their definitions [
6]. For example, CCs often have reduced types of amino acids and have repetitive properties that make them similar to compositionally biased regions [
20]. Both CBRs and homorepeats can be of 20 types, one for each amino acid, while CCs and IDRs are not specifically associated to any amino acid. Generally, homorepeats are shorter (10 to 40 amino acids) than CCs, IDRs, and CBRs, which can reach lengths above 100 amino acids.
The choice of species is very relevant because the taxonomic levels included define which variations in LCRs can be detected. In this exploratory analysis we decided to cover two fish and eight Tetrapoda species (including two rodents and two Sauria) to have the full range of emergence of the mammalian N-terminal polyQ, without going too far in the vertebrate lineage. The four pairs of taxonomically related species (fish, Sauria, rodents, and primates) allow for the detection of conserved features at independent taxonomic ranges, which are important to discriminate functional types of LCRs.
The list of species used is visible in
Figure 3,
Figure 4 and
Figure 5 and in the Methods section. The details about how LCRs are calculated are provided in the Methods section. Importantly, we describe the presence or absence of each LCR per protein across the 10 species. This requires the consolidation of LCRs from different sequences that overlap in the alignment. See details in the Methods section.
The number of consolidated features detected is reported in
Table 2. Note that it is possible to have multiple occurrences of the same LCR type (e.g., CCs, IDRs, or polyQs) in the same alignment, and that LCRs of different types can overlap; this happens often since, as mentioned before, their definitions overlap.
We found an average of six LCRs per alignment: in total 236 polyX, 422 CCs, 658 IDRs, and 1035 CBRs. In general, CBRs of a given type were more frequent than the corresponding polyX, with the exception of L-rich CBRs and polyL (found in 4 and 11 alignments, respectively). Note that these LCRs are detected in alignments of orthologs, so they can be present in one or a few sequences and not necessarily in the human protein. We will study their conservation below.
To analyze if the LCR usage in HTT-interactors differs from other proteins, we computed the enrichment for the frequency of the LCRs found in the human proteins compared to their occurrence in the entire human proteome (
Supplementary File S10). The enrichment in CCs is very significant (89 of the 402 interactors had them, compared to 2083 of 20609 proteins in the human proteome,
p-value = 3.48 × 10
−12), and that of IDRs modest (208 of HTT-interactors had them, compared to 9084 in the human proteome,
p-value = 0.00233).
HTT has a polyQ, an LCR that has been proposed to function in the modulation of interactions between CCs [
22], and it is known that proteins that interact with polyQ proteins are enriched in CCs [
23]. Regarding LCRs by type, E- and K-rich CBRs are significantly enriched (
p-values = 2.72 × 10
−12 and 7.65 × 10
−5) with moderate enrichments of Q-, S-, and T-rich CBRs (
p-value < 0.01). The only significantly enriched polyX we found was polyQ (10 out of 177,
p-value = 0.00224), also consistent with the typical enrichment of polyQ proteins among proteins that interact with polyQ proteins [
23].
Next, we illustrate the predictions obtained for HTT and HAP40, to understand how different LCRs overlap, how they correspond to the current 3D information that we have available, and how they may play a role in the protein interactions of HTT.
2.3. LCRs in HTT and HAP40: Categories, Overlap, and Conservation
Regarding the HTT alignment (with human P42858 as the leading sequence; group number #4; alignment available in
Supplementary File S4), we detected 15 LCRs: zero CCs, four IDRs, seven CBRs, and four polyX (
Table 3; also as entries for P42858 in
Supplementary File S2). Some of these LCRs reflect known functional disordered regions in HTT and overlap each other. Most of them are absent from the 3D structure due to their flexibility. For example, the large fragment missing between HEAT repeats 6 and 7 (from positions 400 to 674 in human HTT) is quite accurately detected as a completely conserved (present in all ten orthologs) IDR and also as an S-rich CBR (
Table 3; note that the coordinates there are from the alignment of the orthologs and will differ from the protein coordinates). We show three fragments of the HTT alignment #4 in
Figure 3 to illustrate other disordered regions.
In
Figure 3A we see the
N-terminal region of HTT, which contains in the human sequence the polyQ whose expansion produces a genetic disease that manifests by protein aggregates. The polyQ followed by polyP has a function in the modulation of protein interactions mediated by coiled-coil interactions [
23]. The polyQ prolongs the preceding alpha-helix in an
N-terminal gradient where closest residues adopt the helical conformation more often [
24]. The polyP overlaps a proline-rich domain (PRD) that adopts a stiff structure thought to reduce the aggregation propensity of the preceding polyQ [
25]. This PRD was noted in HTT [
26]. Its structure has been studied in the context of disease caused by the expanded polyQ but it has a function and participates in interactions with a number of proteins [
8]. The polyQ has been detected also as a longer Q-rich CBR and both overlap a polyP and a P-rich CBR (
Figure 3A). These features (including the IDR encompassing them all) are mostly detected in five of the orthologs (mammalian species).
In
Figure 3B we focus on a small region that is absent from the 3D coordinates, and thus likely disordered, but was not detected as an IDR. This region is too small for the window used and indicates the limits of the detection method. Had there been an insertion present in any of the orthologs making them longer, it would had been identified. On the contrary, in this set of species the length of this region is completely conserved: there are neither insertions nor deletions in the alignment. This region is between the cyan and the orange domain (
Figure 1) and could be in contact with unresolved disordered regions from HAP40 (
Figure 2).
The third fragment we showcase is another strongly conserved LCR with an E-bias (
Figure 3C) corresponding to HTT human coordinates 2614–2664 within the blue domain (
Figure 1) and facing away from the pocket holding HAP40. As far as we know, this has not been noticed before and it is not known to have a function, but its conservation suggests that it must be functional. It is predicted to be part of an
N-terminally extended IDR and this is precisely matched by the absence of 3D information (
Figure 3C). The slight lower frequency of glutamic residues in both fish species makes the detection of the region fall to just below the thresholds used for detection (see patterns of conservation in
Table 3 for features in alignment coordinates around 2732–2763). However, manual inspection of the alignment indicates that the acidic character of the region is conserved with some aspartic residues in place of the glutamic residues observed in Tetrapoda species (
Figure 3C).
We discuss now the LCRs detected in HAP40 to contrast them with the structural information that we have from HAP40 in complex with HTT. HAP40 (UniProt: P23610) has a HIPPIE score of 0.73 (>0.72; high confidence HIPPIE score category) for its interaction with HTT. We have an alignment for orthologs in eight of the ten species (missing
S. harrisii and
G. gallus; alignment #6 in
Supplementary File S4). We detected five LCRs in this alignment (
Table 3). The most significant ones are encompassed in the fragment shown in
Figure 4: one IDR, a more extended conserved A-rich region, an overlapping P-rich region, and a shorter overlapping polyP. The A-rich region corresponds in the human sequence to VAEAG-AALGA (coordinates 43–261) including a polyP “PPPPPPAPQP” (coordinates 223–232). The P-rich region is missing from the 3D structure (
Figure 2) and it is likely flexible. It is far away from the
N-terminal of HTT and possibly not competing with its polyP.
In the complex between HAP40 and HTT, this disordered P-rich region in HAP40 is close to several disordered regions in HTT (
Figure 2), including the one we depict in
Figure 3B. This HTT region is serine-rich. Serine phosphorylation is a mechanism by which protein interactions are regulated and this occurs particularly in IDRs, which are appropriate for this regulatory function and modification since they are often exposed [
4]. Regulation by serine phosphorylation is already known for HTT. For example, CDK5 phosphorylates serines at positions 1179 and 1199 in response to DNA damage [
27]. These serine residues happen to be in a region missing from the 3D structure within the cyan domain (human HTT coordinates 1162–1226) and are predicted to be a conserved IDR overlapping a conserved S-rich CBR (
Table 3; alignment coordinates 1245–1305 and 1033–1312, respectively).
Apparently, this polyP in HAP40 is dispensable for interaction with HTT [
5], reflecting the fact that it is not in the interface between these two proteins and seems to be flexible even when they form a complex. However, it might have a regulatory role. The more disordered regions are used to close and decorate an interaction, the more complex the combination required to unlock the interaction could be. The interplay between disordered regions proximal in 3D from two interacting partners is likely to provide an explosion of combinatorial possibilities for regulation.
2.4. Different Modes of Interaction with HTT
Since HTT is a large protein, there can be different modes of interaction with it, and HAP40 could be an example of one of them. If LCRs are used in these interactions, we could detect other proteins binding like HAP40 if they have a similar profile of LCRs.
Differently from HAP40, there are many interactors that interact with the
N-terminal of HTT. This is relevant to the pathogenicity of polyQ-expanded HTT because these interactors increase the size of pathogenic aggregates. This is an effect that has been observed for multiple proteins and can be explained by the spoiling effect that the expanded polyQ has in the modulated coiled-coil interaction [
28]. Proteins that bind HTT elsewhere may not affect the formation of aggregates and may even reduce the aggregates because they can sequester the toxic protein from the aggregates instead of contributing to them. This effect has been demonstrated in a different polyQ expanded protein: a toxic construct of Ataxin1 [
29].
The PRD is a relevant feature in the
N-terminal of HTT and a few proteins are known to interact with it [
8]. Some of them do so with WW domains [
30] and some with SH3 domains [
31]. Interestingly, HTT-interactors are very significantly enriched in WW and SH3 domain-containing proteins as both of them are domains that interact with P-rich regions [
32]: 11 interactors vs. 53 in background (
p-value = 4 × 10
−9) and 23 interactors vs. 225 in background (
p-value = 7.34 × 10
−11), respectively. No HTT interactor has both domains (in the entire human proteome only four proteins have both SH3 and WW domains) but each of these domains tend to occur more than once, often in tandem. This multiplicity of domains binding proline-rich regions, together with the presence of competing proline-rich regions in the sequences themselves (see e.g., the P-rich region modulating the interaction of WW domain-containing protein HYPB/SETD2 with the PRD in HTT [
33]), suggest that interactions between multiple LCRs and the domains recognizing them can result in very rich and complex regulatory patterns. Interestingly, the HTT-interactors containing WW/SH3 domains have an average HIPPIE score of 0.72, which is a high confidence level, and is above the 0.66 average value of all HTT-interactors.
The PRD has a 1111100000 conservation profile (
Figure 3A), which suggests that HTT gained interactions with WW and SH3 domain-containing proteins within the mammalian lineage. Since this domain is not conserved in evolution, this sets the question of whether the WW proteins that interact with human HTT do so in species that lack the PRD.
2.5. LCRs in HTT Interactors: Conservation, Abundance, and Co-Occurrence
To look for groups of HTT interactors using similar LCRs we need to evaluate (i) their levels of conservation, (ii) which types of LCRs are (unusually) frequent among HTT interactors, and ultimately (iii) try to find if there are some that occur together and could reveal LCR-regulated modes of interaction. The data we present in the next paragraphs reports the frequency, composition, and conservation patterns of LCRs in HTT interactors, and we will use these data to guide our search for LCRs that could work together for interactions with HTT among those unusually frequent and similarly conserved in HTT interactors.
The general conservation patterns by feature are described in
Supplementary Figure S1. For CCs, IDRs, and CBRs, the most frequent pattern is full conservation, particularly for IDRs, which appear to be the more stable type of LCR, suggesting that, in the taxonomic sampling of 10 species chosen for this study, polyX is the evolutionarily more dynamic LCR. A total of six polyX conservation patterns are found 10 or more times showing specificity for the most distant species: each of the fish, for both of them, for
A. carolinensis,
X. tropicalis, or for
S. harrisii.
Interestingly, the set of HTT interactors containing at least one fully conserved LCR has a higher average HIPPIE score than the full set of 402 interactors (values of 0.71, 0.69, 0.69, and 0.72 for CCs, IDRs, CBRs, and polyX, respectively, versus 0.67 for the full set), suggesting that IDR conservation is associated with a reliably measured interaction.
It is known that there are differences in the conservation patterns of CBRs [
7] and polyX [
34] depending on the type of amino acid. We show the main conservation patterns by amino acid type in HTT interactors in
Supplementary Figure S2.
Regarding CBRs, two of the most frequent (
Table 2) are E- and S-rich CBRs, which show a clear bias towards full conservation (
Supplementary Figure S2A). Other frequent CBRs are G-, A-, K-, Q-, and P-rich, with full conservation not being so separated from the rest of the patterns. These results agree with the general stability previously observed for E-rich regions and the more dynamic behavior observed for Q-rich regions [
7]. Most of the CBR conservation patterns indicate conservation in one or very few species, indicating fast selection. Patterns where the LCR is lost in one species and conserved in the rest are rare, suggesting that once a CBR is gained it is difficult to remove it. These results indicate that CBRs are selected and functional.
Regarding polyX, which were identified in fewer occasions than CBRs, the most frequent are polyE, polyP, and polyS (the same amino acids associated to top CBRs;
Table 2). But differently to CBRs, the most frequent conservation patterns correspond to conservation in one species, rarely in more (
Supplementary Figure S2B).
We computed the enrichment of LCRs by contrasting the number of proteins containing LCRs of each type found among the 402 human HTT-interactors with the background of 20,609 human proteins (see the Methods section for details). We note that the size of HTT-interactors is not significantly different from the size of other human proteins (614 to 551, respectively).
We found a significant enrichment in IDRs (208 have IDRs,
p-value = 0.002), which tend to be, but are not necessarily, regions of low complexity [
6]. Coiled coil regions tend to have a reduced number of amino acid types and are low in complexity but adopt alpha-helical conformation and are often involved in interactions with other coiled coils, thus they tend to be a motif for protein interaction [
35] and this can be detected by their co-evolution in interacting proteins [
36]. The enrichment in coiled coil-containing proteins is extremely significant among HTT-interactors (89 contain them,
p-value 3.48 × 10
−12). Considering that HTT has a mode of interaction via polyQ following the
N-terminal helix, this is not too surprising, and we expect that many of these proteins interact with the
N-terminal of HTT.
Regarding amino acid-rich regions, there is a very strong enrichment on E-rich regions (with 110 interactors containing them,
p-value 2.72 × 10
−12) and K-rich regions (53 interactors,
p-value 7.65 × 10
−5). E-rich regions were also often found to be conserved in all 10 species considered (
Supplementary Figure S2A). Q-rich regions were also observed as enriched (40 interactors,
p-value = 3.34 × 10
−4) but were more variable among species (
Supplementary Figure S2A).
Given the lower number of polyX identified, it was more unlikely to find significantly enriched polyX in HTT-interactors. The only one found was polyQ (10 interactors,
p-value = 0.0022), which is also consistent with CC interactions with HTT and its polyQ. This association is found when examining the entire proteome (CC and polyQ tend to coexist [
23]) but we could not find this association in HTT interactors given the low numbers (only 1 of the 10 HTT-interactors with polyQ was predicted to contain CC).
Next, we calculated the co-occurrence of features (IDRs, CCs, the 20 CBRs, and the 20 polyX) in HTT interactors and orthologs used in 10 species. Co-occurrence was calculated across all proteins by generating a vector of absence and presence in these proteins for each LCR and then computing the Jaccard score between all pairs of vectors (see Methods section for details). A score of zero means no co-occurrence between the pair of LCRs in any protein and a value of 1 represents perfect co-occurrence in all proteins. Obviously, LCRs that occur more often have a better chance of produce higher values than those that occur very rarely. However, for a given LCR the matrix of values allows us to find the ones it co-occurs more often with (
Supplementary Figure S3).
Co-occurrence can be due to LCRs being detected for the same region (overlapping features). This explains, for example, the high score obtained for the co-occurrence of CCs with IDRs: CCs are often identified as IDRs reflecting their disordered nature when they are in monomeric state [
37]. Beyond this high co-occurrence between IDRs and CCs, corresponding to a tendency to identify CCs as being part of IDRs, other high values of co-occurrence with IDRs probably due to overlap are observed for E- and S-rich CBRs, which are the two CBRs observed more frequently among HTT-interactors (
Supplementary Figure S2A).
Co-occurrence due to overlap can also be expected between each polyX and the corresponding X-rich CBR. This is observed for the most frequent ones, such as polyG and G-rich CBRs (value of 0.25;
Supplementary Figure S3). Regarding co-occurrence of different polyX, unlikely to result from overlap, the highest value corresponds to the known association of polyP with polyQ (0.10).
The scores of co-occurrence between different X-rich regions suggest other interesting pairings not due to overlap. The strongest signal is from E-rich and K-rich regions (0.24). Considering consolidated regions (
Table 2), E-rich regions are the most frequent CBRs, but K-rich ones are less frequent than S-, P-, and Q-rich CBRs. Differently, P- and Q-rich CBRs seem to co-occur with S-rich regions (scores of 0.22 and 0.20, respectively).
If we compare the values of co-occurrence between IDRs (1st row) and CCs (2nd row), we can see that all values decrease (e.g., A-rich region co-occurrence with IDRs scores 0.08 and with CCs 0.05). The prominent exception are the Q-rich regions, which have stronger co-occurrence with CCs. This agrees with the role of polyQ in modulating CC interactions [
23] that seems to extend in this context also to Q-rich regions.
Since CC could be characteristic of the interaction mode with HTT
N-terminal helix and polyQ of HTT, we can consider the most depleted CBRs in their co-occurrence with CCs as participating in a different mode of interaction. In fact, according to
Supplementary Figure S3, A-rich CBRs have a high ratio of co-occurrence with IDRs versus CC (0.08 to 0.05), thus making them a good LCR candidate. Since polyP has one of the top co-occurrence scores of polyX with A-rich CBRs (0.06), it seems appropriate to search for HTT-interactors with these two features to propose proteins that could bind like HAP40. HAP40 has a conserved A-rich CBR followed by a polyP conserved in human and
P. troglodytes that aligns to an apparently disordered region in the other species (
Figure 4). We could then mine the data to identify HTT interactors with these LCRs as HAP40-like HTT interactors.
Of the 402 HTT-interactor alignments, we identified 76 A-rich CBRs in 66 alignments. Since this could be an important feature for conserved interaction, we considered next A-rich regions conserved in at least four species, which reduces the list to 15 A-rich regions in 15 alignments. To ensure that they were diffuse as in HAP40 (as opposed to due to a polyA, that is, to a stretch of consecutive alanines) we removed those overlapping with any polyA (no matter how conserved it was); this reduced the set to 9 alignments. Interestingly, all 9 contained at least one P-rich CBR. To increase similarity to HAP40, we next required them to have a polyP (no matter how conserved), so that there were not just diffuse P-rich regions. Four of them had a polyP, that is, HAP40 and three other proteins. Interestingly, in all three cases the polyP was bordering the A-rich region, twice at the
C-terminal, as in HAP40, and once at the
N-terminal (
Figure 5).
In human RASA1 (P20936), the A-rich CBR and its N-terminal polyP are in the N-terminal 150 amino acids of the protein, before the most N-terminal predicted domain, an SH2 domain (protein coordinates 181–272). In human SYN2 (Q92777), the A-rich CBR and its C-terminal polyP occupy positions identified as a linker (31–113) N-terminal from domains binding actin and synaptic-vesicles. In human KAT2B (Q92831) the A-rich CBR and the polyP are in the N-terminal 130 amino acids of the protein, well outside the N-acetyltransferase (503–651) and Bromo domains (740–810) known in this sequence. Our results suggest that these three proteins might use a similar mode of interaction with HTT as HAP40, consisting of a region rich in alanines (therefore hydrophobic yet relatively flexible, since alanine is an amino acid with a small side chain), flanked by a polyP that could be modulating their interaction with HTT.
2.6. IDR Coevolution of HTT and HTT-Interactors
In this last section of the results, we explore an alternative use of the correlation study between groups of orthologs focused on IDRs. Now, because we are going to use IDR predictions, we ignore the amino acid type, but become more precise by comparing the position of the IDRs in individual aligned sequences, that is, we will not be using the consolidated data.
The main hypothesis remains: because many IDRs are involved in protein interactions, we hypothesize that the evolution of IDRs could be correlated (i) between proteins that interact using a similar mode of interaction, and additionally (ii) with the protein (or protein fragment in this case, see below) they interact with. For example, if a number of proteins interact with a region of HTT where some IDRs are inserted in evolution that require the parallel insertion (or deletion) of an interacting IDR in these proteins, then this should generate parallel variations in the IDRs of the corresponding species as well as parallel variations of IDRs in HTT.
We previously applied this hypothesis to study the correlation between orthologs conserved across five metazoan species [
7]. Here we apply this approach in a more focused way to find IDR correlations between HTT and the eleven WW-domain containing HTT interactors, to try to get information on the mode of interaction of these proteins with HTT.
Since HTT is a large protein, we considered the five HTT regions defined above separated by large IDRs (
Figure 1). We wanted to find out if the HTT fragments behaved differently, and if we could discriminate between the putative WW-domain containing HTT-interactors and find where they bind to. Correlations in IDR variation between the HTT fragments and the interactors should point to the position of the interface of interaction in HTT.
HTT fragments and orthologs were clustered according to the scores of overlap between all possible pairs of species (see Methods section for details,
Figure 6). HTT fragments 2, 4, and 5 clustered with four of the interactors. Three of these are among the top four of the eleven WW-domain containing HTT-interactors according to their scores in the HIPPIE database, indicating that this cluster detected interactors with the best experimental information reporting their interaction with HTT (average score of 0.83 versus and average score for the other seven interactors of 0.66). The one with the lowest score (0.63) was that of WBP4 (O75554), for which there is no specific publication describing experimentally its interaction with HTT.
Regarding the HTT fragments 1 and 3, they did not display much variability and lacked IDRs for some species, so they clustered together. Fragments 2, 4, and 5 had very similar profiles of variation and clustered together. This result suggests coordinated IDR evolution between the large domains of HTT but does not help to define a position for the interaction of the WW-domain containing interactors. It does suggest that WW domain interactors must interact with other HTT regions beyond their expected interaction with the PRD of the
N-terminal of HTT. One could expect this to happen in the opposite side of the pocket admitting HAP40, which would allow these proteins to contact fragments 2, 4, and 5 of HTT (violet, orange, and blue in
Figure 1). This could also mean that the interactions of these proteins with HTT are more stable (thus easier to detect) than those of the average interactor. In fact, three of these interactors were detected in an early yeast two-hybrid (Y2H) study [
30] with ten other non-WW interactors, and received further support in the meantime from other studies.