1. Introduction
Direct methods play a crucial role in solving protein crystal structures directly from diffraction data without relying on any prior information, such as heavy-atom derivatives or homology structures [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16]. Traditional direct methods for phasing small-molecule crystals [
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29] use the triplet relation and the tangent formula [
21,
22,
23], the
vive la différence algorithm (VLD) [
28], the electron-density modification (EDM) and the difference electron-density modification (DEDM) [
29], etc. There are packages for
ab initio crystal structure solutions for small and medium molecules using direct methods, such as SHELX [
27] and SIR (semi-invariants representation) [
30]. The
ab initio phase retrieval for macromolecule crystals utilizes iterative projection algorithms, such as the hybrid input–output algorithm (HIO) [
1] and the difference map algorithm (DM) [
5]. HIO is usually used for
ab initio phase retrieval of macromolecules. It is cyclic and easy to implement with a large convergence radius when the crystal has a high solvent content. While the direct method for macromolecule crystals has been successfully tested in retrieving known crystal structures of proteins and viruses [
8,
9,
10,
11,
12,
13,
14,
15,
16], it has not been applied to solve the atomic structure of unknown protein crystals.
The term “envelope” in this context refers to a boundary that demarcates the region occupied by the protein from the bulk solvent in the crystal unit cell. Researchers have used various methods to determine a support or envelope for single particles, such as employing a simple geometric shape, manual inspection of a poor electron-density map, utilizing a homologous structure or cryo-EM reconstruction, examining an averaged map, and employing autocorrelation functions, density connectivities and shrinking support techniques [
7,
31,
32,
33,
34,
35,
36,
37]. The envelope plays a crucial role while using direct methods to solve protein crystal structures. Protein and bulk solvent have different density constraints. In order to apply density constraints correctly, a good envelope is required. A well-designed strategy for reconstructing the envelope can limit the search space, effectively reduce degrees of freedom, and speed up phase retrieval.
When dealing with protein crystals, direct methods typically utilize a low-resolution envelope estimate. Previous studies have employed fixed and predefined protein boundaries along with the HIO algorithm to phase protein crystals [
8]. Additionally, iterative projection algorithms such as DM and low-resolution envelopes have been employed to retrieve lost phases in protein and virus crystals [
9,
11,
38]. When the protein crystal has a high solvent content, the protein envelope and the structure can be reconstructed from the diffraction amplitudes alone [
10,
15].
Constructing a good envelope during direct phasing can be challenging, especially when starting from random phases, as the calculated density is nearly random [
39]. To address this, we propose a transition approach to refine the reconstructed envelope, thereby improving direct phasing.
Direct phasing of protein crystals demands a balance between two competing constraints: a large bulk solvent and a well-defined envelope. To achieve a unique solution for the phase problem, a large solvent fraction (>65%) is necessary [
4,
8,
10,
15,
40,
41]. Iterative projection algorithms, such as HIO and DM, retrieve the lost phases by modifying the calculated density outside the protein envelope [
8,
10,
11,
15]. Therefore, the envelope must encompass all protein residues in the unit cell, and a large and loose envelope is typically employed to cover the complete protein structure.
However, this approach can lead to undesirable effects where the loose protein envelope squeezes the bulk solvent outside the envelope within the unit cell, hindering direct phasing. To overcome this challenge, we propose the use of a transition region, a thin layer expected to contain a well-defined protein envelope, to differentiate between the protein and bulk solvent regions in the crystal unit cell.
The proposed transition region approach involves the construction of the transition region from scratch using a weighted-average density map with a larger averaging radius, starting from random phases and diffraction data. Inside this region, we introduce a transition hybrid input–output (THIO) algorithm to reconstruct and refine the protein envelope on a finer weighted-average density map with a smaller averaging radius. The benefits of the proposed transition region for direct phasing are elaborated in
Section 2 and
Section 3.
In our experiments, we tested the transition HIO method on diffraction data from six protein crystals, discussed in
Section 3, five of which had high-resolution data ranging from 1.5 to 1.97 Å, with solvent contents of 60% to 65%. Our results clearly demonstrate that the proposed transition region significantly increases the success rate and speed of direct phasing compared to traditional HIO methods. Additionally, we assessed the transition region’s performance with low-resolution diffraction data.
The results presented in this study establish the efficacy of the proposed transition region approach for refining protein envelopes and its potential for enhancing direct phasing in protein crystallography.
2. Methodology
2.1. Direct Phasing Method of Protein Crystallography
Analyzing the uniqueness of the solution is a critical step in solving the phase problem of protein crystallography [
4,
8,
10,
40,
41]. The unit cell in real space is divided into
N grid points, and the goal is to determine the densities at these points. After a fast Fourier transform,
N complex structure factors in reciprocal space are obtained, with
of them being independent due to the real nature of the densities.
In experimental setups, phases of diffracted beams are lost, and diffraction intensities of the independent structure factors are recorded. This results in an under-determined phase problem since we need to retrieve N densities from diffraction intensities. To address this issue and ensure a unique solution, density constraints such as high solvent content or non-crystallographic symmetry (NCS) are essential to increase the redundancy of the phase problem. If the bulk solvent in the crystal unit cell occupies more than half of the unit cell, the phase problem becomes overdetermined, but in practice, due to factors like missing reflections, measurement errors, and an inaccurate protein envelope, a high solvent content is still required for successful direct phasing of protein crystals.
To find the unique solution, a direct method with the HIO algorithm is employed [
1,
8,
10,
12,
13,
14,
16]. The process starts from random numbers, and lost phases are iteratively retrieved through thousands of iterations. In each iteration, the calculated density is modified based on density constraints in the protein crystal. A weighted-average density
is computed from the calculated density using Equation (
1).
where
is the calculated density at the
jth grid point.
is the distance between two grid points
i and
j.
is a constant which controls the averaging radius. A cutoff value for
is searched in accordance with the estimated solvent content of the unit cell. This cutoff value is used to distinguish between protein and bulk solvent regions.
The calculated density in the protein region is improved using traditional histogram matching [
42], while the density in the solvent region is modified using the HIO algorithm [
1] or solvent flattening [
43]. The HIO algorithm’s negative feedback pushes the solvent density toward zero, leading the calculated density closer to a solution. After density modification, the complex structure factors in reciprocal space are recalculated using an inverse Fourier transform. The observed diffraction data are then incorporated by replacing the calculated magnitudes with the observed values while keeping the calculated phases intact. A flowchart is shown in
Figure 1.
To achieve a solution, several hundred independent runs are usually performed in parallel on a multi-thread server. The iterative cycle involves the computation of four error metrics that characterize the retrieved phases and the calculated structure factors. To avoid overfitting, 5% of the diffraction data are reserved as a free data set, are not involved in phase retrieval, and are used to compute
using Equation (
2). The remaining diffraction data are used to compute
using Equation (
3), which represents the difference between the calculated magnitudes and the diffraction data.
where
and
are the observed and calculated magnitudes.
is a scale factor.
Additionally, mean phase error (
) and correlation coefficient (
) are computed to assess the retrieved phase’s quality according to Equations (
4) and (
5). These metrics are utilized for testing the direct method and are unavailable for unknown structures.
where
are the retrieved phases and
are the true phases computed from the structure posted in the Protein Data Bank.
The reconstruction of missing reflections resulting from the beam stop [
8] and the treatment of observed reflections with significant measurement errors are also important aspects of the process with Equation (
6). Furthermore, the free data used to compute
are also subjected to reconstruction to ensure accurate results.
2.2. Introducing a Transition Region to Refine Protein Envelopes
The direct method employs distinct constraints for protein and solvent regions within a crystal unit cell. The reconstruction of a reliable protein envelope, starting from random phases and diffraction data, is critical for its success. In our previous works [
10,
12,
13,
14,
16], we sought a cutoff value on a weighted-average density map based on the estimated solvent content of the crystal. Grid points with a weighted-average density above this cutoff were designated as the protein region, while the remaining points constituted the solvent region. Instead of relying on a single cutoff value, we proposed a transition region. When starting from random phases, the calculated density is almost random. The envelope is reconstructed from the calculated density. The reconstructed envelope deviates from the true envelope, which implies that some protein structure is outside the envelope and some bulk solvent is inside the envelope. The proposed transition region corresponds to that ambiguous zone.
Figure 2 illustrates the transition region concept.
The transition region is defined by two cutoff values on the weighted-average density map. Let be the estimated solvent content. By sorting the grid points in the unit cell based on their weighted-average density, the grid points at the bottom form the solvent region, while the ones at the top form the protein region. The remaining 10% of grid points constitute the transition region. It is important to strike a balance for the size of the transition region. If it is too large, it diminishes the volume occupied by the bulk solvent, which is not preferred for the direct phasing method. On the other hand, if it is too small, it cannot adequately cover and refine the complete protein envelope, rendering it equivalent to a single cutoff value as used in our previous work. In practice, the transition region empirically occupies approximately 10% of the unit cell volume at the outset of phase retrieval. We also tested slightly larger or smaller transition regions, yielding similar results. However, the volume of the transition region is not fixed for all iterations; instead, it linearly shrinks to zero towards the end of the iterations.
Starting from random phases, the calculated density provides limited information about the protein structure. To obtain a broad protein region, a larger averaging radius is employed to compute a weighted-average density map. Using a single cutoff value to discriminate between protein and solvent in this map can lead to the mislabeling of some protein residues as solvents. To avoid the mislabelling, another average density using a smaller radius is computed within the transition region. This density
is employed to assess the probability
that a grid point is inside the protein.
is defined by Equation (
7).
where
and
represent the max and min of
within the transition region. A grid point with
close to 1 is likely to be located within the protein.
2.3. Introducing the Transition Hybrid Input–Output Algorithm for Refined Protein Envelope Reconstruction
Prior to reaching a final solution, the calculated density may not be entirely accurate, leading to a less precise reconstructed protein envelope. Grid points within the transition region exhibit a mixed state of protein and solvent, with a high weighted-average density indicating a higher probability of being proteins and vice versa. HIO primarily modifies the calculated density in the solvent region through negative feedback, leaving the protein region unaffected [
1]. As grid points in the transition region exist in a hybrid state, a novel transition hybrid input–output (THIO) algorithm is introduced in Equation (
8).
where
represents the calculated density of the
grid point in the
iteration, while
signifies the modified density of the same grid point in the same iteration. A constant
is utilized, typically set to a value ranging from 0.7 to 0.9. The parameter
was defined in Equation (
7).
The solvent region provides the constraint needed to solve the phase problem. In the conventional HIO algorithm, due to inaccuracy of the boundary, incorrect density modification is applied near the boundary, effectively weakening the constraint. The introduction of THIO is a remedy for that. It is particularly important for phasing crystals with low solvent contents.
The proposed transition HIO algorithm distinguishes itself from conventional HIO by not solely modifying the density in the solvent region. Instead, it introduces continuity in the modified density on the boundary of protein and solvent. In real protein crystals, the boundary of the protein is diffused rather than sharp. The introduction of the transition region aims to achieve this continuous modification of density.
3. Results
3.1. Enhancing Unit Cell Selection with the Transition Region: Breaking Crystallographic Degeneracy
Protein crystals often possess several equivalent representations due to crystal symmetry, leading to allowed origin translations or origin choices for the unit cell. Additionally, the crystal and its enantiomorph exhibit identical diffraction patterns. For example, in the space group, the crystal remains invariant under arbitrary translations, resulting in an infinite number of equivalent origin choices for the unit cell. In the case of crystals in the space group, there are eight origin choices, as well as eight enantiomorphs, due to the allowed origin translations. Those equivalent representations sometimes make direct phasing difficult, as the protein envelope lacks the required precision to differentiate between them.
A significant improvement to overcome this challenge is the introduction of the transition region. It proves to be vital in breaking apart crystallographic equivalents, as seen in
Figure 3 with pairs of unit cells (a,c) and (b,d). The transition region allows for the refined reconstruction of the protein envelope, which aids in making the final choice of the unit cell. As the phase retrieval progresses, the calculated envelope becomes more accurate, ultimately differentiating between the crystallographic equivalent pairs (a,c) and (b,d). This accurate envelope selection leads to consistent evolution and brings the calculated envelope closer to the true envelope, ensuring a successful solution.
We used protein 4iqk as an example to illustrate the construction of the transition region. With a data completeness of 99.62% and diffracting at 1.97 Å [
44], the crystal has a solvent content (
) of 63.77%, set to 63% during phase retrieval. Starting from random phases and diffraction data, the density was computed using a fast Fourier transform. A weighted-average density
was derived from the calculated density, employing Equation (
1), where
controlled the averaging radius, set at
Å.
To construct the transition region, two values,
and
, were searched for on the weighted-average density map. Shrinkage inward by 5% from the expected protein envelope and expansion outward by 5% defined the transition region. Grid points with
<
corresponded to the bulk solvent region, occupying a volume of
, while those with
>
corresponded to the protein region, occupying a volume of
. Grid points with
constituted the transition region, occupying 10% of the unit cell volume. Within this transition region, another weighted-average density
was computed using a shorter averaging radius (
Å).
, calculated using Equation (
7), represented the probability of a grid point to be located inside the protein region.
Error metrics were computed in each iteration cycle to monitor the phase retrieval progress. As depicted in
Figure 4,
and
exhibited an evident drop when a solution was reached. In 100 independent runs starting from random phases but without the transition region, all runs failed to reach a solution due to insufficient precision in reconstructing the protein envelope. However, with the inclusion of the transition region, the calculated protein envelope significantly improved, leading to 7 out of 100 runs converging to a solution. Successful runs were distinguished from failed ones based on R values. The retrieved density map was interpretable, facilitating direct model building. The rebuilt model aligned well with the structure in the protein data bank, with an r.m.s.d. of about 1 Å.
Although the structure of 4iqk has non-crystallographical symmetry (NCS), NCS density averaging was not applied here in ab initio phasing since the NCS operators were unknown, starting from random phases.
3.2. Enhancing Protein Envelope Accuracy with the Transition Region: Reconstructing Protein Envelopes with Rough Surfaces
The construction of an accurate protein envelope is a critical factor in the success of direct phase retrieval. Typically, the calculated density does not converge to the correct density until the calculated protein envelope covers approximately 90% of the true protein structure. However, achieving an accurate envelope can be challenging, especially for proteins with residues buried deep in bulk solvent after crystallization. In such cases, the surface of the protein envelope within a unit cell appears rough and uneven. Outlier residues buried in the solvent often evade inclusion in the reconstructed protein envelope, leading to failed direct phasing attempts.
Using a single cutoff value on the weighted-average density map to reconstruct the protein envelope with a rough surface in the unit cell proved inadequate. For instance, four protein structures (3tqe [
45], 4q82 [
46], 2fg0, and 2evr [
47], shown in
Figure 5) demonstrated outlier residues buried in the solvent after crystallization, resulting in non-smooth surfaces for the true protein envelope within the unit cell. Employing a single cutoff value approach in reconstructing the protein envelope from the calculated density led to a low success rate. In 100 independent runs starting from random phases for each of the four diffraction data sets, only a few successful runs were obtained, as shown in
Table 1. The reconstructed envelopes in these failed runs often missed the outlier residues crucial for the accurate representation of the protein structure.
The introduction of a transition region proved to be valuable when dealing with rough surface envelopes. Instead of searching for a single cutoff value on the weighted-average density map, the transition region is identified first, with the expectation that the true protein envelope lies within this region. A smaller
is used to compute a more detailed weighted-average density based on Equation (
1). This detailed weighted-average density aids in identifying an accurate protein envelope that can encompass as many outlier residues as possible. Consequently, the reconstructed protein envelope becomes more accurate, significantly benefiting the phase retrieval process. In our trial calculations, the success rate nearly doubled, as shown in
Table 1. The number of successful runs increased significantly, indicating the effectiveness of the transition region in refining the calculated protein envelope, especially when dealing with rough surfaces.
As seen in
Table 1, we conducted 100 runs for each structure, starting from independent random phase sets. When comparing the two cases with and without a transition region, we used the same random phase sets. The introduction of a transition region generally led to both an increased success rate and improved phasing speed. However, it did not contribute to reducing the final mean phase error, which remained closely associated with the quality of the measured diffraction data.
3.3. Enhancing Direct Phasing with the Transition Region for Protein Crystals with Limited Solvent Contents
Phasing protein crystals with lower solvent contents proved to be a challenge for direct phasing methods that prefer higher solvent contents, typically above 65% [
8,
9,
10,
11,
12,
13,
14,
15,
16]. As most protein crystals exhibit solvent contents below this threshold, we explored the use of a transition region to maximize the utilization of limited solvent during direct phasing.
The introduction of the transition region played a crucial role in direct phasing, particularly when using the THIO algorithm described in Equation (
8). Unlike the conventional HIO method, where a single cutoff value separates the unit cell into protein and solvent regions, the THIO algorithm introduces a transition region that occupies 10% of the unit cell, comprising 5% solvent and 5% protein. This modification ensures that all grid points within the transition region contribute to phase retrieval, making it possible for the 5% protein content to aid in the phasing process. This proves advantageous when dealing with crystals with a solvent content less than 65%.
The evolution of the transition region during iterations is depicted in
Figure 6. Initially occupying 10% of the unit cell, the transition region linearly shrunk to zero as the iterations progressed. We tested the THIO algorithm on protein crystals with solvent contents ranging from 60% to 65%, including five crystal structures: 4iqk, 3tqe, 4q82, 2fg0, and 2evr.
The THIO algorithm works well for low-solvent-content protein crystals, particularly when the structure exhibits non-crystallographical symmetry (NCS). NCS-related copies share the same density, reducing the number of unknown variables and increasing the phase problem’s redundancy. This overdetermined state proves beneficial for low-solvent-content crystals with NCS, as described in our previous works [
14,
16]. However, without NCS, both THIO and HIO methods may not work effectively. When NCS is absent, the phase problem of protein crystallography generally becomes underdetermined for solvent contents below 60%.
The retrieved phase typically exhibits a mean phase error of around 30° for high-resolution diffraction data. The calculated density is interpretable, facilitating model building with tools like ARP/
wARP [
49] or Phenix AutoBuild [
50]. The rebuilt models demonstrate a close match with the structures posted in the protein data bank, with an r.m.s.d. of approximately 1 Å, as shown in
Figure 7. In the
Section 4, we will talk about structures with low-resolution diffraction data, such as 6c4z.
4. Discussion
Our approach to finding an accurate protein envelope involved the introduction of a transition region between the protein and solvent within the unit cell. We computed a weighted-average density map from the calculated density, enabling us to identify the inner and outer surfaces of the transition region, which provided an approximate location of the protein envelope. To refine the envelope, we employed a smaller radius to compute a finer weighted-average density within the transition region, and this was utilized by the THIO algorithm. We wondered if using only the smaller radius would yield accurate results, but our trial calculations indicated that the density was fragmented and not interpretable, rendering it unsuitable for determining a complete envelope and bulk solvent. Additionally, experimenting with multiple layers of transition regions did not improve the results.
Our trial calculations primarily focused on high-resolution diffraction data ranging from 1.50 to 1.98 Å. However, we sought to explore whether the transition region could prove beneficial for low-resolution diffraction data as well. We tested the 6c4z crystal, a human-designed amyloid-like structure with a solvent content of 81.22% and diffraction at 3.30 Å [
51]. In this case, the transition region failed to aid phase retrieval due to the low-resolution data. The calculated protein density extended significantly into the solvent, blurring the boundary between protein and solvent and making the search for a clear protein envelope impractical. For both HIO and THIO, direct phasing low-resolution diffraction data is still a challenge. Trail calculations on several structures with low-resolution diffraction data failed for both HIO and THIO. A more effective approach to deal with low-resolution data is under research.
Empirically, we set the volume of the transition region to 10% of the unit cell in our trial calculations. We experimented with larger and smaller transition regions but found that 10% was an appropriate choice. During the initial iterations, the calculated density map is nearly random, with grid points having high or low weighted-average density located deep within the protein or bulk solvent. Other grid points remain undetermined, with the probability of being either a protein or solvent. That prompted us to assign a probability (Equation (
7)) to such grid points. The volume of the transition region can vary with iteration cycles, and in our tests, it linearly shrunk from a 10% volume at the beginning to complete shutdown at the end of the iterations. A balance was maintained to ensure the transition region was not too large, as it should not overshadow the unit cell, or too small, as it would be indistinguishable from a single cutoff value used in the conventional HIO phasing method.
Comparing our transition method with other related approaches, we observed that the transition region aids in refining the calculated protein envelope, resulting in increased success rates and faster phasing. Liu et al. proposed a block region outside of a fixed protein boundary, setting the density in that region to zero [
8]. While that approach ignores the outliers of residues, it does not contribute to envelope refinement. In contrast, our method refines the protein envelope by updating the transition region in each iteration cycle, proving to be more effective.
The transition region not only increases the volume of density modified by phasing algorithms in bulk solvent but also includes density in the transition region itself, making it beneficial for the direct phasing of protein crystals with lower solvent contents. Utilizing as much solvent as possible is crucial for direct methods, as protein folding often results in pockets buried deep within the protein envelope, which can be occupied by solvent molecules. In future research, we plan to employ new algorithms to exploit those small pockets effectively.
The code for our approach is available online [
52].
5. Conclusions
In conclusion, our introduction of the transition region and the transition hybrid input–output algorithm proved to be highly effective in refining the calculated protein envelope from random phases and diffraction data. The transition region contributes to the direct phasing method in multiple ways. Firstly, it aids in breaking crystal translation symmetry and determining the origin choice, addressing a critical challenge in phase retrieval. Secondly, it proves to be particularly valuable in reconstructing the protein envelope when dealing with crystals that have a rough surface, ensuring accuracy even in challenging cases. Lastly, the transition region significantly increases the volume of the density modified by iterative projection algorithms, benefiting phase retrieval, especially for protein crystals with lower solvent contents.
Our approach demonstrated remarkable success when applied to high-resolution diffraction data. However, for low-resolution diffraction data, where a distinct protein boundary is absent, the transition region does not provide the same benefits. Nevertheless, the transition region, in conjunction with the transition hybrid input–output algorithm, consistently results in an accurate protein envelope, leading to increased success rates and accelerated phase retrieval. This enhancement makes the direct method more adept at phasing protein crystals with limited solvent contents. The improved performance of our method signifies a promising advancement in the field of protein crystallography, making it more versatile and reliable for various crystal structures.