1. Introduction
Thanks to the increased brightness of X-rays at modern synchrotron sources and the appearance of fast and nearly noise-free detectors, it becomes possible to collect diffraction data with fine time sampling or with fine steps in pressure, temperature or anisotropic external fields. These in situ experiments open up new possibilities to reveal the structural mechanisms behind the functional response of the material under study. The dark side of such experimentation is a large volume of the data to be analyzed. Ideally the data analysis should be fast enough for online optimization or modification of the experiment in order to reach the necessary performance for a given in situ setup. There is hence a growing need for high-throughput experiments to be accompanied by high-throughput data processing. One of the Big Data tools, Principal Component Analysis (PCA), has been proposed to facilitate the analysis of powder diffraction data, extracting information about solid-state kinetics [
1]. One of the declared advantages of the method is its blind reduction of initial dimensionality of the data to obtain a representation with a lower number of variables (called Principal Components, PCs) [
2]. Those components, for diffraction data, represent products of functions that solely depend on the Bragg angle (loadings,
) with evolutionary functions (scores,
) that represent the change of the proportion for a given component as a function of time or external variables. For a series of diffractograms collected as a function of time within an in situ experiment, the decomposition on principal components reads as:
where
are the loadings,
are the scores.
The components are sorted in a descending manner, and the decomposition is truncated when a given level of correspondence between the data and decomposition is reached. While the mathematics behind PCA is well-defined and can be found in many nice texts [
3], the physical or chemical sense of the above decomposition is not always clear, especially when it comes to its application to diffraction data. In the literature we can find that PCA has already been used quite often to analyse diffraction patterns aiming to, i.e., clusterization of the data related to materials under study: blend composition ratio of cocaine to sodium hydrocarbonate [
4], classification of counterfeit coins [
5], bulk properties of multiwall carbon nanotubes [
6], grouping materials measured by a pixelated detector [
7], evolution of diffraction signal from in situ electrochemical cell [
8].
For time-resolved in situ diffraction, the scores have been used to extract information on reaction kinetics [
1], and seem to be easy for interpretation. In opposite to the scores, the attempts to analyze loadings are usually stated as “the principal components are often abstract representations of the chemical information and not easily interpreted” [
8]. In [
9] it was pointed out that some features of loadings ‘correspond’ to Bragg lines in the consequent powder diffractograms, but no further analysis was done.
A further application of PCA to the XRPD and PDF data has recently been reported [
10]. PCA output was compared with conventional structural analysis. It was shown that the scores of different components may reflect trends in the temperature evolution of the phase fraction in a multi-phase mixture, change in geometrical parameters of a structural fragment or something that varies twice faster than the temperature. The loadings were composed of positive and negative peaks and/or had an asymmetric shape. These features were proposed to originate from a variation of the unit cell size with temperature. Obviously, without a comparison with traditional structural analysis it is difficult, if possible at all, that a blind use of PCA to diffraction data could receive a rational interpretation.
A tool for data analysis of a big volume of XRPD data collected under periodically changed external conditions, e.g., temperature or pressure, Modulation-Enhanced Diffraction (MED) was proposed in [
11]. The method is based on the decomposition of diffraction data on the time average and varying with time contributions. A Fourier transform of diffraction data from time to frequency domain gives a set of harmonics. Assuming that the variation of the structure factor is a linear function of an external perturbation and in absence of a variation of the unit cell dimensions one derives that the first harmonic corresponds to a cross term between the time average and time changing parts of the structure factor. The second harmonic gives a sole contribution of the time changing part of the structure factor squared. For detailed analysis and practical applications, see [
11,
12,
13,
14].
A first comparison of MED and PCA was done in [
15] where it was shown that the first principal component may correspond to a cross product of the time average and time changing parts of the structure factor, while the second component is proportional to the square of the time changing part of the structure factor, in close similarity with MED. Unfortunately, this observation and conclusion (for both MED and PCA) is correct only for data with a negligibly small variation of the unit cell dimensions. Even when this condition is fulfilled, PCA output has to closely inspected and additional corrections for the rotational ambiguity may be required [
16]. The effective approach to do that, Optimal Constrained Component Rotation (OCCR) is considered in details in [
17]. However, as it stated in [
17], “if the hypothesis that the peak shape and position do not change with the stimulus does not hold, the PCA decomposition cannot be accomplished”.
One of the palliatives is an adaptive adjustment of the line positions, see [
18] as an example. The other possible option is to apply PCA to single crystal data where the intensities and positions of Bragg reflections are measured separately [
19]. However, even for those cases the knowledge on what to expect from PCA applied to diffraction data is a necessary prerequisite, similar to modulation-enhanced diffraction. The information content of the MED output for more realistic conditions assuming both variations of diffraction intensities and unit cell dimensions under changing external conditions is given in Ref. [
14]. However, a clear understanding of the PCA output for this case is still missing. The goal of the present paper is to fill this gap and propose a rigorous scheme for a unambiguous interpretation of the PCA output on diffraction data.
2. Theory
The effect of external perturbations on crystalline materials, such as temperature, pressure, variation of chemical surrounding, can result in atomic displacements, and variation of occupancy and Debye–Waller factors. Changes in positions and intensities of diffraction lines associated with the external perturbation vary with time
t and are bound to each other as well as the structure factors and the unit cell dimensions. We start with an assumption that time-dependent structure factor,
,
, can be expressed as a linearized form:
where
represents the susceptibility of the structure factor if one varies a parameter
S that also modifies the parameter
x (
S and
x may be occupancy, atomic position, scattering function, and DWFs). Conjugated changes in the position of Bragg reflection then writes:
with
may be also seen as the susceptibility relating the line shift and change of the unit cell dimensions
. It is natural to assume a linear link
, this link has, however, to be taken with care if changes in average structure and lattice response may differ, like, for example, for relaxation processes.
We now write the powder diffraction intensity as a function of
and time
t:
that is a sum over all the Bragg reflection intensities convoluted with the line shape function
.
The line shape function,
, in turn, can be expressed as a Taylor series with respect to a small parameter—a shift of the line position in
d:
where the bar represents the averaging over all the measurements as before, the derivatives
and
are anti-symmetric and symmetric profile functions, readily calculated for a Gaussian (see
Appendix A).
The Bragg intensities for every reciprocal vector
are given by the following terms:
The scattered intensity as a function of
d is therefore given by the product of Equations (
5) and (
6), and may be considered as a sum:
where
is the time average,
, the other components and their time evolution functions are given in
Table 1.
The components listed above differ by the time evolution function and by the shape of diffraction lines. Let us consider a few simple cases, first without any change in unit cell dimensions, , where the line positions stay the same, but the diffraction intensities change. There are only two components that depend on time as and , the corresponding diffraction patterns are composed from symmetric lines that may have positive and negative intensities or be strictly positive . The second simple case corresponds to the unit cell evolution without any change in the diffracted intensities. Here, there are also only two components left: the first is built from the anti-symmetric line profiles given by the first derivative of the profile function; the symmetric line shapes for the second component are defined by the second derivative.
The Principal Component Analysis assumes the following representation of a set of powder diffractograms collected as a function of time
t:
where
i is the number of a component,
L and
S are the loading and score, correspondingly. All the components are sorted in the descending order with respect to their contribution to the data variation. The similarity with Equation (
7) is clear, and one may expect that loadings and scores for this decomposition are similar to those given in
Table 1. However, one has to keep in mind that a linear combination may also serve as a component. This problem originated from the rotational ambiguity, inherent for PCA [
16]. It can be exemplified for two components as followed (see also
Appendix B):
This rotational correction is similar to one used for OCCR approach [
17]; here, for the sake of simplicity, we keep the components orthogonal.
We apply the theory described above, first, to the simulated data, and then, to real experimental data. In the simulated data we change specifically variations of the structure factor or positions of the Bragg lines or both. The experimental data from a gas uptake by a porous solid contain both variations and also deviates from linear response.
4. PCA Analysis of Real Data for Mg(Bh) + Kr
The data we use for PCA are the same as used in [
21]. A sequential Rietveld refinement of the data made a detailed analysis of kinetics barriers for Kr uptake by this porous solid possible [
21]. A specific feature of these data is a time-dependent background that comes from Kr fluorescence. This background serves as an independent measure of the absorbed Kr and can be directly compared with the outputs of PCA.
We use a dataset collected at 170 K with
bar pressure of Kr, as a function of time after the “sample + gas” container was quickly inserted in the cold N
stream. The data were recorded with 4 seconds time sampling. In agreement with kinetics analysis given in Ref. [
21] we consider every pattern as a snapshot of the equilibrium state, an overview of this collection of powder patterns is given at
Figure 5. The data were partitioned in three partial subsets, we took only first 1000 patterns of full data (subset A), than 1000 patterns with every second pattern (subset B), and finally 1000 patterns with every fourth pattern (subset C). We use this partitioning to examine how the PCA outputs from partial data correspond to each other and to the Kr fluorescence background.
The “blind” application of PCA gives results that are far from satisfactory, shown in
Figure 6. The first scores can be overlapped with a constant shift with small but notable difference with the Kr background signal. The first loadings do not look the same for three subsets.
The PCA with the dataset A gives one dominating component that accounts for 99 percent of all changes, with loading and score shown in
Figure 7 after a small manual rotational correction of
rad. The expected evolution function and intensity distribution are given in
Table 1 and in Equation (
14). The comparison with intensity estimates are given for the loadings in
Figure 7.
The match of the calculated loading with observed is satisfactory for
, assuming one leading term (Equation (
14),
Table 1), namely:
similar to the model case considered above. For
many lines have anti-symmetric line shapes indicating that the major contributions come from the following terms:
A comparison of the first contribution with the second loading is given in
Figure 7 for the dataset A.
The PCA output for data subsets B and C correlates well with the results for the subset A after the rotational correction (
Appendix B). After shifting with a constant all the first scores overlap and match the time dependence of the Kr fluorescence background (
Figure 8). All first loadings are also looking similar, close inspection evidence a systematic shift in line positions and distortion of the shape. The shift reflects a change in the average unit cell dimension with time. The change in line shape indicates that for datasets covering longer experimental time higher order terms from
Table 1 become significant. A similar effect is reflected in the second loadings, where one sees that the first term in Equation (
18) goes down and the second term increases for datasets with longer time. This change in ratio of different diffraction contributions into the PCA components indicates a certain difference in time evolution for Kr occupancy and unit cell dimension: the unit cell size keeps changing when Kr occupancy is nearly saturated.
5. Conclusions
We have performed a theoretical analysis on how specific changes in time resolved diffraction data are represented into PCA output. The analysis enumerates terms that comes alone or as linear combinations into loadings and scores given by PCA. The above analysis is illustrated by applications of PCA to synthetic and real powder diffraction data. The immediate result shows that this method has an inherent danger when it is applied “blindly”, that is sometimes considered as an advantage. Namely loadings may have a complex structure being a sum of contributions mixing up the lattice and structural response. Leaving quantitative analysis of the loadings aside and focusing on solely on the, e.g., kinetic information extracted from the scores is not less dangerous. Rotational ambiguity makes certain linear combinations of diffraction contributions to be preferred components for PCA, resulting in inaccurate rates and activation energies when applied blindly. The first practical recommendation stemming from the theoretical analysis is to find a rotation correction that makes
as close as possible to a square of
. Successful applications of this recommendation were exemplified with both simulated and real data. We have also shown that the loadings can be rationalized in terms of contributions with different profile functions. The decomposition on contributions is trivial for powder diffraction if there is no change in the unit cell dimension. This makes PCA an interesting tool for single crystal applications where diffraction intensities can be collected independently from unit cell changes [
19]. The information stored in loadings may be used to analyse active (responsive, evolving) substructure from the second loading, or to retrieve partial phase information from the first loading that is defined by the cross-product of structure factors. For powder diffraction additional information is contained in the line shapes. Thus, for an optimal rotational correction the line shapes of low angle reflections for first loading are predominantly symmetric functions of Bragg angle, while for the second loading that may be a sum of both, symmetric and anti-symmetric profiles.
The number of applications of PCA for the fast analysis of big diffraction data will certainly grow. It is one of the efficient data tools that are highly requested especially for very productive large scale experimental facilities. However, a successful use of PCA necessarily implies that the results are carefully analyzed and a further manual rotational correction is applied. If the PCA output is used blindly, the outputs have to be taken with a grain of salt, as we have shown here they almost certainly contain no physically meaningful result.
The other possible option is to use Multivariate curve resolution with alternating least squares (MCR–ALS), see Ref. [
8] as an example. This method, being based on PCA, allows one to constrain the output. For a diffraction probe of structural processes, the constraints have to be set accounting for the diffraction components listed in
Table 1. Potentially MCR–ALS, as well as post-PCA corrections, like OCCR [
17], with proper constraints may be used for targeting individual diffraction contributions. The possible constrains might be based on the shape of corresponding profile functions, but their exact formulation and quantitative analysis need more theoretical efforts and model calculations.
In spite of the difficulties and complications listed above, the interpretation of the PCA applied to single phase powder diffraction is entirely possible with all contributions enumerated in
Table 1. The present report provides one more step towards the rational use of PCA and similar multivariate tools to diffraction data. The next steps in the development and understanding may include quantitative analysis of the loadings, and development of the software tools for on-line analysis of data from in situ experiments.