1. Introduction
Chromatin spatial organization plays a critical role in genome structure and transcriptional regulation [
1,
2,
3]. During the last decade, great strides have been made in the mapping of long-range chromatin interactions, thanks to the rapid development of chromatin conformation capture (3C) based technologies. Among them, Hi-C enables genome-wide measurement of chromatin spatial organization [
4,
5] and has been widely used in practice. To ensure scientific rigor, various methods have been developed to assess the reproducibility of Hi-C data [
6,
7,
8,
9,
10]. For example, HiCRep [
6] first performs 2D smoothing to reduce the stochastic noise resulting from the sparsity of Hi-C data, and then quantifies reproducibility by calculating a stratified correlation, which is a weighted average of correlation coefficients between contact frequencies across specific one-dimensional (1D) genomic distance bands. HiC-Spector [
8] adopts a different approach, transforming symmetric Hi-C contact matrices to their corresponding Laplacian matrices and then calculating similarity as the average of the distances between normalized eigenvectors. Similar to HiCRep, GenomeDISCO [
7] relies on data smoothing, which is performed over a range of steps of the random walk to determine an optimal separation between biological replicates and non-replicates as measured by area under the precision–recall curve. The reproducibility measure is a function of distances between two contact matrices smoothed using this optimized number of steps. QuASAR-Rep [
9] determines a local correlation matrix by comparing observed interaction counts to background signal–distance values within a specified distance. This local correlation matrix is subsequently transformed by element-wise multiplication with a matrix of scaled interaction counts. The reproducibility between two samples is defined as the Pearson correlation coefficient between the corresponding transformed matrices.
Recently, HiChIP [
11] and PLAC-Seq [
12] technologies (hereafter collectively referred to as HP for brevity) have been developed to study protein-mediated long-range chromatin interactions at a much reduced cost and greatly enhanced resolution relative to Hi-C. While the chromatin immunoprecipitation (ChIP) step involved in HP technologies allows for the cost and resolution benefits, it also introduces additional layers of systematic biases, which make analysis methods developed for Hi-C data potentially unsuitable for HP data.
To fill in this gap, we propose a novel method, HPRep, to measure the similarity or reproducibility between two HP datasets. HPRep is motivated by HiCRep [
6], the previously described method developed for quantifying reproducibility of Hi-C data. Similar to HiCRep, HPRep leverages the dependence of chromatin contact frequency on 1D genomic distance; however, in contrast, HPRep models different ChIP enrichment levels (
Section 2.1.2), which contribute to the systematic biases specific to HP data, and also incorporates an unbalanced data matrix that addresses the targeted structure of HP data in comparison to Hi-C data.
3. Results
Currently available methods to quantify reproducibility in Hi-C datasets such as HiCRep, HiC-Spector, GenomeDISCO, and QuASAR-Rep (systematically evaluated in [
10]), all involve derivation of a similarity metric between two contact frequency matrices. The input Hi-C data consists of
n ×
n symmetric matrices of non-negative integers, where each row/column represents one genomic locus (i.e., bin) and
n is the total number of bins. The (
i,j) element of such a matrix represents the number of paired-end reads spanning between bin
i and bin
j.
These existing methods are conceptually inappropriate for HP data due to the unbalanced read distribution due to ChIP enrichment that is introduced in the HP experiments.
In addition, while Hi-C data consist of interactions among all bin pairs, HP data are restricted to bin pairs where at least one bin overlaps a binding region of the protein of interest. Such overlapping bins are referred to as the anchor bins, and two HP datasets may have different sets of anchor bins. We further define bin pairs consisting of two anchor bins as the “AND” pairs, and those consisting of only one anchor bin are defined as the “XOR” pairs. In contrast, the “NOT” pairs, for which neither bin is an anchor bin, are not meaningful due to the nature of HP technologies and therefore not used in HP data analysis [
13].
The data structure in HPRep is an
N ×
m matrix (
Figure 1), where
N represents the number of anchor bins and
m = 2 * 1 Mb/resolution, where resolution refers to the bin size (1 Mb is set as the default but can be modified by the user). The (
i,j) th element represents the normalized contact frequency between anchor
i and the bin
j bin widths away from the anchor,
j ∈ {−
m/2, …,−1, 1, …,
m/2}. The number of anchor bins,
N, is the cardinality of the union set of anchor bins for all datasets in the study. Normalization is performed via a two-step procedure. (1) Raw counts are adjusted for the biases introduced by effective fragment length, GC content, mappability, and ChIP efficiency by fitting a positive Poisson regression model, following the approach detailed in the MAPS method [
13]. Separate models are fit to the AND and XOR pairs since the AND pairs are expected to have significantly higher contact frequencies due to double ChIP enrichment. (2) Using the fitted models, the data are normalized by taking the log
2 value of (1 + observed/expected counts). Further details can be found in
Section 2.1.
Similar to HiCRep [
6], the distance metric used by HPRep is a weighted Pearson correlation coefficient that is stratified by 1D genomic distance. Note in
Figure 1 that these strata are the pairs of columns of the previously described data matrix, which are equal-distant from the center. Due to the sparsity of HP data, especially for long-range chromatin interactions, the normalized count values were smoothed. The smoothing procedure used was a 1D arithmetic mean of values within a window of
d bins away along the same row (see
Section 2.2 for optimization procedure). Each of the
m/2 correlations was weighted based on the variation of the smoothed values at that distance such that the weights sum to one. Therefore, the resultant metric was restricted to [−1, 1] and had a similar interpretation as a standard Pearson correlation coefficient.
Let
ak and
bk be two vectors of length 2
N from samples a and
b, respectively, whose elements are normalized contact counts, where
N represents the number of anchor bins in the union set of anchor bins from all samples in the study, and
k indexes bins that are ±
k units away. Let
ak′ and
bk′ be the resulting vectors of length
Nk ≤ 2
N after removing any elements that are 0 in identical positions in both two vectors. The weight for stratum
k,
wk, is defined as
where
K is the total number of strata, which is analogous to the weights used in HiCRep [
6]. The numerator of
wk is the product of strata size and the standard deviations of
ak′ and
bk′, while the denominator is the sum of the numerators over all strata. Consequently, the weights were restricted to [0, 1] and the sum to 1, where larger and more variable strata carry more weight than smaller and less variable strata. The final reproducibility metric was the weighted sum of correlations between each stratum. This workflow is diagrammed in
Figure 1.
3.1. Mouse H3K4me3 PLAC-Seq Data
To evaluate the performance of HPRep, we first analyzed published H3K4me3 PLAC-Seq datasets from mouse embryonic stem cells (mESCs) [
13] and mouse brain tissues [
15], both consisting of two samples, by applying HPRep at 10 Kb resolution. Samples from the same cell type or tissue were labeled as biological replicates while those cross cell type or tissue were labeled non-replicates, yielding two pairs of biological replicates and four pairs of non-replicates. Pseudo replicates were generated by pooling two samples of the same cell type or tissue together, and then partitioning the pooled contact frequency in each bin pair randomly via binomial (
p = 0.5) sampling.
We would expect that pseudo replicates are most similar, followed by biological replicates, and that non-replicates are least similar. Indeed, this expected pattern is observed using HPRep (
Figure 2), with results also exhibiting highly consistent patterns across chromosomes (
Supplementary Figure S1). The higher metric for replicate mESC samples relative to mouse brain samples is due to the higher sampling depth of the former.
We next compared HPRep with alternative methods, specifically two Hi-C reproducibility methods: HiCRep [
6] and HiC-Spector [
8] as well as a naïve Pearson correlation (
Section 2.3). Since the Hi-C specific methods are designed using n × n symmetric contact matrices as the standard input, for these comparisons, in addition to restricting to bin pairs in the AND and XOR sets, we generated a “pseudo Hi-C” dataset from a HP dataset by also using all bin pairs (including the AND, XOR and NOT sets). The naïve Pearson correlation consisted simply of converting the entire upper triangular Hi-C contact matrices for each sample to single vectors and calculating the Pearson correlation coefficient between them. The methods were performed separately on all 19 autosomal chromosomes and the resulting metrics were reported as the arithmetic mean. The HiCRep and HiC-Spector methods were applied with the default parameters. The results are displayed in
Figure 3.
All methods except for naïve Pearson correlation yielded results consistent with what we expected, namely higher similarity for the biological replicates and lower similarity for the non-replicates. The similarity or reproducibility values for the biological replicates were similar among these three methods, which is expected for HPRep and HiCRep, since both methods are based on stratified Pearson correlation, but is noteworthy for HiC-Spector, since it is based on a rather different method, and was restricted to [0, 1] as opposed to [−1, 1]. The difference among these methods, with the exclusion of HiC-Spector when including the NOT set, manifests largely in values for non-replicates, with HPRep yielding much smaller values relative to the others, although in each case, the four non-replicate pair results were very consistent. Interestingly, the naïve Pearson correlation fails with the mouse brain sample, yielding a reproducibility score nearly identical to those of the non-replicates, whereas the result from mESC replicates is consistent with the other three methods. This failure is obviated in HiCRep and HPRep, the other Pearson based methods. For example, for biological replicates, HPRep yields a mean reproducibility metric of 0.92 compared to a mean value of 0.25 for non-replicates. For the experiments using bin pairs in the AND, XOR and NOT sets, the mean reproducibility metrics comparing replicates and non-replicates were 0.80 vs. 0.51, 0.99 vs. 0.73, and 0.88 vs. 0.76 for HiC-Spector, HiCRep, and Pearson correlation coefficients, respectively.
3.2. Human HiChIP Data
In addition, we applied HPRep to measure the reproducibility of H3K27ac HiChIP data from GM12878 cells (two biological replicates) and K562 cells (three biological replicates) at 10 Kb resolution [
16], resulting in four pairs of biological replicates (one pair from GM12878, three pairs from K562) and six pairs of non-replicates (
Figure 4). We anticipated a priori that differences between replicates and non-replicates would be more pronounced in this human dataset than the previous mouse H3K4me3 PLAC-Seq dataset due to the greater dissimilarity in H3K27ac anchor bins between GM12878 cells and K562 cells. Specifically, the GM12878 and K562 cell lines contained 31,980 and 26,963 H3K27ac 10 Kb anchor bins genome-wide (autosomal), respectively, with only 14,304 shared (Jaccard index 0.32). In contrast, mESC and mouse brain had 28,903 and 21,778 H3K4me3 10 Kb anchor bins, with 17,722 overlapping, (Jaccard index 0.54), which was expected since active promoters are largely shared across tissues and cell lines. For this human dataset, all methods were performed individually on all 22 autosomal chromosomes and the resulting metrics were averaged across chromosomes.
The results from the human HiChIP data were consistent with those from mouse PLAC-Seq data: the biological replicates yielded high similarity (close to 1) while the non-replicates yielded uniformly lower similarity. While all autosomal chromosomes were used in these analyses and the results were largely consistent across them using HPRep, HiCRep, and Pearson correlation coefficients, the results were quite inconsistent using HiC-Spector (
Supplementary Figure S2). Specifically, HiC-Spector used 20 eigenvectors in the computation of a reproducibility metric, however, for several chromosomes, convergence failed, so fewer eigenvectors were used, which yielded erratic results (
Supplementary Table S1). Again, HPRep results in the lowest metrics for the non-replicates, which were all close to zero, highlighting the influence on anchor bin identity in this method.
3.3. Human PLAC-Seq Data
We next applied HPRep to a more complex H3K4me3 PLAC-Seq dataset at 5 Kb resolution, consisting of 11 samples from four brain cell types in human fetal brain obtained via fluorescence-activated cell sorting [
17]: three samples from neurons (N), three samples from interneurons (IN), two samples from radial glial (RG), and three samples from intermediate progenitor cells (IPC). These samples had varying sequencing depths (detailed in
Supplementary Table S2 in [
17]), with the number of intra-chromosomal reads ranging from 47.5 million for RG2 (the second replicate of RG) to 390 million for RG1 (the first replicate of RG). The anchor bins were defined as the union of 1D H3K4me3 peaks from all four cell types. In
Figure 5a, reproducibility obtained by HiCRep showed no differentiation between inter- and intra-cell types. In contrast, HPRep showed a clear pattern of higher similarity for replicates from the same cell type compared to those from different cell types.
Focusing on bin pairs in the AND and XOR sets highlights the effect of normalizing ChIP enrichment level.
Figure 6 is analogous to 5a excluding bin pairs in the NOT set. The cell type clustering is more in line with the known truth, however, still has misspecifications according to the dendrogram: neuron, interneuron, and IPC cells were correctly grouped, but radial glial cells were misclassified into two groups.
Recent studies have shown that HiCRep is sensitive to sequencing depth [
10]. To evaluate the robustness of HPrep with respect to different sequencing depths, we performed downsampling to the original PLAC-Seq data from four human brain cell types. This was performed by sampling from a multinomial distribution with
n equal to the original count multiplied by a downsampling factor and count probabilities set to match the distribution in the original data (
Section 2.4).
The first downsampling was performed so that all samples matched the depth of the sample (RG2), which had the lowest sequencing depth. Note the identical color scales for
Figure 5b and
Figure 7, but the decrease in metric values for many pairwise comparisons for samples of the same cell type such as interneuron cells. To quantify this reduced discernibility between samples, we utilized the silhouette procedure [
14], treating reproducibility score as a distance metric and reporting the average of the 11 silhouette values, one for each sample (
Section 2.5). We obtained 0.717 and 0.685 for the original experiment and downsampled results respectively, where smaller numbers indicate worse clustering performance.
Subsequent downsampling was performed uniformly across all samples so that the total counts were reduced to 80%, 60%, 40%, and 20% of their original values following the previously described sampling protocol. As expected, in
Figure 8, we observed decreased discernibility among samples from different cell types, most strikingly with IPC and RG where the within sample HPRep reproducibility metric dropped to as low as 0.26 and 0.43, respectively. Applying the modified silhouette procedure described above to these four downsampled datasets, we obtained a silhouette score of 0.700, 0.678, 0.634, and 0.518 for downsampling to 80%, 60%, 40%, and 20%, respectively.
We next sought to investigate the extent to which our HPRep metric was driven by the 1D ChIP (anchor) signals relative to the 3D bin contact signals. To this end, we compared the irreproducible discovery rate (IDR) [
18] (
Section 2.7) to the HPRep results utilizing the highest read depth brain sample (RG1). This was accomplished by pairwise comparisons between the original ChIP-Seq data (IDR) or AND/XOR data (HPRep) and corresponding samples that had been downsampled to 80%, 60%, 40%, and 20% to the original depth. As expected, both IDR and HPRep metrics decreased with more aggressive downsampling, however, the effect on IDR, as measured by fraction of peaks passing a false discovery rate threshold of 5%, was far more pronounced. HPRep metrics were 0.97, 0.96, 0.93, and 0.88 compared to IDR of 0.80, 0.68, 0.24, and 0.06 at 80%, 60%, 40%, and 20% of the original depth, respectively. This effect difference suggests that 1D information does not dominate our results; if the HPRep results were merely a reflection of anchor similarity, we would expect a more consistent trend between the two experiments.
4. Discussion
Quantification of data reproducibility is critical to ensure scientific rigor, however, methods tailored for HiChIP and PLAC-Seq data are still lacking. Here, we propose HPRep, the first model-based approach to account for ChIP enrichment in measuring HP data reproducibility. Given the lack of HP specific tools, we compared HPRep to existing methods designed for Hi-C data, specifically HiCRep and HiC-Spector. Additionally, since our method, similar to HiCRep, relies on a weighted average of Pearson correlation coefficients, we also compared HPRep to the naïve Pearson correlation coefficient.
Our HPRep method, improving on existing Hi-C specific methods, was tailored to HP data for the measurement of reproducibility in two fundamental ways. First, HPRep was designed to accommodate the specific structure of HP data: while Hi-C data consist of contact frequencies among all bin pairs, HP data focuses on bin pairs where at least one bin overlaps with a ChIP-Seq peak for a protein of interest. This was different from the standard
n × n symmetric Hi-C contract matrix. We focused on the data matrix on anchor bins, regions that overlapped with ChIP-Seq peaks, and pairs between bins within a specified window of these anchors as illustrated in
Figure 1.
Second, HPRep fits a positive Poisson regression model to normalize HP-specific ChIP enrichment and uses the residuals as the normalized contact frequencies. It also analyzes bin pairs in the AND and XOR sets separately, effectively accounting for ChIP enrichment for the two different types of bin pairs.
Our results from mouse H3K4me3 PLAC-Seq data demonstrated very low variability in metrics between chromosomes (
Figure 2), which is consistent with HiCRep (
Supplementary Figure S3). In addition, we also compared HPRep with other existing methods using human H3K27ac HiChIP data from GM12878 and K562 cells as well as H3K4me3 PLAC-Seq data from four human brain cell types. Our results demonstrated the superior performance of HPRep, in terms of accurate clustering of samples from the human brain cell types, which was not achievable using HiCRep, although better clustering accuracy was observed when excluding bin pairs in the NOT set.
Future work involves exploring the potential of using this method to determine minimum per sample sequencing depth or maximum allowable (if any) differential depth across samples for accurate quantification of HP data reproducibility. We show that sample differentiation and expected clustering were robust to downsampling, but rigorous analysis needs to be performed in order to demonstrate practical use, as more high-depth HP data become available from more tissues, cell lines, or cell types. Additionally, we plan to examine the use of this general framework with capture Hi-C datasets including those targeting a relatively small number of loci identified from genome-wide association studies, and these genome-wide promoter capture Hi-C experiments. The use of pre-defined anchors by these methods suggests that the HPRep framework will be also applicable to these capture Hi-C methods, therefore these extensions are highly warranted but are beyond the scope of our current HPRep work.
In terms of computational efficiency, for the human PLAC-Seq data consisting of 11 samples, tuning the smoothing parameter and determining all 55 pairwise reproducibility metrics for all 22 autosomal chromosomes took 1 h and 5 min using a single core on a 2.50 GHz Intel processor with 4GB of RAM. One can choose to apply HPRep to one chromosome for almost the same result. Using the same data, HPRep takes 35 min to perform tuning and analysis solely on chromosome 1 using the same single core.