Comparison of computational methods for the identification of topologically associating domains - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2018 Dec 10;19(1):217.
doi: 10.1186/s13059-018-1596-9.

Comparison of computational methods for the identification of topologically associating domains

Affiliations
Comparative Study

Comparison of computational methods for the identification of topologically associating domains

Marie Zufferey et al. Genome Biol. .

Abstract

Background: Chromatin folding gives rise to structural elements among which are clusters of densely interacting DNA regions termed topologically associating domains (TADs). TADs have been characterized across multiple species, tissue types, and differentiation stages, sometimes in association with regulation of biological functions. The reliability and reproducibility of these findings are intrinsically related with the correct identification of these domains from high-throughput chromatin conformation capture (Hi-C) experiments.

Results: Here, we test and compare 22 computational methods to identify TADs across 20 different conditions. We find that TAD sizes and numbers vary significantly among callers and data resolutions, challenging the definition of an average TAD size, but strengthening the hypothesis that TADs are hierarchically organized domains, rather than disjoint structural elements. Performances of these methods differ based on data resolution and normalization strategy, but a core set of TAD callers consistently retrieve reproducible domains, even at low sequencing depths, that are enriched for TAD-associated biological features.

Conclusions: This study provides a reference for the analysis of chromatin domains from Hi-C experiments and useful guidelines for choosing a suitable approach based on the experimental design, available data, and biological question of interest.

Keywords: Hi-C; Method comparison; Topologically associating domain.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Identification of topologically associating domains (TADs) in chromosome 6 of the GM12878 cell line from ICE-normalized Hi-C data. a The performance of 22 TAD callers (listed on the left and right) was assessed using Hi-C data of the chromosome 6 of the lymphoblastoid cell line GM12878. b Total number of TADs detected in the ICE-normalized Hi-C data of chromosome 6 at four different resolutions (10, 50, 100, 250 kb) by each of the 22 TAD callers. Color intensity is proportional to the number of TADs in log-scale, and gray boxes correspond to TAD callers that did not successfully identified TADs at a given resolution. c Mean size (measured in kb) of the TADs detected in the ICE-normalized Hi-C data of chromosome 6 at four different resolutions (10, 50, 100, 250 kb) by each of the 22 TAD callers. Color intensity is proportional to the mean size of the TADs in log-scale, and gray boxes correspond to TAD callers that did not successfully identified TADs at a given resolution. d, e Variation of the mean size of the TADs measured in kb (d) or in number of bins (e) across Hi-C matrix resolutions. Each line refers to a TAD caller (numbered as indicated at the bottom of the plot), and only TAD callers that successfully identified TADs at all five resolutions are shown. f Slopes derived from the linear fit of the curves in panel d (TAD size in kb across resolutions) versus slopes derived from the linear fit of the curves in panel e (TAD size in number of bins across resolutions). Dots are colored based on the general approach used by the tool (“linear score,” “clustering,” “statistical model,” “network features”—see Table 1). The dashed line indicates the linear fit
Fig. 2
Fig. 2
Concordance of TADs identified using different normalization strategies and resolutions. a Concordance between TAD partitions obtained with each TAD caller from ICE- and LGF-normalized matrices at five different resolutions (10, 50, 100, 250 kb) was assessed using the Measure of Concordance (MoC). MoC varies from 0 (absence of concordance, white) to 1 (full concordance, dark red). TAD callers are ranked based on the average of the MoC values across all resolutions (from highest to lowest). TAD callers that did not successfully identified TADs at a given resolution (gray boxes) were scored as 0 for the purpose of ranking by average MoC. b Concordance between TAD partitions obtained at different resolutions was assessed in a pairwise manner (e.g., 10 kb vs. 50 kb, 10 kb vs. 100 kb, etc.; results for the ICE data only are shown here) using the MoC. MoC varies from 0 (absence of concordance, white) to 1 (full concordance, dark red). TAD callers are ranked based on the average of the MoC values across all comparisons (from highest to lowest), and resolution comparisons are ordered according to the fold difference between matrix resolutions. TAD callers that did not successfully identified TADs at a given resolution (gray boxes) were scored as 0 for the purpose of ranking by average MoC. c Concordance across normalizations (average MoC obtained by comparing ICE and LGF partitions at different resolutions) versus concordance across resolutions (average MoC obtained by comparing Hi-C matrix of different resolutions for a fixed normalization. The dashed line indicates the linear fit
Fig. 3
Fig. 3
Concordance of TADs identified with different sequencing depths. Top panel (table). From the top: percentage of reads retained, actual number of reads (million), and estimated cost for generating the corresponding number of reads based on 150-bp paired-end sequencing in 2017. Bottom panel (heatmap). Measure of Concordance (MoC) values between TADs identified using 100% of the reads (rightmost column contoured in black) and TADs identified using a given percentage of reads (columns) for each TAD caller (rows). MoC values are color-coded with cold colors indicating low values (blue = 0) and warm colors indicating high values (red = 1). Percentages of reads used increase from left to right (see top panel for detailed values). Gray boxes correspond to TAD callers that did not successfully identify TADs with a given number of reads. Hi-C matrices were ICE-normalized and binned at 50 kb. TAD callers are ranked (from top to bottom) based on the minimal percentage of reads (values are reported on the right) required to identify TADs obtaining a MoC of at least 0.75 when compared to TADs identified using 100% of the reads
Fig. 4
Fig. 4
Pairwise comparison of TADs identified by all TAD callers (ICE-normalized Hi-C data at 10-kb resolution). a Histograms of the numbers of unique TAD boundaries (start and end positions of each TAD) identified by a given number of TAD callers (on the rows) with an increasing tolerance radius ranging from 0 (± 0 kb) to 5 bins (± 50 kb). b Histograms of the numbers of TADs identified by a given number of TAD callers (on the rows) with an increasing tolerance radius ranging from 0 (± 0 kb) to 5 bins (± 50 kb) for each TAD boundary. c Fraction of TAD boundaries identified by each TAD caller (rows) that were also identified by 5 or less (blue), between 6 and 10 (green), between 11 and 15 (orange), and more than 15 (red) other TAD callers. d Fraction of TADs identified by each TAD caller (rows) that were also identified by 1 (blue), between 2 and 5 (green), between 6 and 10 (orange), and more than 10 (red) other TAD callers. e Average Measure of Concordance (MoC) between TADs identified by each caller (rows) versus TADs identified by all other callers. TAD callers are annotated by the general approach they adopt (colored dots) and ranked (from top to bottom) by decreasing average MoC. f Map of the t-SNE analysis performed on the Pearson’s correlation matrix of the matrix of pairwise MoC between TADs identified by all callers. TAD callers are annotated by the general approach they adopt (colored dots). Three clusters were manually annotated, and the mean MoC within and between each of these groups is reported. ClusterTAD and PSYCHIC were not included in any of the clusters as their MoC values with each member/most members of closest group were below the average of the group. g Boxplot of the number of TADs detected by the callers in each of the three groups identified from the t-SNE analysis
Fig. 5
Fig. 5
Assessment of TAD calling with biological features. a Schematic representation of the structural proteins CTCF (orange), RAD21 (blue), and SMC3 (red) that are enriched at TAD boundaries. b, c Representative examples of ChIP-seq peak signals (average number of peaks in 5-kb intervals) for TopDom (b) and spectral (c). Peak signals for CTCF (orange line), RAD21 (blue line), and SMC3 (red line) are overlaid. d Fold change of structural protein peak signals at TAD boundaries for CTCF (orange bar), RAD21 (blue bar), or SMC3 (red bar). TAD callers are ordered from left to right by increasing average fold change of peak signals of the three proteins. The fold change was computed as the ratio of protein binding signal at TAD boundaries (upper-left, red area) versus flanking regions (upper left, gray areas) minus 1. e Fold change of CTCF peak signal for boundaries called by at least 50% of the callers (red bars) or less than 50% of the callers (blue bars). f Mean fold change across callers of CTCF peak signal for shared (red track) and not shared (blue track) boundaries as a function of the minimum number of callers to define shared boundaries. Error areas correspond to one standard deviation. g CTCF fold change vs TAD mean size for each hierarchy level of the different callers. Hierarchical levels are labeled by increasing numbers (L1, …, Ln) with L1 being the level including TADs that do not contain nested TAD. Overall Pearson’s correlation = 0.37, Pearson’s correlation within the window [250–1250 kb] (gray area) is 0.05. The size of the dots is proportional to the number of TADs. h Schematic representation of H3K36me3 (green) or H3K27me3 (red) histone mark ChIP-seq read counts observed within TADs: TADs are typically enriched either for H3K36me3 marks (example of the left) or for H3K27me3 marks (example on the right) in a mutually exclusive manner. i For each TAD caller, fraction of TADs with a significant (high or low) H3K27me3/H3K36me3 log10-ratio (FDR < 0.1)
Fig. 6
Fig. 6
TAD caller performance across independent datasets. a, b Mean size of the TADs identified in the GM12878 Hi-C dataset (X-axis) versus the mean size of the TADs identified in four Hi-C datasets (Y-axis): GM12878 replicate 1 (dark blue), GM12878 replicate 2 (light blue), IMR90 (green), mouse cortical neurons (MNC, red). Results are shown for 6 TAD callers and Hi-C matrices using 10-kb bin size (a) and 50-kb bin size (b). For some of the TAD callers, jitter on the X-axis was added to improve visibility of the dots; the gray thin lines indicate the TAD mean size values in the GM12878 dataset. c Measure of Concordance (MoC) between TAD partitions determined by all pairs of callers (among the 6 tested—see panel a) in the GM12878 dataset (X-axis) and the other tested datasets (Y-axis). Results for each of these datasets are shown in distinct subplot; datasets are labeled and color-coded. d, e CTCF binding fold change (d) and ratios between H3K27me3 and H3K36me3 within TADs (e) obtained by the TAD callers for the GM12878 dataset (X-axis) versus and the corresponding values obtained in the four tested datasets (Y-axis, color-coded as in panels a and b). For some of the TAD caller, jitter on the X-axis was added to improve visibility of the dots; the gray thin lines indicate the TAD mean size values in the GM12878 dataset

Similar articles

Cited by

References

    1. Dekker J, Marti-Renom MA, Mirny LA. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat Rev Genet. 2013;14(6):390–403. doi: 10.1038/nrg3454. - DOI - PMC - PubMed
    1. Rao SSP, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159(7):1665–1680. doi: 10.1016/j.cell.2014.11.021. - DOI - PMC - PubMed
    1. Jin F, et al. A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature. 2013;503(7475):290–294. doi: 10.1038/nature12644. - DOI - PMC - PubMed
    1. Horta A, Monahan K, Bashkirova L, Lomvardas S. Cell type-specific interchromosomal interactions as a mechanism for transcriptional diversity, bioRxiv. 10.1101/287532.
    1. Lieberman-Aiden E, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326(5950):289–293. doi: 10.1126/science.1181369. - DOI - PMC - PubMed

Publication types

LinkOut - more resources