Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Sep 15;25(18):1915-27.
doi: 10.1101/gad.17446611. Epub 2011 Sep 2.

Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses

Affiliations

Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses

Moran N Cabili et al. Genes Dev. .

Abstract

Large intergenic noncoding RNAs (lincRNAs) are emerging as key regulators of diverse cellular processes. Determining the function of individual lincRNAs remains a challenge. Recent advances in RNA sequencing (RNA-seq) and computational methods allow for an unprecedented analysis of such transcripts. Here, we present an integrative approach to define a reference catalog of >8000 human lincRNAs. Our catalog unifies previously existing annotation sources with transcripts we assembled from RNA-seq data collected from ∼4 billion RNA-seq reads across 24 tissues and cell types. We characterize each lincRNA by a panorama of >30 properties, including sequence, structural, transcriptional, and orthology features. We found that lincRNA expression is strikingly tissue-specific compared with coding genes, and that lincRNAs are typically coexpressed with their neighboring genes, albeit to an extent similar to that of pairs of neighboring protein-coding genes. We distinguish an additional subset of transcripts that have high evolutionary conservation but may include short ORFs and may serve as either lincRNAs or small peptides. Our integrated, comprehensive, yet conservative reference catalog of human lincRNAs reveals the global properties of lincRNAs and will facilitate experimental studies and further functional classification of these genes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
lincRNA catalog generation. (A) An integrative computational pipeline to map, reconstruct, and determine the coding potential of lincRNAs based on known annotations and computational methods, and its application to human lincRNAs. The pipeline takes as input RNA-seq data (top, red) and existing annotation sources (top) (RefSeq NR, Gencode, and UCSC annotation for humans). RNA-seq data are assembled by two assemblers: Cufflinks (gold) and Scripture (blue). Transcripts from all inputs are filtered by known annotations, presence of a Pfam domain, and positive coding potential. Transcripts annotated by RefSeq NR (*) were not filtered by the Pfam domain scan and the coding potential score. Finally, only multiexonic transcripts >200 base pairs (bp) are retained. (B) The number of lincRNA loci identified and their overlap with other annotation sources. The Venn diagram shows the overlap between transcripts from RNA-seq assembly (red), GENCODE and UCSC (purple), and RefSeq (green). (C) A representative example of a noncoding transcript that was reconstructed by Cufflinks and Scripture and was also curated in GENCODE and UCSC. (Top) The human genomic locus of the human lincRNAs (red) and its protein-coding neighbors. (Black, arrowhead) Direction of transcription. (Bottom) Magnified view of the lincRNA locus showing the coverage of RNA-seq reads from the testes (red) and the transcripts identified by each source (black). (iso) Isoform.
Figure 2.
Figure 2.
Tissue specificity of lincRNAs and coding genes. (A) Abundance of 4273 lincRNA (rows, left panel) and 28,803 protein-coding genes (rows, right panel) across tissues (columns). Rows and columns are ordered based on a k-means clustering of lincRNAs and protein-coding genes. Color intensity represents the fractional density across the row of log-normalized FPKM counts as estimated by Cufflinks (saturating <4% of the top normalized expression values) (Supplemental Methods). (B) lincRNAs are more lowly expressed than protein-coding genes. Maximal expression abundance (log2-normalized FPKM counts as estimated by Cufflinks) of each lincRNA (left panel, blue) and coding (left panel, black) transcript across tissues. The right panel shows the expression levels of 1508 lincRNAs (top right) and 8906 coding genes (bottom right) that have a maximal expression level within the range bounded by the dashed segments in the left panel ([1.6–4.3] log2 FPKM) (see Supplemental Material). Heat maps are clustered and visualized as in A. (C) Tissue-specific expression. Shown are distributions of maximal tissue specificity scores calculated for each transcript across the tissues from the data in A for coding genes (black), lincRNAs (blue), and the 1508 highly expressed lincRNAs (pink; as in B). Examples of the tissue specificity score of coding genes with known tissue-specific patterns are marked by gray dots.
Figure 3.
Figure 3.
Chromosomal domains of gene expression. (A) Correlation of expression patterns between pairs of neighboring genes. Shown are distributions of Pearson correlation coefficients in expression levels across the tissues in Figure 2A between either 6524 pairs of coding gene neighbors (black), 497 pairs of lincRNAs and their neighboring coding gene (blue), or 10,000 random pairs of protein-coding genes (gray; null model) (*). (B) Shown are distributions of Pearson correlation coefficients calculated as in A, but only for 223 pairs of divergently transcribed pairs of lincRNA and protein-coding gene (blue) or 1575 pairs of divergently transcribed protein-coding genes (*). (C) Expression patterns of pairs of divergently expressed genes. Shown are expression patterns (presented as in Fig. 2A) for pairs of divergently transcribed lincRNA (rows, top left) and protein-coding genes (rows, top right), or pairs of divergently transcribed protein-coding genes (rows, bottom left and right panels) (*). (*) Only lincRNAs that have spliced read support when maximally expressed and that are not testes-specific are presented (refer to Supplemental Material, “Estimating expression abundance,” for further details).
Figure 4.
Figure 4.
Orthologous transcripts of human lincRNAs in mammals and other vertebrates. (A) A human lincRNA with syntenic trans-map mappings to mice and cows. Shown are UCSC browser (Kent et al. 2002) tracks showing two isoforms of the human lincRNA (black, top tracks), the mouse and cow transcripts (green, middle tracks) that were trans-mapped to their human locus, and the base-wise conservation calculated by PhyloP at this locus (red–blue, bottom track). (B) Syntenic trans-mapping to XIST. Tracks presented as in A. (C) Syntenic trans-mapping to p53. (D) Species distribution of 993 human lincRNAs with trans-mapped orthologs (columns) and the species in which the trans-mapped transcripts were found (rows, purple). (E) Characteristics of trans-mapping to human lincRNAs. Box plots of the fraction of the human lincRNA transcript that is aligned to an ortholog (first and second boxes) and the fraction of the lincRNA genomic locus covered by the syntenic mapping of the ortholog (third and fourth boxes) for all trans-mapped lincRNAs (first and third boxes) or only for those lincRNAs that were mapped to mouse coding transcripts (second and fourth boxes). The gray square, star, and circle represent XIST, HOTAIR, and the lincRNA shown in A, respectively. (F) Distribution of the percentage of identical bases across the FSA (Bradley et al. 2009) pairwise alignments between human and mouse trans-mapped transcript pairs. (Blue) lincRNAs and their mouse orthologs; (black) human coding genes and their mouse orthologs; (green) randomly selected 1-kb human and mouse syntenic blocks; (gray) random pairing of human lincRNAs and mouse transcripts (from the set marked in blue). All statistics presented in this figure were calculated at the locus level (i.e., each lincRNA loci was accounted for once, rather than accounting for all of its isoforms).
Figure 5.
Figure 5.
Novel transcripts with potential coding capacity. (A) Characteristics of TUCP transcripts. Shown is a Venn diagram of the 2305 TUCP set transcripts annotated as pseudogenes (purple), containing a Pfam domain (green), having a PhyloCSF score higher than the pipelines set criteria (pink), or combinations thereof. (B) Expression levels of TUCP transcripts. Shown are distributions of maximal expression abundance (log-normalized FPKM counts as estimated by Cufflinks) in TUCP (red), stringent set lincRNA (blue), and coding (black) transcripts. (C) Tissue specificity of TUCP transcripts. Shown are distributions of maximal tissue specificity scores calculated for each transcript in the TUCP set (red), stringent lincRNA set (blue), coding (black), and higher-expressed lincRNAs (magenta) (transcripts defined as in Fig. 2C). (D) PhyloCSF scores of TUCP transcripts. Shown is the distribution of PhyloCSF scores of the TUCP transcripts (red), all noncoding genes in RefSeq (blue), or the subset of RefSeq classified as lincRNA by our pipeline (light blue). (Inset) The corresponding distribution for protein-coding genes that spans a much wider range of positive scores. (E,F) Putative ORFs in TUCP transcripts. Shown are scatter plots of the fraction of each transcript spanned by an ORF (E; X-axis) or of the ORF size (F, in nucleotides; X-axis) versus the PhyloCSF score of that ORF (Y-axis), for the 1404 TUCP transcripts that had a PhyloCSF score >0. (G) Orthologs for TUCP transcripts. Shown are 838 TUCP transcripts (columns) with trans-mapped orthologs and the species in which the trans-mapped transcripts were found (rows, purple).

Similar articles

Cited by

References

    1. Amaral PP, Clark MB, Gascoigne DK, Dinger ME, Mattick JS 2010. lncRNAdb: a reference database for long noncoding RNAs. Nucleic Acids Res 39: D146–D151 doi: 10.1093/nar/gkq1138 - PMC - PubMed
    1. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, et al. 2004. Global identification of human transcribed sequences with genome tiling arrays. Science 306: 2242–2246 - PubMed
    1. Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799–816 - PMC - PubMed
    1. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L 2009. Fast statistical alignment. PLoS Comput Biol 5: e1000392 doi: 10.1371/journal.pcbi.1000392 - PMC - PubMed
    1. Brown CJ, Hendrich BD, Rupert JL, Lafreniere RG, Xing Y, Lawrence J, Willard HF 1992. The human XIST gene: analysis of a 17 kb inactive X-specific RNA that contains conserved repeats and is highly localized within the nucleus. Cell 71: 527–542 - PubMed

Publication types

MeSH terms

Substances