Integrating massive RNA-seq data to elucidate transcriptome dynamics in Drosophila melanogaster - PMC Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2023 May 25;24(4):bbad177. doi: 10.1093/bib/bbad177

Integrating massive RNA-seq data to elucidate transcriptome dynamics in Drosophila melanogaster

Sheng Hu Qian 1,#, Meng-Wei Shi 2,#, Dan-Yang Wang 3,#, Justin M Fear 4,#, Lu Chen 5, Yi-Xuan Tu 6, Hong-Shan Liu 7, Yuan Zhang 8, Shuai-Jie Zhang 9, Shan-Shan Yu 10, Brian Oliver 11,, Zhen-Xia Chen 12,13,14,15,16,17,
PMCID: PMC10505420  PMID: 37232385

Abstract

The volume of ribonucleic acid (RNA)-seq data has increased exponentially, providing numerous new insights into various biological processes. However, due to significant practical challenges, such as data heterogeneity, it is still difficult to ensure the quality of these data when integrated. Although some quality control methods have been developed, sample consistency is rarely considered and these methods are susceptible to artificial factors. Here, we developed MassiveQC, an unsupervised machine learning-based approach, to automatically download and filter large-scale high-throughput data. In addition to the read quality used in other tools, MassiveQC also uses the alignment and expression quality as model features. Meanwhile, it is user-friendly since the cutoff is generated from self-reporting and is applicable to multimodal data. To explore its value, we applied MassiveQC to Drosophila RNA-seq data and generated a comprehensive transcriptome atlas across 28 tissues from embryogenesis to adulthood. We systematically characterized fly gene expression dynamics and found that genes with high expression dynamics were likely to be evolutionarily young and expressed at late developmental stages, exhibiting high nonsynonymous substitution rates and low phenotypic severity, and they were involved in simple regulatory programs. We also discovered that human and Drosophila had strong positive correlations in gene expression in orthologous organs, revealing the great potential of the Drosophila system for studying human development and disease.

Keywords: sequence read archive, gene expression dynamics, computational methods, data integration and quality control, massively parallel sequencing

INTRODUCTION

A thoroughly understood transcriptome is the basis of biological studies of almost all species [1–3]. The D. melanogaster has approximately 65% of human orthologous genes associated with disease and is therefore attracting growing attention as a model organism [4–6]. Some previous studies have investigated the gene expression throughout the fly life cycle in specific tissues [7–10]. Recent work has applied ribonucleic acid (RNA)-seq technology to uncover the diversity, dynamics and genetic basis of the Drosophila transcriptome [3, 11–13]. Similarly, emerging single-cell or single-nucleus studies provide valuable data on gene expression in different types of cells [14–16].

While a significant amount of transcriptomic data is publicly available, there remains a need for systematic integration of both new and existing data remains. Meta-analysis offers the potential to identify rare transcripts and facilitate comprehensive and multidimensional characterization of gene expression patterns. The value of reusing datasets is recognized in the FAIR Data Principles (Findable, Accessible, Interoperable and Reusable) [17], which are increasingly being implemented as policy. However, the challenge of ensuring data quality and adherence to FAIR standards remains due to the heterogeneity of data resulting from the specific needs of focused studies and procedures in different laboratories, as well as poorly organized metadata. Practical, rather than intellectual, challenges are the main barriers to the comprehensive analysis of large gene expression datasets from both model organisms and humans.

It is crucial to ensure the data quality when integrating large-scale public sequencing data. However, commonly used data quality control tools, such as FastQC, fastp [18] and MultiQC [19], only examine the quality of the sequencing sample itself, such as the sequencing quality of raw read and contamination from other species. They did not take into account the differences between samples, so the error caused by many other factors, such as mis-annotation of sequencing type, would be ignored. Therefore, some software for quality control of high-volume data, such as RNA-SeQC2 [20], RNA-QC-chain [21] and Qualimap [22], extracted some features from samples to characterize the quality of alignment and expression. However, these software products neither integrate these features nor provide criteria for subsequent filtering. While some studies have used expression correlation as an indicator to exclude outliers [20, 23], this approach would result in significant loss of information and is not applicable to large-scale heterogeneous data (e.g. different tissues and developmental stages).

Based on the unsupervised machine learning algorithm, Isolation Forest (iForest) [24], we developed a tool named MassiveQC to perform quality control and outlier detection on large amounts of public data. In addition to checking the sequencing quality of raw reads, MassiveQC extracts 21 important features from the sequencing samples and generates an iForest model based on the features of all samples for outlier detection. Using MassiveQC, we collected and filtered 62,927 Drosophila RNA-seq samples from Sequence Read Archive (SRA), resulting in 12 K high-quality samples for further analysis. To highlight the value of this exercise, we investigated the gene expression dynamics and systematically characterized the correlations between expression dynamics and other amenable high-throughput measures such as evolutionary constraint, regulatory input and functional categories. Finally, we analyzed the conservation in gene content and expression profile between human and Drosophila, providing a greatly improved view of the commonalities between expressions in these two organisms. The striking conservation of expression patterns by organ provides the basis for the high translatability of disease gene research between human and Drosophila. The integration of massive Drosophila RNA-seq data not only serves the model organism community but also provides a test of our methods for use in other organisms.

MATERIALS AND METHODS

SraMongo: downloading NCBI databases

We developed a Python package (SraMongo, https://sramongo.readthedocs.io/en/latest/) to query, download, parse and store records from the SRA. SraMongo uses the Entrez utility API to query Entrez databases and download their XML records. Each sample is an individual sequencing library assigned with a unique SRA accession number (SRX/ERX/DRX) that is associated with one or more sequencing runs (SRR/ERR/DRR), which constitutes technical replicates. Although only 24 K samples were analyzed, SraMongo is capable of handling larger numbers of datasets. Each XML record is parsed into a local document database (MongoDB Community server) and indexed by experiment accession (SRX/ERX/DRX). We used this local database to generate a processing queue, explore technical and biological metadata and aggregate metadata for use in downstream analyses.

MassiveQC: quality control of massive high-throughput sequencing data

MassiveQC obtained the download URL of each run (SRR) from the ENA Portal API (https://www.ebi.ac.uk/ena/portal/api/filereport) and used the command ‘ascp’ or ‘wget’ command to download the raw sequencing file (.fastq.gz). The downloaded files were examined by checking the sequencing reads. Single-end runs with at least 1000 reads and an average read length > 10 bp were considered successful. Meanwhile, pair-end runs with both R1 and R2 having at least 1000 reads and an average read length > 10 bp were considered successful. For runs where either R1 or R2 failed, the workflow kept the passing FASTQ files (R1 or R2) and re-annotated the library as single-ended. Runs that completely failed these criteria were flagged as having download issues and removed from further processing. Additionally, runs with malformed quality scores or sequenced using the ABI Solid color space were similarly flagged and removed.

Following ‘Check reads’, contamination levels were quantified using FastQ Screen [25] and adapter and poly(T/A) sequences were trimmed using Atropos. Reads shorter than 25 bp after trimming were removed. The runs were aligned to the reference genome using HISAT2 with the parameter‘—max_intronlen 300000—known-splicesite-file <splicesite-file>’. Additional alignment statistics were then generated using ‘samtools stats’ [26], ‘bamtools stats’ [27] and ‘CollectRNASeqMetrics’ (https://gatk.broadinstitute.org/hc/en-us). Library complexity was estimated using ‘MarkDuplicates’. Library strandedness was determined by calculating the percentage of reads mapping to each strand using ‘CollectRNASeqMetrics’. A run was assigned as forward stranded if 75% of the reads mapped to the same strand as the gene model, reverse stranded if 75% of the reads mapped to the opposite strand as the gene model or unstranded. Subsequently, MassiveQC generated a set of genomic counts (genic, intergenic, junctions) using ‘featureCounts’ [28]. After ‘Quality control’ and ‘Alignment’, we trained an iForest model (scikit-learn; v0.22.1) using the 21 features generated by the above process and detected outliers based on this model.

iForest model construction and evaluation in D. melanogaster RNA-Seq

The Library Strategy information obtained by SraMongo describes which sequencing method was used (e.g. RNA-Seq, EST, WGS, ChIP-Seq). Currently, the SRA allows for 36 different terms. Here we focused on the most common Library Strategy (RNA-Seq). In MassiveQC, we used an anomaly detection method (iForest) to identify and remove outlier RNA-Seq samples. Initially, the RNA-Seq samples were segregated into training and testing datasets, with the former being used to train the iForest model by utilizing 100 decision trees. We validated the RNA-Seq iForest model using RNA-Seq test data to generate a test dataset consisting of different proportions of EST, WGS or ChIP-Seq runs. To better understand why iForest classified a sample as an outlier, we used Shapley Values to help interpret the model [29]. Briefly, given the RNA-Seq iForest model, local Shapley values were calculated for each sample describing how each feature contributed to moving the sample toward or away from the mean.

Alignment workflow

The alignment workflow generated experiment level (SRX) deliverables, including genomic feature counts and coverage tracks (BigWigs). The workflow began by removing adapter sequences, poly(T/A) sequences and low-quality bases (−q 20) using Atropos [30]. Reads shorter than 25 bp after trimming are excluded. Runs were then aligned to the D. melanogaster Release 6 plus ISO1 MT (GCA_000001215.4) using ‘hisat2—max_intronlen 300000—known-splicesite-file <r6.11 GTF> —rna-strandness <forward or reverse>’. The library strandedness information was obtained from the pre-alignment workflow. Multi-mapping reads and low-quality alignments were removed using ‘samtools -q 20’, and all runs from a library (i.e. SRRs from an SRX) were merged using ‘samtools merge’. We also performed a post hoc analysis using correlation to flag libraries whose runs did not look like technical replicates. The workflow generated genomic coverage counts (genic, intergenic and junction) using ‘featureCounts’ with the options ‘-p -P -C -J’ for pair-end or ‘-J’ for single-end libraries. Stranded genome browser tracks were then generated using ‘bamCoverage—binSize 1—normalizeTo1x 129,000,000—ignoreForNormalization chrX—outFileFormat bedgraph—filterRNAstrand <forward or reverse>’ [31].

Normalization of biological metadata

The normalization of biological metadata was carried out for RNA-Seq samples, by creating a set of keywords for each sample using natural language processing. Briefly, for each sample, we tokenized all author biological metadata, including sample titles and used term frequency inter-document frequency (TFIDF), to identify and weight keywords using NLTK [32]. Next, we aimed to create a set of normalized biological attributes for the four main attribute classes (sex, developmental stage, tissue and cell type). We aggregated author-supplied attributes related to these categories and harmonized class labels by hand (e.g. developmental stage versus dev_stage). For each attribute class, we manually mapped values to the FlyBase controlled vocabulary and imputed missing values by searching the sample keywords.

Analysis of gene expression dynamics

All analyses in this study used FPKM (Fragments per Kilobase of transcript per Million mapped reads) to quantify gene expression. We applied the R package sva to remove batch effects among samples and estimate gene expression variations [33]. Genes with FPKM > 0.1 in at least one sample were retained for further analysis. The evolutionary age of genes was retrieved from GenOrigin [34] and GenTree [35]. The estimates of dN/dS values were provided by flyDIVaS [36]. The promoter of each gene was defined as the region 500 bp upstream to 500 bp downstream of the transcription start site. We measured sequence conservation of gene promoters using phastCons and PhyloP, with basewise conservation scores acquired from the UCSC Genome Browser [37]. Transcription factor (TF) binding information was downloaded from GTRD, a database of transcription factor binding sites generated from ChIP-seq data [38]. Bedtools intersect was used to identify the overlap between of TF binding sites and promoter regions [39]. The number of types of different TFs binding in a region was indicative of the TF diversity. The number and diversity of TFs represented the complexity of transcription regulation. GO enrichment analysis was performed using clusterProfiler [40] based on the background of all annotated genes.

RESULTS

MassiveQC detects outliers based on isolated forests using sample self-reporting

When processing a large-scale public sequencing data from SRA, it is essential to capture the data quality and technical experimental details [17]. The SRA stores metadata in addition to RNA-Seq reads. We found that SRA samples were sometimes mis-annotated, and more often, that important information was scattered across fields, databases and subsections of the underlying manuscripts. There was no central source for this information, which made it difficult for us to use a large amount of public data for analysis. Almost all quality control software considered only sequencing quality of the samples and did not address the above issues [18, 19]. Fortunately, many of the critical metadata affect the data and are thus self-reporting. Therefore, we developed a machine learning-based software, MassiveQC, that takes advantage of this self-reporting to (1) eliminate poor quality samples, samples collected for special purposes (e.g. 5′ or 3′ end analysis) and mis-annotated samples (e.g. DNA-seq labeled as RNA-seq) (Supplementary Figure S1) and (2) to separate a major type of sequencing samples of interest (RNA-seq) from a few other types of contaminated sequencing samples (e.g. ChIP-seq) using iForest [24].

To ensure the quality of the sequencing and to gain more information from their self-reporting, MassiveQC used a variety of methods to extract the features of sequencing samples (Figure 1). First, MassiveQC checked and filtered the reads in each FASTQ file after the download (Figure 1A) and integrated FastQ Screen [25] and Atropos [30] to detect contamination and trim adapters, respectively (Figure 1B). Meanwhile, the features of FASTQ were extracted from the results (‘FASTQ features’, Figure 1B and D). Then, the processed FASTQs with high quality were aligned to the reference genome using HISAT2 [41]. Picard and Bamtools were employed to extract alignment statistics, and count statistics were summarized from the featureCounts results (Figure 1C and D). Finally, MassiveQC obtained 21 features from each sample and used these features to build an RNA-Seq iForest model and to identify outliers among all samples (Figure 1E) (see Methods). Given that MassiveQC identifies outliers depending only on read features of samples provided by the users, it can be applied to any other species.

Figure 1.

Figure 1

Overview of the MassiveQC. (A) Download: Use the pysradb package to search the download URL of the SRA accession, and use aspera or wget to download the raw FASTQs. (B) Quality Control.Check sequencing reads, detect FASTQ contamination and trim reads adapters with ‘FastQ Screen’ and ‘Atropos’. By the way, some FASTQ features are extracted from the result, like ‘libsize’ and ‘Reads that were too short’ (C) Alignment. Reads were mapped using Hisat2 and quantified with FeatureCounts. ‘bamtools’, ‘samtools’ and ‘picard’ are used to extract alignment features from bam files. (D) Features Selection. Organize the features of all samples and select the 21 most important features for downstream analysis. (E) Isolation Forest. An outlier detection model was trained using selected features from all samples.

Compared to other tools designed for data quality control, such as FastQC, RNA-SeQC2 and RNA-QC-chain, the advantages of MassiveQC are 1) it can integrate features from multiple dimensions (read, alignment and expression) to make the results more accurate; 2) it generates cutoffs based on the self-reporting of all samples, which is not affected by artificial factors and 3) isolated forests are robust to multi-modal distributions and are applicable to publicly available heterogeneous data (e.g. different tissues and developmental stages) (Table 1).

Table 1.

Comparison of quality control tools

FastQC fastp RNA-SeQC2 RNA-QC-chain Qualimap2 MassiveQC
Read QC Yes Yes Yes Yes Yes Yes
Alignment QC No No Yes Yes Yes Yes
Expression QC No No Yes Yes Yes Yes
Identification method No No No No No iForest
Cutoff Manual Manual Manual Manual Manual Self-reporting

To evaluate the performance of MassiveQC, we used the large RNA-Seq datasets with 952 runs in the SRA project SRP045429 as an example [42]. The median size of the runs is 346.4 Mb (Figure 2B) and the median running time is 11 minutes (Figure 2A). The above tasks were performed with 8 threads per run on a PBS cluster with 40 nodes whose CPU is Intel(R) Xeon(R) CPU E5-2690 v4. Each node has 256G running memory and 28 threads. Each sample was processed by 8 threads. We then collected 30 samples of other sequencing types as contaminating samples (Supplementary Table S1) to examine the accuracy of MassiveQC. We randomly added k samples (k = 5 to 30) from the 30 contaminating samples to the SRP045429 dataset and used MassiveQC to identify outliers. After 100 simulations, we calculated the mean probability that a contaminating sample was identified as an outlier. In all simulations, contaminating samples were identified as outliers (Figure 2C). In addition, we performed 100 simulations with n normal samples (n = 100 to 800) and 30 contaminated samples to evaluate the stability of the model in the samples of different sizes (Figure 2D). When samples of the target type accounted for more than 90% of the total samples, 99.8% of all contaminating samples would be identified as outliers.

Figure 2.

Figure 2

Performance of MassiveQC. The distribution of running time (A) and sample size (B) for 952 runs from GSE60314. (C) The average accuracy of outlier detection with k contaminated samples (k = 5 to 30). The accuracy is calculated by dividing the number of contaminated samples identified as outliers by the total number of contaminated samples. The y-axis represents the average accuracy of 100 simulations with different numbers of contaminated samples. (D) The average accuracy in different numbers of normal samples.

A pipeline for collection and filtration of RNA-seq data from SRA using MassiveQC

Next, we built a pipeline based on MassiveQC for integrating massive RNA-seq data to generate a comprehensive transcriptome atlas of D. melanogaster, by curating the SRA [43], the well-known global repository of raw sequencing data (Figure 3). In this pipeline, we first queried the SRA for ‘D. melanogaster’ [organism]’ and downloaded all associated metadata from the NCBI BioProject, BioSample, Gene Expression Omnibus (GEO) and PubMed repositories using SraMongo which we developed to store and organize SRA metadata (Figure 3A). As of March 2020, there were 62,927 D. melanogaster samples (Supplementary Figure S2). We individually downloaded and processed all 73,539 runs using MassiveQC, except for 107 runs that could not be downloaded and 4923 ABI Solid runs that we excluded to streamline downstream processing (Figure 3B). These results are available on GEO (this study, GSE117217) and the composite tracks are available on FlyBase (‘Oliver lab SRA Aggregated RNA-Seq’) [44].

Figure 3.

Figure 3

Curation of the Drosophila Sequence Read Archive (SRA). (A) Schematic of metadata download. SraMongo queries all D. melanogaster samples from the SRA and downloads all associated metadata from multiple NIH databases to a local MongoDB database. (B) Schematic of MassiveQC. This tool downloads FASTQs from the SRA and processes the files using a number of third-party tools. Generated data is used to select well-behaving RNA-Seq samples. (C) Schematic of the alignment workflow. This workflow processes selected RNA-Seq samples using third-party tools to align reads, merge technical replicates (SRRs), and generate final counts tables and coverage tracks. The resulting files are uploaded to the Gene Expression Omnibus (GEO). (D) Schematic of the metadata workflow. This workflow normalizes sample descriptions to a set of controlled vocabularies and uses gene expression to flag outlying samples. Updated metadata is uploaded to GEO. (E) Schematic of outlier detection using Isolation Forest. An outlier detection model was trained using samples annotated as RNA-Seq in the SRA (Library Strategy) and selected features from the pre-alignment workflow. Isolation Forest builds a set of decision trees while randomly selecting a subset of samples and features. Each decision (node or dashed line) splits the samples into various leaves. An inlier requires many splits to be isolated, while an outlier is isolated with few splits.

Then, we explored the marginal contributions of feature importance using Shapley values (Supplementary Figure S1C and D) [29] and found that the most important indicators of library mRNA quality were ‘% rRNA reads’ and ‘% alignment with transcripts’, and that metadata mis-annotation was indicated by ‘% mRNA’ and ‘Mid Gene Body Coverage’. To test the model, we mixed different proportions of samples from other library strategies [such as expressed sequence tags (EST) and whole genome sequencing (WGS)] (Supplementary Figure S1B) with RNA-Seq samples. Our data showed that WGS and ChIP-Seq libraries (DNA-Seq samples) were consistently identified as RNA-Seq outliers by iForest, confirming the ability of iForest to identify mis-annotations (Figure 3E). Using this method, we removed 875 RNA-Seq outliers (Supplementary Figure S1A). We also excluded libraries with low complexity (< 5 K expressed genes with fragments per kilobase of transcript per million mapped reads (FPKM) > 1) or low read depth (< 2 million mapped reads) (Supplementary Figure S3). As a result of these processes, we retained 12,049 high-quality RNA-seq samples for further analysis. After filtering, we used more detailed parameters to carry out the alignment workflow and generate experimental-level (SRX) deliverables including genomic feature counts and coverage tracks (BigWigs). In addition, we normalized metadata from SRA to facilitate downstream analysis. We identified and weighted keywords using NLTK [32] and manually aggregated them into 4 major attribute classes (sex, developmental stage, tissue and cell type) and mapped them to FlyBase-controlled vocabularies (Figure 3D). The metadata are also available at GEO (GSE117217).

Drosophila transcriptome atlas based on 12,049 high-quality public RNA-seq samples

A large collection of samples also facilitates the grouping of results by fly anatomy. We manually annotated the RNA samples by organ, developmental stage and sex using available metadata and explored the expression variation among samples to identify outliers. When parsed by organ anatomy (Figure 4A), we found 28 sources, with the two largest groups being whole body (n = 5225) and head (n = 2436, including the brain). We also parsed by five developmental stages, including egg, embryo, larva, pupa and adult. The adult stage has the largest sample size (n = 7835). Given that the technical biases among our data could confound biological outcomes [45], we applied the R package sva to remove batch effects between samples (Supplementary Figure S4). The large-scale dataset allows us to examine gene expression patterns in depth and systematically. We first examined the breadth of gene expression, defined as the frequency of samples in which a gene was expressed and revealed that 6455 (39.06%) genes were expressed (FPKM >1) in more than 50% samples (only protein-coding genes, lncRNAs and pseudogenes were considered, same below). For example, eukaryotic translation initiation factor 1A (eIF1A), required for mRNA translation [46], was expressed in almost all samples. Meanwhile, we found that 4559 (27.59%) genes were expressed in less than 10% samples (Supplementary Figure S5), suggesting their distinct gene expression spectrums. Another example was the olfactory receptors (ORs). The olfactory system of Drosophila contains 61 OR genes [47] and the saturation curve showed that the more head tissues, the higher the detection rates of OR genes, with Or43b detected in 93.44% of all head samples (2186) (Supplementary Figure S6).

Figure 4.

Figure 4

Overview of Drosophila transcriptome data. (A) Schematic overview of the dataset. The two-layer ring diagram shows the distribution of filtered samples across tissues (inner) and stages (outer), with sample sizes indicated in the legend. Due to the different sample contents, we manually merge different parts of the same organization and merge two similar organizations to facilitate analysis. It is worth noting that the brain is classified into the head, and the sperm and accessory glands are classified into the testes. (BD) Uniform manifold approximation and projection (UMAP) of 12,049 samples colored by tissue (B), sex (C), and developmental stage (D).

It is not easy to visualize expression distances between 12 K samples. Therefore, we next subjected the samples to low-dimensional embedding and identified clusters [48]. The 2D Uniform Manifold Approximation and Projection (UMAP) plot showed that gene expressions of functionally relevant samples (organs, tissues or cells) were clustered together, for example, ovary and egg, as well as midgut and digestive system (Figure 4B), suggesting the common gene expression programs. Likewise, tissues were also grouped by developmental stage. Many adult samples were sexed and they clearly separated by sex in the UMAPs (Figure 4C and D), which was consistent with previous studies that sex dimorphism is fully manifested at the adult stage [3, 16, 49–51].

Dynamic pattern of gene expression

We further applied the two-dimensional uniform manifold approximation and projection (UMAP) implemented in Seurat to calculate the standardized variance of each expressed gene [48]. Among the 16,679 genes, roX2, a long non-coding RNA that is functionally redundant with roX1 and essential for the assembly of the dosage compensation complex [52, 53] exhibited the highest expression variance (Figure 5A). Consistently, the expression of roX2 was dynamic during development and was limited to a few tissues based on FlyBase expression data [50]. Furthermore, our observation was also consistent with a recent study that roX2 was the top male-biased specific gene at the single cell level [16], confirming our evaluation of gene expression dynamics. We also introduced another metric tau [54] to evaluate the expression specificity of genes and observed the strong consistency between gene expression dynamics and expression specificity (Spearman correlation, Rho = 0.77, P = 0) (Supplementary Figure S7). We then ranked all genes based on their expression variance and showed the expression pattern of the top dynamic genes (5%) and the bottom dynamic genes (also known as stably expressed genes) (5%). As expected, the most dynamically expressed genes were expressed in one or limited tissues (Figure 5B). For example, a set of genes was preferentially expressed in larval anal pad (box1) and another set of genes exhibited a higher expression level in the adult retina than in other tissues (box2). In contrast, the least dynamically expressed genes were found in a wide range of organs or tissues (Figure 5C).

Figure 5.

Figure 5

Expression variation of Drosophila genes. (A) Standardized variance and average expression level of each gene. Each point represents one gene. (B) Heatmap shows the expression pattern of dynamically expressed genes (C) Same as in (B) while stably expressed genes were used. (D) UMAP plots of the normalized expression of Lsm11 (FBgn0033450). (E) The t-distributed stochastic neighbor embedding (t-SNE) plots of the normalized expression of Lsm11 across fly tissues (Fly Cell Atlas data), which is generated by SCope (https://scope.aertslab.org/#/FlyCellAtlas). The annotation of tissues and cell types could be found in the article [16] (Figure 1G and Supplementary Figure S18). (F) UMAP plots of the normalized expression of eEF2 (FBgn0000559). (G) Same as in (E) with the gene eEF2 investigated.

For an intuitive understanding of their distinct expression dynamics, we displayed the expression profile of Lsm11, which was unevenly distributed in the ovary samples and expressed at the adult stage (Figure 5D). We also confirmed and extended these observations using single cell data from Fly Cell Atlas [16]. Once again, and congruently, Lsm11 was found to be expressed in cells of fat body and ovary (Figure 5E). Interestingly, when we further examined the expression pattern of Lsm11 in different cell types of ovary, we observed that it was preferentially expressed in young germ cell, 16-cell germline cyst in germarium region and oviduct (Supplementary Figure S8A), showing remarkable expression dynamics at single-cell resolution.

However, eukaryotic translation elongation factor 2 (eEF2) was widely expressed in nearly all samples (12,041 out of 12,049) (Figure 5F). Similarly, its expression profile at the single-cell level recapitulated the above observation (Figure 5G). We also showed the expression pattern of eEF2 in different cell types of ovary and found that it was expressed in the majority of cells (Supplementary Figure S8B). Considering that the ubiquitously expressed genes are associated with frequent protein-DNA-binding events [55], we further examined whether these genes were involved in complex expression regulation.

Regulation of genes with different expression dynamics

Gene expression in organism is modulated by spatial–temporally dynamic regulatory programs [56]. To further reveal whether the genes with dynamic expression have distinct evolutionary and regulatory patterns, we first investigated the association of gene types with expression dynamics at the basic level of encoded gene product. We observed that protein-coding genes exhibited the lowest expression variation, followed by lncRNAs and pseudogenes (based on FlyBase annotation) (Wilcoxon test) (Figure 6A), consistent with previous reports that lncRNAs and pseudogenes are expressed in a highly tissue- or condition-specific manner [57]. We also looked for correlations with the time when a gene first arose in the lineage. Interestingly, the origination time of genes (gene age) was negatively correlated with its expression dynamics (Spearman correlation, Rho = −0.43, P = 0), with younger genes displaying higher expression variation (Figure 6B, Supplementary Figure S9). Previous studies have shown that young genes tend to have a testis-specific expression and subsequently acquire broader expression (Supplementary Figure S10) [34, 58]. Our work showed that this gradual repurposing of genes is a general trend.

Figure 6.

Figure 6

Relationships between expression dynamics and evolutionary constraints and regulations. (A) Expression variation of protein-coding gene, lncRNA, and pseudogene. Pseudogene is based on flybase annotation. Wilcoxon test is used to evaluate the statistical significance. (B) Relationship between gene expression dynamics and their ages (origination times). The age of a gene is negatively correlated with its expression dynamics. Myr, million years. The evolutionary age of genes was retrieved from GenOrigin, a comprehensive protein-coding gene origination database [34]. (C) Genes expressed later during development show expression dynamics at significantly higher values than those expressed early. (D) Expression dynamics of the transcriptome across organs. Higher values indicate higher dynamics of expressed genes. (E) Relationship between gene expression dynamics and transcription factor (TF)-binding sites. TF-binding sites are defined as the number of binding sites overlapping the promoters of each gene. (F) Relationship between gene expression dynamics and number of RNA binding protein (RBP).

We next surveyed the forces governing the evolution of Drosophila genes at the protein level. First, we examined the evidence of selection on dynamically expressed genes using dN/dS value, which was defined as the ratio of the rates of nonsynonymous and synonymous codon substitutions, with high values (> 1) indicating positive selection and low values indicating purifying selection [59]. The expression dynamics were positively correlated with selection strength and relatively stably expressed (or widely expressed) genes were enriched for signatures of purifying selection (Supplementary Figure S11), suggesting that purifying selection might constrain the dynamics of gene expression.

While adult phenotypes are entirely dependent on the successful completion of increasingly complex gene expression patterns during development, it is thought that pathways evolve in reverse. In other words that as long as the final phenotype is achieved, it is possible to exchange mechanisms for initiating development [60]. If this is applicable to gene expression, then we might expect to see more variance in early developmental stages. Surprisingly, we discovered that late expressed genes showed higher expression variation than early-expressed genes (Figure 6C). This suggested that even if early development was evolutionarily fungible, there was tight regulation at micro-timescales. The overall trend was linear, but showed two peaks at adult 14-day and 24-day ages. This result might be explained by the mutation accumulation theory of aging, where the force of purifying selection decreases with age, as proposed by Sir Peter Medawar [61]. Moreover, the genes expressed in testis showed significantly larger expression variations than those in other organs (Figure 6D), the pattern might result from rapid gene evolution in testis and pervasively accessible chromatin allowing extensive transcription [56, 62]. We also observed that genes expressed in the ovary were much less dynamic than those in other tissues, which might because genes important for Drosophila ovarian function are highly conserved during evolution [63], suggesting that these genes might also be expressed in other tissues and be pleiotropic. Collectively, these observations (Figure 6B–D) provided a link between gene expression patterns in an extant species and the molecular evolution and life history theory of the lineage (Medawar’s hypothesis).

To evaluate whether our observations of the associations between expression dynamics and evolutionary constraints could be extended to regulatory programs, we examined the sequence conservation in promoter regions which governed the transcription initiation and gene activity [64]. Expectedly, the promoters of genes with higher expression dynamics were significantly, albeit slightly, less evolutionarily conserved (Supplementary Figure S12). Interestingly, we also found that the promoters of genes with lower expression dynamics harbored more TF-binding sites and higher diversity of TFs (Figure 6E, Supplementary Figure S13) (Spearman correlation, Rho = −0.53, P = 0). These results were reminiscent of high-occupancy target (HOT) regions, defined as the segments that were bound by high amount of TFs, and genes related to HOT regions were stably expressed [55, 65], with their expression variability resembling that of housekeeping genes [66]. Additionally, the transcripts derived from less dynamically expressed genes were bound by more RNA-binding proteins (RBPs) than those from highly dynamically expressed genes (Figure 6F, Supplementary Figure S14). These data indicated a stronger and more complex transcriptional and post-transcriptional regulation of these less dynamic genes. Given the diverse expression patterns, genes that need to be expressed widely require more TF-binding sites to ensure that they could be expressed in essentially all cell types. If the levels of proteins encoded by these genes were also in a narrow tolerance range, then they might be more subject to translational control by RBPs.

Function and phenotypic essentiality of genes with different expression dynamics

Evolutionary constraints and regulatory complexity are strong predictors of functional importance. To extend this inference, we hypothesized that the dynamically expressed genes might perform lineage-specific functions while the less dynamically expressed genes might be associated with the maintenance of basic cellular functions (acting as ‘housekeeping genes’). Considering the different functional paradigms between protein-coding genes and lncRNAs [57], we first focus on protein-coding genes. In support of our notion, the dynamically expressed genes were significantly enriched in developmental process, immune response and neuropeptide signaling pathway gene ontology terms (Figure 7A), while the less dynamically expressed genes were enriched in protein phosphorylation, cell junction organization and GTPase-mediated signal transduction terms (Figure 7B), indicating the different functional categories of these two types of genes.

Figure 7.

Figure 7

Gene ontology (GO) enrichment analysis. (A) Enriched biological processes of highly dynamically expressed protein-coding genes. Top 5% of all dynamic protein-coding genes are used for analysis. (B) Enriched biological processes of less dynamically expressed protein-coding genes. Bottom 5% of all dynamic protein-coding genes are used for analysis. (C) Expression dynamics of genes with different phenotypic severity. (D) Circos plot showing genome-wide lncRNA–protein-coding gene contacts based on expression correlation. The first track (shown by coding) indicates protein-coding genes, and second track (shown by pseudo) represents lncRNAs. (E) Enriched biological processes of protein-coding genes with significantly high expression correlation with highly dynamically expressed lncRNAs. Top 5% dynamic lncRNAs is user for analysis. (F) Same as in E, bottom 5% dynamic lncRNAs used.

To further assess whether the less dynamically expressed genes were required for housekeeping functions (i.e. are essential), we assigned genes to different classes of phenotypic severity based on a public large-scale transgenic short-guide RNA library that enabled efficient disruption of specific target genes [67]. We compared the degree of expression variation for genes in these different classes and revealed a clear association between expression dynamics and the severity of phenotypes when those genes were disrupted (Figure 7C). For example, genes required for lethality showed the lowest expression dynamics. Using another dataset of essential genes predicted by a machine learning algorithm [68], we found that essential genes were enriched for less dynamically expressed genes (Supplementary Figure S15). Collectively, our data reinforced and extended the idea that genes with less dynamic expression have important, conserved functions in Drosophila.

Considering our relatively poor understanding of specific lncRNA functions, it is difficult to compare expression dynamics with functional importance. To get around the poor functional annotation, we borrowed functional information from protein-coding genes based on a previous method [69], by constructing a co-expression network between protein-coding genes and lncRNAs. We used this functional relationship to infer lncRNA functions. With P < 0.01 and absolute Pearson correlation coefficient R > 0.65, a total of 21,613 co-expression pairs were identified between 2383 coding genes and 89 lncRNAs (Figure 7D). Protein-coding genes significantly correlated with dynamically expressed lncRNAs were involved in specific differentiation, such as the sperm-associated functions of spermatid development and differentiation (Figure 7E), while protein-coding genes correlated with stably expressed lncRNAs were involved in basic biological processes, such as cell–cell signaling and cell junction assembly (Figure 7F). These observations on protein-coding genes and lncRNAs indicated that expression dynamics and functional categories were intertwined in these gene pairs.

Conservation and divergence of gene content and expression profile between Drosophila and human

As a premier model species, Drosophila is routinely used to elucidate the progress and underlying molecular mechanisms of human development and diseases [4, 70–73], under two assumptions: 1) a large fraction of human genes, especially disease-associated genes, have counterparts in the Drosophila genome; 2) the genetic regulatory programs are largely conserved between the two. However, how these two assumptions interact has not been systematically characterized. For example, approximately ranging from 65% to 77% of human disease genes have orthologs in Drosophila [71, 74], but how these genes are regulated in the two species is underexplored. Our dataset provided important genome-wide information on the transcriptional regulation of orthologs.

To this end, we classified human genes into different categories based on their phenotypic severity by integrating a dataset of essential genes [75] with a dataset of disease-associated genes [76]. This analysis obtained 2460 vital genes, 1145 vital and disease-suppressing genes, 3150 disease-suppressing genes and 10,463 other genes, with gradually decreasing severity of loss-of-function phenotypes. We matched these with 20,400 ortholog pairs between human (9223 genes) and Drosophila (6937 genes) using Ensembl Compara [77]. Interestingly, we observed a clear and significant association between orthology in Drosophila and phenotypic severity in human loss-of-function (Figure 8A). Specifically, 68.5% of human vital genes were conserved in the Drosophila genome, followed by vital and disease-suppressing genes (65.33%), disease-suppressing genes (49.3%) and other genes (42.34%), suggesting that functionally important genes have been conserved by evolutionary forces. The utility of Drosophila is partly due to the lower number of paralogs, but we also discovered that 40.57% of human vital genes had a one-to-one (1:1) orthology in Drosophila, significantly higher than genes in other three phenotypic groups. Overall, these analyses illustrated that the vital or disease-suppressing genes in human ‘obey’ strict 1:1 orthology relationships, suggesting that most of them could be investigated in Drosophila avatars.

Figure 8.

Figure 8

Expression dynamics of orthologs between human and Drosophila. (A) Proportion of human genes in each type that have homologs in Drosophila. One2one represents that the human genes in the pair have only one ortholog in Drosophila (1:1 orthologous relationship). One2many, 1:m orthologous relationship (m ≥ 2). Many2many, m:m orthologous relationship (m ≥ 2). No homolog, human genes that do not have orthologs in Drosophila. Fisher’s exact test, *P < 0.05; **P < 0.01; ***P < 0.001. Expression correlation of human genes and Drosophila orthologs in muscle (B), ovary (C), testis (D). The correlation coefficients of vital genes, vital and disease-suppressing genes, disease-suppressing genes, and other genes are represented by rho1, rho2, rho3, and rho4, respectively. 1:1 orthologs are used in this analysis.

Much of the knowledge of the genetic programs underlying human biology was first discovered in model organisms like rhesus macaque, mouse and Drosophila. Although the expression conservation between human and mammalian models has been elucidated [78], such analyses on human genes and Drosophila orthologs are understudied. Therefore, we evaluated the expression conservation and dynamics of 1:1 orthologs in 9 matched tissues (Figure 8B–D, Supplementary Figure S16) and found that the expression correlations ranged from 0.08 to 0.67 (Spearman correlation coefficient), with the lowest value in muscle (Figure 8B) and the highest value in ovary (Figure 8C), suggesting the high conservation of gene expression in ovary over hundreds of millions of years of divergence. Interestingly, the general linear relationship did not depend on gene category, but when looking closely, we found that the vital genes exhibited the highest expression conservation than other types of genes in 6 of the tissues, with the highest value of 0.65 in ovary, while the other genes exhibited the lowest conservation in 7 tissues, with the lowest value of 0.08 in muscle. We also found that the expression similarity of genes expressed in testis was considerably lower than other tissues, especially for disease-suppressing and other human genes (Figure 8D). These observations strongly suggested that the clinical phenotypes of human vital and vital and disease-suppressing genes were more likely to be recapitulated by the Drosophila model. For example, the high expression conservation and low expression variability of human Drosophila ortholog pairs in ovaries made them an ideal organ for translating disease etiology research between human and Drosophila.

DISCUSSION

The scientific community generates enormous amounts of sequence data. The journals have done a good job of insisting that these data find a home in accessible data repositories. However, these data are not reused optimally due to the lack of interoperability inherent in the metadata structure. It is crucial to ensure the data quality when integrating large-scale public sequencing data. However, current quality control software does not provide appropriate criteria for users to screen samples. Moreover, these methods do not consider the heterogeneity in public data and are not applicable to data with multi-modal distribution. To this aim, we developed a tool called MassiveQC based on Isolation Forest to perform quality control and outlier detection on large amounts of public data. It could automatically identify outliers based on the self-reporting of all samples and is suitable for data with multi-modal pattern. To highlight its value, we downloaded and filtered all Drosophila RNA-seq data from NCBI SRA using MassiveQC and developed SraMongo to automatically parse the sample information. Our work finally generated 12 K high-quality quality-controlled RNA-seq samples, with several orders of magnitude larger than previous Drosophila genomics studies. This general approach is species agnostic and could be applied to other species and should be amenable to automation. Such database implementations would be a major functional advance in support of the FAIR policy.

The unprecedented breadth and depth allowed systematic characterization of gene expression dynamics and associations. We observed that highly dynamic genes were enriched for signatures of relaxed selective constraint and tended to be expressed at late stages, whereas less dynamic genes were more likely to be under purifying selection and expressed at early stages, suggesting their different functional categories. Genes with less dynamic expression might have a broad expression spectrum and might perform fundamental and vital functions. These ‘housekeeping’ genes, therefore, were of great importance in defining a basal genome required for basic cell function and served as a reference of gene expression in computational and experimental biology (i.e. normalization for RT-qPCR data) [79–81]. While the housekeeping genes in human and mouse have been identified by mining large-scale RNA-seq datasets [66], a consensus list in Drosophila still remains to be defined and the less dynamic genes among the 12 K samples (combining tau value) might be a potential candidate for further validation. Lists of candidate ‘housekeeping’ genes are almost exclusively protein coding. Consistent with prior studies of lncRNAs with tight spatiotemporal specificity [82] and functions [69], we found only a few lncRNAs with less dynamic expression. However, there were some. For example, among the top 200 ‘housekeeping’ genes in our data, one lncRNA, CR46268, was found to be widely expressed in 11,626 samples (96.5%) (FPKM > 0.1), and its top 10 co-expressed protein-coding genes were associated with protein processing and modification. Our data resource could also be used to identify specifically expressed genes in different tissues or at different developmental stages using some tools (e.g. SEGtool [83]).

Gene regulation is extremely useful in comparative genomics and is essential for determining whether Drosophila will be a good human disease model for a particular case. Our meta-analysis of Drosophila gene expression connected to human gene expression and phenotypes provides a direct, easy-to-use window for exploring cross-species comparisons.

In summary, our study has for the first time examined Drosophila gene expression dynamics across organs, developmental stages and sexes by integrating the large-scale transcriptome data and systematically characterized the association of expression dynamics with functional categories, regulatory complexity, evolutionary properties and applicability to human biology. As that the Drosophila has been successfully used as a leading model organism in fundamental biological research, we anticipate that our observations on the expression dynamics are also applicable to other species in general. Our data and the software (MassiveQC and SraMongo), along with the accompanying online resource (http://flybase.org/jbrowse/?data=data/json/dmel, ‘Oliver lab SRA Aggregated RNA-Seq’), which allows the gene expression dynamics to be investigated in an interactive manner, will facilitate the disentangling of the underlying mechanisms in the future.

Key Points

  • We developed a novel software, MassiveQC, based on an unsupervised machine learning algorithm (Isolation Forest), to perform quality control and detect outliers from large amounts of public data using their self-reporting.

  • We applied MassiveQC to collect and filter 62,927 Drosophila RNA-seq samples from Sequence Read Archive (SRA), resulting in a comprehensive transcriptome of 12 K high-quality samples across organ, developmental stage and sex.

  • We systematically characterized fly gene expression dynamics and its associations with function, regulation, evolution patterns and applicability in human biology.

Supplementary Material

Supplementary_Figures_bbad177
Supplementary_Table_S1_bbad177

ACKNOWLEDGEMENTS

B.O. and Z.C. conceived and designed the project; B.O. and J.F. acquired the data; S.Q., M.S., D.W. and J.F. performed data analysis and wrote the manuscript with the input from B.O. and Z.C.; L.C., Y.T., H.L., Y.Z., S.Z. and S.Y. provided helpful comments on the manuscript.

Author Biographies

Sheng Hu Qian is currently a PhD candidate at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Meng-Wei Shi is currently a PhD candidate at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Dan-Yang Wang is currently a PhD candidate at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Justin M. Fear is a postdoc researcher at Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, USA.

Lu Chen is a postdoc researcher at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Yi-Xuan Tu is currently a PhD candidate at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Hong-Shan Liu is a master’s graduate at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Yuan Zhang is a master’s graduate at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Shuai-Jie Zhang is a master’s graduate at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Shan-Shan Yu is a master’s graduate at Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology and College of Biomedicine and Health, Huazhong Agricultural University, China.

Brian Oliver is a professor of Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, USA.

Zhen-Xia Chen is a professor of Hubei Hongshan Laboratory, Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, China; Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shenzhen 518000, China; Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518,000, China. Her research group has been working in evolutionary and developmental genomics.

Contributor Information

Sheng Hu Qian, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Meng-Wei Shi, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Dan-Yang Wang, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Justin M Fear, Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, MD 20892, USA.

Lu Chen, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Yi-Xuan Tu, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Hong-Shan Liu, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Yuan Zhang, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Shuai-Jie Zhang, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Shan-Shan Yu, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China.

Brian Oliver, Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, MD 20892, USA.

Zhen-Xia Chen, Hubei Hongshan Laboratory, College of Biomedicine and Health, Huazhong Agricultural University, Wuhan 430070, China; Section of Developmental Genomics, National Institute of Diabetes and Kidney and Digestive Diseases, National Institutes of Health, Bethesda, MD 20892, USA; Hubei Key Laboratory of Agricultural Bioinformatics, College of Life Science and Technology, Huazhong Agricultural University, Wuhan 430070, China; Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, China; Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shenzhen 518000, China; Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518000, China.

FUNDING

The science and technology major program of Hubei Province (2021ABA011); Foundation of Hubei Hongshan Laboratory (2021hszd012 and 2022hszd024); HZAU-AGIS Cooperation Fund (SZYJY2021010).

DATA AVAILABILITY STATEMENT

All raw and processed RNA-seq data generated in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE117217. The composite tracks are available on FlyBase (http://flybase.org/jbrowse/?data=data/json/dmel, ‘Oliver lab SRA Aggregated RNA-seq track’). The python package SraMongo and MassiveQC has been deposited in github (https://github.com/jfear/sramongo and https://github.com/shimw6828/MassiveQC) and figshare (https://figshare.com/articles/software/MassiveQC/21674198). The computer codes for the workflows of transcriptome atlas of Drosophila melanogster have been deposited in github (https://github.com/jfear/ncbi_remap).

REFERENCES

  • 1. Papili Gao  N, Ud-Dean  SMM, Gandrillon  O, Gunawan  R. SINCERITIES: inferring gene regulatory networks from time-stamped single cell transcriptional expression profiles. Bioinformatics  2018;34:258–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Hillenbrand  P, Maier  KC, Cramer  P, Gerland  U. Inference of gene regulation functions from dynamic transcriptome data. Elife  2016;5:e12188. 10.7554/eLife.12188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Yang  H, Jaime  M, Polihronakis  M, et al.  Re-annotation of eight Drosophila genomes. Life Sci Alliance  2018;1:e201800156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Link  N, Bellen  HJ. Using Drosophila to drive the diagnosis and understand the mechanisms of rare human diseases. Development  2020;147:dev191411. 10.1242/dev.191411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Mirzoyan  Z, Sollazzo  M, Allocca  M, et al.  Drosophila melanogaster: a model organism to study cancer. Front Genet  2019;10:51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Banerjee  S, Benji  S, Liberow  S, et al.  Using Drosophila melanogaster to discover human disease genes: an educational primer for use with “amyotrophic lateral sclerosis modifiers in Drosophila reveal the phospholipase D pathway as a potential therapeutic target”. Genetics  2020;216:633–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Parisi  M, Nuttall  R, Edwards  P, et al.  A survey of ovary-, testis-, and soma-biased gene expression in Drosophila melanogaster adults. Genome Biol  2004;5:R40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Manak  JR, Dike  S, Sementchenko  V, et al.  Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nat Genet  2006;38:1151–8. [DOI] [PubMed] [Google Scholar]
  • 9. Ranz  JM, Castillo-Davis  CI, Meiklejohn  CD, Hartl  DL. Sex-dependent gene expression and evolution of the Drosophila transcriptome. Science  2003;300:1742–5. [DOI] [PubMed] [Google Scholar]
  • 10. Stolc  V, Gauhar  Z, Mason  C, et al.  A gene expression map for the euchromatic genome of Drosophila melanogaster. Science  2004;306:655–60. [DOI] [PubMed] [Google Scholar]
  • 11. Vedelek  V, Bodai  L, Grezal  G, et al.  Analysis of Drosophila melanogaster testis transcriptome. BMC Genomics  2018;19:697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Shi  MW, Zhang  NA, Shi  CP, et al.  SAGD: a comprehensive sex-associated gene database from transcriptomes. Nucleic Acids Res  2019;47:D835–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Huang  W, Carbone  MA, Magwire  MM, et al.  Genetic basis of transcriptome diversity in Drosophila melanogaster. Proc Natl Acad Sci U S A  2015;112:E6010–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Fu  Y, Huang  X, Zhang  P, et al.  Single-cell RNA sequencing identifies novel cell types in Drosophila blood. J Genet Genomics  2020;47:175–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Calderon  D, Blecher-Gonen  R, Huang  X, et al.  The continuum of Drosophila embryonic development at single-cell resolution. Science  2022;377:eabn5800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Li  H, Janssens  J, De Waegeneer  M, et al.  Fly cell atlas: a single-nucleus transcriptomic atlas of the adult fruit fly. Science  2022;375:eabk2432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Scheffler  M, Aeschlimann  M, Albrecht  M, et al.  FAIR data enabling new horizons for materials research. Nature  2022;604:635–42. [DOI] [PubMed] [Google Scholar]
  • 18. Chen  S, Zhou  Y, Chen  Y, et al.  Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics  2018;34:i884–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Ewels  P, Magnusson  M, Lundin  S, et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics  2016;32:3047–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Graubert  A, Aguet  F, Ravi  A, et al.  RNA-SeQC 2: efficient RNA-seq quality control and quantification for large cohorts. Bioinformatics  2021;37:3048–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Zhou  Q, Su  X, Jing  G, et al.  RNA-QC-chain: comprehensive and fast quality control for RNA-Seq data. BMC Genomics  2018;19:144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Okonechnikov  K, Conesa  A, Garcia-Alcalde  F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics  2016;32:292–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Pembroke  WG, Hartl  CL, Geschwind  DH. Evolutionary conservation and divergence of the human brain transcriptome. Genome Biol  2021;22:52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Liu  FT, Ting  KM, Zhou  Z-H. Isolation-based anomaly detection. ACM Trans Knowl Discov Data  2012;6:3. [Google Scholar]
  • 25. Wingett  SW, Andrews  S. FastQ screen: a tool for multi-genome mapping and quality control. F1000Res  2018;7:1338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Danecek  P, Bonfield  JK, Liddle  J, et al.  Twelve years of SAMtools and BCFtools. Gigascience  2021;10:giab008. 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Barnett  DW, Garrison  EK, Quinlan  AR, et al.  BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics  2011;27:1691–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Liao  Y, Smyth  GK, Shi  W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics  2014;30:923–30. [DOI] [PubMed] [Google Scholar]
  • 29. Lundberg  SM, Erion  G, Chen  H, et al.  From local explanations to global understanding with explainable AI for trees. Nat Mach Intell  2020;2:56–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Didion  JP, Martin  M, Collins  FS. Atropos: specific, sensitive, and speedy trimming of sequencing reads. PeerJ  2017;5:e3720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Ramirez  F, Dundar  F, Diehl  S, et al.  deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res  2014;42:W187–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Loper  E, Bird  S. NLTK: the natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2002, pp. 63–70.
  • 33. Leek  JT, Johnson  WE, Parker  HS, et al.  The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics  2012;28:882–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Tong  YB, Shi  MW, Qian  SH, et al.  GenOrigin: a comprehensive protein-coding gene origination database on the evolutionary timescale of life. J Genet Genomics  2021;48:1122–9. [DOI] [PubMed] [Google Scholar]
  • 35. Shao  Y, Chen  C, Shen  H, et al.  GenTree, an integrated resource for analyzing the evolution and function of primate-specific coding genes. Genome Res  2019;29:682–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Stanley  CE, Jr, Kulathinal  RJ. flyDIVaS: a comparative genomics resource for Drosophila divergence and selection. G3 (Bethesda)  2016;6:2355–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Lee  CM, Barber  GP, Casper  J, et al.  UCSC genome browser enters 20th year. Nucleic Acids Res  2020;48:D756–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Yevshin  I, Sharipov  R, Kolmykov  S, et al.  GTRD: a database on gene transcription regulation-2019 update. Nucleic Acids Res  2019;47:D100–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Quinlan  AR. BEDTools: the Swiss-Army tool for genome feature analysis. Curr Protoc Bioinformatics  2014;47:11 12 11-34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Wu  T, Hu  E, Xu  S, et al.  clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation (N Y)  2021;2:100141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Kim  D, Paggi  JM, Park  C, et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol  2019;37:907–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Lin  Y, Golovnina  K, Chen  ZX, et al.  Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC Genomics  2016;17:28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Sayers  EW, Beck  J, Bolton  EE, et al.  Database resources of the National Center for biotechnology information. Nucleic Acids Res  2021;49:D10–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Gramates  LS, Agapite  J, Attrill  H, et al.  FlyBase: a guided tour of highlighted features. Genetics  2022;220:iyac035. 10.1093/genetics/iyac035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Goh  WWB, Yong  CH, Wong  L. Are batch effects still relevant in the age of big data?  Trends Biotechnol  2022;40:1029–40. [DOI] [PubMed] [Google Scholar]
  • 46. Geng  R, Zhu  X, Tao  X, et al.  EIF1A depletion restrains human pituitary adenoma progression. Transl Oncol  2022;15:101299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Gomez-Diaz  C, Martin  F, Garcia-Fernandez  JM, et al.  The two main olfactory receptor families in Drosophila, ORs and IRs: a comparative approach. Front Cell Neurosci  2018;12:253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Hao  Y, Hao  S, Andersen-Nissen  E, et al.  Integrated analysis of multimodal single-cell data. Cell  2021;184:3573–3587.e29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Chen  ZX, Sturgill  D, Qu  J, et al.  Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res  2014;24:1209–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Brown  JB, Boley  N, Eisman  R, et al.  Diversity and dynamics of the Drosophila transcriptome. Nature  2014;512:393–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Graveley  BR, Brooks  AN, Carlson  JW, et al.  The developmental transcriptome of Drosophila melanogaster. Nature  2011;471:473–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Lee  H, Oliver  B. Non-canonical Drosophila X chromosome dosage compensation and repressive topologically associated domains. Epigenetics Chromatin  2018;11:62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Kim  M, Faucillion  ML, Larsson  J. RNA-on-X 1 and 2 in Drosophila melanogaster fulfill separate functions in dosage compensation. PLoS Genet  2018;14:e1007842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Qian  SH, Chen  L, Xiong  YL, et al.  Evolution and function of developmentally dynamic pseudogenes in mammals. Genome Biol  2022;23:235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Wreczycka  K, Franke  V, Uyar  B, et al.  HOT or not: examining the basis of high-occupancy target regions. Nucleic Acids Res  2019;47:5735–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Qian  SH, Xiong  YL, Chen  L, et al.  Dynamic spatial-temporal expression ratio of X chromosome to autosomes but stable dosage compensation in mammals. Genomics Proteomics Bioinf  2022. [DOI] [PubMed] [Google Scholar]
  • 57. Statello  L, Guo  CJ, Chen  LL, et al.  Gene regulation by long non-coding RNAs and its biological functions. Nat Rev Mol Cell Biol  2021;22:96–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Zhang  JY, Zhou  Q. On the regulatory evolution of new genes throughout their life history. Mol Biol Evol  2019;36:15–27. [DOI] [PubMed] [Google Scholar]
  • 59. Khodursky  S, Svetec  N, Durkin  SM, et al.  The evolution of sex-biased gene expression in the Drosophila brain. Genome Res  2020;30:874–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Salazar-Ciudad  I. On the origins of morphological variation, canalization, robustness, and evolvability. Integr Comp Biol  2007;47:390–400. [DOI] [PubMed] [Google Scholar]
  • 61. Cheng  C, Kirkpatrick  M. Molecular evolution and the decline of purifying selection with age. Nat Commun  2021;12:2657. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Xia  B, Yan  Y, Baron  M, et al.  Widespread transcriptional scanning in the testis modulates gene evolution rates. Cell  2020;180:248–262.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Elis  S, Desmarchais  A, Cardona  E, et al.  Genes involved in Drosophila melanogaster ovarian function are highly conserved throughout evolution. Genome Biol Evol  2018;10:2629–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Haberle  V, Stark  A. Eukaryotic core promoters and the functional basis of transcription initiation. Nat Rev Mol Cell Biol  2018;19:621–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Ramaker  RC, Hardigan  AA, Goh  ST, et al.  Dissecting the regulatory activity and sequence content of loci with exceptional numbers of transcription factor associations. Genome Res  2020;30:939–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Hounkpe  BW, Chenou  F, de  Lima  F, et al.  HRT atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets. Nucleic Acids Res  2021;49:D947–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Port  F, Strein  C, Stricker  M, et al.  A large-scale resource for tissue-specific CRISPR mutagenesis in Drosophila. Elife  2020;9:e53865. 10.7554/eLife.53865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Aromolaran  O, Beder  T, Oswald  M, et al.  Essential gene prediction in Drosophila melanogaster using machine learning approaches based on sequence and functional features. Comput Struct Biotechnol J  2020;18:612–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Sarropoulos  I, Marin  R, Cardoso-Moreira  M, et al.  Developmental dynamics of lncRNAs across mammalian organs and species. Nature  2019;571:510–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Bellen  HJ, Wangler  MF, Yamamoto  S. The fruit fly at the interface of diagnosis and pathogenic mechanisms of rare and common human diseases. Hum Mol Genet  2019;28:R207–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Ji  JY, Han  C, Deng  WM. Understanding human diseases using Drosophila. J Genet Genomics  2019;46:155–6. [DOI] [PubMed] [Google Scholar]
  • 72. Baldridge  D, Wangler  MF, Bowman  AN, et al.  Model organisms contribute to diagnosis and discovery in the undiagnosed diseases network: current state and a future vision. Orphanet J Rare Dis  2021;16:206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Ma  M, Moulton  MJ, Lu  S, Bellen  HJ. 'Fly-ing' from rare to common neurodegenerative disease mechanisms. Trends Genet  2022;38:972–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Markow  TA. The secret lives of Drosophila flies. Elife  2015;4:e06793. 10.7554/eLife.06793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Bartha  I, di  Iulio  J, Venter  JC, et al.  Human gene essentiality. Nat Rev Genet  2018;19:51–62. [DOI] [PubMed] [Google Scholar]
  • 76. Stenson  PD, Mort  M, Ball  EV, et al.  The human gene mutation database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet  2017;136:665–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Howe  KL, Achuthan  P, Allen  J, et al.  Ensembl 2021. Nucleic Acids Res  2021;49:D884–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Cardoso-Moreira  M, Sarropoulos  I, Velten  B, et al.  Developmental gene expression differences between humans and mammalian models. Cell Rep  2020;33:108308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Monaco  G, Lee  B, Xu  W, et al.  RNA-Seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep  2019;26:1627–1640.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Wei  K, Zhang  T, Ma  L. Divergent and convergent evolution of housekeeping genes in human-pig lineage. PeerJ  2018;6:e4840. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Gonzalez-Bermudez  L, Anglada  T, Genesca  A, et al.  Identification of reference genes for RT-qPCR data normalisation in aging studies. Sci Rep  2019;9:13970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Li  K, Tian  Y, Yuan  Y, et al.  Insights into the functions of LncRNAs in Drosophila. Int J Mol Sci  2019;20:4646. 10.3390/ijms20184646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Zhang  Q, Liu  W, Liu  C, et al.  SEGtool: a specifically expressed gene detection tool and applications in human tissue and single-cell sequencing data. Brief Bioinform  2018;19:1325–36. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Figures_bbad177
Supplementary_Table_S1_bbad177

Data Availability Statement

All raw and processed RNA-seq data generated in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE117217. The composite tracks are available on FlyBase (http://flybase.org/jbrowse/?data=data/json/dmel, ‘Oliver lab SRA Aggregated RNA-seq track’). The python package SraMongo and MassiveQC has been deposited in github (https://github.com/jfear/sramongo and https://github.com/shimw6828/MassiveQC) and figshare (https://figshare.com/articles/software/MassiveQC/21674198). The computer codes for the workflows of transcriptome atlas of Drosophila melanogster have been deposited in github (https://github.com/jfear/ncbi_remap).


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES