mixOmics: An R package for 'omics feature selection and multiple data integration - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov 3;13(11):e1005752.
doi: 10.1371/journal.pcbi.1005752. eCollection 2017 Nov.

mixOmics: An R package for 'omics feature selection and multiple data integration

Affiliations

mixOmics: An R package for 'omics feature selection and multiple data integration

Florian Rohart et al. PLoS Comput Biol. .

Abstract

The advent of high throughput technologies has led to a wealth of publicly available 'omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a 'molecular signature') to explain or predict biological conditions, but mainly for a single type of 'omics. In addition, commonly used methods are univariate and consider each biological feature independently. We introduce mixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a systems biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous 'omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple 'omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latest mixOmics integrative frameworks for the multivariate analyses of 'omics data available from the package.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig 1
Fig 1. Overview of the mixOmics multivariate methods for single and integrative ‘omics supervised analyses.
X denote a predictor ‘omics data set, and y a categorical outcome response (e.g. healthy vs. sick). Integrative analyses include N-integration with DIABLO (the same N samples are measured on different ‘omics platforms), and P-integration with MINT (the same P ‘omics predictors are measured in several independent studies). Sample plots depicted here use the mixOmics functions (from left to right) plotIndiv, plotArrow and plotIndiv in 3D; variable plots use the mixOmics functions network, cim, plotLoadings, plotVar and circosPlot. The graphical output functions are detailed in Supporting Information S1 Text.
Fig 2
Fig 2. Prediction area visualisation on the Small Round Blue Cell Tumors data (SRBCT [35]) data, described in the Results Section, with respect to the prediction distance.
From left to right: ‘maximum distance’, ‘Centroid distance’ and ‘Mahalanobis distance’. Sample prediction area plots from a PLS-DA model applied on a microarray data set with the expression levels of 2,308 genes on 63 samples. Samples are classified into four classes: Burkitt Lymphoma (BL), Ewing Sarcoma (EWS), Neuroblastoma (NB), and Rhabdomyosarcoma (RMS).
Fig 3
Fig 3. Illustration of a single ‘omics analysis with mixOmics.
A) Unsupervised preliminary analysis with PCA, A1: PCA sample plot, A2: percentage of explained variance per component. B) Supervised analysis with PLS-DA, B1: PLS-DA sample plot with confidence ellipse plots, B2: classification performance per component (overall and BER) for three prediction distances using repeated stratified cross-validation (10×5-fold CV). C) Supervised analysis and feature selection with sparse PLS-DA, C1: sPLS-DA sample plot with confidence ellipse plots, C2: arrow plot representing each sample pointing towards its outcome category, see more details in Supporting Information S1 Text. C3: Clustered Image Map (Euclidean Distance, Complete linkage) where samples are represented in rows and selected features in columns (10, 300 and 30 genes selected on each component respectively), C4: ROC curve and AUC averaged using one-vs-all comparisons.
Fig 4
Fig 4. Illustration of N-integrative supervised analysis with DIABLO.
A: sample plot per data set, B: sample scatterplot from plotDiablo displaying the first component in each data set (upper diagonal plot) and Pearson correlation between each component (lower diagonal plot). C: Clustered Image Map (Euclidean distance, Complete linkage) of the multi-omics signature. Samples are represented in rows, selected features on the first component in columns. D: Circos plot shows the positive (negative) correlation (r > 0.7) between selected features as indicated by the brown (black) links, feature names appear in the quadrants, E: Correlation Circle plot representing each type of selected features, F: relevance network visualisation of the selected features.
Fig 5
Fig 5. Illustration of MINT analysis in mixOmics.
A: Parameter tuning of a MINT sPLS-DA model with two components using Leave-One-Group-Out cross-validation and maximum distance, BER (y-axis) with respect to number of selected features (x-axis). Full diamond represents the optimal number of features to select on each component, B: Performance of the final MINT sPLS-DA model including selected features based on BER and classification error rate per class, C: Global sample plot with confidence ellipse plots, D: Study specific sample plot, E: Clustered Image Map (Euclidean Distance, Complete linkage). Samples are represented in rows, selected features on the first component in columns. F: Loading plot of each feature selected on the first component in each study, with color indicating the class with a maximal mean expression value for each gene.

Similar articles

Cited by

References

    1. Lê Cao KA, Rohart F, Gonzalez I, Déjean S, Gautier B, Bartolo F, et al. mixOmics: Omics Data Integration Project; 2017. Available from: https://CRAN.R-project.org/package=mixOmics.
    1. Boulesteix AL, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform. 2007;8(1):32–44. doi: 10.1093/bib/bbl016 - DOI - PubMed
    1. Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, Culhane AC. Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings in bioinformatics. 2016; p. bbv108. doi: 10.1093/bib/bbv108 - DOI - PMC - PubMed
    1. Labus JS, Van Horn JD, Gupta A, Alaverdyan M, Torgerson C, Ashe-McNalley C, et al. Multivariate morphological brain signatures predict patients with chronic abdominal pain from healthy control subjects. Pain. 2015;156(8):1545–1554. doi: 10.1097/j.pain.0000000000000196 - DOI - PMC - PubMed
    1. Cook JA, Chandramouli GV, Anver MR, Sowers AL, Thetford A, Krausz KW, et al. Mass Spectrometry–Based Metabolomics Identifies Longitudinal Urinary Metabolite Profiles Predictive of Radiation-Induced Cancer. Cancer research. 2016;76(6):1569–1577. doi: 10.1158/0008-5472.CAN-15-2416 - DOI - PMC - PubMed

Grants and funding

FR was supported, in part, by the Australian Cancer Research Foundation (ACRF) for the Diamantina Individualised Oncology Care Centre at The University of Queensland Diamantina Institute. KALC was supported, in part, by the National Health and Medical Research Council (NHMRC) Career Development fellowship (APP1087415). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.