Ten Simple Rules for Digital Data Storage - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Editorial
. 2016 Oct 20;12(10):e1005097.
doi: 10.1371/journal.pcbi.1005097. eCollection 2016 Oct.

Ten Simple Rules for Digital Data Storage

Affiliations
Editorial

Ten Simple Rules for Digital Data Storage

Edmund M Hart et al. PLoS Comput Biol. .
No abstract available

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Example of an untidy dataset (A) and its tidy equivalent (B).
Dataset A is untidy because it mixes observational units (species, location of observations, measurements about individuals), the units are mixed and listed with the observations, more than one variable is listed (both latitude and longitude for the coordinates, and genus and species for the species names), and several formats are used in the same column for dates and geographic coordinates. Dataset B is an example of a tidy version of dataset A that reduces the amount of information that is duplicated in each row, limiting chances of introducing mistakes in the data. By having species in a separate table, they can be identified uniquely using the Taxonomic Serial Number (TSN) from the Integrated Taxonomic Information System (ITIS), and it makes it easy to add information about the classification of these species. It also allows researchers to edit the taxonomic information independently from the table that holds the measurements about the individuals. Unique values for each observational unit facilitate the programmatic combination of information using “join” operations. With this example, if the focus of the study for which these data were collected is based upon the size measurements of the individuals (weight and length), information about “where,” “when,” and “what” animals were measured can be considered metadata. Using the tidy format makes this distinction clearer.

Similar articles

Cited by

References

    1. Reid JG, Carroll A, Veeraraghavan N, Dahdouli M, Sundquist A, English A, et al. Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline. BMC bioinformatics. 2014;15: 30 10.1186/1471-2105-15-30 - DOI - PMC - PubMed
    1. Hampton SE, Strasser C, Tewksbury JJ, Gram WK, Budden AE, Batcheller AL, et al. Big data and the future of ecology. Frontiers in Ecology and the Environment. 2013; 130312142848005. 10.1890/120103 - DOI
    1. Eisenstein DJ, others. SDSS-III: Massive Spectroscopic Surveys of the Distant Universe, the Milky Way, and Extra-Solar Planetary Systems. The Astronomical Journal. 2011;142: 72 10.1088/0004-6256/142/3/72 - DOI
    1. Adams J. Collaborations: The rise of research networks. Nature. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved. 2012;490: 335–6. - PubMed
    1. Fraser LH, Henry HA, Carlyle CN, White SR, Beierkuhnlein C, Cahill JF, et al. Coordinated distributed experiments: an emerging tool for testing global hypotheses in ecology and environmental science. Frontiers in Ecology and the Environment. Ecological Society of America; 2013;11: 147–155. 10.1890/110279 - DOI

Publication types

MeSH terms

Grants and funding

DL was funded by the Energy Biosciences institute and the National Center for Supercomputing Applications. FM was funded by iDigBio (Integrated Digitized Biocollections), and therefore this material is based upon work supported by the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program (Cooperative Agreement EF-1115210). KHW was funded by Washington State University. NBZ was funded by the Gordon and Betty Moore Foundation through Grant GBMF 2550.03 to the Life Sciences Research Foundation. PB was funded by the Natural Sciences & Engineering Research Council of Canada and the Academic Development Fund of the University of Western Ontario. TP was funded by an NSERC Discovery Grant, and a Start-Up grant from the Université de Montréal. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.