This is an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets.
Creation of this archive was supported by a grant from the Information and Data Management Program at the National Science Foundation. The archive is intended to serve as a permanent repository of publicly-accessible data sets for research in KDD and data mining. It complements the original UCI Machine Learning Archive , which typically focuses on smaller classification-oriented data sets.
In addition to storing data and description files, we also archive task files that describe a specific analysis, such as clustering or regression, for the data sets stored. The call for data sets lists typical data types and tasks of interest.
Data Sets | Task Files | |
---|---|---|
|
|
If you publish material based on databases obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets and replicate your experiments. We suggest the following pseudo-APA reference format for referring to this repository:
Hettich, S. and Bay, S. D. (1999). The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science.
We also request that you send the citation information for your article to kdd '@' ics.uci.edu. If your article is available online and you provide us with a url, we will link the data set's documentation to your file.
We are always looking for additional data sets and task files. Note that you may submit: (1) data and a description file, (2) a task file describing a particular analysis for a data set, or (3) both. There may be multiple task files for the same data set and the author of a task file may be different from the data donor.
If you are in doubt as to whether a data set or task file would be of interest, please contact the librarian. Donations may be made with anonymous ftp as follows:
Alternatively, you may provide us a web url and we will download the data. If neither of these methods is suitable, please contact the librarian and we will arrange the transfer of data in the most convienent manner for you.
As many researchers use this archive, please carefully fill out a data documentation form when you submit data. If you are submitting an analysis of data, please fill out a task documentation form:
There are several sample files, which may help you fill out the documentation:
We prefer that the data have a standard format. For multivariate data sets that can be represented by a table, please format the data to have one instance/example per line, no spaces, commas separated attributes values, and missing values denoted by "?". For other types of data, use your best judgment.
Thank you for your donations.
David Newman (librarian)