Multi-Label Classification Dataset Repository – Knowledge Discovery and Intelligent Systems – KDIS – University of Córdoba

Multi-Label Classification Dataset Repository

In this website we provide a huge compilation of multi-label classification datasets, obtained from different sources. For further information, please contact Jose M. Moyano (jmoyano@uco.es).

Datasets

For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in the MLDA documentation. We also used MLDA for the characterization and partitioning of the datasets.

We include the following partitions for each dataset:

  • Original: dataset as originally provided by their authors.
  • Full dataset: the entire dataset in Mulan/Meka format.
  • Random train-test: the dataset was randomly partitioned into train and test files, using 67% of data for training and 33% for testing.
  • Stratified train-test: the partition in 67% train and 33% test was performed by following the Iterative Stratification method proposed by Sechidis et al. 2011.
  • Random 5-folds CV: a random partition in 5 folds was performed, and then they were joined in 5 different train-test partitions, where in each case 4 folds are used for training and the remaining one for testing. Thus, each train-test partition includes different data for testing.
  • Stratified 5-folds CV: a stratified partition in 5 folds was performed followint the Iterative Stratification method proposed by Sechidis et al. 2011, and then they were joined in 5 different train-test partitions, where in each case 4 folds are used for training and the remaining one for testing.
  • Stratified 10-folds: a stratified partition in 10 folds was performed by following the method proposed by J. Motl. In this case, the 10 folds partitions are directly provided, instead of giving the train-test partitions obtained as combination of folds. For further information about the 10-folds partition, please contact jan.motl@fit.cvut.cz.

In all cases, with exception of the original data, the datasets are provided in both Mulan and Meka formats.

Dataset Domain  m    d    q    Card   Dens   Div   avgIR  rDep   m×q×d  
20NG Text 19300 1006 20 1.029 0.051 0.003 1.007 0.984 3.88E+08
3s-bbc1000 Text 352 1000 6 1.125 0.188 0.234 1.718 0.733 2.11E+06
3s-guardian1000 Text 302 1000 6 1.126 0.188 0.219 1.773 0.667 1.81E+06
3s-inter3000 Text 169 3000 6 1.142 0.190 0.172 1.766 0.400 3.04E+06
3s-reuters1000 Text 294 1000 6 1.126 0.188 0.219 1.789 0.667 1.76E+06
Bibtex Text 7395 1836 159 2.402 0.015 0.386 12.498 0.111 2.16E+09
Birds Audio 645 260 19 1.014 0.053 0.206 5.407 0.123 3.19E+06
Bookmarks Text 87860 2150 208 2.028 0.010 0.213 12.308 0.315 3.93E+10
CAL500 Music 502 68 174 26.044 0.150 1.000 20.578 0.192 5.94E+06
CHD_49 Medicine 555 49 6 2.580 0.430 0.531 5.766 0.267 1.63E+05
Corel16k001 Image 13770 500 153 2.859 0.019 0.349 34.155 0.142 1.05E+09
Corel16k002 Image 13760 500 164 2.882 0.018 0.354 37.678 0.128 1.13E+09
Corel16k003 Image 13760 500 154 2.829 0.018 0.350 37.058 0.137 1.06E+09
Corel16k004 Image 13840 500 162 2.842 0.018 0.351 35.899 0.126 1.12E+09
Corel16k005 Image 13850 500 160 2.858 0.018 0.364 34.936 0.133 1.11E+09
Corel16k006 Image 13860 500 162 2.885 0.018 0.361 33.398 0.128 1.12E+09
Corel16k007 Image 13920 500 174 2.886 0.017 0.371 37.715 0.120 1.21E+09
Corel16k008 Image 13860 500 168 2.883 0.017 0.357 36.200 0.121 1.16E+09
Corel16k009 Image 13880 500 173 2.930 0.017 0.373 36.446 0.119 1.20E+09
Corel16k010 Image 13620 500 144 2.815 0.020 0.345 32.998 0.147 9.81E+08
Corel5k Image 5000 499 374 3.522 0.009 0.635 189.568 0.030 9.33E+08
Delicious Text 16110 500 983 19.020 0.019 0.981 71.134 0.143 7.92E+09
Emotions Music 593 72 6 1.868 0.311 0.422 1.478 0.933 2.56E+05
Enron Text 1702 1001 53 3.378 0.064 0.442 73.953 0.141 9.03E+07
EukaryoteGO Biology 7766 12690 22 1.146 0.052 0.014 45.012 0.281 2.17E+09
EukaryotePseAAC Biology 7766 440 22 1.146 0.052 0.014 45.012 0.281 7.52E+07
Eurlex-dc Text 19350 5000 412 1.292 0.003 0.083 3.99E+10
Eurlex-ev Text 19350 5000 3993 5.310 0.001 0.851 3.86E+11
Eurlex-sm Text 19350 5000 201 2.213 0.011 0.129 1.94E+10
Flags Image 194 19 7 3.392 0.485 0.422 2.255 0.381 2.58E+04
Foodtruck Recommend. 407 21 12 2.290 0.191 0.285 7.095 0.409 1.03E+05
Genbase Biology 662 1186 27 1.252 0.046 0.048 37.315 0.157 2.12E+07
GnegativeGO Biology 1392 1717 8 1.046 0.131 0.074 18.448 0.536 1.91E+07
GnegativePseAAC Biology 1392 440 8 1.046 0.131 0.074 18.448 0.536 4.90E+06
GpositiveGO Biology 519 912 4 1.008 0.252 0.438 3.861 0.667 1.89E+06
GpositivePseAAC Biology 519 440 4 1.008 0.252 0.438 3.861 0.667 9.13E+05
HumanGO Biology 3106 9844 14 1.185 0.085 0.027 15.289 0.418 4.28E+08
HumanPseAAC Biology 3106 440 14 1.185 0.085 0.027 15.289 0.418 1.91E+07
Image Image 2000 294 5 1.236 0.247 0.625 1.193 0.900 2.94E+06
Imdb Text 120900 1001 28 2.000 0.071 0.037 25.124 0.868 3.39E+09
Langlog Text 1460 1004 75 1.180 0.016 0.208 39.267 0.035 1.10E+08
Mediamill Video 43910 120 101 4.376 0.043 0.149 256.405 0.342 5.32E+08
Medical Text 978 1449 45 1.245 0.028 0.096 89.501 0.039 6.38E+07
Ohsumed Text 13930 1002 23 1.663 0.072 0.082 7.869 0.526 3.21E+08
Nus-Wide BoW Image 269599 501 81 1.869 0.023 0.068 1.09E+10
Nus-Wide cVLADplus Image 269600 129 81 1.869 0.023 0.068 2.82E+09
PlantGO Biology 978 3091 12 1.079 0.090 0.033 6.690 0.318 3.63E+07
PlantPseAAC Biology 978 440 12 1.079 0.090 0.033 6.690 0.318 5.16E+06
rcv1subset1 Text 6000 47240 101 2.88 0.029 0.171 54.492 0.202 2.86E+10
rcv1subset2 Text 6000 47240 101 2.634 0.026 0.159 45.514 0.179 2.86E+10
rcv1subset3 Text 6000 47240 101 2.614 0.026 0.157 68.333 0.183 2.86E+10
rcv1subset4 Text 6000 47230 101 2.484 0.025 0.136 89.371 0.163 2.86E+10
rcv1subset5 Text 6000 47240 101 2.642 0.026 0.158 69.682 0.170 2.86E+10
Reuters-K500 Text 6000 500 103 1.462 0.014 0.135 54.081 0.080 3.09E+08
Scene Image 2407 294 6 1.074 0.179 0.234 1.254 0.933 4.25E+06
Slashdot Text 3782 1079 22 1.181 0.054 0.041 19.462 0.273 8.98E+07
Stackex_chemistry Text 6961 540 175 2.109 0.012 0.436 56.878 0.056 6.58E+08
Stackex_chess Text 1675 585 227 2.411 0.011 0.644 85.790 0.030 2.22E+08
Stackex_coffee Text 225 1763 123 1.987 0.016 0.773 27.241 0.017 4.88E+07
Stackex_cooking Text 10490 577 400 2.225 0.006 0.609 37.858 0.034 2.42E+09
Stackex_cs Text 9270 635 274 2.556 0.009 0.512 85.002 0.049 1.61E+09
Stackex_philosophy Text 3971 842 233 2.272 0.010 0.566 68.753 0.040 7.79E+08
tmc2007 Text 28600 49060 22 2.158 0.098 0.047 15.157 0.818 3.09E+10
tmc2007-500 Text 28600 500 22 2.22 0.101 0.041 17.134 0.818 3.15E+08
VirusGO Biology 207 749 6 1.217 0.203 0.266 4.041 0.400 9.30E+05
VirusPseAAC Biology 207 440 6 1.217 0.203 0.266 4.041 0.400 5.46E+05
Water-quality Chemistry 1060 16 14 5.073 0.362 0.778 1.767 0.473 2.37E+05
Yahoo_Arts Text 7484 23150 26 1.654 0.064 0.080 94.738 0.338 4.50E+09
Yahoo_Business Text 11210 21920 30 1.599 0.053 0.021 880.178 0.209 7.37E+09
Yahoo_Computers Text 12440 34100 33 1.507 0.046 0.034 176.695 0.364 1.40E+10
Yahoo_Education Text 12030 27530 33 1.463 0.044 0.042 168.114 0.199 1.09E+10
Yahoo_Entertainment Text 12730 32000 21 1.414 0.067 0.026 64.417 0.367 8.55E+09
Yahoo_Health Text 9205 30610 32 1.644 0.051 0.036 653.531 0.192 9.02E+09
Yahoo_Recreation Text 12830 30320 22 1.429 0.065 0.041 12.203 0.455 8.56E+09
Yahoo_Reference Text 8027 39680 33 1.174 0.036 0.034 461.863 0.169 1.05E+10
Yahoo_Science Text 6428 37190 40 1.45 0.036 0.071 52.632 0.196 9.56E+09
Yahoo_Social Text 12110 52350 39 1.279 0.033 0.030 257.704 0.189 2.47E+10
Yahoo_Society Text 14510 31800 27 1.67 0.062 0.073 302.068 0.382 1.25E+10
Yeast Biology 2417 103 14 4.237 0.303 0.082 7.197 0.670 3.49E+06
Yelp Text 10810 671 5 1.638 0.328 1.000 2.876 0.700 3.63E+07

Description of the datasets

20NG [Lang 2008]: is a compilation of around 20000 post to 20Newsgroups. Around 1000 posts are available for each group.

3sources [Greene et al. 2009]: These datasets includes 948 news articles covering 416 distinct news stories from the period February–April 2009. They have been collected from 3 sources: BBC, Reuters and The Guardian. Of these stories, 169 were reported in all three sources, 194 in two sources, and 53 appeared in a single news source. Each story was manually annotated with one or more of the six topical labels: business, entertainment, health, politics, sport, technology. In this way, three datasets with the news from BBC, Reuters and The Guardian respectively are created. A feature selection method has been performed in order to reduce the feature space and achieve a better performance. Each dataset has been selected 1000 features. Also, a dataset with the intersection (3sources-inter3000) of these three datasets (news which are in all three sources) has been created with the union of the 1000 features of each one of the datasets. The 3soures-inter3000 dataset can be also considered as a Multi-View Multi-Label (MVML) dataset, since it includes features from 3 distinct sources. The original data has been downloaded from http://mlg.ucd.ie/datasets/3sources.html

Bibtex [Katakis et al. 2008]: This dataset is based on the data of the ECML/PKDD 2008 discovery challenge. It contains 7395 bibtex entries from the BibSonomy social bookmark and publication sharing system, annotated with a subset of the tags assigned by BibSonomy users.

Birds [Briggs et al. 2013]: It is a dataset to predict the set of birds species that are present, given a ten-second audio clip.

Bookmarks [Katakis et al. 2008]: Is based on the data of the ECML/PKDD 2008 discovery challenge and contains bookmark entries from the Bibsonomy system.

CHD_49 [Shao et al. 2013]: This dataset has information of coronary heart disease (CHD) in traditional Chinese medicine (TCM). This dataset has been filtered by specialist removing irrelevant features, keeping only 49 features.

CAL500 [Turnbull et al. 2008]: It is a music dataset, composed by 502 songs. Each one was manually annotated by at least three human annotators, who employ a vocabulary of 174 tags concerning to semantic concepts. These tags span 6 semantic categories: instrumentation, vocal characteristics, genres, emotions, acoustic quality of the song, and usage terms.

Corel5k [Duygulu et al. 2002]: Corel5k is a popular benchmark for image classification and annotation methods. It is based in 5000 Corel images.

Corel16k [Barnard et al. 2003] is derived from the popular benchmark dataset ECCV 2002 by eliminating less frequently appeared labels.

Delicious [Tsoumakas et al. 2008]: This dataset contains textual data of web pages along with their tags.

Emotions [Tsoumakas et al. 2008]: Also called Music in [Read 2010]. Is a small dataset to classify music into emotions that it evokes according to the Tellegen-Watson-Clark model of mood: amazed-suprised, happy-pleased, relaxing-calm, quiet-still, sad-lonely and angry-aggresive. It consists of 593 songs with 6 classes.

Enron [Read et al. 2008]: The Enron dataset is a subset of Enron email Corpus, labelled with a set of categories. It is based in a collection of email messages that were categorized into 53 topic categories, such as company strategy, humour and legal advice.

Eukaryote [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 7766 sequences for Eukaryote species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 22 subcellular locations (acrosome, cell membrane, cell wall, centrosome, chloroplast, cyanelle, cytoplasm, cytoeskeleton, endoplasmatic reticulum, endosome, extracell, golgi apparatus, hydrogenosome, lysosome, melanosome, microsome, mitochondrion, nucleus, peroxisome, spindle pole body, synapse and vacuole).

EUR-Lex [Loza and Fürnkranz 2008]: The EUR-Lex text collection is a collection of 19348 documents about European Union law. It contains many different types of documents, as treaties, legislation, case-law and legislative proposals, which are indexed according to several orthogonal categorization schemes to allow for multiple search facilities. The most important categorization is provided by the EUROVOC descriptors, which form a topic hierarchy with almost 4000 categories regarding different aspects of European law.

Flags [Gonçalves et al. 2013]: This dataset contains details of some countries and their flags, and the goal is to predict some of the features. The dataset was used the first time for Multi-label Classification in [Gonçalves et al. 2013], and the original dataset can be found at the UCI repository.

Foodtruck [Rivolli et al. 2017]: The food truck dataset was created from the answers provided by the 407 survey participants. They either were approached in fast food festivals and popular events or anonymously received a request to fill out a questionnaire, in Portuguese, describing their personal information and preferences when it comes to their selection from food trucks.

Genbase [Diplaris et al. 2005]: It is a dataset for protein function classification. Each instance is a protein and each label is a protein class. This dataset is small comparatively with the large number of labels.

Gnegative [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 1392 sequences for Gram negative bacterial (Gnegative) species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 8 subcellular locations (cell inner membrane, cell outer membrane, cytoplasm, extracellular, fimbrium, flagellum, nucleoid and periplasm).

Gpositive [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 519 sequences for Gram positive species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 4 subcellular locations (cell membrane, cell wall, cytoplasm and extracell).

Human [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 3106 sequences for Human species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 14 subcellular locations (centriole, cytoplasm, cytoskeleton, endoplasm reticulum, endosome, extracell, golgi apparatus, lysosome, microsome, mitochondrion, nucleus, peroxisome, plasma membrace, and synapse).

Image [Zhang and Zhou 2007]: This dataset is composed by 2,000 images. Concretely, each color image is firstly converted to the CIE Luv space, which is a more perceptually uniform color space such that perceived color differences correspond closely to Euclidean distances in this color space. After that, the image is divided into 49 blocks using a 7×7 grid, where in each block the first and second moments (mean and variance) of each band are computed, corresponding to a low-resolution image and to computationally inexpensive texture features respectively. Finally, each image is transformed into a 49×3×2 = 294-dimensional feature vector.

IMDB [Read 2010]: It contains 120919 movie plot tex summaries from the Internet Movie Database (www.imdb.com), labelled with one or more genres.

LangLog [Read 2010]: It was compiled from the Language Log Forum, which discussed various topics relating to language, and 75 topics represents the label space.

Mediamill [Snoek et al. 2006]: It is a multimedia dataset for generic video indexing, which was extracted tom the TRECVID 2005/2006 benchmark. This dataset contains 85 hours of international broadcast news data categorized into 100 labels and each video instance is represented as a 120-dimensional feature vector of numeric features.

Medical [Pestian et al. 2007]: The dataset is based on the data made available during the Computational Medicine Centers 2007 Medical Natural Language Processing Challenge 10 . It consists of 978 clinical free text reports labelled with one or more out of 45 disease codes.

Nus-Wide [Chua et al. 2009]: We provide two versions of the full NUS-WIDE dataset. In the first version, images are represented using 500-D bag of visual words features provided by the creators of the dataset [Chua et al. 2009]. In the second version, images are represented using 128-D cVLAD+ features described in [Spyromitros et al. 2014]. In both cases, the 1st attribute is the image id.

Ohsumed [Joachims 1998]: This collection includes medical abstracts from the MeSH categories of the year 1991. The specific task was to categorize the 23 cardiovascular diseases categories.

Plant [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 978 sequences for Plant species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 12 subcellular locations (cell membrace, cell wall, chloroplast, cytoplasm, endoplasmic reticulum, extracellular, golgi apparatus, mitochondrion, nucleus, peroxisome, plastid, and vacuole).

Reuters-RCV1 [Lewis et al. 2004]: This dataset is a well-known benchmark for text classification methods. It has 5 subsets, each one with 6000 articles assigned into one or more of 101 topics. The Reuters-K500 dataset was obtained by selecting 500 features by applying the method proposed in [Tsoumakas et al. 2007].

Scene [Boutell et al. 2004]: It is a image dataset, that contains 2407 images, annotated in up to 6 classes: beach, sunset, fall foliage, field, mountain and urban. Each image is described with 294 visual numeric features corresponding to spatial colour moments in the LUV space.

Slashdot [Read 2010]: It consists of article blurbs with subject categories representing the label space, mined from http://slashdot.org.

Stackex [Charte et al. 2015]: It is a collection of six datasets generated from the text collected in a selection of Stack Exchange forums. It includes stackex_chess, stackex_chemistry, stackex_coffee, stackex_cooking, stackex_cs and stackex_philosophy.

TMC2007 [Srivastava et al. 2005]: It is a subset of the Aviation Safety Reporting System dataset. It contains 28596 aviation safety free text reports that the fligth crew submit after each flight about events that took place during the flight. The goal is to label the documents with respect to what types of problem they describe. The dataset has 49060 discrete attributes corresponding to terms in the collection. The safety reports are provided with 22 labels, each of them representing a problem type that appears during a flight. Also the dataset TMC2007-500, which was obtained doing a features selection of the top-500, is included.

Virus [Xu et al. 2016]: This dataset is used to predict the sub-cellular locations of proteins according to their sequences. It contains 207 sequences for Virus species. Both the GO (Gene ontology) features and PseAAC (including 20 amino acid, 20 pseudo-amino acid and 400 diptide components) are provided. There are 6 subcellular locations (viral capsid, host cell membrane, host endoplasm reticulum, host cytoplasm, host nucleus and secreted).

Water quality [Blockeel et al. 1999]: This dataset is used to predict the quality of water of Slovenian rivers, knowing 16 characteristics such as the temperature, ph, hardness, NO2 or C02.

Yahoo [Ueda and Saito 2002]: It is a dataset to categorize web pages and consists of 14 top-level categories, each one is classified into a number of second-level categories. By focusing in second-level categories, there were used 11 out of the 14 independent text categorization problems.

Yeast [Elisseeff and Weston 2001]: This dataset contains micro-array expressions and phylogenetic profiles for 2417 yeast genes. Each gen is annotated with a subset of 14 functional categories (e.g. Metabolism, energy, etc.) of the top level of the functional catalogue.

Yelp [Sajnani et al. 2013]: This dataset has been obtained from the user’s reviews and ratings about business and services on Yelp. It is used in order to categorize if the food, service, ambiance, deals and price of one of these business are good or not. It contains more than 10000 reviews of users. This dataset has been downloaded from http://www.ics.uci.edu/~vpsaini/.

Description of the format of the datasets

All the datasets included in the repository are in Mulan [Tsoumakas et al. 2011] and Meka [Read et al. 2016] formats, both based in Weka’s arff format [Hall et al. 2009].

Mulan dataset format

In Mulan each dataset consists of two files: a xml file and an arff file.

  • XML file: since in the arff file nowhere is indicated which the labels are, in the xml file the attributes that act like labels must be indicated. It has a simple format as shown in the following example:
    <?xml version="1.0" encoding="utf-8"?>
    <labels xmlns="http://mulan.sourceforge.net/labels">
      <label name="amazed-suprised"></label>
      <label name="happy-pleased"></label>
      <label name="relaxing-calm"></label>
      <label name="quiet-still"></label>
      <label name="sad-lonely"></label>
      <label name="angry-aggresive"></label>
    </labels>

  • ARFF file: in this file the full set of attributes and labels (without distinguishing between them there) and the instances are exposed. The specific format of this file is as follows:
    • The relation name goes in the first line, following the statement @relation.
    • If the relation name has spaces or special characters, it must be between qoutes.
    • Each attribute is defined in a different new line.
      • The attribute name must be between qoutes if it has spaces or special characters.
      • Labels are always binary {0, 1}.
      • Attribute types could be the following:
        • numeric (integer and real are treated as numeric).
        • <nominal-values>, between braces and separated by commas all the possible values.
        • string
        • date[<format>]
    • Instances must start with the statement @data, and each one must be in a new different line. Each attribute value is separated by commas, and they must be in the same order they were declared.
      • Also there are a shorter way to write the instances, where attributes with zero value are not included. In this case, instances will be in braces, and separated by commas each pair compised by attribute and value.
    • Comments are inserted with the character %.

    A mulan arff file is shown in the following example:

    @relation emotions_test

    @attribute att1 numeric
    @attribute att2 numeric
    @attribute att3 numeric
    ...
    @attribute att72 numeric
    @attribute amazed-suprised {0,1}
    @attribute happy-pleased {0,1}
    @attribute relaxing-calm {0,1}
    @attribute quiet-still {0,1}
    @attribute sad-lonely {0,1}
    @attribute angry-aggresive {0,1}

    %Starting with the data
    @data
    0.094829,0.204498,0.082824, ..., 0.335371,1,0,0,0,1,1
    0.065248,0.117975,0.08597, ..., 0.442898,0,0,0,1,0,0
    0.101287,0.23254,0.078028, ..., 1.183461,1,1,0,0,0,0
    ...
    0.172427,0.378696,0.081777, ..., 1.294949,1,1,1,0,0,0

Meka dataset format

In meka format, only an arff file is necessary to define the dataset. In this case, the separation between attributes and labels is done in the arff file.
Meka’s arff format is similar to Mulan’s arff, except only on the relation name line, where the attributes that are labels are indicated. Following the relation name and separated by colon, it is indicated with “-C” and a integer positive number if the first q attributes are labels, or with an negative integer if the labels are the last q attributes.

@relation "relationName: -C q"

In the following example a dataset in Meka format is shown:

@relation "emotions: -C -6"

@attribute att1 numeric
@attribute att2 numeric
@attribute att3 numeric
...
@attribute att72 numeric
@attribute amazed-suprised {0,1}
@attribute happy-pleased {0,1}
@attribute relaxing-calm {0,1}
@attribute quiet-still {0,1}
@attribute sad-lonely {0,1}
@attribute angry-aggresive {0,1}

@data
0.094829,0.204498,0.082824, ..., 0.335371,1,0,0,0,1,1
0.065248,0.117975,0.08597, ..., 0.442898,0,0,0,1,0,0
0.101287,0.23254,0.078028, ..., 1.183461,1,1,0,0,0,0
...
0.172427,0.378696,0.081777, ..., 1.294949,1,1,1,0,0,0

Software

We have developed a tool for data exploration and analysis of multi-label and multi-view multi-label datasets called MLDA [Moyano et al. 2017]. It includes both a GUI tool and a Java API. It provides an easy to use tool for multi-label datasets analysis, including a wide set of characterization metrics, charts for measuring the imbalance and relationship among labels, several methods for data preprocessing and transformation, multi-view multi-label datasets characterization and allowing to load several datasets simultaneously. It has been created under the GPLv3 license.

More information about MLDA is available at the GitHub repository. Last release, version 1.2.4 is available here and the documentation in pdf is available here.

References

[Barnard et al. 2003]: Kobus Barnard, Pinar Duygulu, Nando de Freitas, David Forsyth, David Blei, and Michael I. Jordan. Matching Words and Pictures. Journal of Machine Learning Research. 2003. Vol 3, pp 1107-1135.
[Blockeel et al. 1999]: H. Blockeel, S. Džeroski, and J. Grbovic. Simultaneous prediction of multiple chemical parameters of river water quality with tilde. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1704:32–40, 1999.
[Boutell et al. 2004]: Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771, September 2004.
[Briggs et al. 2013]: Forrest Briggs, Yonghong Huang, Raviv Raich, Konstantinos Eftaxias, Zhong Lei, William Cukierski, Sarah Frey Hadley, Adam Hadley, Matthew Betts, Xiaoli Z. Fern, Jed Irvine, Lawrence Neal, Anil Thomas, Gábor Fodor, Grigorios Tsoumakas, Hong Wei Ng, Thi Ngoc Tho Nguyen, Heikki Huttunen, Pekka Ruusuvuori, Tapio Manninen, Aleksandr Diment, Tuomas Virtanen, Julien Marzat, Joseph Defretin, Dave Callender, Chris Hurlburt, Ken Larrey, and Maxim Milakov. The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment. In IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2013, Southampton, United Kingdom, September 22-25, 2013, pages 1–8, 2013.
[Charte et al. 2015]: Francisco Charte and David Charte. Working with multilabel datasets in R: The mldr package. The R Journal, 7(2):149–162, 2015.
[Chua et al. 2009]: Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. “NUS-WIDE: A Real-World Web Image Database from National University of Singapore”, ACM International Conference on Image and Video Retrieval. Greece. Jul. 8-10, 2009.
[Diplaris et al. 2005]: Sotiris Diplaris, Grigorios Tsoumakas, Pericles Mitkas, and Ioannis Vlahavas. Protein Classification with Multiple Algorithms. In Panhellenic Conference on Informaticspages 448–456. 2005.
[Duygulu et al. 2002]: Pinar Duygulu, Kobus Barnard, Nando de Freitas, and David Forsyth, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary , 7th European Conference on Computer Vision, pp IV:97-112, 2002.
[Elisseeff and Weston 2001]: Andre Elisseeff and Jason Weston. A kernel method for multi-labelled classification. In In Advances in Neural Information Processing Systems 14, volume 14, pages 681–687, 2001.
[Gonçalves et al. 2013]: E.C. Goncalves, Alexandre Plastino, and Alex A. Freitas. A genetic algorithm for optimizing the label ordering in multi-label classifier chains. In IEEE 25th International Conference on Tools with Artificial Intelligence, pages 469–476. IEEE Computer Society Conference Publishing Services (CPS), 2013.
[Greene et al. 2009]: Derek Greene and Pádraig Cunningham. A matrix factorization approach for integrating multiple data views. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I, ECML PKDD ’09, pages 423–438, 2009.
[Hall et al. 2009]: Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann & Ian H Witten. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explor. Newsl., 11, 10-18.
[Joachims 1998]: Thorsten Joachims, Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec C., Rouveirol C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol 1398.
[Katakis et al. 2008]: Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. Multilabel Text Classification for Automated Tag Suggestion. In Proceedings of the ECML/PKDD 2008 Discovery Challenge, 2008.
[Lang2008]: K. Lang. 2008. The 20 newsgroup dataset. http://people.csail.mit.edu/jrennie/20Newsgroups/.
[Lewis et al. 2004]: David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.
[Loza and Fürnkranz 2008]: Loza and Johannes Fürnkranz. Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Disocvery in Databases (ECML-PKDD-2008), Part II, pages 50–65. 2008.
[Moyano et al. 2017]: Jose M. Moyano, Eva L. Gibaja, Sebastián Ventura, MLDA: A tool for analyzing multi-label datasets, Knowledge-Based Systems, Volume 121, 1 April 2017, Pages 1-3, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2017.01.018.
[Pestian et al. 2007]: John P. Pestian, Christopher Brew, Pawel Matykiewicz, D. J. Hovermale, Neil Johnson, K. Bretonnel Cohen, and Wodzislaw Duch. A shared task involving multi-label classification of clinical free text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP ’07), pages 97–104, 2007.
[Read et al. 2008]: Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Multi-label Classification Using Ensembles of Pruned Sets. In ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, volume 0, pages 995–1000, Washington, DC, USA, 2008. IEEE Computer Society.
[Read 2010]: Jesse Read. Scalable multi-label classification. PhD Thesis, University of Waikato, 2010.
[Read et al. 2016]: Jesse Read, Peter Reutemann, Bernhard Pfahringer & Geoff Holmes (2016). MEKA: A Multi-label/Multi-target Extension to Weka. Journal of Machine Learning Research, 17, 1-5.
[Rivolli et al. 2017]: Adriano Rivolli, Larissa C. Parker, and Andre C.P.L.F. de Carvalho. Food Truck Recommendation Using Multi-label Classification. In EPIA 2017: Progress in Artificial Intelligence, pages 585–596, 2017.
[Sajnani et al. 2013]: H. Sajnani, V. Saini, K. Kumar , E. Gabrielova , P. Choudary, C. Lopes. 2013. Classifying Yelp reviews into relevant categories. http://www.ics.uci.edu/~vpsaini/.
[Sechidis et al. 2011]: K. Sechidis, G. Tsoumakas, I. Vlahavas. 2011. On the Stratification of Multi-label Data. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD ’11, pages 145–158, 2011.
[Shao et al. 2013]: H. Shao, G.Z. Li, G.P. Liu, and Y.Q. Wang. Symptom selection for multi-label data of inquiry diagnosis in traditional chinese medicine. Science China Information Sciences, 56(5):1–13, 2013.
[Snoek et al. 2006]: C.G.M. Snoek, M.Worring, J.C. van Gemert, J.-M. Geusebroek, A.W.M. Smeulders. 2006. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia, In Proceedings of ACM Multimedia, 421-430.
[Spyromitros et al. 2014]: E. Spyromitros-Xioufis, S. Papadopoulos, Y. Kompatsiaris, G. Tsoumakas, I. Vlahavas, “A Comprehensive Study over VLAD and Product Quantization in Large-scale Image Retrieval”, IEEE Transactions on Multimedia, 2014.
[Srivastava et al. 2005]: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005).
[Tsoumakas et al. 2007]: Grigorios Tsoumakas and Ioannis Vlahavas. Random k-Labelsets: An Ensemble Method for Multilabel Classification, pages 406–417. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.
[Tsoumakas et al. 2008]: G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and Efficient Multilabel Classification in Domains with Large Number of Labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), 2008.
[Tsoumakas et al. 2011]: G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, I. Vlahavas (2011) “Mulan: A Java Library for Multi-Label Learning”, Journal of Machine Learning Research, 12, pp. 2411-2414.
[Turnbull et al. 2008]: Douglas Turnbull, Luke Barrington, David Torres and Gert Lanckriet. Semantic Annotation and Retrieval of Music and Sound Effects, IEEE Transactions on Audio, Speech and Language Processing 16(2), pp. 467-476, 2008.
[Ueda and Saito 2002]: N. Ueda, K. Saito: Parametric mixture models for multi-labeled text, In Neural Information Processing Systems 15 (NIPS 15), MIT Press, pp. 737-744, 2002.
[Xu et al. 2016]: Jianhua Xu, Jiali Liu, Jing Yin, and Chengyu Sun. A multi-label feature extraction algorithm via maximizing feature variance and feature-label dependence simultaneously. Knowledge-Based Systems, 98:172 — 184, 2016
[Zhang and Zhou 2007]: Min-Ling Zhang and Zhi-Hua Zhou. ML-kNN: a lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007.

At the ages of fifty four fifty can advise you that I'onal ended up lucky not to have wanted the product sooner, nevertheless loosing your partner of 25yrs 2010 became a curve which modified me personally for a long time. Now there came out a place exactly where click to read fifty had visit to have my tastes fulfilled only to find out this plumbing related desired just a little poke. So I questioned my own Computer system doc intended for a little something with tiny facet is affecting. He or she provided the particular recommended you read Cialis regular 5mg. 1st working day fifty had 5mg without any help to discover more help in the event that t discovered virtually any difference considering the woman never was planning determine, which'azines our system and also l'meters being dedicated to the idea. Regardless, these materials Operates, along with is useful. And click here then up coming night time with your ex m took 10mg at 8pm, and the rest is heritage. Through 13:double zero fifty manufactured my own move, but it appeared to be the most element to live in place until finally the sunlight came up upward , 100%Pleased :)When i't thirty-two as well as gone pretty much 1,5 years without intercourse. I had created pop over to this website plenty of anxiousness related penile erection challenges. And hop over to here then We found this specific great which woman My spouse and i started courting, along with first 2 times us all sex didn'to determine which properly, and i also appeared to be worried about just what exactly your woman considered this matter. We obtained braveness to visit and request cialis approved from the health practitioner. When i had taken 10mg product and it labored perfectly. I could truthfully continue on having sex many times a day without the problems. Merely bad thing is a smallish frustration. For me this can be truly a wonder drug.