Preferential text classification: learning algorithms and evaluation measures

Aiolli, Fabio; Cardin, Riccardo; Sebastiani, Fabrizio; Sperduti, Alessandro

doi:10.1007/s10791-008-9071-y

Preferential text classification: learning algorithms and evaluation measures

Published: 09 October 2008

Volume 12, pages 559–580, (2009)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Preferential text classification: learning algorithms and evaluation measures

Download PDF

Fabio Aiolli¹,
Riccardo Cardin¹,
Fabrizio Sebastiani² &
…
Alessandro Sperduti¹

711 Accesses
7 Citations
Explore all metrics

Abstract

In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form “for document d _i, category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.

Feature selection methods for text classification: a systematic literature review

Article 24 February 2021

One-Class Support Vector Machine and LDA Topic Model Integration—Evidence for AI Patents

Model-induced term-weighting schemes for text classification

Article 15 January 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary and the secondary categories that are attached to a given document. The primary category (or categories) of a document represent the topic(s) that are central to the document, or that the document is mainly about. The secondary categories represent instead topics that are somehow touched upon, albeit peripherally, in the document, and do not represent, thematically speaking, the main thrust of the document. This distinction has been neglected in text categorization (TC) research. For instance, a systematic search through the literature shows that all authors who have tested their TC systems on the WIPO-alpha collection (Fall et al. 2003) have either considered the primary categories alone, thus basically using WIPO-alpha as a single-label classification dataset (Cai and Hofmann 2004; Fall et al. 2003; Hofmann et al. 2003; Tikk and Biró 2003; Tikk et al. 2004; Tsochantaridis et al. 2004; Vishwanathan et al. 2006), or collapsed primary and secondary categories, thus using it as a for multi-label classification dataset (Cai and Hofmann 2007; Rousu et al. 2006). The same happens for those who have worked on the OHSUMED collection (Hersh et al. 1994), who have all collapsed primary and secondary categories (Forman 2003; Lam and Ho 1998; Lewis et al. 1996; Ruiz and Srinivasan 2002; Yang et al. 2003).^{Footnote 1}

We instead contend that the distinction between primary and secondary categories is important, and deserves to be explicitly tackled by TC research. The main reason is that, in most contexts in which the distinction is made, misclassifications may be more or less serious, depending on whether they involve a primary or a secondary category.

For instance, when a patent application is submitted to the European Patent Office (EPO), a primary category from the International Patent Classification (IPC) scheme^{Footnote 2} is attached to the application, and that category determines the expert examiner who will be in charge of evaluating the application. Secondary categories are instead attached for the only purpose of identifying related prior art, since the appointed expert examiner will need to determine the novelty of the proposed invention against existing patents classified under either the primary or any of the secondary categories. For the purposes of EPO, failing to recognize the true primary category of a document is thus a more serious mistake than failing to recognize a true secondary category.^{Footnote 3}

Another instance is represented by the ACM Computing Reviews magazine,^{Footnote 4} which publishes reviews of articles and books related to computer science, each classified according to one primary and several secondary categories from the ACM Computing Classification System.^{Footnote 5} Here the primary category determines in which section of the magazine the review is going to be printed, while secondary categories, together with the primary category, are used for facilitating search (e.g., allowing a user to search only the reviews belonging to a particular category). Again, for the purposes of ACM Computing Reviews, getting the primary category of a document wrong is thus a more serious mistake than failing to recognize a true secondary category.^{Footnote 6}

This paper represents a first attempt at solving preferential text categorization, a task which we define as the attribution to a textual document d _i not of a subset $C_{i}\subseteq C$ of the set of categories C (as in standard multi-label—aka “n-of-m”—text categorization), but of a partial ordering among the set of categories C; this partial ordering specifies which category “applies more than” (or “is preferred to”) which other category to the document. This is itself a variant of the so-called “category ranking problem”, which consists in attributing to a textual document a total ordering among the categories (Crammer and Singer 2002). We will here discuss a special case of preferential TC, that is, the case in which each document is associated to a “three-layer” partial order, consisting of a top layer of one or more primary categories,^{Footnote 7} which are each preferred to each of a mid layer of secondary categories, which are each preferred to each of a bottom layer of “non-categories” (i.e., categories that do not apply at all to the document).

The original contribution of this paper is threefold. First, we propose an evaluation measure for preferential TC, in which different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness.

Second, we establish baseline results for this task on the WIPO-alpha collection. These results (expressed in terms of the evaluation measure defined in the previous step) are obtained by decomposing the 3-layered preferential TC problem into simpler TC problems for which standard state-of-the-art learning tools can be used. In WIPO-alpha each document is associated to a single primary category and multiple secondary categories. Therefore, we first test a “poor man’s baseline” consisting of generating binary classifiers for all categories involved by means of binary SVMs and picking, as the primary category of document d, the category which has received the highest score. We then establish an alternative, stronger baseline by decomposing our problem into (i) a single-label (aka 1-of-m) TC problem, aimed at determining the primary category, and for which we use a recent, top-performing type of multiclass SVMs (Platt et al. 1999); and (ii) a multi-label TC problem, aimed at determining which of the remaining m − 1 categories is a secondary category of the document, and for which we use plain binary SVMs (Joachims 1998).

The combined use of these two technologies allows us to attach primary and secondary categories to the test documents, but uses the categories attached to the training documents in a traditional way. That is, a training document d′ for which category c is a primary category has the same impact on the classifier for c of another training document d′′ for which c is a secondary category; we think this is unintuitive. Finally, we formulate the problem as a (multivariate) ordinal regression task, where, given a document, each category is associated to one of the three possible ordered ranks, i.e. primary, secondary, non-category.

Our third contribution is thus an improvement on the above baseline results obtained by using a learning model, dubbed the Generalized Preference Learning Model (Aiolli and Sperduti 2005), that was explicitly devised for learning from training data expressed in preferential form, i.e., in the form “category c′ is preferred to category c′′ for document d”. This model, which does not require any decomposition into subproblems, allows us to draw a fine distinction between primary and secondary categories not only in the testing phase but also in the learning phase, and to leverage on the different importance that primary and secondary categories attached to a training document have.

1.1 Outline of the paper

The paper is structured as follows. In Sect. 2 we propose an evaluation function for preferential TC. Section 3 introduces the learning algorithms by means of which we will tackle the preferential TC task, from the “standard” (both multiclass and binary) SVM technology that will allow us to obtain baseline results (Sect. 3.1), to more novel “preference learning” technology by means of which we improve on this baseline (Sect. 3.2). Section 4 reports on our experiments, by briefly reviewing the WIPO-alpha dataset we have used and the experiments we have conducted on it. Section 5 concludes.

2 Preferential text categorization and its evaluation

The de facto standard measure for the evaluation of a binary classifier for category c is

$$ F_{1}(c)={\frac{2\pi(c)\rho(c)}{\pi(c)+\rho(c)}} $$

(1)

which corresponds to the harmonic mean of precision (π(c)) and recall (ρ(c)) (Lewis 1995). Since multi-label TC resolves to binary TC, averaging F ₁(c) across the categories is the standard way of evaluating multi-label TC.

However, F ₁ is too coarse for preferential TC, since it is the very notions of precision and recall that are too coarse for properly addressing this task. For instance, precision is the probability that, if a category has been assigned to a document, this assignment was correct. But what does, in preferential TC, “assigning a category” mean? Assigning it as a primary or as a secondary category are different things.

2.1 Why Kendall distance is unsuitable

It might be tempting to hypothesize that the “right” evaluation measure for preferential TC is one from the tradition of evaluating automatically produced rankings (predicted rankings) σ_P against a “true” ranking σ_T. The (normalized) Kendall distance with penalization p (noted K ^p(σ_P, σ_T)—see Fagin et al. 2004, 2006) is nowadays considered the standard function for the evaluation of rankings in which “ties” (i.e., pairs of objects which have the same place in one of the two rankings) might occur; K ^p and its variants are now heavily used in the evaluation of several IR-related tasks involving ranking, such as ranking system runs based on their results on a given test collection (Voorhees 1998), ranking topics by estimated difficulty (Yom-Tov et al. 2005), or computing the similarity between features in terms of the document rankings that their use brings about (Geng et al. 2007). It is a non-symmetrical^{Footnote 8} distance (i.e., $K^{p}(\sigma_{i},\sigma_{j})\not =K^{p}(\sigma_{j},\sigma_{i}))$ defined as

$$ K^{p} = {\frac{n_{d}+p\cdot n_{u}}{Z}} $$

(2)

where n _d is the number of swappings, i.e., pairs of objects ordered one way in σ_T and the other way in σ_P; n _u is the number of false ties, i.e., pairs ordered not tied in σ_T and tied in σ_P; p is a penalization to be attributed to each false tie; and Z is a normalization factor (equal to the number of pairs that are ordered in the true ranking) whose aim is to make the range of τ_p coincide with the [0,1] interval. True ties (i.e., pairs tied in σ_T) are not considered by K ^p. The penalization factor is typically set to $p={\frac{1}{2}},$ which is equal to the probability that a ranking algorithm correctly orders the pair by random guessing, so that there is no advantage to be gained from either random guessing or assigning ties between objects. If σ_P ≡ σ_T, then K ^p(σ_P, σ_T) ≡ 0; and if σ_P is exactly the reverse of σ_T, then K ^p(σ_P, σ_T) ≡ 1. That is, lower values of K ^p indicate better performance.

For preferential TC, one would use K ^p on a document-by-document basis, i.e., to measure, given a document d, the difference between how the categories in C are ordered in the true ranking (that is: the ones that most apply to d placed on top of the ranking, and the ones that least apply to d at the bottom of the ranking), and how they are instead ordered in the predicted ranking.

However, a closer analysis reveals that this measure is unsuitable to dealing with situations in which a large number of objects to be ranked are allowed to fall into a much smaller number of “layers” and these layers are imbalanced in σ_T. This is the case for 3-layered preferential TC, in which the number of categories that we need to rank is ≫ 3 (in WIPO-alpha we use 614 categories), and in which the first and second layers (primary and secondary categories, respectively) are much less populated than the third one (non-categories), given that the average document has one primary category, a handful of secondary categories, and hundreds of non-categories. In such cases, ties (in both σ_P and σ_T) obviously tend to be the norm rather than the exception, and a violation involving a tie in one of the overpopulated layers entails a large penalization, as in the following example.

Example 1

Suppose a WIPO-alpha document d ₁ is such that its set of primary categories is P(d ₁) = {c ₁}, its set of secondary categories is S(d ₁) = {c ₂, c ₃, c ₄}, and its set of non-categories is N(d ₁) = {c ₅,…,c ₆₁₄}. Suppose that the only mistake a classifier Φ does is incorrectly deeming c ₄ a non-category. This will bring about a cost of 610p (since this will generate 610 “false ties” between c ₄ and c ₅,…,c ₆₁₄), and K ^p = .249. Instead, another classifier Φ′ whose only mistake is to incorrectly deem c ₁ a secondary category (arguably a more serious mistake) will only bring about a cost of 3p (since this will generate 3 “false ties” between c ₁ and c ₂, c ₃, c ₄), and K ^p = .001. Since lower values of K ^p are better, Φ′ is incorrectly deemed a much better system than Φ.

A second reason why K ^p is unsuitable to our case is that it is sensitive only to the relative order of objects, rather than to their having been placed in the correct layer, as in the following example.

Example 2

Suppose a WIPO-alpha document d ₂ is such that P(d ₂) = {c ₁}, S(d ₂) = {}, and N(d ₂) = {c ₂,…,c ₆₁₄}: a system Φ′′ which correctly deems c ₁ the primary category and incorrectly deems all of c ₂,…,c ₆₁₄ as secondary categories would bring about a perfect score, i.e., K ^p = 0.

What Example 2 shows is that preferential TC is fundamentally different from (category) ranking. Guessing the perfect rank is not enough for preferential TC, since one (also) needs to guess, for each category, whether it is above or below the threshold that separates one layer from the other layer. This is not unlike binary classification, in which it is not enough, given a document, to rank the categories according to their estimated degree of relevance to the document. For this reason, none of the evaluation measures currently used (along K ^p) in other ranking tasks in IR—such as mean average precision (MAP), precision at position n (P@n), normalized discounted cumulative gain (NDCG), etc.—are suitable for preferential TC.

2.2 3-Layered F ₁

Below we thus propose an alternative evaluation function consisting of a weighted combination of different F ₁ measures. First of all observe that, while a “standard” binary classifier for category c can be evaluated in terms of the standard 4-cell contingency matrix (on which π(c) and ρ(c) are based), n-layered preferential TC brings about an n ²-cell such matrix; for instance, in 3-layered preferential TC category c is associated to the 9-cell contingency matrix M ^PSN of Table 1. From this latter matrix, the three 4-cell sub-matrices M ^PS, M ^SN, M ^PN of Table 2 can be extracted, each of them detailing, with respect to category c, how many documents are correctly placed into or erroneously swapped between two layers t _i and t _j. For each category c we thus have a sub-matrix for first and second layer, one for first and third layer, and one for second and third layer. On each such sub-matrix, π(c), ρ(c), and F ₁(c) can be computed as usual, although their meaning is clearly changed. For instance, recall as computed on sub-matrix M ^SN of category c (denoted ρ^SN(c), where S and N stand for “secondary category” and “non-category”) is computed as $\rho^{SN}(c)={\frac{SS}{SS+NS}},$ i.e., as the fraction of documents for which c has been correctly deemed a secondary category, out of the total number of documents for which c is in fact a secondary category and has been deemed either a secondary category or a non-category. The meaning and definition of ρ^PS(c), ρ^PN(c), and those of π^PS(c), π^SN(c), π^PN(c), F ^PS₁ (c), F ^SN₁ (c), and F ^PN₁ (c), should now be obvious.

We propose 3-layered F ₁ as a measure for evaluating 3-layered TC; this is defined as

$$ F_{1}^{3}(c)=\sum_{i\in\{PS,SN,PN\}}\alpha_{i}F_{1}^{i}(c) $$

(3)

i.e., as a linear combination of the three F ₁(c) functions computed on the M ^PS, M ^SN, M ^PN sub-matrices. The α_i are to be set depending on the constraints of the application. We propose that their default values should be α_PS = .25, α_SN = .25, and α_PN = .50; these values have the effect of considering erroneous swappings between first and second layer to bring about a cost as high as swappings between second and third layer, but only half as high as erroneous swappings between first and third layer. That is, cost is viewed simply in terms of the distance between the true and the predicted layer of a category. The definitions of π³(c) and ρ³(c) are completely analogous to that of F ³₁ (c).

Of course, several variants of these measures can be used, including ones in which the β parameter of any of the three component F _β measures is not necessarily set to 1; setting β at values different from 1 is well-known to have the effect of emphasizing precision at the expense of recall (β < 1), or vice versa (β > 1). For n-layered preferential TC we may analogously define F ⁿ₁ (c), a function that depends on the computation of F ₁ on $\frac{1}{2}n(n-1)$different submatrixes.

It can be easily seen that F ³₁ does not suffer from the problems from which K ^p suffers, as described in Sect. 2.1. For instance, in Example 1 system Φ would receive a score of F ³₁ = .95 (resulting from F ^PS₁ = 1, F ^SN₁ = .80, F ^PN₁ = 1) while system Φ′ would receive a score of F ³₁ = .75 (resulting from F ^PS₁ = 0, F ^SN₁ = 1, F ^PN₁ = 1), i.e., Φ would correctly be deemed a much better system than Φ′. Also, in Example 2, system Φ′′ would receive a score of F ³₁ = .75 (resulting from F ^PS₁ = 1, F ^SN₁ = 0, F ^PN₁ = 1), correctly stating that Φ′′ is far from being the perfect system.

Table 1 The 9-cell contingency matrix M ^PSN for 3-layered preferential TC; P, S, and N stand for “primary category”, “secondary category”, and “non-category”, respectively; PS stands for the number of documents for which c _i is a true primary category and a predicted secondary category; the interpretation of the other double-letter symbols is analogous

Full size table

Table 2 The three 4-cell sub-matrices (M ^PS, M ^SN, M ^PN) of the contingency matrix M ^PSN of Table 1

Full size table

3 Learning algorithms for preferential text categorization

3.1 Two baselines: binary and multiclass SVMs

We concentrate on the problem of predicting, for each document, a single primary category and several (possibly zero) secondary categories. We thus defer the problem of predicting several primary and several secondary categories for the same document to future work; anyway, this problem admits solutions similar to the ones presented here.

At first glance, the 3-layered classification problem of attributing a single primary category and a (possibly empty) set of secondary categories to a given test document d seems to have a very simple solution. In fact, one could build a binary classifier for each c _i ∈ C (by using as positive examples of category c _i all the documents that have c _i either as a primary or as a secondary category) and use the real-valued scores output by each classifier for d: the category for which the largest score has been obtained would be selected as the primary category, while the set of secondary categories could then be identified by optimizing a threshold for each individual category and selecting the categories whose associated classifier has returned a score above its associated threshold. We have indeed implemented this approach (by using standard binary SVMs); in Sect. 4 this is dubbed “Baseline1”.

However, this simple approach has a main drawback, i.e., it does not use the distinction between primary and secondary categories in the training phase. In other words, a training document d′ for which category c _i is a primary category has the same impact on the classifier for c _i of another training document d′′ for which c _i is a secondary category; we think this is unintuitive.

A stronger approach (dubbed “Baseline2” in Sect. 4) consists in performing two different classification tasks, a first one (by means of a single-label classifier h _P) aimed at identifying the primary category of d, and a second one (by means of a multi-label classifier h _S consisting of m binary classifiers h ⁱ_S , one for each category c _i ∈ {c ₁,…,c _m}) aimed at identifying, among the remaining categories, the secondary categories of d. The h _P classifier is trained by using, as positive examples of each c _i, only the training documents that have c _i as primary category. Each of the h ⁱ_S is instead trained by using as positive examples only the training documents that have c _i as secondary category, and as negative examples only the training documents that have c _i as non-category (those that have c _i as primary category are discarded).

As an aside, we have also tested a slight variant of Baseline2 in which each of the h ⁱ_S is trained by using as positive examples the training documents that have c _i either as primary or as secondary category; this corresponds to using the single-label classifier h _P of Baseline2 and the h ⁱ_S binary classifiers of Baseline1. In our experiments this “Baseline2a” has given inferior results to Baseline2 (see Table 3), and we will thus not discuss this any further.

Table 3 Micro-averaged F ³₁ values obtained by the classifiers

Full size table

In our experiments, for generating h _P we use a “multiclass” (i.e., single-label) SVM based on combining binary classifiers, each of them generated through standard binary SVMs, into a Decision Directed Acyclic Graph (DDAG—see Platt et al. 1999, for details, see also Fig. 1). Briefly, a DDAG-based multiclass SVM for categories {c ₁,…,c _m} generates a rooted, layered binary DAG with the structure of a “Pachinko machine”: layer 1 (the root) is placed at the top, the j-th layer (j ∈ {1,…, m − 1}) contains j internal nodes, down until the m-th (bottom) layer, consisting of m leaves. The i-th node of layer j < m points to the i-th and (i + 1)-th node of the (j + 1)-th layer. Each of the $\frac{1}{2}m(m-1)$ internal nodes is associated with a binary classifier, generated by a binary SVM, and in charge of deciding whether or not a category c′ is less suitable than another category c′′ for test document d. The less suitable category is discarded from consideration; through a recursive descent through the DDAG, categories are repeatedly discarded until only one is left for d. With this method, a total of $\frac{1}{2}m(m-1)$ binary classifiers need to be trained; however, this is less expensive than it might appear at first, since each classifier is trained only with a small subset of examples (namely, the training documents which have either of the two categories as primary category). This makes DDAG-based SVMs more efficient at classification time than other classes of multiclass SVMs, such as those presented in (Crammer and Singer 2001), since the fact that few training documents are used brings about a smaller number of support vectors, which ultimately means that the generated classifiers consist of sparser vectors, hence more efficient at classification time. Note that, while ${\frac{1}{2}}m(m-1)$ binary classifiers need to be generated at training time, only m need to be invoked at classification time for each test document.

The multi-label classifier h _S is simply formed by m binary classifiers {h ⁱ_S : c _i ∈ {c ₁,…,c _m}} generated by binary SVMs; see Sect. 4.1 for details on how we have performed parameter optimization.

3.2 Moving further: the generalized preference learning model

The method proposed in the previous section has been obtained by decoupling the problems of finding the primary and the secondary categories of a document. In a sense, the overall problem has been simplified and reduced to two almost independent modules whose predictions are then combined. This approach would be reasonable if the primary and secondary categories had been attached to the document independently of each other. Unfortunately, in many applicative domains this is not the case. For example, it is plausible to imagine that the set of secondary categories associated to WIPO-alpha patents depends on the primary category. Therefore, a classifier which aims at separating secondary categories from non-categories should have access to the information about the primary category of the document.

As a principled solution of the 3-layered classification problem, we propose the adoption of the generalized preference learning model (GPLM) (Aiolli 2005; Aiolli and Sperduti 2005), a recent framework which generalizes a large class of supervised learning problems by using the notion of preference between categories. The next sections show how this can be adapted to our needs.

3.2.1 The GPLM

The GPLM assumes the existence of a real-valued relevance function that, for each document d and category c, returns a score r(d, c) (the relevance value) which measures the degree to which category c applies to document d. For each document d the relevance function thus induces a ranking among the categories. A preference is a constraint on categories and documents, that should be satisfied by the relevance function. Specifically, GPLM focuses on two types of preferences: (i) qualitative preferences $c_{i}\,\vartriangleright_d\,c_{j}$ (“category c _i applies to document d more than c _j does”), which means that r(d, c _i) > r(d, c _j); and (ii) quantitative preferences of type $c\,\vartriangleright_d\,\tau$ (“the degree to which category c applies to document d is at least τ”, where $\tau\in{\mathbb{R}}),$ which means that r(d, c) > τ.

In this learning framework, supervision for a training document is provided as a set of preferences (of either type). These preferences constitute constraints on the form of the relevance function which has to be learned. The aim of the learning process is to return a relevance function which is as consistent as possible with these constraints.

As a very simple example of how supervised problems can be modelled in the GPLM let us consider the (single-label) classification problem in which a classifier has to predict the primary category P(d) for a test document d. This case can be modelled by stating, for each training document d′, the set of preferences $\{P(d')\,\vartriangleright_{d'}\,c_{i}\}_{c_{i}\,\neq\,P(d')}.$ Note that, when classifying a test document d, its primary category will correspond to arg max_c∈C r(d, c).

As a further example, a multi-label classification problem can instead be modelled by stating, for each training document d′, the set of preferences

$$ \{c_{i}\,\vartriangleright_{d'}\,\tau\}_{c_{i} \in C(d')} \cup \{\tau\,\vartriangleright_{d'}\,c_{j}\}_{c_{j} \in C \setminus C(d')} $$

where τ is a real-valued threshold to be optimized, C is the set of categories, and C(d′) is the set of categories to which d′ belongs. In this case, the set of categories to which a test document d is deemed to belong are obtained by comparing their associated relevance value to τ: d is deemed to belong to a category c if and only if r(d, c) > τ.

It should be stressed that in GPLM any set of preferences can be associated to a document d, so if no information about the relative ranking of two categories for d is available, no preference involving these two categories need be stated. This allows us to impose on the learner only constraints which are needed.

We may instantiate the GPLM by assuming that the relevance of a document to a category can be expressed in linear form, i.e.

$$ r(d,c_{i}) = {\mathbf{w}}_{i} \cdot {\mathbf{d}} $$

(4)

where ${\mathbf{d}}\in {\mathbb{R}}^{D}$ is a (possibly weighted) vectorial representation of d (and D is the size of this vector) and ${\mathbf{w}}_{i} \in {\mathbb{R}}^{D}$ is a weight vector (containing parameters to be learnt) associated to category c _i. Interestingly, for this case it is possible to give effective algorithms which explicitly attempt to minimize the number of wrong predictions in the training set. In fact, following Eq. 4, qualitative and quantitative preferences can be conveniently reformulated as linear constraints. Specifically, let us consider the qualitative preference $\lambda_{1} \equiv (c_{i}\,\vartriangleright_d\,c_{j}).$ This preference imposes the constraint r(d, c _i) > r(d, c _j) on the relevance function r, which using Eq. 4 can be rewritten as w _i · d > w _j · d, or (w _i · d − w _j · d) > 0. Similar transformations can be done for quantitative preferences.

A uniform treatment of quantitative and qualitative preferences can then be obtained by concatenating all the vectors w _i (for i ∈ {1,…,m}) and all the thresholds τ₁,…,τ_q involved in the formulation of the problem, into a single vector ${\mathbf{w}} = ({\mathbf{w}}_1,\ldots,{\mathbf{w}}_m,\tau_1,\ldots,\tau_q)\in{\mathbb{R}}^{mD+q}.$

In the qualitative case, assuming i < j with no loss of generality, λ₁ can then be further rewritten as

$$ {\mathbf{w}} \cdot \underbrace{(\underbrace{{\mathbf{0}},\ldots,{\mathbf{0}}}_{i-1},{\mathbf{d}},\underbrace{{\mathbf{0}},\ldots,{\mathbf{0}}}_{j-i-1}, -{\mathbf{d}},\underbrace{{\mathbf{0}},\ldots,{\mathbf{0}}}_{m-j},\underbrace{0,\ldots,0}_{q})}_{\psi(\lambda_{1})} > 0 $$

(5)

where $\psi(\lambda_{1})\in{\mathbb{R}}^{mD+q}$ is a representation of λ₁ and 0 stands for a vector of all 0’s of length D.

In the quantitative case, the preference $\lambda_{2}\equiv (c_{j}\,\vartriangleright_d\,\tau_k)$ can similarly be expressed as

$$ {\mathbf{w}} \cdot (\underbrace{\underbrace{{\mathbf{0}},\ldots,{\mathbf{0}}}_{j-1},{\mathbf{d}},\underbrace{{\mathbf{0}},\ldots,{\mathbf{0}}}_{m-j}, \underbrace{{0},\ldots,{ 0}}_{k-1},-1,\underbrace{{0},\ldots,{0}}_{q-k}}_{\psi(\lambda_{2})}) > 0 $$

(6)

while preference $\lambda_{3}\equiv (\tau_k\,\vartriangleright_d\,c_{j})$ is expressed as

$$ {\mathbf{w}} \cdot (\underbrace{\underbrace{{\mathbf{0}},\ldots,{\mathbf{0}}}_{j-1},-{\mathbf{d}},\underbrace{{\mathbf{0}},\ldots,{\mathbf{0}}}_{m-j}, \underbrace{{0},\ldots,{ 0}}_{k-1},1,\underbrace{{0},\ldots,{0}}_{q-k}}_{\psi(\lambda_{3})}) > 0 $$

(7)

In general, the training data can then be reduced to a set of linear constraints of the form w · ψ(λ) > 0 where w is the vector of weights and thresholds and ψ(λ) is a suitable representation of preference λ. As a consequence, any preference learning problem can be seen as a (homogeneous) linear problem in ${\mathbb{R}}^{mD+q}.$ Specifically, any algorithm for linear optimization (e.g., perceptron or a linear programming package) can be used to solve it, provided the problem has a solution.

Unfortunately, the set of preferences may generate a set of linear constraints that have no solution (i.e., the set of the ψ(λ)’s is not linearly separable), i.e., such that there is no weight vector w able to fulfil all the constraints induced by the preferences in the training set. To deal with training errors we may minimize, consistently with the principles of Structural Risk Minimization (SRM) theory (Vapnik 1998), an objective function which is increasing in the number of unfulfilled preferences (the training error) while maximizing the margin 1/||w|| (where ||·|| denotes the 2-norm of a vector). To this end, let us consider the quantity ρ(λ|w) ≡ w · ψ(λ) as the degree of satisfaction of a preference λ given the hypothesis w. This value is greater than zero when the hypothesis is consistent with the preference and smaller than zero otherwise. Now, let us assume a training set Tr = {(d _i, Λ_i)}_i=1,…,n, where Λ_i is the set of preferences associated to the i-th document. We aim at minimizing the number of preferences which are unfulfilled (and thus the wrong predictions) on the training set, while trying to maximize the margin 1/||w||₂. Let L(·) be a convex, always positive, and non-increasing function, such that L(0) = 1. It is not difficult to show that the function L(ρ(λ|w)) is an upper bound to the error function on the preference λ. Thus, a fairly general approach is the one that attempts to minimize a (convex) function like

$$ D({\mathbf{w}}) = {\frac{1}{2}}||{\mathbf{w}}||^2 + \gamma \sum_{i=1}^n\sum_{\lambda \in \Uplambda_i} L(\rho(\lambda|{\mathbf{w}})) $$

(8)

where γ is a parameter that determines the relative contributions of the regularization and the loss in the objective function. Note that, if we adopt the hinge loss L(ρ) = [1−ρ]₊ = max(0, 1 − ρ), the optimization required is the same as required by a binary SVM with an extended training set consisting of all $\psi(\lambda) \in {\mathbb{R}}^{mD+q}$ for each preference, each one taken as a (positive) example (see Aiolli 2005 for details).

3.2.2 GPLM mappings for 3-layered classification

We now propose GPLM models for a principled solution of the 3-layered classification task. In the following, we denote by d a document having the set P(d) = {c _p} (a singleton) as the set of its primary categories, $S(d)=\{c_{s_{1}},\ldots,c_{s_{k}}\}$ as the (possibly empty) set of its secondary categories, and $N(d)=\{c_{n_{1}},\ldots,c_{n_{l}}\}$ as the set of its non-categories, such that C = P(d) ∪ S(d) ∪ N(d).

3.2.2.1 GPLM: ordinal regression for 3-layered classification

One could be tempted to interpret the 3-layered classification problem as a (multi-variate) ordinal regression (OR) problem, i.e., the problem of attributing a label (here called rank) from the ordered set $\{{\mathsf{primary}},\;{\mathsf{secondary}},\;{\mathsf{non{\text -}category}}\}$ to each pair consisting of a document and a category.

Now, we first present a GPLM mapping which can be demonstrated equivalent to the ordinal regression method of (Chu and Keerthi 2007) (see Aiolli 2005 for details). Then we discuss why, in our opinion, this setting does not exactly match the notion of preferential classification. Our experiments, which will be reported in the experimental section, support this claim.

Specifically, for ordinal regression a GPLM model is built by considering two thresholds τ_p and τ_s (see Fig. 2). For each training document the relevance function of a primary category should be above the threshold τ_p, while the relevance function for any other category (either secondary or non-category) should be below τ_p. On the other hand, the relevance function of any secondary category should be above the threshold τ_s, while for any non-category it should be below τ_s. Summarizing, the preference graph for a given training document is as in Fig. 2. As a simple example, consider the set of categories C = {c ₁, c ₂, c ₃, c ₄, c ₅} and a training document d such that P(d) = {c ₁}, S(d) = {c ₂, c ₃}, and N(d) = {c ₄, c ₅}. The set of preferences we thus generate is

$$ \Uplambda = \{(c_1\,\vartriangleright_d\,\tau_p),(\tau_p\,\vartriangleright_d\,c_2),(\tau_p\,\vartriangleright_d\,c_3),(c_2\,\vartriangleright_d\,\tau_s),(c_3\,\vartriangleright_d\,\tau_s),(\tau_s\,\vartriangleright_d\,c_4),(\tau_s\,\vartriangleright_d\,c_5)\} $$

The standard classification procedure of ordinal regression would select as primary categories the categories that reach a relevance score above τ_p, and select as secondary categories all the other categories that reach a relevance score above τ_s. However, a classification performed in this way could not guarantee that the predictions are consistent with the requirements of certain preferential tasks. For example, in WIPO-alpha, there is one and only one primary category for each document, and the predictions of ordinal regression as presented above do not necessarily have this property. To overcome this problem in WIPO-alpha, we slightly modify the standard ordinal regression classification procedure by selecting as primary category the category reaching the highest relevance score, and by selecting as secondary categories all the other categories that reach a relevance score above τ_s.

At this point we can discuss the OR-based preference model in more detail. In particular, in (multi-variate) ordinal regression it is assumed that, for each document, the rank of a category is independent from the rank of other categories. This assumption would be reasonable when discriminating between relevant categories (primary or secondary) and non-categories, since this is not a “competitive” decision, but is far less reasonable when one has to choose exactly one (the most relevant) among the relevant categories as the primary category for a document, since in this case we actually have a “competitive” decision. Thus, in this latter case, the choice of the primary category is strongly dependent on which are the relevant categories. This difference is reminiscent of the difference between single-label classification (which is competitive) and multi-label classification (which is not) in multi-class classification tasks. In other words, requiring the relevance score for the primary category to be higher than a given threshold seems an unnecessary constraint, which could eventually lead to deteriorate the overall performance.

3.2.2.2 GPLM: a mapping tailored to 3-layered classification

In this section, a variant of the ordinal regression scheme, which seems more suitable for the task of 3-layered classification, is presented. In this case, a GPLM model is built as follows. We interpret the primary category as the most relevant among relevant categories. This constraint is introduced by the insertion of a set of qualitative preferences between the primary and all the secondary categories. Moreover, given the multi-label nature of the problem of discerning the secondary categories from the remaining one, a single threshold τ on the relevance scores has to be added between the secondary categories and the non-categories. The categories reaching a relevance score above the threshold (apart from the one recognized as the primary category) will be predicted as secondary categories. See Fig. 3a for a graphical representation of this kind of preference model. Note that whenever S(d) = ∅, this means that the relevance values for categories in $C \setminus P(d)$ are all below the threshold. To cope with this situation, the qualitative preferences can be collapsed into a direct quantitative preference between the primary category and the threshold. See Fig. 3b for a graphical description of this kind of preference. As a simple example, consider the set of categories C = {c ₁, c ₂, c ₃, c ₄, c ₅} and a training document d such that P(d) = {c ₁}, S(d) = {c ₂, c ₃}, and N(d) = {c ₄, c ₅}. The set of preferences we generate is

$$ \Uplambda = \{(c_1\,\vartriangleright_d\,c_2),(c_1\,\vartriangleright_d\,c_3),(c_2\,\vartriangleright_d\,\tau),(c_3\,\vartriangleright_d\,\tau),(\tau\,\vartriangleright_d\,c_4),(\tau\,\vartriangleright_d\,c_5)\} $$

Similarly, if d is instead such that P(d) = {c ₁}, S(d) = ∅, N(d) = {c ₂, c ₃, c ₄, c ₅}, this will generate the set of preferences

$$ \Uplambda = \{(c_1\,\vartriangleright_d\,\tau),(\tau\,\vartriangleright_d\,c_2),(\tau\,\vartriangleright_d\,c_3),(\tau\,\vartriangleright_d\,c_4),(\tau \,\vartriangleright_d\, c_5)\} $$

3.2.3 GPLM by using an Adatron-like algorithm

A problem with the direct optimization of Eq. 8 is the huge number of preferences involved. This implies that the naïve solution (e.g., the one that involves training an SVM directly) is impractical. This is especially due to the fact that the number of examples in the extended training set is O(ne), where n is the number of documents in the original training set and e is the average number of preferences associated to each such documents. Note that in our application e is O(|C|), and |C| can be as high as many hundreds. As a consequence, keeping all the examples in memory may itself be an issue for a common SVM optimization package.

To overcome this problem we propose using a version of the iterative Adatron algorithm (Friess et al. 1998) tailored to our needs. The original Adatron algorithm is capable of efficiently finding the solution of a hard-margin SVM, defined by

$$ \begin{aligned} &\min_{{\mathbf{w}}} {\frac{1}{2}}||{\mathbf{w}}||^2 \\ & \hbox{subject to: } {\mathbf{w}}\cdot\psi(\lambda_i^j) \geq 1,\quad \forall i \in \{1,\ldots,n\},\;\forall j: \lambda_i^j \in \Uplambda_i \end{aligned} $$

(9)

by optimizing the dual problem. More formally, let λ ^j_i be the jth preference associated to the ith document, and let α ^j_i be the associated dual variable; then the optimization problem, as instantiated in our setting, is

$$ \begin{aligned} &\max_{\alpha} \sum_{i,j} \alpha_i^{j} - {\frac{1}{2}} \sum_{i^{\prime},j^{\prime}} \sum_{i^{\prime\prime},j^{\prime\prime}}\alpha_{i^{\prime}}^{j^{\prime}} \alpha_{i^{\prime\prime}}^{j^{\prime\prime}} k_{\epsilon}(\psi(\lambda_{i^{\prime}}^{j^{\prime}}), \psi(\lambda_{i^{\prime\prime}}^{j^{\prime\prime}})) \\ &\hbox{subject to: } \alpha_i^{j} \ge 0,\quad \forall i \in \left\{ 1,\ldots,n \right\}, \forall j: \lambda_i^j \in \Uplambda_i \end{aligned} $$

(10)

where $k_{\epsilon}$ is defined by

$$ k_{\epsilon}({\mathbf{v}}_i,{\mathbf{v}}_j) = {\mathbf{v}}_i\cdot{\mathbf{v}}_j + \epsilon \delta_{ij} $$

with δ_ij the “Kronecker function”, which has value 1 if i = j and 0 otherwise (Cristianini and Shawe-Taylor 2000). It can be shown (Herbrich et al. 2001) that, for all $\epsilon\,\ge\,0,$ $k_{\epsilon}$ is indeed a kernel, and the higher the value of $\epsilon,$ the more likely the training set will be linearly separable in feature space. Moreover, solving this formulation is equivalent to solving the soft-margin SVM with quadratic hinge loss (Vapnik 1998).

The Adatron algorithm requires the update of one single dual variable α ^j_i at each iteration of the algorithm. Specifically, the update is made in such a way as to maximize the value of the objective function when all the other variables are fixed. At each iteration, the value of the w vector can be computed as

$$ {\mathbf{w}} = \sum_{i,j} \alpha_i^j \psi(\lambda_i^j) $$

(11)

Note that, since all the terms of the sum are linear, we may directly keep w in memory.

It can be shown that, given the jth preference of the ith example, i.e., λ ^j_i , the update to perform for the associated α ^j_i in order to maximize the objective function can be given in closed form as

$$\Updelta \alpha_i^j = {\frac{1- {\mathbf{w}}\cdot\psi(\lambda_i^j) + \epsilon \alpha_i^j}{||\psi(\lambda_i^j)||^2+\epsilon}} $$

In order to keep the solution admissible for the quadratic problem, when α ^j_i + Δα ^j_i < 0 the new value of α ^j_i is simply put to 0. We can also note that, by construction of the preference representations given in Sect. 3.2.1, ||ψ(λ ^j_i )||² = 2||d _i||² whenever λ ^j_i is a qualitative preference and ||ψ(λ ^j_i )||² = ||d _i||² + ||τ||² whenever λ ^j_i is a quantitative preference.

The complete Adatron-like algorithm, instantiated to our setting is given in the following. Recall that, with the GPLM construction discussed earlier, all the preferences can be seen as positive examples.

1.
Let Pr = {ψ(λ ^j_i )}_i,j be the set of training preferences. Set α ^j_i = 0 for all i,j, and set w = 0. Set $\epsilon$ to a positive value.
2.
For all i ∈ {1,…,n} and for all λ ^j_i ∈ Λ_i;
- Compute the “degree of satisfaction” of ψ(λ ^j_i ) by using the formula z ^j_i ← w · ψ(λ ^j_i ) + $\epsilon$α ^j_i ;
- Compute the current update for the dual variable α ^j_i as $\Updelta \alpha_i^j \leftarrow {\frac{1-z_i^j} {||\psi(\lambda_i^j)||^2 + \epsilon}};$
- If α ^j_i + Δα ^j_i < 0, set Δα ^j_i ← −α ^j_i ;
- Compute the new weight vector w ← w + Δα ^j_i ψ(λ ^j_i );
- Update the dual variable α ^j_i ← α ^j_i + Δα ^j_i ;
3.
If the stopping criterion (see Sect. 4.1) in not satisfied, go back to Step 2;

Given that our Adatron-like algorithm is iterative, we can choose not to allow the training to converge, and consider instead the number of iterations as a further parameter which can be optimized by validating on a hold-out sample, similarly to what is done for the $\epsilon$ parameter. This seems reasonable because (i) the GPLM loss function that our Adatron-like algorithm minimizes is different from F ³₁ , which means that minimizing the former is anyway suboptimal for our needs, and (ii) given the enormous number of preferences, this procedure can potentially bring about a huge speedup for training (in fact, as we will see empirically, this “early-stopping” strategy leads to stopping the algorithm after very few iterations).

The Adatron-like algorithm as presented above eventually converges to the same solution independently from the order of presentation of the preferences in Step 2. However, when the algorithm is stopped far before complete convergence, the produced models may still be immature. This implies that runs of the algorithm performed with different orders of presentation of the preferences can produce different models and, more importantly, they can generate different kinds of errors. In these cases, machine learning theory tells us that an improvement can be obtained by taking these diverse “good” classifiers and combining them into a single one, for example by averaging. Note that, since the classifiers w _k are linear, the computation of the average over K classifiers is obtained as $\hat{{\mathbf{w}}} = {\frac{1}{K}}\sum_k {\mathbf{w}}_k.$

4 Experiments

4.1 Experimental setting

We have evaluated our algorithms on the WIPO-alpha dataset, a large (3GB) collection published by the World Intellectual Property Organization (WIPO) in 2003. The dataset consists of 75,250 patents classified according to version 8 of the International Patent Classification scheme (IPC—see Footnote 1). Each document d has one primary category (known as the main IPC symbol of d), and a variable (possibly null) number of secondary categories (the secondary IPC symbols of d). The IPC scheme consists in a four-level hierarchy comprising 8 “sections” (1st level—the root coincides with level 0), 120 “classes” (2nd), 630 “subclasses” (3rd), and about 69,000 “groups” (4th). A typical category might be “D05C 1/00”, which is to be read as Section D (Textiles; Paper), Class 05 (“Sewing; Embroidering; Tufting”), Sub-class C (“Embroidering; Tufting”) and Group 1/00 (“Apparatus, devices, or tools for hand embroidering”). In order to avoid problems due to excessive sparsity, and consistently with previous literature (Fall et al. 2003), we only consider categories at the subclass level; each of the 630 IPC subclasses is thus viewed as containing the union of the documents contained in its subordinate groups.

Among the 614 IPC subclasses that are attached to at least one WIPO-alpha training document, either as a primary or a secondary category, only 451 appear as primary category for some training document; other 163 appear in the training set only in the role of secondary category. This poses an interesting problem in evaluation: should we test the system on the first 451 categories only, or should we also consider the latter 163? Our parameters for predicting secondary categories have sometimes (e.g., in Baseline1) been collectively optimized on all 614 categories, so using only the first 451 would penalize these approaches. However, the classifier for primary categories in Baseline2 is a 1-of-451 classifier. If we evaluated on all the 614 categories, it would necessarily misclassify a test document whose true primary category is one of the remaining 163 categories; and we cannot ask a system to predict a concept for which we provide no training data. As a consequence, we consider the first 451 categories the potential primary categories, and all the 614 categories the potential secondary categories, consistently with what our training set tells us. This means that, if a system predicts as primary category a category not in the set of 451 potential primary categories (this can be the case for both Baseline1 and our GPLM system), we simply treat this as a predicted secondary category, and at the same time treat as a predicted primary category the top-scoring predicted secondary category which is in the set of the 451 potential primary categories.

WIPO-alpha comes partitioned into a training set Tr of 46,324 documents and a test set Te of 28,926 documents. Each category appears as primary category in at least 20 and at most 2,000 training documents, and of at least 10 and at most 1,000 test documents; each category appears as secondary category in at least 1 and at most 1,857 training documents, and of at least 1 and at most 1,621 test documents. The percentage of documents which are associated to at least one secondary category is 34% in the training set and 33% in the test set; most documents have thus no secondary categories attached. Categories that are attached as secondary categories to at least ten documents are 62% of the total set of categories for the training set and 39,7% for the test set.

In our experiments we use the entire WIPO-alpha set of 75,250 documents. Each document includes a title, a list of inventors, a list of applicant companies or individuals, an abstract, a claims section, and a long description. Similarly to Fall et al. (2003) we have only used the title, the abstract, and the first 300 words of the “long description.”^{Footnote 9} Preprocessing has been obtained by performing stop word removal, punctuation removal, down-casing, number removal, and Porter stemming. Vectorial representations have been generated for each document by the well-known “ltc” variant of cosine-normalized tfidf weighting.

For both baselines we have used “soft-margin” SVMs (in Thorsten Joachims’ SVMlight implementation^{Footnote 10}) with a linear kernel; this has thus required us to optimize the c parameter, which determines the tradeoff between the complexity of the generated model and its training error.

In training the baseline classifiers, the ordinal regression classifier, and our GPLM-based classifier, we have performed thorough parameter optimization. For Baseline1, the validation process was performed by selecting a unique value of c for all the 614 binary SVMs; this is due to the fact that the overall prediction calculated by Baseline1 depends upon all the classifiers associated to the classes. It follows that, allowing for different choices of c over individual SVMs, let say q different values, would have required an exponential number, namely q ⁶¹⁴, of different classifiers to train.

For Baseline2, validation was performed independently on the single-label (h _P) and on the multi-label (h _S) classifiers. For producing h _P we have generated a DDAG with $\frac{1} {2}(451*(451-1))=101,475$ binary classifiers, each from 1,027 positive training examples per category on average. Since 101,475 is a very high number, we have chosen to optimize c only once for all binary classifiers. In order to do this, we have divided the training set into a validation set (Va) and a “true” training set (Tr − Va), obtained by attributing 70% of the positive training examples of each category to Tr − Va and the remaining 30% to Va. The classifiers generated from Tr − Va were organized into a DDAG-based single-label classifier, which was then evaluated on Va in terms of accuracy (the percentage of documents that have been correctly classified).^{Footnote 11} As values of c we have tested 10ⁱ with i ∈ {−4, −3, −2, −1, 0, 1, 2, 3, 4}; the best result we obtained on Va (48.62%) was for i = 1.

For the multi-label classifier h _S of Baseline2 the validation process was performed by selecting a possibly different value of c for each of the 614 binary SVMs. We fixed the threshold to 0, so that all the categories whose associated classifier attributes a score higher than 0 to d are attached to d as secondary categories. For each category c _i, 70% of the training documents were used as “true” training examples and the other 30% as validation examples. The values of c we tested are the same as those used for optimizing h _P. The classifiers generated from the Tr − Va were evaluated on Va in terms of F ₁.

Validation for the ordinal regression and the GPLM classifiers was instead aimed at optimizing the $\epsilon$ parameter, along with the number of iterations, in our Adatron-like algorithm. The training set, which generates 28,859,852 preferences when using the approach described in Sect. 3.2.2, was divided into a true training set Tr and a validation set Va as for the h _S classifier. Model selection was performed by testing all values of (i) the parameter $\epsilon$ in $\epsilon$ ∈ {10⁻², 10⁻¹, 10⁰, 10¹} and (ii) the number of iterations t in the range {1,…,30}, and choosing the best-performing values.

The overall validation procedure for both the individual GPLM models and the averaged model can be summarized as follows:

1.
For t ∈ {1,…,30}, for $\epsilon$ ∈ {10⁻², 10⁻¹, 10⁰, 10¹}, and for k = 1,…,K do
1. (a)
  Fix an order o _k of presentation of the preferences;
2. (b)
  Run t iterations of the Adatron algorithm with parameter $\epsilon$ over the set Tr − Va, apply the resulting classifiers on Va and compute the value of F ³₁ , denoting it by $V_{k}(t, \epsilon)$);
2.
For each pair ($(t, \epsilon)$compute the average performance $\hat{V}(t,\epsilon) = \frac{1}{K} \sum_{k=1}^{K} V_k(t,\epsilon);$
3.
Return $(\hat{t},\hat{\epsilon}) = \arg\max_{(t,\epsilon)} \hat{V}(t,\epsilon);$

Once determined the “optimal” parameters $(\hat{t},\hat{\epsilon})$ as described above, K = 20 runs of $\hat{t}$ iterations of the Adatron algorithm with parameter $\hat{\epsilon}$ are executed over the whole training set. As proposed in Sect. 3.2.3, the obtained models are also averaged to form a classifier committee. In our case, we obtained $\hat{t} = 3$ and $\hat{\epsilon} = 0.01.$

Finally, note that we do not compare our results to the ones previously obtained by other researchers on WIPO-alpha, since

these authors work on tasks different from preferential classification, i.e., single-label classification in Cai and Hofmann (2004), Fall et al. (2003), Hofmann et al. (2003), Rousu et al. (2006), Tikk and Biró (2003), Tsochantaridis et al. (2004), Vishwanathan et al. (2006) and multi-label classification in Cai and Hofmann (2007), Tikk et al. (2004), and
only few of them (Fall et al. 2003; Rousu et al. 2006; Tikk and Biró 2003) test their systems on the full set of 75,250 WIPO-alpha documents. For instance, Altun et al. (2007), Hofmann et al. (2003), Seeger (2007), Shahbaba (2007), Tikk et al. (2004), Tsochantaridis et al. (2004), Vishwanathan et al. (2006) only use 1,710 documents, while Cai and Hofmann (2004, 2007) only use 9,406.

4.2 Results

The results obtained for the different classifiers are summarized in Table 3.

The first three rows report the performance of the two baseline classifiers and the modified version of the Baseline2 classifier as described in Sect. 3.1. It can be observed that the fist two have almost identical F ³₁ . Both baselines are good in telling apart primary from secondary categories. This is especially true for Baseline2, and it can be explained by recalling that the single-label classifier that selects the primary category is trained in such a way that both secondary and non-categories are considered as not relevant. Thus, since the number of secondary categories is much smaller that the number of non-categories, it is more likely to wrongly predict as primary a non-category instead of a secondary category. As a consequence of this, there will only be few cases in which a secondary category is deemed a primary category, thus improving F ^PS₁ , while there it will be more frequently the case that a non-category is predicted as a primary category, thus worsening F ^PN₁ . Also Baseline1 is not very good in telling apart secondary categories from non-categories (F ^SN₁ ).

The fourth row reports the performance of the ordinal regression classifier, which turns out to have the best separation between primary categories and non-categories (F ^PN₁ ) but a low performance on separating primary and secondary categories (F ^PS₁ ). These results seem coherent with the analysis we have given in Sect. 3.2.2.1 as the separation between primary categories and non-categories is over-constrained by the ordinal regression model. The overall performance (F ³₁ ) is roughly equal to that of the baseline classifiers.

The fourth row reports the average performance of the GPLM Adatron over 20 different runs (which differ for the order of presentation of the preferences). Standard deviation is also reported to show that there is not a severe dependence of the performance on the order of presentation of the preferences. With respect to the baselines and the ordinal regression classifier, there is a clear improvement on F ^SN₁ , while F ^PS₁ decreases. Overall, however, there is a significant improvement in F ³₁ . The remaining rows report the performances obtained by committees using different number of members. The members are the classifiers obtained by the 20 runs described above. The committee with i members is composed of the classifiers obtained by the first i runs. It can be noted that F ³₁ increases with the number of members in the committee, however adding more members to a committee of ten does not give a significant improvement.

5 Conclusions

We have addressed the problem of how to learn a classifier that distinguishes between the primary and the secondary categories of a document, and argued that this task deserves to be explicitly tackled by TC research.

The first problem in dealing with this novel task is how to evaluate the performance of a classifier. We observed that already known evaluation measures defined either on standard categorization tasks or on ranking tasks, have drawbacks. We have thus proposed an F ₁-based evaluation measure in which different kinds of misclassifications involving either primary or secondary categories have a different impact.

Then, by using state-of-the-art learning technology such as multiclass SVMs (for detecting the unique primary category) and binary SVMs (for detecting the secondary categories), we have established strong baseline results for this task on a patent classification dataset in which the distinction between primary and secondary categories is present. In addition, we have provided a slightly better solution based on an ordinal regression model implemented by a recently proposed class of learning algorithms explicitly geared to learning from training data expressed in preferential form, i.e., in the form “for document d _i, category c′ is preferred to category c′′”.

Finally, we have shown that it is possible to improve on the baselines and on the ordinal regression classifier, by defining a preferential model according to the true nature of the problem, i.e., where the primary category is in competition with the secondary categories. Thanks to this approach we have been able to give proper treatment to primary and secondary categories not only in the testing phase but also in the learning phase. The proposed learner is incremental, improves its performance rapidly with the learning iterations, and generates a single model per class.

In the future we plan to investigate preferential text classification further by tailoring the training loss function used in the GPLM to the specific measure eventually used for evaluating the results of classification (F ³ ₁, in our case). Note in fact (see Sect. 3.2.1) that we have not made any attempt at selecting a training loss function that closely fits the chosen evaluation measure, since in our case the former only measures the number of unfulfilled preferences, without distinguishing which kind of preferences are unfulfilled. Aiming at a better such fit might turn out to improve effectiveness since, as reflected in the different values that the α_i parameters of Equation 3 may typically be given, different types of misclassification may have a different impact on the computed performance.

Notes

Both OHSUMED and WIPO-alpha make a distinction between primary and secondary categories. A third dataset in which this distinction is made is the ICCCFT dataset used in the 2007 “International Challenge on Classifying Clinical Free Text Using Natural Language Processing” (http://www.computationalmedicine.org/challenge/index.php). Other well-known TC datasets such as Reuters-21578 or RCV1 instead do not make this distinction.
http://www.wipo.int/classifications/en/.
Barrou Diallo, personal communication. Barrou Diallo is Head of the Research and Development Directorate at the European Patent Office.
http://www.reviews.com/.
http://www.acm.org/class/1998/.
Carol Hutchins, personal communication. Carol Hutchins is Editor-in-Chief of ACM Computing Reviews.
Whether one or several primary categories are possible will depend on the application. For instance, in the WIPO-alpha collection only one primary category per document is allowed, while in the OHSUMED collection several primary categories for the same document may exist.
The asymmetric character of K ^p is due to the fact that it caters for ties (since ties in the true ranking are treated differently from ties in the predicted ranking); “pure” Kendall distance (i.e., with no penalization p) is indeed symmetric, but it assumes that there are ties neither in the predicted nor in the true ranking.
Fall et al. (2003) choose this setting due to the fact that the long description of the patent is a very long text that describes the invention at a level of detail largely irrelevant to the purposes of classification. Among other things, using the entire long description would bring about a feature set with more than half a million features.
http://www.svmlight.joachims.org/.
For single-label classification it is well-known that micro-averaged precision, micro-averaged recall, micro-averaged F ₁, and accuracy have the same value.

References

Aiolli, F. (2005). A preference model for structured supervised learning tasks. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05) (pp. 557–560). Houston, USA.
Aiolli, F., & Sperduti, A. (2005). Learning preferences for multiclass problems. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems 17 (Vol. 17, pp. 17–24). Cambridge, MA: MIT Press.
Google Scholar
Altun, Y., Hofmann, T., & Tsochantaridis, I. (2007). Support vector machine learning for interdependent and structured output spaces. In: G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, & S. V. N. Vishwanathan (Eds.), Predicting structured data (pp. 85–104). Cambridge, MA: The MIT Press.
Google Scholar
Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04) (pp. 78–87). Washington, DC.
Cai, L., & Hofmann, T. (2007). Exploiting known taxonomies in learning overlapping concepts. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07) (pp. 714–719). Hyderabad, India
Chu, W., & Keerthi, S. S. (2007). Support vector ordinal regression. Neural Computation, 19(3), 792–815.
Article MATH MathSciNet Google Scholar
Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292.
Article Google Scholar
Crammer, K., & Singer, Y. (2002). A new family of online algorithms for category ranking. In Proceedings of the 25th ACM International Conference on Research and Development in Information Retrieval (SIGIR’02) (pp. 151–158). Tampere, FI
Cristianini, N., & Shawe-Taylor J. (2000). An introduction to support vector machines. Cambridge, UK: Cambridge University Press.
Google Scholar
Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., & Vee, E. (2004). Comparing and aggregating rankings with ties. In Proceedings of ACM International Conference on Principles of Database Systems (PODS’04) (pp. 47–58). Paris, France
Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., & Vee, E. (2006). Comparing partial rankings. SIAM Journal on Discrete Mathematics, 20(3), 628–648.
Article MATH MathSciNet Google Scholar
Fall, C. J., Törcsvári, A., Benzineb, K., & Karetka, G. (2003). Automated categorization in the international patent classification. SIGIR Forum, 37(1), 10–25.
Article Google Scholar
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
Article MATH Google Scholar
Friess, T.-T., Cristianini, N., & Campbell, C. (1998). The Kernel-Adatron: A fast and simple learning procedure for support vector machines. In Proceedings of the 15th International Conference on Machine Learning (ICML’98) (pp. 188–196). Madison, USA.
Geng, X., Liu, T.-Y., Qin, T., & Li, H. (2007). Feature selection for ranking. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval (SIGIR’07) (pp. 407–414). Amsterdam, The Netherlands.
Herbrich, R., Graepel, T., & Campbell, C. (2001). Bayes point machines. Journal of Machine Learning Research, 1, 245–279.
Article MATH MathSciNet Google Scholar
Hersh, W., Buckley, C., Leone, T., & Hickman, D. (1994). Ohsumed: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94) (pp. 192–201). Dublin, Ireland.
Hofmann, T., Cai, L., & Ciaramita, M. (2003). Learning with taxonomies: Classifying documents and words. In Proceedings of the NIPS’03 Workshop on Syntax, Semantics, and Statistics. Vancouver, Canada.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML’98) (pp. 137–142). Chemnitz, Germany.
Lam, W., & Ho, C. Y. (1998). Using a generalized instance set for automatic text categorization. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR’98) (pp. 81–89). Melbourne, Australia.
Lewis, D. D. (1995). Evaluating and optmizing autonomous text classification systems. In Proceedings of 18th ACM International Conference on Research and Development in Information Retrieval (SIGIR’95) (pp. 246–254). Seattle, USA.
Lewis, D. D., Schapire, R. E., Callan, J. P., & Papka, R. (1996). Training algorithms for linear text classifiers. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR’96) (pp. 298–306). Zürich, Switzerland.
Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (1999). Large-margin DAGs for multiclass classification. In Proceedings of the 11th International Conference on Neural Information Processing Systems (NIPS’99) (pp. 533–547). Denver, USA.
Rousu, J., Saunders, C., Szedmak, S., & Shawe-Taylor, J. (2006). Kernel-based learning of hierarchical multilabel classification models. Journal of Machine Learning Research, 7, 1601–1626.
MathSciNet Google Scholar
Ruiz, M., & Srinivasan, P. (2002). Hierarchical text classification using neural networks. Information Retrieval, 5(1), 87–118.
Article MATH Google Scholar
Seeger, M. W. (2007). Cross-validation optimization for large scale hierarchical classification Kernel methods. In B. Schölkopf, J. Platt, & Hoffman, T. (Eds.), Advances in neural information processing systems (Vol. 19, pp. 1233–1240). Cambridge, MA: MIT Press.
Google Scholar
Shahbaba, B. (2007). Improving classification models when a class hierarchy is available. Ph.D. thesis, Graduate Department of Public Health Sciences, University of Toronto, Toronto, Canada.
Tikk, D., & Biró, G. (2003). Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection. In Proceedings of the 4th International Symposium on Uncertainty Modeling and Analysis (ISUMA’03) (pp. 104–109). College Park, USA.
Tikk, D., Biró, G., & Yang, J. (2004). Experiments with a hierarchical text categorizer. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ-IEEE’04) (pp. 1191–1196). Budapest, Hungary.
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In Proceedings of the 21st International Conference on Machine Learning (ICML’04) (pp. 104–111). Banff, Canada.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
MATH Google Scholar
Vishwanathan, S., Schraudolph, N., & Smola, A. (2006). Step size adaptation in reproducing kernel Hilbert space. Journal of Machine Learning Research, 7, 1107–1133.
MathSciNet Google Scholar
Voorhees, E. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR’98) (pp. 315–323). Melbourne, Australia.
Yang, Y., Zhang, J., & Kisiel, B. (2003). A scalability analysis of classifiers in text categorization. In Proceedings of the 26th ACM International Conference on Research and Development in Information Retrieval (SIGIR’03) (pp. 96–103). Toronto, Canada.
Yom-Tov, E., Fine, S., Carmel, D., & Darlow, A. (2005). Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR’05) (pp. 512–519). Salvador de Bahia, Brazil.

Download references

Acknowledgements

We thank Tiziano Fagni for indexing the WIPO-alpha collection and Andrea Esuli for useful discussions on Kendall distance. Thanks also to Lijuan Cai, Shantanu Godbole, Juho Rousu, Sunita Sarawagi, Domonkos Tikk, and S. Vishwanathan for clarifying the details of their experiments. This work has been partially supported by the project “Tecniche di classificazione automatica per brevetti”, funded by the University of Padova.

Author information

Authors and Affiliations

Dipartimento di Matematica Pura e Applicata, Università di Padova, Via Trieste, 63-35121, Padova, Italy
Fabio Aiolli, Riccardo Cardin & Alessandro Sperduti
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Via Giuseppe Moruzzi, 1-56124, Pisa, Italy
Fabrizio Sebastiani

Authors

Fabio Aiolli
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Cardin
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Sebastiani
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Sperduti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabrizio Sebastiani.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aiolli, F., Cardin, R., Sebastiani, F. et al. Preferential text classification: learning algorithms and evaluation measures. Inf Retrieval 12, 559–580 (2009). https://doi.org/10.1007/s10791-008-9071-y

Download citation

Received: 03 September 2007
Accepted: 08 September 2008
Published: 09 October 2008
Issue Date: October 2009
DOI: https://doi.org/10.1007/s10791-008-9071-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Preferential text classification: learning algorithms and evaluation measures

Abstract

Similar content being viewed by others

Feature selection methods for text classification: a systematic literature review

One-Class Support Vector Machine and LDA Topic Model Integration—Evidence for AI Patents

Model-induced term-weighting schemes for text classification