Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification
IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Regular Section
Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification
Heum PARKHyuk-Chul KWON
Author information
JOURNAL FREE ACCESS

2011 Volume E94.D Issue 4 Pages 855-865

Details
Abstract

This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4%, 15.9%, 3.3%, 2.8% and 2.9% (kNN) in Micro-F1, 14%, 9.8%, 9.2%, 3.5% and 4.3% (SVM) in Micro-F1, 20%, 16.9%, 2.8%, 3.6% and 3.1% (kNN) in Macro-F1, 16.3%, 14%, 7.1%, 4.4%, 6.3% (SVM) in Macro-F1, compared with tf*idf, χ2, Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.

Content from these authors
© 2011 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top