2011 Volume E94.D Issue 4 Pages 855-865
This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4%, 15.9%, 3.3%, 2.8% and 2.9% (kNN) in Micro-F1, 14%, 9.8%, 9.2%, 3.5% and 4.3% (SVM) in Micro-F1, 20%, 16.9%, 2.8%, 3.6% and 3.1% (kNN) in Macro-F1, 16.3%, 14%, 7.1%, 4.4%, 6.3% (SVM) in Macro-F1, compared with tf*idf, χ2, Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.