A Novel Approach to Feature Selection Based on Quality Estimation Metrics

Lamirel, Jean-Charles; Cuxac, Pascal; Hajlaoui, Kafil

doi:10.1007/978-3-319-45763-5_7

Jean-Charles Lamirel⁵,
Pascal Cuxac⁶ &
Kafil Hajlaoui⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 665))

703 Accesses

Abstract

Feature maximization (F-max) is an unbiased quality estimation metric of unsupervised classification (clustering) that favours clusters with a maximal feature F-measure value. In this article we show that an adaptation of this metric within the framework of supervised classification allows efficient feature selection and feature contrasting to be performed. We experiment the method on different types of textual data. In this context, we demonstrate that this technique significantly improves the performance of classification methods as compared with the use of state-of-the art feature selection techniques, notably in the case of the classification of unbalanced, highly multidimensional and noisy textual data gathered in similar classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Hardcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

F1-Measure

Feature Maximization Based Clustering Quality Evaluation: A Promising Approach

Feature Selection in Text Mining

Notes

References

Aha, D., Kibler, D., & Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Google Scholar
Alphonse, E. E., et al. (2005). Préparation des donnés et analyse des résultats de DEFT’05. In TALN 2005 - Atelier DEFT 2005 (pp. 99–111).
Google Scholar
Bache, K., & Lichman, M. (2013). Uci machine learning repository. University of California, School of Information and Computer Science, Irvine, CA, USA. http://archive.ics.uci.edu/ml.
Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37.
Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article MathSciNet MATH Google Scholar
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Technical report. Wadsworth International Group, Belmont, CA, USA.
Google Scholar
Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
MATH Google Scholar
Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1), 155–176.
Article MathSciNet MATH Google Scholar
Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. Thèse de doctorat, Université de Nantes.
Google Scholar
El-Bèze, M., Torres-Moreno, J.-M., & Béchet, F. (2005). Peut-on rendre automatiquement à César ce qui lui appartient. Application au jeu du Chirand-Mitterrac. In TALN 2005 - Atelier DEFT 2005 (pp. 125–134).
Google Scholar
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.
MATH Google Scholar
Good, P. (2006). Resampling methods. Ed. Birkhauser.
Google Scholar
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
MATH Google Scholar
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.
Article MATH Google Scholar
Habert, B., et al. (2000). Profilage de textes: cadre de travail et expérience. In Proceedings of JADT’2000 (5ièmes journées internationales d’Analyse Statistique des Données Textuelles).
Google Scholar
Hajlaoui, K., Cuxac, P., Lamirel, J.-C., & Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. In J.-G. Ganascia, P. Lenca & J.-M. Petit (Eds.), Discovery science. (Vol. 7569, pp. 299–312), Lecture notes in computer science. Berlin Heidelberg: Springer.
Google Scholar
Hall, M., & Smith, L. (1999). Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference (pp. 235–239).
Google Scholar
Kira, K., & Rendell, L. (1995). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 129–134).
Google Scholar
Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324.
Article MATH Google Scholar
Konokenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Proceedings of European Conference on Machine Learning (pp. 171–182).
Google Scholar
Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797.
Google Scholar
Lallich, S., & Rakotomalala, R. (2000). Fast feature selection using partial correlation for multi-valued attributes. In D. A. Zighed, J. Komorowski & J. Żytkow (Eds.), Principles of data mining and knowledge discovery (Vol. 1910, pp. 221–231), Lecture notes in computer science. Berlin Heidelberg: Springer.
Google Scholar
Lamirel, J., Al Shehabi, S., François, C., & Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: Application to patent analysis and web mapping. Scientometrics, 60(3), 445–562.
Article Google Scholar
Lamirel, J., Ghribi, M., & Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010, Paris, France).
Google Scholar
Lamirel, J., Cuxac, P., Chivukula, A.S., & Hajlaoui, K. (2014). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, Special issue on PAKDD-QIMIE 2013 (pp. 1–18).
Google Scholar
Lamirel, J., & Ta, A. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: An application to social network analysis. In Proceedings of the 4th International Conference on Webometrics, Informetrics and Scientometrics and 9th COLLNET Meetings, Berlin, Germany.
Google Scholar
Lang, K. (1995). Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (pp. 331–339).
Google Scholar
Pearson, K. (1901). On lines an planes of closetst fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.
Article MATH Google Scholar
Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods (pp. 185–208). Cambridge, MA, USA: MIT Press.
Google Scholar
Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Article Google Scholar
Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Google Scholar
Salton, G. (1971). Automatic processing of foreign language documents. Englewood Clifs, NJ, USA: Prentice-Hill.
Google Scholar
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing.
Google Scholar
Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann.
Google Scholar
Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of ICML 03, Washington DC, USA (pp. 856–863).
Google Scholar

Download references

Acknowledgments

This work was carried out in the context of the QUAERO program (http://www.quaero.org) supported by OSEO (http://www.oseo.fr/), Agence française de développement de la recherche.

Author information

Authors and Affiliations

SYNALP Team-LORIA, INRIA Nancy-Grand Est, Vandoeuvre-lès-Nancy, France
Jean-Charles Lamirel
CNRS-Inist, Vandoeuvre-lès-Nancy, France
Pascal Cuxac & Kafil Hajlaoui

Authors

Jean-Charles Lamirel
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Cuxac
View author publications
You can also search for this author in PubMed Google Scholar
Kafil Hajlaoui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jean-Charles Lamirel .

Editor information

Editors and Affiliations

Polytech Nantes, University of Nantes, Nantes Cedex 3, France
Fabrice Guillet
University of Bordeaux, Talence Cedex, France
Bruno Pinaud
Polytechnics Graduate School, University of Tours, Tours, France
Gilles Venturini

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lamirel, JC., Cuxac, P., Hajlaoui, K. (2017). A Novel Approach to Feature Selection Based on Quality Estimation Metrics. In: Guillet, F., Pinaud, B., Venturini, G. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 665. Springer, Cham. https://doi.org/10.1007/978-3-319-45763-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-45763-5_7
Published: 04 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45762-8
Online ISBN: 978-3-319-45763-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics