A Novel Approach to Feature Selection Based on Quality Estimation Metrics | SpringerLink
Skip to main content

A Novel Approach to Feature Selection Based on Quality Estimation Metrics

  • Chapter
  • First Online:
Advances in Knowledge Discovery and Management

Part of the book series: Studies in Computational Intelligence ((SCI,volume 665))

  • 703 Accesses

Abstract

Feature maximization (F-max) is an unbiased quality estimation metric of unsupervised classification (clustering) that favours clusters with a maximal feature F-measure value. In this article we show that an adaptation of this metric within the framework of supervised classification allows efficient feature selection and feature contrasting to be performed. We experiment the method on different types of textual data. In this context, we demonstrate that this technique significantly improves the performance of classification methods as compared with the use of state-of-the art feature selection techniques, notably in the case of the classification of unbalanced, highly multidimensional and noisy textual data gathered in similar classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
JPY 14299
Price includes VAT (Japan)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.ncbi.nlm.nih.gov/pubmed/.

  2. 2.

    http://web.ist.utl.pt/~acardoso/datasets/.

  3. 3.

    http://www.research.att.com/~lewis/reuters21578.html.

  4. 4.

    http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/.

  5. 5.

    http://www.cs.waikato.ac.nz/ml/weka/.

References

  • Aha, D., Kibler, D., & Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.

    Google Scholar 

  • Alphonse, E. E., et al. (2005). Préparation des donnés et analyse des résultats de DEFT’05. In TALN 2005 - Atelier DEFT 2005 (pp. 99–111).

    Google Scholar 

  • Bache, K., & Lichman, M. (2013). Uci machine learning repository. University of California, School of Information and Computer Science, Irvine, CA, USA. http://archive.ics.uci.edu/ml.

  • Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37.

    Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MathSciNet  MATH  Google Scholar 

  • Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Technical report. Wadsworth International Group, Belmont, CA, USA.

    Google Scholar 

  • Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    MATH  Google Scholar 

  • Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1), 155–176.

    Article  MathSciNet  MATH  Google Scholar 

  • Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. Thèse de doctorat, Université de Nantes.

    Google Scholar 

  • El-Bèze, M., Torres-Moreno, J.-M., & Béchet, F. (2005). Peut-on rendre automatiquement à César ce qui lui appartient. Application au jeu du Chirand-Mitterrac. In TALN 2005 - Atelier DEFT 2005 (pp. 125–134).

    Google Scholar 

  • Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.

    MATH  Google Scholar 

  • Good, P. (2006). Resampling methods. Ed. Birkhauser.

    Google Scholar 

  • Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

    MATH  Google Scholar 

  • Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1), 389–422.

    Article  MATH  Google Scholar 

  • Habert, B., et al. (2000). Profilage de textes: cadre de travail et expérience. In Proceedings of JADT’2000 (5ièmes journées internationales d’Analyse Statistique des Données Textuelles).

    Google Scholar 

  • Hajlaoui, K., Cuxac, P., Lamirel, J.-C., & Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. In J.-G. Ganascia, P. Lenca & J.-M. Petit (Eds.), Discovery science. (Vol. 7569, pp. 299–312), Lecture notes in computer science. Berlin Heidelberg: Springer.

    Google Scholar 

  • Hall, M., & Smith, L. (1999). Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference (pp. 235–239).

    Google Scholar 

  • Kira, K., & Rendell, L. (1995). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 129–134).

    Google Scholar 

  • Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324.

    Article  MATH  Google Scholar 

  • Konokenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Proceedings of European Conference on Machine Learning (pp. 171–182).

    Google Scholar 

  • Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797.

    Google Scholar 

  • Lallich, S., & Rakotomalala, R. (2000). Fast feature selection using partial correlation for multi-valued attributes. In D. A. Zighed, J. Komorowski & J. Żytkow (Eds.), Principles of data mining and knowledge discovery (Vol. 1910, pp. 221–231), Lecture notes in computer science. Berlin Heidelberg: Springer.

    Google Scholar 

  • Lamirel, J., Al Shehabi, S., François, C., & Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: Application to patent analysis and web mapping. Scientometrics, 60(3), 445–562.

    Article  Google Scholar 

  • Lamirel, J., Ghribi, M., & Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010, Paris, France).

    Google Scholar 

  • Lamirel, J., Cuxac, P., Chivukula, A.S., & Hajlaoui, K. (2014). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, Special issue on PAKDD-QIMIE 2013 (pp. 1–18).

    Google Scholar 

  • Lamirel, J., & Ta, A. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: An application to social network analysis. In Proceedings of the 4th International Conference on Webometrics, Informetrics and Scientometrics and 9th COLLNET Meetings, Berlin, Germany.

    Google Scholar 

  • Lang, K. (1995). Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning (pp. 331–339).

    Google Scholar 

  • Pearson, K. (1901). On lines an planes of closetst fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.

    Article  MATH  Google Scholar 

  • Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods (pp. 185–208). Cambridge, MA, USA: MIT Press.

    Google Scholar 

  • Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Article  Google Scholar 

  • Quinlan, J. R. (1993). C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

    Google Scholar 

  • Salton, G. (1971). Automatic processing of foreign language documents. Englewood Clifs, NJ, USA: Prentice-Hill.

    Google Scholar 

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing.

    Google Scholar 

  • Witten, I., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of ICML 03, Washington DC, USA (pp. 856–863).

    Google Scholar 

Download references

Acknowledgments

This work was carried out in the context of the QUAERO program (http://www.quaero.org) supported by OSEO (http://www.oseo.fr/), Agence française de développement de la recherche.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jean-Charles Lamirel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Lamirel, JC., Cuxac, P., Hajlaoui, K. (2017). A Novel Approach to Feature Selection Based on Quality Estimation Metrics. In: Guillet, F., Pinaud, B., Venturini, G. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 665. Springer, Cham. https://doi.org/10.1007/978-3-319-45763-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45763-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45762-8

  • Online ISBN: 978-3-319-45763-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics