Towards an Automated Classification of Spreadsheets | SpringerLink
Skip to main content

Towards an Automated Classification of Spreadsheets

  • Conference paper
  • First Online:
Software Technologies: Applications and Foundations (STAF 2016)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9946))

Abstract

Many spreadsheets in the wild do not have documentation nor categorization associated with them. This makes difficult to apply spreadsheet research that targets specific spreadsheet domains such as financial or database.

We introduce with this paper a methodology to automatically classify spreadsheets into different domains. We exploit existing data mining classification algorithms using spreadsheet-specific features. The algorithms were trained and validated with cross-validation using the EUSES corpus, with an up to 89% accuracy. The best algorithm was applied to the larger Enron corpus in order to get some insight from it and to demonstrate the usefulness of this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 5719
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 7149
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Please see the spreadsheet horror stories available at http://www.eusprig.org/horror-stories.htm.

  2. 2.

    This information is not clearly specified by the authors, but range from password protected files to spreadsheets with disruptive macros [4].

  3. 3.

    https://bitbucket.org/SSaaPP/spreadsheet-classification/src/2259e60/paper/data/.

  4. 4.

    The Enron spreadsheet corpus is available through here: www.felienne.com/enron.

  5. 5.

    Only tools that can be used locally were considered to avoid issues related to transfering much data across networks.

References

  1. Abreu, R., Cunha, J., Fernandes, J.P., Martins, P., Perez, A., Saraiva, J.: Smelling faults in spreadsheets. In: 2014 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 111–120, September 2014

    Google Scholar 

  2. Apache Software Foundation: Apache POI. http://poi.apache.org

  3. Aurigemma, S., Panko, R.R.: The detection of human spreadsheet errors by humans versus inspection (auditing) software. In: Proceedings of EuSpRIG Conference (2010)

    Google Scholar 

  4. Fisher, M., Rothermel, G.: The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. In: Proceedings of the First Workshop on End-User Software Engineering (WEUSE I), pp. 1–5. ACM (2005)

    Google Scholar 

  5. Google: Google. https://www.google.com

  6. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  7. Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: Proceedings of the 37th International Conference on Software Engineering, ICSE 2015, vol. 2. pp. 7–16. IEEE Press, Piscataway (2015)

    Google Scholar 

  8. Jannach, D., Schmitz, T., Hofer, B., Wotawa, F.: Avoiding, finding and fixing spreadsheet errors - a survey of automated approaches for spreadsheet QA. J. Syst. Softw. 94, 129–150 (2014)

    Article  Google Scholar 

  9. John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Mateo (1995)

    Google Scholar 

  10. Klimt, B., Yang, Y.: Introducing the Enron corpus. In: 1st Conference on Email and Anti-Spam (CEAS) (2004)

    Google Scholar 

  11. Kohavi, R.: The power of decision tables. In: Lavrac, N., Wrobel, S. (eds.) Machine Learning, vol. 912, pp. 174–189. Springer, Heidelberg (1995)

    Google Scholar 

  12. Mccallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Workshop on Learning for Text Categorization, AAAI 1998 (1998)

    Google Scholar 

  13. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)

    Google Scholar 

  14. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (2005)

    MATH  Google Scholar 

  15. Yusof, Y., Rana, O.F.: Classification of software artifacts based on structural information. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 546–555. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15384-6_58

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jorge Mendes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Mendes, J., Do, K.N., Saraiva, J. (2016). Towards an Automated Classification of Spreadsheets. In: Milazzo, P., Varró, D., Wimmer, M. (eds) Software Technologies: Applications and Foundations. STAF 2016. Lecture Notes in Computer Science(), vol 9946. Springer, Cham. https://doi.org/10.1007/978-3-319-50230-4_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-50230-4_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-50229-8

  • Online ISBN: 978-3-319-50230-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics