Abstract
Many spreadsheets in the wild do not have documentation nor categorization associated with them. This makes difficult to apply spreadsheet research that targets specific spreadsheet domains such as financial or database.
We introduce with this paper a methodology to automatically classify spreadsheets into different domains. We exploit existing data mining classification algorithms using spreadsheet-specific features. The algorithms were trained and validated with cross-validation using the EUSES corpus, with an up to 89% accuracy. The best algorithm was applied to the larger Enron corpus in order to get some insight from it and to demonstrate the usefulness of this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Please see the spreadsheet horror stories available at http://www.eusprig.org/horror-stories.htm.
- 2.
This information is not clearly specified by the authors, but range from password protected files to spreadsheets with disruptive macros [4].
- 3.
- 4.
The Enron spreadsheet corpus is available through here: www.felienne.com/enron.
- 5.
Only tools that can be used locally were considered to avoid issues related to transfering much data across networks.
References
Abreu, R., Cunha, J., Fernandes, J.P., Martins, P., Perez, A., Saraiva, J.: Smelling faults in spreadsheets. In: 2014 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 111–120, September 2014
Apache Software Foundation: Apache POI. http://poi.apache.org
Aurigemma, S., Panko, R.R.: The detection of human spreadsheet errors by humans versus inspection (auditing) software. In: Proceedings of EuSpRIG Conference (2010)
Fisher, M., Rothermel, G.: The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. In: Proceedings of the First Workshop on End-User Software Engineering (WEUSE I), pp. 1–5. ACM (2005)
Google: Google. https://www.google.com
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Hermans, F., Murphy-Hill, E.: Enron’s spreadsheets and related emails: a dataset and analysis. In: Proceedings of the 37th International Conference on Software Engineering, ICSE 2015, vol. 2. pp. 7–16. IEEE Press, Piscataway (2015)
Jannach, D., Schmitz, T., Hofer, B., Wotawa, F.: Avoiding, finding and fixing spreadsheet errors - a survey of automated approaches for spreadsheet QA. J. Syst. Softw. 94, 129–150 (2014)
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Mateo (1995)
Klimt, B., Yang, Y.: Introducing the Enron corpus. In: 1st Conference on Email and Anti-Spam (CEAS) (2004)
Kohavi, R.: The power of decision tables. In: Lavrac, N., Wrobel, S. (eds.) Machine Learning, vol. 912, pp. 174–189. Springer, Heidelberg (1995)
Mccallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Workshop on Learning for Text Categorization, AAAI 1998 (1998)
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (2005)
Yusof, Y., Rana, O.F.: Classification of software artifacts based on structural information. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6279, pp. 546–555. Springer, Heidelberg (2010). doi:10.1007/978-3-642-15384-6_58
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Mendes, J., Do, K.N., Saraiva, J. (2016). Towards an Automated Classification of Spreadsheets. In: Milazzo, P., Varró, D., Wimmer, M. (eds) Software Technologies: Applications and Foundations. STAF 2016. Lecture Notes in Computer Science(), vol 9946. Springer, Cham. https://doi.org/10.1007/978-3-319-50230-4_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-50230-4_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50229-8
Online ISBN: 978-3-319-50230-4
eBook Packages: Computer ScienceComputer Science (R0)