Abstract
Defect prediction could help software practitioners to predict the future occurrence of bugs in the software code regions. In order to improve the accuracy of defect prediction, dozens of supervised and unsupervised methods have been put forward and achieved good results in this field. One limiting factor of defect prediction is that the data size of defect data is not big, which restricts the scope of application with defect prediction models. In this study, we try to construct bigger defect datasets by merging available datasets with the same measurement dimension and check whether bigger data will bring better defect prediction performance with supervised and unsupervised models or not. The results of our experiment reveal that larger-scale dataset doesn’t bring improvements of both supervised and unsupervised classifiers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 83(1), 2–17 (2010)
Emam, K.E., Benlarbi, S., Goel, N., Rai, S.N.: The confounding effect of class size on the validity of object-oriented metrics. IEEE Trans. Softw. Eng. 27(7), 630–650 (1999)
Erlikh, L.: Leveraging legacy system dollars for e-business (2000)
Ghotra, B., Mcintosh, S., Hassan, A.E.: Revisiting the impact of classification techniques on the performance of defect prediction models. In: International Conference on Software Engineering (2015)
Ibrahim, D.R., Ghnemat, R., Hudaib, A.: Software defect prediction using feature selection and random forest algorithm. In: International Conference on New Trends in Computing Sciences (2017)
Romano, J., Kromrey, J.D., Coraggio, J.: Exploring methods for evaluating group differences on the NSSE and other surveys: are the t-test and Cohen’s d indices the most appropriate choices? (2006)
Jing, X.Y., Ying, S., Zhang, Z.W., Wu, S.S., Liu, J.: Dictionary learning based software defect prediction (2014)
Khalid, H., Nagappan, M., Shihab, E., Hassan, A.E.: Prioritizing the devices to test your app on: a case study of Android game apps. In: ACM SIGSOFT International Symposium on Foundations of Software Engineering (2014)
Kim, S., Zimmermann, T., Whitehead Jr., E.J., Zeller, A.: Predicting faults from cached history. In: International Conference on Software Engineering (2008)
Kocaguneli, E., Menzies, T., Keung, J., Cok, D., Madachy, R.: Active learning and effort estimation: finding the essential content of software effort estimation data. IEEE Trans. Softw. Eng. 39(8), 1040–1053 (2013)
Lee, T., Nam, J., Han, D.G., Kim, S., In, H.P.: Micro interaction metrics for defect prediction (2011)
Ma, W., Lin, C., Yang, Y., Zhou, Y., Xu, B.: Empirical analysis of network measures for effort-aware fault-proneness prediction. Inf. Softw. Technol. 69(C), 50–70 (2016)
Mittas, N., Angelis, L.: Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans. Softw. Eng. 39(4), 537–551 (2013)
Nam, J., Fu, W., Kim, S., Menzies, T., Tan, L.: Heterogeneous defect prediction. IEEE Trans. Softw. Eng. PP(99), 1 (2015)
Pushphavathi, T.P., Suma, V., Ramaswamy, V.: A novel method for software defect prediction: hybrid of FCM and random forest. In: International Conference on Electronics & Communication Systems (2014)
Rahman, F., Devanbu, P.: How, and why, process metrics are better. In: International Conference on Software Engineering (2013)
Scott, A.J., Knott, M.: A cluster analysis method for grouping means in the analysis of variance. Biometrics 30(3), 507–512 (1974)
Wang, J., Shen, B., Chen, Y.: Compressed c4.5 models for software defect prediction. In: International Conference on Quality Software (2012)
Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction (2016)
Wilcoxon, F.: Individual comparisons of grouped data by ranking methods. J. Econ. Entomol. 39(6), 269 (1946)
Yang, Y., et al.: Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, pp. 157–168. ACM, New York (2016)
Zhang, J., Xu, L., Li, Y.: Classifying Python code comments based on supervised learning. In: Meng, X., Li, R., Wang, K., Niu, B., Wang, X., Zhao, G. (eds.) WISA 2018. LNCS, vol. 11242, pp. 39–47. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02934-0_4
Zhou, Y., Xu, B., Leung, H., Chen, L.: An in-depth study of the potentially confounding effect of class size in fault prediction. ACM Trans. Softw. Eng. Methodol. 23(1), 1–51 (2014)
Zhou, Y., et al.: How far we have progressed in the journey? an examination of cross-project defect prediction. ACM Trans. Softw. Eng. Methodol. 27(1), 1:1–1:51 (2018)
Zimmermann, T., Nagappan, N., Gall, H., Giger, E., Murphy, B.: Cross-project defect prediction a large scale experiment on data vs. domain vs. process. In: Proceedings of the Joint Meeting of the European Software Engineering Conference & the ACM SIGSOFT Symposium on the Foundations of Software Engineering (2009)
Acknowledgement
The work is supported by National Key R&D Program of China (2018YFB1003901) and the National Natural Science Foundation of China (Grant No. 61872177).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, X., Li, Y. (2019). Is Bigger Data Better for Defect Prediction: Examining the Impact of Data Size on Supervised and Unsupervised Defect Prediction. In: Ni, W., Wang, X., Song, W., Li, Y. (eds) Web Information Systems and Applications. WISA 2019. Lecture Notes in Computer Science(), vol 11817. Springer, Cham. https://doi.org/10.1007/978-3-030-30952-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-30952-7_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30951-0
Online ISBN: 978-3-030-30952-7
eBook Packages: Computer ScienceComputer Science (R0)