Abstract
Much research in software engineering (SE) is focused on modeling data collected from software repositories. Insights gained over the last decade suggests that such datasets contain a high amount of variability in the data. Such variability has a detrimental effect on model quality, as suggested by recent research. In this paper, we propose to split the data into smaller homogeneous subsets and learn sets of individual statistical models, one for each subset, as a way around the high variability in such data. Our case study on a variety of SE datasets demonstrates that such local models can significantly outperform traditional models with respect to model fit and predictive performance. However, we find that analysts need to be aware of potential pitfalls when building local models: firstly, the choice of clustering algorithm and its parameters can have a substantial impact on model quality. Secondly, the data being modeled needs to have enough variability to take full advantage of local modeling. For example, our case study on social data shows no advantage of local over global modeling, as clustering fails to derive appropriate subsets. Lastly, the interpretation of local models can become very complex when there is a large number of variables or data subsets. Overall, we find that a hybrid approach between local and traditional global modeling, such as Multivariate Adaptive Regression Splines (MARS) combines the best of both worlds. MARS models are non-parametric and thus do not require prior calibration of parameters, are easily interpretable by analysts and outperform local, as well as traditional models out of the box in four out of five datasets in our case study.
Similar content being viewed by others
References
Ackerman M, Ben-david S (2009) Clusterability: a theoretical study. In: Proceedings of the twelfth international conference on artificial intelligence and statistics (AISTATS’09), JMLR workshop and conference proceedings, vol 5. pp. 1–8
Akaike H (1974) A new look at the statistical model identification. Autom Control IEEE Trans 19(6):716–723
Andreou A, Papatheocharous E (2008) Software cost estimation using fuzzy decision trees. In: Automated software engineering, 2008. ASE 2008. 23rd IEEE/ACM International Conference, pp 371 –374
Attoh-okine N, Mensah S, Nawaiseh M, Hall D (2001) Using multivariate adaptive regression splines (mars) in pavement roughness prediction. Strategy
Barkmann H, Lincke R, Lowe W (2009) Quantitative evaluation of software quality metrics in open-source projects. In: Proceedings of the 2009 international conference on advanced information networking and applications workshops, WAINA ’09. IEEE Computer Society, Washington, DC, pp 1067–1072
Bettenburg N, Hassan AE (2010) Studying the impact of social structures on software quality. In: Proceedings of the 2010 IEEE 18th international conference on program comprehension, ICPC ’10. IEEE Computer Society, Washington, DC, pp 124–133
Bettenburg N, Nagappan M, Hassan A (2012) Think locally, act globally: improving defect and effort prediction models. In: Mining software repositories (MSR), 2012 9th IEEE working conference. pp 60–69
Di Penta M (2011) Nothing else matters: what predictive model should i use?. In: Proceedings of the 7th international conference on predictive models in software engineering, promise ’11. ACM, New York, NY,pp 10:1–10:3
Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81:649–660
Fox J (2008) Applied regression analysis and generalized linear models, 2nd edn. Sage, Los Angeles, London
Fraley C (2007) Bayesian regularization for normal mixture estimation and model-based clustering. J Classif 181(2):155–181
Fraley C, Raftery AE (2009) Mclust version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, University of Washington, Department of Statistics, Seattle, 2006 (subsequent revisions)
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–67
Harrell FE (2001) With Applications to Linear Models, Logistic Regression, and Survival Analysis, Series: Springer Series in Statistics, 1st ed. 2002. Corr. 2nd printing 2001, XXIII, New York, 571 pp
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. JSTOR: Appl Stat 28(1):100–108
Kamei Y, Matsumoto S, Monden A, Matsumoto Ki, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: Proceedings of the 2010 IEEE international conference on software maintenance, ICSM ’10. IEEE Computer Society, pp 1–10
Li M, Zhang H, Wu R, Zhou ZH (2012) Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19(2):201–230. doi:10.1007/s10515-011-0092-1
McQuitty L (1966) Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Meas 26(4):825–831
Menzies T, Butcher A, Cok D, Layman L, Marcus A, Shull F, Turhan B, Zimmermann T (2013) Local vs. global lessons from defect prediction and effort estimation. IEEE Trans Softw Eng (to appear)
Menzies T, Butcher A, Marcus A, Zimmermann T, Cok D (2011) Local vs global models for effort estimation and defect prediction. In: Proceedings of the 26th IEEE/ACM international conference on automated software engineering
Menzies T, Caglayan B, Kocaguneli E, Krall J, Peters F, Turhan B (2012) The promise repository of empirical software engineering data. http://promisedata.googlecode.com
Menzies T, Greenwald J, Frank A (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 33(1):2–13
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5:169–180
Mockus A, Weiss DM, Zhang P (2003) Understanding and predicting effort in software projects. In: Proceedings of the 25th international conference on software engineering, ICSE ’03. IEEE Computer Society, Washington, DC, pp 274–284
Mockus A, Zhang P, Li PL (2005) Predictors of customer perceived software quality. In: Proceedings of the 27th international conference on Software engineering, ICSE ’05. ACM, New York, NY, pp 225–233
Nagappan N, Ball T (2005) Use of relative code churn measures to predict system defect density. In: Proceedings of the 27th international conference on software engineering, ICSE ’05. ACM, pp 284–292
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering, ICSE ’06. ACM, New York, NY, pp 452–461
Nguyen THD, Adams B, Hassan AE (2010) Studying the impact of dependency network measures on software quality. In: Proceedings of the 2010 IEEE international conference on software maintenance. IEEE Computer Society, pp 1–10
Osei-Bryson KM, Ko M (2004) Exploring the relationship between information technology investments and firm performance using regression splines analysis. Inf Manag 42(1):1–13
Posnett D, Filkov V, Devanbu P (2011) Ecological inference in empirical software engineering. Int Conf Autom Softw Eng 362–371
Raftery AE, Raftery AE (2007) Bayesian model selection in social research adrian e. raftery sociological methodology. Soc Res 25:111–163
Rahman F, Devanbu P (2013) How, and why, process metrics are better. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Computer Society, Washington, DC, pp 432–441
Rice JA (2001) Mathematical statistics and data analysis, Series: Wadsworth statistics/probability series, 2nd ed. illustrated, 1995. Duxbury Press, Pacific Grove, 602 pp
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Shepperd M (1988) A critique of cyclomatic complexity as a software metric. Softw Eng J 3(2):30–36
Shihab E, Bird C, Zimmermann T, Zimmermann T (2012) The effect of branching strategies on software quality. In: ESEM. pp 301–310
Witten IH, Frank E (2005) Data Mining: practical machine learning tools and techniques, 2nd edn (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco
York TP, Eaves LJ (2001) Common disease analysis using multivariate adaptive regression splines (mars): Genetic analysis workshop 12 simulated sequence data. Genet Epidemiol 21 Suppl 1:S649–S654
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering, ESEC/FSE ’09. ACM, New York, NY, pp 91–100
Zimmermann T, Premraj R, Zeller A (2007) Predicting defects for eclipse. In: Proceedings of the third international workshop on predictor models in software engineering, PROMISE ’07. IEEE Computer Society, Washington, DC, p 9
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Maximillano di Penta and Tao Xie
Appendices
Appendix
Further descriptions of the individual metrics used in these datasets can be found in the work by Menzies et al. (2011) andour previous work on social metrics Bettenburg and Hassan (2010).
1.1 Metrics in the Lucene 2.4 Dataset
-
Dependent Variables: bug
-
Independent Variables: amc,avg_cc, ca, cam, cbm, ce, dam, dit, ic, lcom, lcom3, max_cc, mfa, moa, noc, npm
1.2 Metrics in the Xalan 2.6 Dataset
-
Dependent Variables: bug
-
Independent Variables: avg_cc, ca, cam, cbm, ce, dam, dit, ic,lcom, lcom3, loc, max_cc, mfa, moa, noc, npm
1.3 Metrics in the CHINA Dataset
-
Dependent Variables: Effort
-
Independent Variables: Input, Output, Enquiry, File, Interface,Changed, PDR_UFP, Resource, Duration
1.4 Metrics in the NASACOC Dataset
-
Dependent Variables: months
-
Independent Variables: pmat, rely, data, cplx, time, stor, pvol, pcap, apex, plex, ltex, site, sced, kloc, effort
1.5 Metrics in the Eclipse 3.0 (Code) Dataset
-
Dependent Variables: post
-
Independent Variables: pre, ImportDeclaration, VG_sum, PrefixExpression, NOM_avg, NOF_avg, TLOC, NullLiteral, NOM_max, BooleanLiteral, SwitchCase, SimpleName, SuperMethodInvocation, LabeledStatement, Block, NORM_Assignment, Initializer, NSM_avg, InfixExpression, Assignment, NumberLiteral, VariableDeclarationFragment, Javadoc, NORM_FieldDeclaration, ForStatement, Modifier, NORM_ArrayCreation, MethodInvocation, VariableDeclarationExpression, ArrayCreation, ExpressionStatement, NORM_PostfixExpression, InstanceofExpression, SwitchStatement, ArrayType, SynchronizedStatement
1.6 Metrics in the Eclipse 3.0 (Social) Dataset
-
Dependent Variables: poat
-
Independent Variables: NSCOM, PATCHS, NSOURCE, NPATCH, NTRACE, TRACES, NLINK, NPART, NDEVS, NUSERS, SNACENT, NMSG, REPLY, REPLYE, DLEN, DLENE, INT, INTE, WA, WAE
Appendix B
Rights and permissions
About this article
Cite this article
Bettenburg, N., Nagappan, M. & Hassan, A.E. Towards improving statistical modeling of software engineering data: think locally, act globally!. Empir Software Eng 20, 294–335 (2015). https://doi.org/10.1007/s10664-013-9292-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-013-9292-6