Abstract
To support program comprehension, software artifacts can be labeled—for example within software visualization tools—with a set of representative words, hereby referred to as labels. Such labels can be obtained using various approaches, including Information Retrieval (IR) methods or other simple heuristics. They provide a bird-eye’s view of the source code, allowing developers to look over software components fast and make more informed decisions on which parts of the source code they need to analyze in detail. However, few empirical studies have been conducted to verify whether the extracted labels make sense to software developers. This paper investigates (i) to what extent various IR techniques and other simple heuristics overlap with (and differ from) labeling performed by humans; (ii) what kinds of source code terms do humans use when labeling software artifacts; and (iii) what factors—in particular what characteristics of the artifacts to be labeled—influence the performance of automatic labeling techniques. We conducted two experiments in which we asked a group of students (38 in total) to label 20 classes from two Java software systems, JHotDraw and eXVantage. Then, we analyzed to what extent the words identified with an automated technique—including Vector Space Models, Latent Semantic Indexing (LSI), latent Dirichlet allocation (LDA), as well as customized heuristics extracting words from specific source code elements—overlap with those identified by humans. Results indicate that, in most cases, simpler automatic labeling techniques—based on the use of words extracted from class and method names as well as from class comments—better reflect human-based labeling. Indeed, clustering-based approaches (LSI and LDA) are more worthwhile to be used for source code artifacts having a high verbosity, as well as for artifacts requiring more effort to be manually labeled. The obtained results help to define guidelines on how to build effective automatic labeling techniques, and provide some insights on the actual usefulness of automatic labeling techniques during program comprehension tasks.
Similar content being viewed by others
Notes
The number of unique terms ranges from 26 to 186, while the number of documents, i.e., methods, from 4 to 37.
Note that both LSI and LDA were used in the same way by other authors to support different software engineering tasks. For instance, both the techniques have been applied at class level when computing class cohesion/coupling exhibiting good results (Liu et al. 2009; Marcus and Poshyvanyk 2005; Poshyvanyk and Marcus 2006).
Note that in our case the asymmetric Jaccard overlap coincides with the precision measure (Baeza-Yates and Ribeiro-Neto 1999). Assuming that K(C i ) represents the set of “correct” keywords, the overlap measures the number of identified keywords that are actually correct, i.e., precision.
References
Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983
Asuncion HU, Asuncion A, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 95–104
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley
Baker RD (1995) Modern permutation test software. In: Edgington E (ed) Randomization tests. Marcel Decker
Baldi P, Lopes CV, Linstead E, Bajracharya SK (2008) A theory of aspects as latent topics. In: Proceedings of the 23rd annual ACM SIGPLAN conference on object-oriented programming, systems, languages, and applications. ACM Press, Nashville, TN, USA, pp 543–562
Binkley D, Feild H, Lawrie D, Pighin M (2007) Software fault prediction using language processing. In: Proceedings of the testing: academic and industrial conference practice and research techniques. IEEE Computer Society, pp 99–110
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Buse RPL, Weimer W (2010) Automatically documenting program changes. In: Proceedings of the 25th IEEE/ACM international conference on automated software engineering. ACM Press, Antwerp, Belgium, pp 33–42
Canfora G, Cerulo L (2005) Impact analysis by mining software and change request repositories. In: Proceedings of 11th IEEE international symposium on software metrics. IEEE CS Press, Como, Italy, pp 20–29
Chang J, Blei DM (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150
Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: Proc. of ICSE, pp 155–164
Cullum JK, Willoughby RA (1998) Lanczos algorithms for large symmetric eigenvalue computations, vol 1, chapter Real rectangular matrices. Birkhauser, Boston
De Lucia A, Di Penta M, Oliveto R (2011) Improving source code lexicon via traceability and information retrieval. IEEE Trans Softw Eng 2(37):205–227
De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artefact management systems using information retrieval methods. ACM Trans Soft Eng Methodol 16(4), article no. 13
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Detienne F (2002) Software design: cognitive aspects. Springer Verlag
Gethers M, Oliveto R, Poshyvanyk D, De Lucia A (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: Proceedings of the 27th international conference on software maintenance. IEEE Press, Williamsburg, USA, pp 133–142
Gethers M, Savage T, Di Penta M, Oliveto R, Poshyvanyk D, De Lucia A (2011) Codetopics: which topic am i coding now? In: Proceedings of the 33rd International conference on software engineering, ICSE 2011, Waikiki, Honolulu, HI, USA, 21–28 May 2011. ACM, pp 1034–1036
Grissom RJ, Kim JJ (2005) Effect sizes for research: a broad practical approach, 2nd edn. Lawrence Earlbaum Associates
Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc Y-G (2011) TIDIER: an identifier splitting approach using speech recognition techniques. J Softw Evol Process 25(6):575–599
Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 223–226
Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th working conference on reverse engineering. IEEE Computer Society, Beverly, MA, USA, pp 35–44
Hayes JH, Dekhtyar A, Sundaram SK (2006) Advancing candidate link generation for requirements tracing: The study of methods. IEEE Trans Softw Eng 32(1):4–19
Hindle A, Bird C, Zimmermann T, Nagappan N (2012) Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers? In Proceedings of the 28th international conference on software maintenance. IEEE CS Press, Riva del Garda, Italy
Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: Proceedings of the 8th international working conference on mining software repositories. IEEE CS Press, Waikiki, Honolulu, USA, pp 163–172
Holm S (1979) A simple sequentially rejective Bonferroni test procedure. Scand J Stat 6:65–70
Ko AJ, Myers BA, Coblenz MJ, Aung HH (2006) An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Trans Softw Eng 32(12):971–987
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: Identifying topics in source code. Inf Softw Technol 49(3):230–243
LaToza TD, Venolia G, DeLine R (2006) Maintaining mental models: a study of developer work habits. In: Proceedings of the 28th international conference on software engineering. ACM Press, Shanghai, China, pp 492–501
Lavrenko V (2009) A generative theory of relevance, vol 26. Springer
Lawrie D, Feild H, Binkley D (2007) An empirical study of rules for well-formed identifiers. J Softw Maint 19(4):205–229
Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: Proceedings of the 18th annual workshop on psychology of programming. University of Sussex, Brighton, UK
Linstead E, Lopes CV, Baldi P (2008) An application of latent dirichlet allocation to analyzing software evolution. In: Proceedings of the 7th international conference on machine learning and applications. IEEE CS Press, San Diego, California, USA, pp 813–818
Liu Y, Poshyvanyk D, Ferenc R, Gyimóthy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proc. of ICSM, pp 233–242
Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proceedings of 23rd international conference on software engineering. IEEE CS Press, Toronto, Ontario, Canada, pp 103–112
Marcus A, Maletic JI (2001) Identification of high-level concept clones in source code. In: Proceedings of 16th IEEE international conference on automated software engineering. IEEE CS Press, San Diego, California, USA, pp 107–114
Marcus A, Maletic JI (2003) Recovering documentation-to-source-code traceability links using latent semantic indexing. In: Proceedings of 25th international conference on software engineering. IEEE CS Press, Portland, Oregon, USA, pp 125–135
Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: Proceedings of 21st IEEE international conference on software maintenance. IEEE CS Press, Budapest, Hungary, pp 133–142
Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300
Medini S, Antoniol G, Guéhéneuc Y-G, Di Penta M, Tonella P (2012) Scan: an approach to label and relate execution trace segments. In: Proceedings of the 19th working conference on reverse engineering. IEEE Press, Kingston, Ontario, Canada
Murphy G (1996) Lightweight structural summarization as an aid to software evolution. PhD thesis, University of Washington
Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M (2008) Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 569–577
Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137
Poshyvanyk D, Gael-Gueheneuc Y, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432
Poshyvanyk D, Marcus A (2006) The conceptual coupling metrics for object-oriented systems. In: Proceedings of 22nd IEEE international conference on software maintenance. IEEE CS Press, Philadelphia, PA, USA, pp 469–478
Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proceedings of the 10th international workshop on program comprehension. IEEE Computer Society, Paris, France, pp 271–280
Rastkar S (2010) Summarizing software concerns. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering – student research competition. ACM Press, Cape Town, South Africa, pp 527–528
Rastkar S, Murphy GC, Murray G (2010) Summarizing software artifacts: a case study of bug reports. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering. ACM Press, Cape Town, South Africa, pp 505–514
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 625–56
Sridhara G, Hill E, Muppaneni D, LL Pollock, Vijay-Shanker K (2010) Towards automatically generating summary comments for java methods. In: Proceedings of the 25th IEEE/ACM international conference on automated software engineering. ACM Press, Antwerp, Belgium, pp 43–52
Sridhara G, Pollock LL, Vijay-Shanker K (2011) Automatically detecting and describing high level actions within methods. In Proceedings of the 33rd International conference on software engineering. ACM Press, Honolulu, HI, USA, pp 101–110
Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. Chapman & Hall/CRC
Storey M-AD (2006) Theories, tools and research methods in program comprehension: past, present and future. SQJ 14(3):187–208
Takang A, Grubb P, Macredie R (1996) The effects of comments and identifier names on program comprehensibility: an experiential study. J Program Lang 4(3):143–167
Teh YW, Newman D, Welling M (2006) A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In: NIPS, pp 1353–1360
Thomas SW, Adams B, Hassan AE, Blostein D (2010) Validating the use of topic models for software evolution. In: Tenth IEEE international working conference on source code analysis and manipulation, SCAM 2010. IEEE Computer Society, Timisoara, Romania, 12–13 Sept 2010, pp 55–64
Thomas SW, Adams B, Hassan AE, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proceedings of the 8th international working conference on mining software repositories. IEEE Press, Honolulu, HI, USA, pp 173–182
Acknowledgements
We would like to thank all the students that participated in our study. We would also like to thank anonymous reviewers for their careful reading of our manuscript and high-quality feedback. Their detailed comments have helped us to improve the original version of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated By: Michael Godfrey and Arie van Deursen
This paper is an extension of the work “Using IR Methods for Labeling Source Code Artifacts: Is It Worthwhile?” appeared in the Proceedings of the 20th IEEE International Conference on Program Comprehension, Passau, Bavaria, Germany, pp. 193–202, 2012. IEEE Press.
Rights and permissions
About this article
Cite this article
De Lucia, A., Di Penta, M., Oliveto, R. et al. Labeling source code with information retrieval methods: an empirical study. Empir Software Eng 19, 1383–1420 (2014). https://doi.org/10.1007/s10664-013-9285-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-013-9285-5