Abstract
Reusing software through copying and pasting is a continuous plague in software development despite the fact that it creates serious maintenance problems. Various techniques have been proposed to find duplicated redundant code (also known as software clones). A recent study has compared these techniques and shown that token-based clone detection based on suffix trees is fast but yields clone candidates that are often not syntactic units. Current techniques based on abstract syntax trees—on the other hand—find syntactic clones but are considerably less efficient. This paper describes how we can make use of suffix trees to find syntactic clones in abstract syntax trees. This new approach is able to find syntactic clones in linear time and space. The paper reports the results of a large case study in which we empirically compare the new technique to other techniques using the Bellon benchmark for clone detectors. The Bellon benchmark consists of clone pairs validated by humans for eight software systems written in C or Java from different application domains. The new contributions of this paper over the conference paper are the additional analysis of Java programs, the exploration of an alternative path that uses parse trees instead of abstract syntax trees, and the investigation of the impact on recall and precision when clone analyses insist on consistent parameter renaming.
Similar content being viewed by others
Notes
A trie is an ordered tree data structure that is used to store an associative array where the keys are strings.
An alternative to suffix trees are suffix arrays, which offer the advantage of less space consumption (Manber and Myers 1991) but at the cost of more runtime.
Please be reminded that (a, b, l) denotes a clone pair starting at token index a and b, respectively, with length l.
Personal communication; March 2007.
Personal communication at Dagstuhl, July 2006.
References
Antoniol G, Casazza G, Penta MD, Merlo E (2001) Modeling clones evolution through time series. In: International conference on software maintenance. IEEE CS Press, pp 273–280
Antoniol G, Villano U, Merlo E, Penta M (2002) Analyzing cloning evolution in the linux kernel. Inf Softw Technol 44(13):755–765
Bailey J, Burd E (2002) Evaluating clone detection tools for use during preventative maintenance. In: SCAM
Baker BS (1992) A program for identifying duplicated code. In: Computer science and statistics 24: Proceedings of the 24th symposium on the interface
Baker BS (1995) On finding duplication and near-duplication in large software systems. In: Working conference on reverse engineering. IEEE CS Press
Baker BS (1996) Parameterized pattern matching: algorithms and applications. JCSS
Baker BS (2007) Finding clones with Dup: analysis of an experiment. IEEE Trans Softw Eng 33(9):608–621 (September)
Baker BS, Giancarlo R (2002) Sparse dynamic programming for longest common subsequence from fragments. J Algorithms 42(2):231–254 (February)
Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (1999) Measuring clone based reengineering opportunities. In: IEEE symposium on software metrics. IEEE CS Press, pp 292–303 (November)
Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (2000) Advanced clone-analysis to support object-oriented system refactoring. In: Working conference on reverse engineering, IEEE CS Press, pp 98–107 (October)
Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM
Bellon S (2002) Vergleich von Techniken zur Erkennung duplizierten Quellcodes. Master’s thesis, University of Stuttgart, Germany
Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591 (September)
Bentley JL, Ottmann TA (1979) Algorithms for reporting and counting geometric intersections. IEEE Trans Comput C-28:643–647
Bruntink M, van Deursen A, Tourwe T, van Engelen R (2004) An evaluation of clone detection techniques for crosscutting concerns. In: International conference on software maintenance, pp 200–209
Bruntink M, van Engelen R, Tourwe T (2005) On the use of clone detection for identifying crosscutting concern code. IEEE Trans Softw Eng 31(10):804–818
Chou A, Yang J, Chelf B, Hallem S, Engler DR (2001) An empirical study of operating system errors. In: Symposium on operating systems principles, pp 73–88
Cordy JR (2003) Comprehending reality—practical barriers to industrial adoption of software maintenance automation. In: International workshop on program comprehension, IEEE CS Press
Cordy JR, Dean TR, Synytskyy N (2004) Practical language-independent detection of near-miss clones. In: CASCON, IBM Press
Di Lucca G, Di Penta M, Fasolino, A (2002) An approach to identify duplicated web pages. In: COMPSAC
Ducasse S, Rieger M, Demeyer S (1999) A language independent approach for detecting duplicated code. In: ICSM
Fowler M (1999) Refactoring: improving the design of existing code. Addison Wesley, Boston, MA, USA
Gitchell D, Tran N (1999) Sim: a utility for detecting similarity in computer programs. In: SIGCSE, ACM Press
Godfrey M, Tu Q (2001) Growth, evolution and structural change in open source software. In: Workshop on principles of software evolution (September)
Higo Y, Ueda Y, Kamiya T, Kusumoto S, Inoue K (2002) On software maintenance process improvement based on code clone analysis. In: International conference on product focused software process improvement. Lecture notes in computer science, vol 2559. Springer
Johnson JH (1993) Identifying redundancy in source code using fingerprints. In: CASCON, IBM Press
Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multi-linguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670
Kapser C, Godfrey M (2003a) A taxonomy of clones in source code: the reengineers most wanted list. In: Working conference on reverse engineering. IEEE CS Press
Kapser C, Godfrey MW (2003b) Toward a taxonomy of clones in source code: a case study. In: Evolution of large scale industrial software architectures
Kapser C, Godfrey M (2005) Improved tool support for the investigation of duplication in software. Proceedings of the 21st IEEE international conference on software maintenance
Kapser C, Godfrey MW (2006) “Clones considered harmful” considered harmful. In: Working conference on reverse engineering
Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in OOPL. In: International symposium on empirical software engineering. IEEE CS Press, pp 83–92
Kim M, Sazawal V, Notkin D, Murphy GC (2005) An empirical study of code clone genealogies. In: European software engineering conference and foundations of software engineering (ESEC/FSE)
Komondoor R, Horwitz S (2001) Using slicing to identify duplication in source code. In: Proc. int. symposium on static analysis
Kontogiannis K, Mori RD, Merlo E, Galler M, Bernstein M (1996) Pattern matching for clone and concept detection. Autom Softw Eng 3(1/2):79–108
Koschke R, Girard JF, Würthner M (1998) Intermediate representations for reverse engineering. In: Working conference on reverse engineering. IEEE CS Press, pp 241–250
Koschke R, Falke R, Frenzel P (2006) Clone detection using abstract syntax suffix trees. In: Working conference on reverse engineering. IEEE CS Press, pp 253–262
Krinke J (2001) Identifying similar code with program dependence graphs. In: WCRE
Lague B, Proulx D, Mayrand J, Merlo E, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: International conference on software maintenance, pp 314–321
Laguë B, Proulx D, Mayrand J, Merlo EM, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: ICSM
Lanubile F, Mallardo T (2003) Finding function clones in web applications. In: Conference on software maintenance and reengineering
Leitao AM (2003) Detection of redundant code using R2D2. In: Workshop source code analysis and manipulation. IEEE CS Press
Li Z, Lu S, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: Operating system design and implementation, pp 289–302
Li Z, Lu S, Myagmar S, Zhou Y (2006) Copy-paste and related bugs in large-scale software code. IEEE Trans Softw Eng 32(3):176–192 (March)
Manber U, Myers G (1991) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948 (October)
Marcus A, Maletic J (2001) Identification of high-level concept clones in source code. In: Conference on automated software engineering
Mayrand J, Leblanc C, Merlo EM (1996) Experiment on the automatic detection of function clones in a software system using metrics. In: ICSM. IEEE Computer Society Press
McCreight E (1976) A space-economical suffix tree construction algorithm. J ACM 32(2):262–272
Monden A, Nakae D, Kamiya T, Sato S, Matsumoto K (2002) Software quality analysis by code clones in industrial legacy software. In: IEEE symposium on software metrics, pp 87–94
Prechelt L, Malpohl G, Philippsen M (2000) Jplag: finding plagiarisms among a set of programs. Technical report, University of Karlsruhe, Department of Informatics
Rieger M (2005) Effective clone detection without language barriers. Dissertation, University of Bern, Switzerland
Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the SIGMOD, pp 76–85
Shannon CE (1984) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260
Van Rysselberghe F, Demeyer S (2004) Evaluating clone detection techniques from a refactoring perspective. In: Conference on automated software engineering
Walter V, Seipel D, von Gudenberg JW, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: Workshop source code analysis and manipulation
Yang W (1991) Identifying syntactic differences between two programs. Software Pract Ex 21(7):739–755
Acknowledgements
We would like to thank Stefan Bellon for providing us with his benchmark and ccdiml, his support in the evaluation, and for comments on this paper. We also like to thank Felix Beckwermert for his support in the evaluation and Thilo Mende for comments on this paper. We would also like to thank the anonymous reviewers for their valuable comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Massimiliano Di Penta and Susan Sim
Rights and permissions
About this article
Cite this article
Falke, R., Frenzel, P. & Koschke, R. Empirical evaluation of clone detection using syntax suffix trees. Empir Software Eng 13, 601–643 (2008). https://doi.org/10.1007/s10664-008-9073-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-008-9073-9