Empirical evaluation of clone detection using syntax suffix trees | Empirical Software Engineering
Skip to main content

Empirical evaluation of clone detection using syntax suffix trees

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Reusing software through copying and pasting is a continuous plague in software development despite the fact that it creates serious maintenance problems. Various techniques have been proposed to find duplicated redundant code (also known as software clones). A recent study has compared these techniques and shown that token-based clone detection based on suffix trees is fast but yields clone candidates that are often not syntactic units. Current techniques based on abstract syntax trees—on the other hand—find syntactic clones but are considerably less efficient. This paper describes how we can make use of suffix trees to find syntactic clones in abstract syntax trees. This new approach is able to find syntactic clones in linear time and space. The paper reports the results of a large case study in which we empirically compare the new technique to other techniques using the Bellon benchmark for clone detectors. The Bellon benchmark consists of clone pairs validated by humans for eight software systems written in C or Java from different application domains. The new contributions of this paper over the conference paper are the additional analysis of Java programs, the exploration of an alternative path that uses parse trees instead of abstract syntax trees, and the investigation of the impact on recall and precision when clone analyses insist on consistent parameter renaming.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. http://www.axivion.com.

  2. A trie is an ordered tree data structure that is used to store an associative array where the keys are strings.

  3. An alternative to suffix trees are suffix arrays, which offer the advantage of less space consumption (Manber and Myers 1991) but at the cost of more runtime.

  4. Please be reminded that (a, b, l) denotes a clone pair starting at token index a and b, respectively, with length l.

  5. http://www.edg.com.

  6. Personal communication; March 2007.

  7. Personal communication at Dagstuhl, July 2006.

References

  • Antoniol G, Casazza G, Penta MD, Merlo E (2001) Modeling clones evolution through time series. In: International conference on software maintenance. IEEE CS Press, pp 273–280

  • Antoniol G, Villano U, Merlo E, Penta M (2002) Analyzing cloning evolution in the linux kernel. Inf Softw Technol 44(13):755–765

    Article  Google Scholar 

  • Bailey J, Burd E (2002) Evaluating clone detection tools for use during preventative maintenance. In: SCAM

  • Baker BS (1992) A program for identifying duplicated code. In: Computer science and statistics 24: Proceedings of the 24th symposium on the interface

  • Baker BS (1995) On finding duplication and near-duplication in large software systems. In: Working conference on reverse engineering. IEEE CS Press

  • Baker BS (1996) Parameterized pattern matching: algorithms and applications. JCSS

  • Baker BS (2007) Finding clones with Dup: analysis of an experiment. IEEE Trans Softw Eng 33(9):608–621 (September)

    Article  Google Scholar 

  • Baker BS, Giancarlo R (2002) Sparse dynamic programming for longest common subsequence from fragments. J Algorithms 42(2):231–254 (February)

    Article  MATH  MathSciNet  Google Scholar 

  • Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (1999) Measuring clone based reengineering opportunities. In: IEEE symposium on software metrics. IEEE CS Press, pp 292–303 (November)

  • Balazinska M, Merlo E, Dagenais M, Lague B, Kontogiannis K (2000) Advanced clone-analysis to support object-oriented system refactoring. In: Working conference on reverse engineering, IEEE CS Press, pp 98–107 (October)

  • Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM

  • Bellon S (2002) Vergleich von Techniken zur Erkennung duplizierten Quellcodes. Master’s thesis, University of Stuttgart, Germany

  • Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591 (September)

    Article  Google Scholar 

  • Bentley JL, Ottmann TA (1979) Algorithms for reporting and counting geometric intersections. IEEE Trans Comput C-28:643–647

    Article  Google Scholar 

  • Bruntink M, van Deursen A, Tourwe T, van Engelen R (2004) An evaluation of clone detection techniques for crosscutting concerns. In: International conference on software maintenance, pp 200–209

  • Bruntink M, van Engelen R, Tourwe T (2005) On the use of clone detection for identifying crosscutting concern code. IEEE Trans Softw Eng 31(10):804–818

    Article  Google Scholar 

  • Chou A, Yang J, Chelf B, Hallem S, Engler DR (2001) An empirical study of operating system errors. In: Symposium on operating systems principles, pp 73–88

  • Cordy JR (2003) Comprehending reality—practical barriers to industrial adoption of software maintenance automation. In: International workshop on program comprehension, IEEE CS Press

  • Cordy JR, Dean TR, Synytskyy N (2004) Practical language-independent detection of near-miss clones. In: CASCON, IBM Press

  • Di Lucca G, Di Penta M, Fasolino, A (2002) An approach to identify duplicated web pages. In: COMPSAC

  • Ducasse S, Rieger M, Demeyer S (1999) A language independent approach for detecting duplicated code. In: ICSM

  • Fowler M (1999) Refactoring: improving the design of existing code. Addison Wesley, Boston, MA, USA

    Google Scholar 

  • Gitchell D, Tran N (1999) Sim: a utility for detecting similarity in computer programs. In: SIGCSE, ACM Press

  • Godfrey M, Tu Q (2001) Growth, evolution and structural change in open source software. In: Workshop on principles of software evolution (September)

  • Higo Y, Ueda Y, Kamiya T, Kusumoto S, Inoue K (2002) On software maintenance process improvement based on code clone analysis. In: International conference on product focused software process improvement. Lecture notes in computer science, vol 2559. Springer

  • Johnson JH (1993) Identifying redundancy in source code using fingerprints. In: CASCON, IBM Press

  • Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multi-linguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28(7):654–670

    Article  Google Scholar 

  • Kapser C, Godfrey M (2003a) A taxonomy of clones in source code: the reengineers most wanted list. In: Working conference on reverse engineering. IEEE CS Press

  • Kapser C, Godfrey MW (2003b) Toward a taxonomy of clones in source code: a case study. In: Evolution of large scale industrial software architectures

  • Kapser C, Godfrey M (2005) Improved tool support for the investigation of duplication in software. Proceedings of the 21st IEEE international conference on software maintenance

  • Kapser C, Godfrey MW (2006) “Clones considered harmful” considered harmful. In: Working conference on reverse engineering

  • Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in OOPL. In: International symposium on empirical software engineering. IEEE CS Press, pp 83–92

  • Kim M, Sazawal V, Notkin D, Murphy GC (2005) An empirical study of code clone genealogies. In: European software engineering conference and foundations of software engineering (ESEC/FSE)

  • Komondoor R, Horwitz S (2001) Using slicing to identify duplication in source code. In: Proc. int. symposium on static analysis

  • Kontogiannis K, Mori RD, Merlo E, Galler M, Bernstein M (1996) Pattern matching for clone and concept detection. Autom Softw Eng 3(1/2):79–108

    Google Scholar 

  • Koschke R, Girard JF, Würthner M (1998) Intermediate representations for reverse engineering. In: Working conference on reverse engineering. IEEE CS Press, pp 241–250

  • Koschke R, Falke R, Frenzel P (2006) Clone detection using abstract syntax suffix trees. In: Working conference on reverse engineering. IEEE CS Press, pp 253–262

  • Krinke J (2001) Identifying similar code with program dependence graphs. In: WCRE

  • Lague B, Proulx D, Mayrand J, Merlo E, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: International conference on software maintenance, pp 314–321

  • Laguë B, Proulx D, Mayrand J, Merlo EM, Hudepohl J (1997) Assessing the benefits of incorporating function clone detection in a development process. In: ICSM

  • Lanubile F, Mallardo T (2003) Finding function clones in web applications. In: Conference on software maintenance and reengineering

  • Leitao AM (2003) Detection of redundant code using R2D2. In: Workshop source code analysis and manipulation. IEEE CS Press

  • Li Z, Lu S, Myagmar S, Zhou Y (2004) Cp-miner: a tool for finding copy-paste and related bugs in operating system code. In: Operating system design and implementation, pp 289–302

  • Li Z, Lu S, Myagmar S, Zhou Y (2006) Copy-paste and related bugs in large-scale software code. IEEE Trans Softw Eng 32(3):176–192 (March)

    Article  Google Scholar 

  • Manber U, Myers G (1991) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948 (October)

    Article  MathSciNet  Google Scholar 

  • Marcus A, Maletic J (2001) Identification of high-level concept clones in source code. In: Conference on automated software engineering

  • Mayrand J, Leblanc C, Merlo EM (1996) Experiment on the automatic detection of function clones in a software system using metrics. In: ICSM. IEEE Computer Society Press

  • McCreight E (1976) A space-economical suffix tree construction algorithm. J ACM 32(2):262–272

    Article  MathSciNet  Google Scholar 

  • Monden A, Nakae D, Kamiya T, Sato S, Matsumoto K (2002) Software quality analysis by code clones in industrial legacy software. In: IEEE symposium on software metrics, pp 87–94

  • Prechelt L, Malpohl G, Philippsen M (2000) Jplag: finding plagiarisms among a set of programs. Technical report, University of Karlsruhe, Department of Informatics

  • Rieger M (2005) Effective clone detection without language barriers. Dissertation, University of Bern, Switzerland

  • Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the SIGMOD, pp 76–85

  • Shannon CE (1984) A mathematical theory of communication. Bell Syst Tech J 27(379–423):623–656

    MathSciNet  Google Scholar 

  • Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–260

    Article  MATH  MathSciNet  Google Scholar 

  • Van Rysselberghe F, Demeyer S (2004) Evaluating clone detection techniques from a refactoring perspective. In: Conference on automated software engineering

  • Walter V, Seipel D, von Gudenberg JW, Fischer G (2004) Clone detection in source code by frequent itemset techniques. In: Workshop source code analysis and manipulation

  • Yang W (1991) Identifying syntactic differences between two programs. Software Pract Ex 21(7):739–755

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Stefan Bellon for providing us with his benchmark and ccdiml, his support in the evaluation, and for comments on this paper. We also like to thank Felix Beckwermert for his support in the evaluation and Thilo Mende for comments on this paper. We would also like to thank the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rainer Koschke.

Additional information

Editors: Massimiliano Di Penta and Susan Sim

Rights and permissions

Reprints and permissions

About this article

Cite this article

Falke, R., Frenzel, P. & Koschke, R. Empirical evaluation of clone detection using syntax suffix trees. Empir Software Eng 13, 601–643 (2008). https://doi.org/10.1007/s10664-008-9073-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-008-9073-9

Keywords