Abstract
Automated trace retrieval methods based on machine-learning algorithms can significantly reduce the cost and effort needed to create and maintain traceability links between requirements, architecture and source code. However, there is always an upfront cost to train such algorithms to detect relevant architectural information for each quality attribute in the code. In practice, training supervised or semi-supervised algorithms requires the expert to collect several files of architectural tactics that implement a quality requirement and train a learning method. Establishing such a training set can take weeks to months to complete. Furthermore, the effectiveness of this approach is largely dependent upon the knowledge of the expert. In this paper, we present three baseline approaches for the creation of training data. These approaches are (i) Manual Expert-Based, (ii) Automated Web-Mining, which generates training sets by automatically mining tactic’s APIs from technical programming websites, and lastly (iii) Automated Big-Data Analysis, which mines ultra-large scale code repositories to generate training sets. We compare the trace-link creation accuracy achieved using each of these three baseline approaches and discuss the costs and benefits associated with them. Additionally, in a separate study, we investigate the impact of training set size on the accuracy of recovering trace links. The results indicate that automated techniques can create a reliable training set for the problem of tracing architectural tactics.
Similar content being viewed by others
Notes
Weka’s NaiveBayesMultinomialText method was used.
Please see terms in the figures: http://www.1tech.eu/clients/casestudy_ventraq
References
Anish PR, Balasubramaniam B, Cleland-Huang J, Wieringa R, Daneva M, Ghaisas S (2015) Identifying architecturally significant functional requirements. In: Proceedings of the Fifth International Workshop on Twin Peaks of Requirements and Architecture, TwinPeaks ’15. IEEE Press, NJ, USA, pp 3–8
California Senate Bill SB 1386 (2002) http://www.leginfo.ca.gov/pub/13-14/bill/sen/sb_1351-1400/sb_1351_bill_20140221_introduced.pdf
Congress US (1999) Gramm-Leach-Bliley Act, Financial Privacy Rule. 15 USC:6801–6809. http://www.law.cornell.edu/uscode/usc_sup_01_15_10_94_20_I.html
Council PCI, Payment card industry (pci) data security standard Available over the Internet (July 2010). https://www.pcisecuritystandards.org
Bachmann F, Bass L, Klein M (2003) Deriving Architectural Tactics: Architectural A Step Toward Methodical Architectural Design. Technical Report, Software Engineering Institute
Bass L, Clements P, Kazman R (2003) Software Architecture in Practice. Adison Wesley
Beeler GW Jr, Gardner D (2006) A requirements primer. Queue 4(7):22–26
Brodley CE (1993) Addressing the selective superiority problem: Automatic algorithm/model class selection
Cano JR, Herrera F, Lozano M (2003) Using evolutionary algorithms as instance selection for data reduction in kdd An experimental study. Trans Evol Comp 7(6):561–575
Cleland-Huang J, Czauderna A, Gibiec M, Emenecker J (2010) A machine learning approach for tracing regulatory codes to product specific requirements. In: ICSE (1), pp 155–164
Cleland-Huang J, Gotel O, Huffman Hayes J, Mader P, Zisman A (2014) Software traceability: Trends and future directions. In: Proceedings of the 36th International Conference on Software Engineering (ICSE), India
Cleland-Huang J, Settimi R, Zou X, Solc P (2007) Automated detection and classification of non-functional requirements. Requir Eng 12(2):103–120
Dyer R, Rajan H, Nguyen HA, Nguyen TN (2014) Mining billions of ast nodes to study actual and potential usage of java language features. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014. ACM, NY, USA, pp 779–790
Gates G (1972) The reduced nearest neighbor rule (corresp). IEEE Trans Inf Theory 18(3):431–433
Gethers M, Oliveto R, Poshyvanyk D, Lucia A (2011) On integrating orthogonal information retrieval methods to improve traceability recovery. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM), pp 133–142
Gibiec M, Czauderna A, Cleland-Huang J (2010) Towards mining replacement queries for hard-to-retrieve traces. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, ASE ’10. ACM, NY, USA, pp 245–254
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Morgan Kaufmann
Koders (2014) http://www.koders.com
Liebchen GA, Shepperd M (2008) Data sets and data quality in software engineering. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, PROMISE’08. ACM, NY, USA, pp 39–44
Mahmoud A (2015) An information theoretic approach for extracting and tracing non-functional requirements. In: Proceedings RE. IEEE, pp 36–45
McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in Action, 2nd edn. Covers Apache Lucene 3.0. Manning Publications Co, CT, USA
Mehdi Mirakhorli J. C.-H. (2015) Detecting, tracing, and monitoring architectural tactics in code. IEEE Trans Software Eng
Mirakhorli M (2014) Preserving the quality of architectural decisions in source code. PhD Dissertation, DePaul University Library
Mirakhorli M, Cleland-Huang J (2011) Tracing Non-Functional Requirements. In: Zisman A, Cleland-Huang J, Gotel O (eds) Software and Systems Traceability. Springer-Verlag
Mirakhorli M, Cleland-Huang J (2011) Using tactic traceability information models to reduce the risk of architectural degradation during system maintenance. In: Proceedings of the 2011 27th IEEE International Conference on Software Maintenance, ICSM ’11. IEEE Computer Society, DC, USA, pp 123–132
Mirakhorli M, Fakhry A, Grechko A, Wieloch M, Cleland-Huang J (2014) Archie: A tool for detecting, monitoring, and preserving architecturally significant code. In: CM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2014)
Mirakhorli M, Mäder P., Cleland-Huang J (2012) Variability points and design pattern usage in architectural tactics. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, FSE ’12. ACM, pp 52:1–52:11
Mirakhorli M, Shin Y, Cleland-Huang J, Cinar M (2012) A tactic centric approach for automating traceability of quality concerns. In: International Conference on Software Engineering, ICSE (1)
Molina LC, Belanche L, Nebot À (2002) Feature Selection Algorithms: A Survey and Experimental Evaluation. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), 9–12 December 2002, Maebashi City, Japan. doi:10.1109/ICDM.2002.1183917, 10.1109/ICDM.2002.1183917, pp 306–313
Passini MLC, Estb̆anez K. B., Figueredo GP, Ebecken NFF (2013) A strategy for training set selection in text classification problems. (IJACSA) International Journal of Advanced Computer Science and Applications 4(6):54–60
Salton G (1989) Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., MA, USA
Skalak DB (1994) Prototype and feature selection by sampling and random mutation hill climbing algorithms. In: Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, pp 293–301
University of California I (2010) The sourcerer project. sourcerer.ics.uci.edu
De Winter JCF (2013) Using the Student’s t-test with extremely small sample sizes
Wilson DR, Martinez TR (2000) Reduction techniques for instance-basedlearning algorithms. Mach Learn 38(3):257–286
Zhu J, Zhou M, Mockus A (2014) Patterns of folder use and project popularity: A case study of github repositories. In: Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement ESEM ’14, vol 4, pp 30:1–30:4
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Patrick Mäder, Rocco Oliveto and Andrian Marcus
Rights and permissions
About this article
Cite this article
Zogaan, W., Mujhid, I., S. Santos, J.C. et al. Automated training-set creation for software architecture traceability problem. Empir Software Eng 22, 1028–1062 (2017). https://doi.org/10.1007/s10664-016-9476-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-016-9476-y