Usage and attribution of Stack Overflow code snippets in GitHub projects | Empirical Software Engineering Skip to main content
Log in

Usage and attribution of Stack Overflow code snippets in GitHub projects

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of copyable code snippets. Using those snippets raises maintenance and legal issues. SO’s license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub (GH) projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results. For the different sets of projects that we analyzed, the ratio of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required. Of the surveyed developers, almost one half admitted copying code from SO without attribution and about two thirds were not aware of the license of SO code snippets and its implications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://stackoverflow.com/help/licensing

  2. http://web.archive.org/

  3. http://balusc.omnifaces.org/2007/07/fileservlet.html

References

  • Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from StackOverflow: An exploratory study on Android apps. Inf Softw Technol 88:148–158

    Article  Google Scholar 

  • Acar Y, Backes M, Fahl S, Kim D, Mazurek ML, Stransky C (2016) You get where you’re looking for: The impact of information sources on code security. In: Locasto M, Shmatikov V, Erlingsson Ú (eds) 2016 IEEE Symposium on Security and Privacy (S&P 2016), IEEE Computer Society, San Jose, CA, USA, pp 289–305

  • Achte Z (2016) AZ I-8 O 294/15. Landgericht Bochum http://www.justiz.nrw.de/nrwe/lgs/bochum/lg_bochum/j2016/I_8_O_294_15_Urteil_20160303.html

  • Agresti A (2007) An introduction to categorical data analysis, 2nd edn. Wiley, Hoboken

    Book  MATH  Google Scholar 

  • Aioobe (2010) How to convert byte size into human readable format in java? http://stackoverflow.com/a/3758880

  • Allamanis M, Sutton C (2013) Why, when, and what: Analyzing Stack Overflow questions by topic, type, and code. In: Zimmermann T, Di Penta M, Kim S (eds) 10th international working conference on mining software repositories (MSR 2013) IEEE, San Francisco, CA, USA, pp 53–56

  • Almeida DA, Murphy GC, Wilson G, Hoye M (2018) Investigating whether and how software developers understand open source software licensing. Empir Softw Eng 11(11):730

    Google Scholar 

  • Alsup W (2012) Oracle America, Inc v. Google, Inc. United States District Court for the Northern District of California

  • An L, Mlouki O, Khomh F, Antoniol G (2017) Stack Overflow: A Code Laundering Platform? In: Pinzger M, Bavota G, Marcus A (eds) 24th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2017) IEEE Computer Society, Klagenfurt, Austria, pp 283–293

  • Badashian AS, Esteki A, Gholipour A, Hindle A, Stroulia E (2014) Involvement, Contribution and Influence in GitHub and Stack Overflow. In: Ng J, Li J, Wong K (eds) 24th international conference on computer science and software engineering (CASCON 2014), IBM / ACM, Markham, ON, Canada, pp 19–33

  • Baltes S (2017). sbaltes/api-retriever on GitHub. https://doi.org/10.5281/zenodo.1049419

  • Baltes S (2018) Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects — Supplementary Material. https://doi.org/10.5281/zenodo.1148069

  • Baltes S, Dumani L (2018) SOTorrent Dataset. https://doi.org/10.5281/zenodo.1135262

  • Baltes S, Dumani L, Treude C, Diehl S (2018) SOTOrrent: Reconstructing and analyzing the evolution stack overflow posts. In: Zaidman A, Hill E, Kamei Y (eds) 15th international conference on mining software repositories (MSR 2018) ACM, Gothenburg, Sweden, pp 1–12

  • Bartlett JEII, Kotrlik JW, Higgins CC (2001) Organizational research: Determining appropriate sample size in survey research. Information technology Learning, and Performance Journal 19(1):43–50

    Google Scholar 

  • Bosu A, Corley CS, Heaton D, Chatterji D, Carver JC, Kraft NA (2013) Building reputation in StackOverflow: An empirical investigation. In: Zimmermann T, Di Penta M, Kim S (eds) 10th international working conference on mining software repositories (MSR 2013) IEEE, San Francisco, CA, USA, pp 89–92

  • Brandt J, Dontcheva M, Weskamp M, Klemmer SR (2010) Example-centric programming: Integrating web search into the development environment. In: Mynatt E, Edwards K, Rodden T (eds) 2010 Conference on human factors in computing systems (CHI 2010), ACM, Atlanta, GA, USA, pp 513–522

  • Burrows S, Tahaghoghi SMM, Zobel J (2007) Efficient plagiarism detection for large code repositories. Software—Practice and Experience 37(2):151–176

    Article  Google Scholar 

  • Campbell BA, Treude C (2017) NLP2Code: Code snippet content assist via natural language tasks. In: Mei H, Zhang L, Zimmermann T (eds) 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME 2017) IEEE Computer Society, Shanghai, China, pp 628–632

  • Cavaretta MJ (2015) Open source issues in mergers & acquisitions. http://www.mbbp.com/news/open-source-issues

  • Cochran WG (1977) Sampling techniques, 3rd edn. Wiley, Hoboken

    MATH  Google Scholar 

  • Corley JS (2017) Artifex Software Inc v. Hancom, Inc. United States District Court for the Northern District of California

  • Creative Commons Corporation (2007) Attribution-ShareAlike 3.0 Unported. https://creativecommons.org/licenses/by-sa/3.0/legalcode

  • Creative Commons Corporation (2017a) Compatible Licenses. https://creativecommons.org/share-your-work/licensing-considerations/compatible-licenses/

  • Creative Commons Corporation (2017b) Frequently Asked Questions. https://creativecommons.org/faq/#can-i-apply-a-creative-commons-license-to-software

  • Duijn M, Kucera A, Bacchelli A (2015) Quality questions need quality code: Classifying code fragments on stack overflow. In: Di Penta M, Pinzger M, Robbes R (eds) 12th working conference on mining software repositories (MSR 2015) IEEE Computer Society, Florence, Italy, pp 410–413

  • Electronic Frontier Foundation (2018) Oracle v. Google. https://www.eff.org/cases/oracle-v-google

  • Engelfriet A (2016) What is the license status of StackOverflow code snippets? https://legalict.com/software/what-is-the-license-status-of-stackoverflow-code-snippets/

  • Fazunenko D (2016) Get rid of the humanReadableByteCount() method in openjdk/hotspot. https://bugs.openjdk.java.net/browse/JDK-8170860

  • Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack overflow considered harmful? the impact of Copy&Paste on android application security. In: Butler KRB, Erlingsson Ú, Parno B (eds) 2017 IEEE Symposium on security and privacy (S&P 2017) IEEE Computer Society, San Jose, CA, USA, pp 121–136

  • German DM, Hassan AE (2009) License integration patterns: Addressing license mismatches in component-based development. In: Fickas S, Atlee J M, Inverardi P (eds) 31St international conference on software engineering (ICSE 2009) IEEE Computer Society, Vancouver, BC, Canada, pp 188–198

  • German DM, Di Penta M, Gueheneuc YG, Antoniol G (2009) Code siblings: Technical and legal implications of copying code between applications. In: Godfrey M W, Whitehead J (eds) 6th international working conference on mining software repositories (MSR 2009), IEEE Computer Society, Vancouver, BC, Canada, pp 81–90

  • Gharehyazie M, Ray B, Filkov V (2017) Some From Here, Some From There: Cross-project Code Reuse in GitHub. In: Gonzalez-Barahona JM, Hindle A, Tan L (eds) 14th international conference on mining software repositories (MSR 2017), IEEE Computer Society, Buenos Aires, Argentina, pp 291–301

  • GitHub Inc (2017a) Choosealicense.com: No License. https://choosealicense.com/no-license/

  • GitHub Inc (2017b) GitHub Developer – API. https://developer.github.com/v3/

  • GitHub Inc (2018) The State of the Octoverse 2017. https://octoverse.github.com/

  • Google Cloud Platform (2017a) GitHub Data. https://cloud.google.com/bigquery/public-data/github

  • Google Cloud Platform (2017b) Stack Overflow Data. https://cloud.google.com/bigquery/public-data/stackoverflow

  • Gousios G (2013) The GHTorrent dataset and tool suite. In: Zimmermann T, Di Penta M, Kim S (eds) 10th international working conference on mining software repositories (MSR 2013), IEEE, San Francisco, CA, USA, pp 233–236

  • Gousios G (2017) GHTorrent on the Google cloud. http://ghtorrent.org/gcloud.html

  • Kaess J, Müller J, Rieger J (2004) Welte v. Sitecom Deutschland GmbH. District Court of Munich I

  • Kalliamvakou E, Gousios G, Blincoe K, Singer L, Germán DM, Damian D (2014) The promises and perils of mining GitHub. In: Devanbu PT, Kim S, Pinzger M (eds) 11th working conference on mining software repositories (MSR 2014), ACM, Hyderabad, India, pp 92–101

  • Lancaster T, Culwin F (2004) A comparison of source code plagiarism detection engines. Comput Sci Educ 14(2):101–112

    Article  Google Scholar 

  • Lopes CV, Maj P, Martins P, Saini V, Di Yang, Zitny J, Sajnani H, Vitek J (2017) DéJàvu: A map of code duplicates on GitHub. Proc ACM Program Lang 1(OOPSLA):84:1–84:28

    Article  Google Scholar 

  • Martins VT, Fonte D, Henriques PR, Dd Cruz (2014) Plagiarism detection: a tool survey and comparison. In: Pereira MJV, Leal JP, Simoes A (eds) 3rd symposium on languages, applications and technologies (SLATE 2014), Schloss Dagstuhl–Leibniz-Zentrum Fuer Informatik, Bragança, Portugal, Openaccess Series In Informatics (OASIcs), vol 38, pp 143–158

  • Meloca R, Pinto G, Baiser L, Mattos M, Polato I, Wiese IS, German D (2018) Understanding the usage, impact, and adoption of non-OSI approved licenses. In: Zaidman A, Hill E, Kamei Y (eds) 15th international conference on mining software repositories (MSR 2018) ACM, Gothenburg, Sweden, pp 1–11

  • Morrison P, Murphy-Hill E (2013) Is programming knowledge related to age? An exploration of Stack Overflow. In: Zimmermann T, Di Penta M, Kim S (eds) 10th international working conference on mining software repositories (MSR 2013) IEEE, San Francisco, CA, USA, pp 69–72

  • Munaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating GitHub for engineered software projects. Empir Softw Eng 22(6):3219–3253

    Article  Google Scholar 

  • Nasehi SM, Sillito J, Maurer F, Burns C (2012) What makes a good code example? A study of programming Q&A in StackOverflow. In: Tonella P, Di Penta M, Maletic JI (eds) 28th IEEE International Conference on Software Maintenance (ICSM 2012) IEEE Computer Society, Trento, Italy, pp 25–34

  • Nederhof AJ (1985) Methods of coping with social desirability bias: a review. Eur J Soc Psychol 15(3):263–280

    Article  Google Scholar 

  • PMD (2016) Finding duplicated code. http://pmd.github.io/pmd-5.5.1/usage/cpd-usage.html

  • Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack overflow in the IDEa. In: Notkin D, Cheng BHC, Pohl K (eds) 35th international conference on software engineering (ICSE 2013) IEEE Computer Society, San Francisco, CA, USA, pp 1295–1298

  • Ponzanelli L, Mocci A, Bacchelli A, Lanza M (2014) Understanding and classifying the quality of technical forum questions. In: Wong WE, McMillin B (eds) 14th international conference on quality software (QSIC 2014) IEEE, Allen, TX, USA, pp 343–352

  • Poteat H (2016) GitHub’s 2015 Transparency Report. https://github.com/blog/2202-github-s-2015-transparency-report

  • Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 8(11):1016–1038

    Google Scholar 

  • Ragkhitwetsagul C (2016) Measuring code similarity in Large-Scaled code corpora. In: Kraft NA, Menzies T, Adams B, Poshyvanyk D (eds) 2016 IEEE International conference on software maintenance and evolution (ICSME 2016) IEEE Computer Society, Raleigh, NC, USA, pp 626–630

  • Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74 (7):470–495

    Article  MathSciNet  MATH  Google Scholar 

  • Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: Scaling code clone detection to big-code. In: Dillon L, Visser W, Williams L (eds) 38th international conference on software engineering (ICSE 2016), ACM, Austin, TX, USA, pp 1157–1168

  • Scalabrino S, Bavota G, Vendome C, Linares-Vásquez M, Poshyvany D, Oliveto R (2017) Automatically assessing code understandability: How far are we? In: Rosu G, Penta MD, Nguyen TN (eds) 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2017) IEEE Computer Society, Urbana, IL, USA, pp 417–427

  • Schenk D, Lungu M (2013) Geo-locating the knowledge transfer in StackOverflow. In: Ali R, Begel A, Maalej W (eds) 2013 International workshop on social software engineering (SSE 2013) ACM, Saint Petersburg, Russian Federation, pp 21–24

  • Silvestri G, Yang J, Bozzon A, Tagarelli A (2015) Linking Accounts across Social Networks: The Case of StackOverflow, GitHub and Twitter. In: Armano G, Bozzon A, Giuliani A (eds) 1st International Workshop on Knowledge Discovery on the WEB (KDWeb 2015) CEUR-WS.org, Cagliari, Italy, CEUR Workshop Proceedings, pp 41–52

  • Software Freedom Law Center (2008) Free Software Foundation Inc v. Cisco Systems, Inc. United States District Court for the Southern District of New York

  • Sojer M, Henkel J (2011) License risks from ad hoc reuse of code from the internet. Commun ACM 54(12):74–81

    Article  Google Scholar 

  • St Laurent AM (2004) Understanding Open Source and Free Software Licensing. O’Reilly Media, Sebastopol

    Google Scholar 

  • Stack Exchange Inc (2015) Stack Exchange Data Dump: August 18, 2015. https://archive.org/details/stackexchange/

  • Stack Exchange Inc (2016) Stack Exchange API v2.2. https://api.stackexchange.com/docs

  • Stack Exchange Inc (2017a) Stack Exchange Data Dump 2017-12-01. https://archive.org/details/stackexchange/

  • Stack Exchange Inc (2017b) Stack Exchange Data Dump: March 14, 2017. https://archive.org/details/stackexchange/

  • Stack Exchange Inc (2018a) Stack exchange network terms of service. https://web.archive.org/web/20180228075555/http://stackexchange.com/legal

  • Stack Exchange Inc (2018b) Stack exchange network terms of service. http://stackexchange.com/legal

  • Stack Exchange Meta (2009) What is up with the source code license on Stack Overflow? http://meta.stackexchange.com/q/25956

  • Stack Exchange Meta (2013) Do I have to worry about copyright issues for code posted on Stack Overflow? http://meta.stackexchange.com/q/12527

  • Stack Exchange Meta (2015) Can we get some explicit clarification on the *intended* legal usage of code from SO answers? http://meta.stackoverflow.com/q/286582

  • Stack Exchange Meta (2016) A new code license: The MIT, this time with attribution required. http://meta.stackexchange.com/q/272956

  • Subramanian S, Holmes R (2013) Making sense of online code snippets. In: Zimmermann T, Di Penta M, Kim S (eds) 10th international working conference on mining software repositories (MSR 2013) IEEE, San Francisco, CA, USA, pp 85–88

  • Tim Post (2018) A new (2018) update to our terms of service is here. https://meta.stackexchange.com/questions/309746/a-new-2018-update-to-our-terms-of-service-is-here

  • TIOBE software BV (2017) TIOBE Index for February 2017. http://www.tiobe.com/tiobe-index/

  • Treude C, Robillard MP (2016) Augmenting API documentation with insights from stack overflow. In: Dillon L, Visser W, Williams L (eds) 38th international conference on software engineering (ICSE 2016), ACM, Austin, TX, USA, pp 392–403

  • Treude C, Robillard MP (2017) Understanding stack overflow code fragments. In: Mei H, Zhang L, Zimmermann T (eds) 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME 2017) IEEE Computer Society, Shanghai, China, pp 509–513

  • Treude C, Barzilay O, Storey MAD (2011) How do programmers ask and answer questions on the web? In: Taylor RN, Gall HC, Medvidovic N (eds) 33rd international conference on software engineering (ICSE 2011) ACM, Waikiki, Honolulu, pp 804–807

  • Vasilescu B, Capiluppi A, Serebrenik A (2012) Gender, representation and online participation: A quantitative study of stackoverflow. In: Aberer K, Flache A, Jager W, Liu L, Tang J, Gueret C (eds) 4th International Conference on Social Informatics (SocInfo 2012) Springer, Lausanne, Switzerland, Lecture Notes in Computer Science, pp 332–338

  • Vasilescu B, Filkov V, Serebrenik A (2013) StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge. In: Chang LW, Srivastava J, Zhan J (eds) 2013 International conference on social computing (SocialCom 2013) IEEE Computer Society, Washington, DC, USA, pp 188–195

  • Vendome C (2015) A large scale study of license usage on GitHub. In: Bertolino A, Canfora G, Elbaum S (eds) 37th international conference on software engineering (ICSE 2015) IEEE, Florence, Italy, pp 772–774

  • Wang S, Lo D, Jiang L (2013) An empirical study on developer interactions in StackOverflow. In: Shin SY, Maldonado JC (eds) 28th annual ACM symposium on applied computing (SAC 2013) ACM, Coimbra, Portugal, pp 1019–1024

  • White JS (2008) Jacobsen v. Katzer, 535 F.3d 1373, 1379. United States Court of Appeals for the Federal Circuit

  • Wikipedia (2017) Free Software Foundation, Inc v. Cisco Systems, Inc. https://en.wikipedia.org/wiki/Free_Software_Foundation,_Inc._v._Cisco_Systems,_Inc.

  • Xia X, Bao L, Lo D, Kochhar PS, Hassan AE, Xing Z (2017) What do developers search for on the web? Empir Softw Eng 22(6):3149–3185

    Article  Google Scholar 

  • Yang D, Hussain A, Lopes CV (2016) From query to usable code: an analysis of stack overflow code snippets. In: Kim M, Robbes R, Bird C (eds) 13th international conference on mining software repositories (MSR 2016), ACM, Austin, TX, USA, pp 391–402

  • Yang D, Martins P, Saini V, Lopes CV (2017) Stack Overflow in GitHub: Any snippets there? In: Gonzalez-Barahona JM, Hindle A, Tan L (eds) 14th international conference on mining software repositories (MSR 2017) IEEE Computer Society, Buenos Aires, Argentina, pp 280–290

  • Yang J, Hauff C, Bozzon A, Houben GJ (2014) Asking the right question in collaborative Q&A systems. In: Ferres L, Rossi G, Almeida VAF, Herder E (eds) 25th ACM conference on hypertext and social media (HT 2014) ACM, Santiago, Chile, pp 179–189

  • Zagalsky A, Barzilay O, Yehudai A (2012) Example Overflow: Using social media for code recommendation. In: Maalej W, Robillard MP, Walker RJ, Zimmermann T (eds) 3rd international workshop on recommendation systems for software engineering (RSSE 2012) IEEE, Zurich, Switzerland, pp 38–42

  • Zhang T, Upadhyaya G, Reinhardt A, Rajan H, Kim M (2018) Are Code Examples on an Online Q&A Forum Reliable? A Study of API Misuse on Stack Overflow. In: Crnkovic I, Chechik M, Harman M (eds) 40th international conference on software engineering (ICSE 2018) ACM, Gothenburg, Sweden, pp 1–11

Download references

Acknowledgements

The authors would like to thank the participants of the online surveys, the anonymous reviewers, and Bernhard Baltes-Götz for their valuable feedback. Moreover, we thank Richard Kiefer for his help with the calibration of CPD and the extraction of the snippet sets and Florian Reitz for his help with database-related issues.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sebastian Baltes.

Additional information

Communicated by: Ahmed E. Hassan

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Baltes, S., Diehl, S. Usage and attribution of Stack Overflow code snippets in GitHub projects. Empir Software Eng 24, 1259–1295 (2019). https://doi.org/10.1007/s10664-018-9650-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-018-9650-5

Keywords

Navigation