Code corpora, as observed in large software systems, are now known to be far more repetitive and predictable than natural language corpora. But why? Does the difference simply arise from the syntactic limitations of programming languages? Or does it arise from the differences in authoring decisions made by the writers of these natural and programming language texts? We conjecture that the differences are not entirely due to syntax, but also from the fact that reading and writing code is un-natural for humans, and requires substantial mental effort; so, people prefer to write code in ways that are familiar to both reader and writer. To support this argument, we present results from two sets of studies: 1) a first set aimed at attenuating the effects of syntax, and 2) a second, aimed at measuring repetitiveness of text written in other settings (e.g. second language, technical/specialized jargon), which are also effortful to write. We find that this repetition in source code is not entirely the result of grammar constraints, and thus some repetition must result from human choice. While the evidence we find of similar repetitive behavior in technical and learner corpora does not conclusively show that such language is used by humans to mitigate difficulty, it is consistent with that theory. This discovery of “non-syntactic” repetitive behaviour is actionable, and can be leveraged for statistically significant improvements on the code suggestion task. We discuss this finding, and other future implications on practice, and for research.
Precise details on the datasets and language models will be presented later their respective sections.
While this category is very rarely updated, there could be unusual and significant changes in the language – for instance a new preposition or conjunction in English.
Indeed, it is not clear how to even design such an experiment.
Bidirectional LSTMs can make use of context both before and after a token.
A good explanation of the details of LSTM cell structure can be found at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/.
One exception for Java is the Eclipse project, which was not hosted on GitHub, but is selected for significance within the Java community.
We add ∖, proc, forall, mdo, family, data, and type.
We add __ENCODING__, __END__, __FILE__, and __LINE__.
We add recur, set!, moniter-enter, moniter-exit, throw, try, catch, finally, and /, along with some operators Pygments had classified as Token.Names.
In particular, we called these primitives types open category to be consistent with how other programming languages like Haskell treat their types.
Additionally, in our experience, tweaking the boundaries of these categories results only in slight changes in repetition.
(e.g. ADVP-TMP reflects that the adverbial phrase serves a temporal function).
Indeed, when running LSTMs over just the nonterminals, we see that the Java grammar is more predictable than the English grammar.
The size of the entropy difference between English and Code open category words is less, though still larger than between all tokens
In the simplified parse tree in the case of English.
We note that an independent study on commit message entropy and build failure found similar ranges (Santos and Hindle 2016)
(Allamanis et al. 2017) have extensively surveyed such applications.
See Tobe et al., PNAS, 144(22), May 2017.
This framework can be found at https://github.com/SLP-team/SLP-Core.
We would like to thank Professors C. Sutton, Z. Su, V. Filkov, and R. Aranovich, along with the UC Davis DECAL and NLP Reading groups for comments and feedback on this research. We also would like to especially thank V. Hellendoorn for his feedback and input on our experiment between parse trees in Java and English. We also acknowledge support from NSF Grant #1414172, Exploiting the Naturalness of Software. Finally we are grateful to the reviewers and editors of this journal for their thoughtful comments, which were very helpful in improving the work presented in this paper.
