How Programmers Comment When They Think Nobody's Watching

Abstract
Documentation is essential to software development. Experienced programmers know this well from having worked with poorly documented code. They wish to improve their documentation techniques and habits, but there is little consensus for them to follow. Somehow, the many different standards must be compared objectively. This desire motivates my work, which aims to better understand existing documentation practices.
This work focuses exclusively on comments within the program code. Programming is a complex human activity, despite a widespread misconception among programmers that writing code is a mechanical process. This is especially true of comments, where programmers express themselves freely. My work fills a gap in research on software documentation by systematically investigating the comments in a unique database of code written by programmers under natural conditions.
The true variety of programming behaviour is surprising. But this variety does not mean that the output of programmers is completely arbitrary; there are patterns in this data, which my research aims to understand.
This work makes three contributions:
- A novel taxonomy of comments developed from the data, which to date is the most thorough description of commenting behaviour actually exhibited by programmers.
- Empirical hypotheses regarding large scale commenting behaviour, which were validated on separate test data. These hypotheses describe underlying regularities in programming which appear to transcend individual differences.
- The database of code I collected, which has unique opportunities for further research on software development, and is thus available for use by other researchers.
Open Data
The aggregate data studied in Chapters 3 and 4 is available for download with all textual information removed.
The full database of code studied in this thesis is available for legitimate research purposes. The data is not freely available for download, because it is impossible to ensure that all personal identifiers have been removed. Therefore, in order to protect the anonymity of the programmers, the data is available only upon request. To request it, please contact the current custodian of the data, whose email address is given below.