GitHub Systematic Review: Know your data.

While I was working on my Masters in Game Technology at Breda University of Applied Sciences (BUas), I came across an interesting problem: How am I executing a systematic review of GitHub repositories?

When searching on the Internet, it was challenging to find good sources. I came across the fantastic work from the Department of Information and Computing Sciences, Utrecht University: A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare by Zhengru Shen and Marco Spruit. They created a paper that explains step-by-step how to do a systematic review of GitHub repositories! As I mentioned in my previous blog post: “None of the standard literature regarding methodology covers this case. Okay, okay… GitHub is relatively new and special!".

You can basically follow along and do as they described. See my previous blog post for a TL;DR version. I describe how to execute the method of collecting data a little bit more practical with a python script-based example. The example code you can find, of course, on GitHub itself: collect-data-from-github

This is all great but I wanted to discuss the data collection and processing part a bit more. As I found out, it is very different for each topic what kind of data you need to collect from the GitHub REST API. Also, it is important to discuss the reasoning behind the data and its processing

TL;DR

When gathering data from GitHub, we need to know in advance what we need for our study. There are 2 data sets we can collect: Textual data and numerical data. The textual data can be separated into Source Code and metadata. The metadata contains tags, topics, descriptions, README, and documentation links. The Numeric values are either date, such as the last date the project was updated, or values, such as the size of the project, the number of people staring the project issues, etc. The textual and numeric data must always be analyzed within the proper context. There are several existing tools that can help one to conduct the analyses.

A few important takeaways about some specific data categories:

  • A.I. is important to make sense out of Textual context such as README files or documentation in case manual labor shall be avoided. Such as IBM Watson Natural Language Understanding Natural Language Toolkit (NLTK) for python [9] or Google’s Natural Language A.I. [10].
  • Source Code analyzes can be done by tools such as static analyzes tools, documentation tools or code base visualizers. Such as Sourcetrail , Source Graph.
  • Stars are a indication of interest in a project which needs to be set into a proper context, which can be done with tools such as RepoReaper and GHTorrent. The second meaning of stars is just to bookmark something which does not guarantee that I am actually interested in this project. Also, the risk of distortion from fake stars or promotion actions can have an influence on star growth.
  • Commits are a not reliable source to use to calculate if a project is successful or popular when matched with stars. Due to the reason that every project has their own commit policy [14].
  • Forks are somewhat more reliable since there is a strong indication that they are connected with the number of stars. However, one should keep in mind that there is also the ability to have bots that do nothing else than fork projects. [14]

What kind of data can you collect on GitHub?

GitHub provides a REST API [2], which is an service that allows the user to send request to it in a defined way (protocol) via the HTTP protocol and receive answers [3]. The service provides a lot of entry points, for a systematic review of GitHub Projects the search endpoint, the REST API part that is concert with searching for repositions, is very important. Besides this, there are a few other endpoints that will be very useful which are listen in Table 1.

EndpointDescription
Repositories“The Repos API allows access information about GitHub repositories.”
Projects“The Projects API let you fetch projects in a repository.”
Organizations“The Organizations API let you fetch information about a GitHub organizations.”

Table 1: Useful other API endpoints than the Search.

From the GitHub REST API you can gather two kind of main data types: Numerical and Textual. These can be used for further analyzes. There are two different textual information you can collect on GitHub: Source Code and Meta data to describe the repository [1,2].

The Source Code for example can be scanned manually or via a static analyzes tool or any other tool that can understand source code and gives us a meaning full understanding of it. Alberto S.Nuñez-Varela et al., has shown in their study Source code metrics: A systematic mapping study [5] what kind of metrics can be applied when analyzing source code. The results of their work are very interesting since they report of almost 300 source code metrics. They also concluded that object oriented metrics have been mostly found and more research is needed to conduct to gather metrics for more feature oriented aspects.

Furthermore, to analyze source code from a practical point of view. If one is looking for certain patterns, tools to visualize a codebase are useful. Sourcetrail is such an open source tool that lets you navigate through the code base virtually. An other useful tool is Source Graph which allows you to search your code and 2M+ open source repositories [6].

The other meta data such as README files, wiki or documentation stored in the project needs to be analyzed by hand or with the raise of AI one can make use of tools such as IBM Watson Natural Language Understanding [7] to help understand and process natural language. How you can use IBM’s Watson is greatly explained in the following article: Getting started with NLP using IBM Watson Studio by Aritro Mukherjee [8]. There are, of course, other alternatives such as Natural Language Toolkit (NLTK) for python [9] or Google’s Natural Language AI [10]. These AI driven tools can help to identify themes, topics or search for the needed information within the meta data of a repository.

In the space of numerical values we have also again two types: Dates and numerical values [1,4]. The dates can be used for our limitations since GitHub gives us only 3 dates [4]: pushed_at, updated_at,created_at. The meaning behind them can be described as updated_at will be updated any time the repository object is updated. A repository object is updated when one, for example, update the description or the primary language of the repository. On the other hand, pushed_at represents the date and time of the last commit. Since updated_at represents the timestamp of the last change to the repository which might be a be a commit, but it may also be other things, such as changing the description of the repo, creating wiki pages, etc. That is why one can say that commits are a subset of updates, and the pushed_at timestamp will therefore, either be the same as the updated_at timestamp, or it will be an earlier timestamp [2,11].

The other numeric values of interest can be gather from a GitHub Repository are described in Table 2:

Numeric ValueDescription
Stargazers“Stargazers refers to the number of times a repository is bookmarked. It reflects an approximate level of interest in the repository” [1,2]
Forks“A fork is a copy of a repository. Forking is necessary for developers to contribute a project. Forks refers to the number of forks.” [1,2]
Contributors“The number of contributors who have worked for a repository.” [1,2]
Commits“The total number of commits.” [1,2]
Issues“Issues is the number of open issues in a repository” [1,2]
Size of source codes“The size is valued as the size of the whole repository (including all of its history), in KB.” [1,2]
Size of README file“The size of the README file of a repository, in B” [1,2]

Table 2: Overview of numerical values one can gather on GitHub*

This data can be used and processed with methods such as GAM which are a extension of generalized linear models (GLMs), a GAM is an additive modeling technique that captures the impact of the predictive variables through smooth functions [1,2].

The meaning of numeric values: Stars,Forks and Commits

Zhengru Shen and Marco Spruit as well as of other researchers [13,14,15] are suggesting to use stargazers (stars) as indication of popularity. Research has shown that the “stargazers-based classifier […] to exhibit high precision (97%)” when trying to find retrieving engineered software projects [16]. Munaiah, N at all Research results in “RepoReaper”. RepoReaper “is a tool used to assess a GitHub repository in the form of a score. It considers a number of different attributes in order to perform a thorough assessment.” [17] This tool is intend to be used with “together with a database of metadata provided by the GHTorrent project, reaper considers both contextual information such as commit history as well as the contents of the repository itself [17].” RepoReaper can be helpful in order to score the results of your systematic review of a subset of specific projects.

The Meaning of stars from a developer point of view

However it is important to discuss the meaning of stars, forks, and commits. As stated by GitHub, stars are a method to keep track of projects that you find interesting or discover via the explore/ “news” feed related projects [18]. They are also frequently used by the community as bookmarks [15]. As explained in several sources, such as from Zhengru Shen and Marco Spruit and a lot of other researchers [13,14,15] ,stars have two major functions. They function as an indication that someone likes the project or that they want to book mark the project [15]. A developer on OpenSource Stackexchange states this perfectly: “[…] Users on the GitHub website are able to “star” other people’s repositories, thereby saving them in their list of Starred Repos. Some people use “stars” to indicate that they like a project, other people use them as bookmarks so they can follow what’s going on with the repo later. […]” 15, Left SE On 10_6_19].

Hudson Borges et al, provide the academic data to the statements gathered on OpenSource Stackexchange [15] in their survey of 791 developers describe how they use stars on GitHub[14]. Their Table 2 [14] shows this.

Note: Note that one answer can receive more than one theme therefore numbers might not add up to 791 for more details see Paper.

ReasonTotal%
To show appreciation41552.5
Bookmarking40451.1
Due to usage29036.7
Due to recommendations364.6
Unknown reasons50.6

Table 3. Why do users star GitHub repositories? (95% confidence level with a 3.15% confidence interval). based on Table 2 from “What’s in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform” by Hudson Borges at al.

The reason why one needs to look at stars critical is that from technical perspective they are simple to fake. Since it is very simple to just create GitHub fake accounts and write a bot that just likes your project [15]. Besides the simplicity of faking stars a huge growth of stars might be the result of a promotion on social media (e.g. twitter) [13]. Hudson Borges at al suggests that “when ranking projects, we should check whether stars are result of active promotion” in their recommendations for researchers at the end of their journal article [15]. There are two main reasons why we should look critically on stars. Firstly one is that they might be created through fake accounts. Secondly they might be a result of active promotion on social media platforms.

Moreover Hudson Borges et al, suggested in their final conclusion that stars are important for users to pick a project. The study found out that 3 out of 4 developers (of the 791 developers surveyed developers) check the stars metric before using or contributing to projects. Despite this strong indication between stars and popularity, the paper suggested that other factors such as code quality and documentation are important. Both factors can be evaluated with RepoReaper in order to create a relationship between popularity and documentation / code quality [16,17]. For a lot of developers stars are a great indication to either contribute to a project but not just other factors such as code quality and documentation necessary.

Forks and commits

Research has shown that forks can be used to evaluate the popularity of a project since there is a strong correlation between stars and forks [15]. It is noteworthy to say that there are fork bots out there that just fork a project automatically since this can also be done automatic via the REST API [2]. That bots may create forks of projects might be important to consider when using the fork value as a indication for popularity. Forking a project via bots is not that frequently done as for stars that this can be ignored but still it is important to keep this in mind when building an argumentation about the correlation between stars and forks.

Commits however only show a weak correlation with stars [15]. This can be explained with the practical reason that the way a project handles its commits can differ greatly, since every bigger project has their own policies in regards to how often one shall commit and how big they shall be. For example, The QT project has its own policies Commit Policy or the KDE Projects Policies/Commit Policy. Looking at this from a developer point of view, it explains than that the number of commits in relation to stars or in general towards the project popularity cannot be a hard link. The main indication of commits is to show how actively maintained a project is. Every commit will change the pushed_at date. The fact that every project might have their own commit policy explains why the number of commits and the projects popularity has just a weak link.

Takeaways

The main takeaways are that when one is analyzing GitHub repositories, one has to take the different limiting factors into account. These limiting factors are sometimes better described by users of GitHub than by pure academia. The best example is the discussion on Opensource Stack Exchange “GitHub Stars” is a very useful metric. But for what? from Left SE On 10_6_19. Moreover, it is important to realize what is the meaning behind the possible metrics. There has been some great work done by several authors [1,5,13,16]. Therefore it can be said that:

  • A.I. is important to make sense out of Textual context such as README files or documentation in case manual labor shall be avoided. Such as IBM Watson Natural Language Understanding Natural Language Toolkit (NLTK) for python [9] or Google’s Natural Language A.I. [10].
  • Source Code analyzes can be done by tools such as static analyzes tools, documentation tools or code base visualizers. Such as Sourcetrail , Source Graph.
  • Stars are a indication of interest in a project which needs to be set into a proper context, which can be done with tools such as RepoReaper and GHTorrent. The second meaning of stars is just to bookmark something which does not guarantee that I am actually interested in this project. Also, the risk of distortion from fake stars or promotion actions can have an influence on star growth.
  • Commits are a not reliable source to use to calculate if a project is successful or popular when matched with stars. Due to the reason that every project has their own commit policy [14].
  • Forks are somewhat more reliable since there is a strong indication that they are connected with the number of stars. However, one should keep in mind that there is also the ability to have bots that do nothing else than fork projects. [14]

References

[1] Department of Information and Computing Sciences, Utrecht University: A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare by Zhengru Shen and Marco Spruit

[2] GitHub REST API (accessed on 02 April 2022).

[3] What is a REST API? (accessed on 02 April 2022).

[4] GitHub Repositories (accessed on 02 April 2022).

[5] Alberto S. Nuñez-Varela, Héctor G. Pérez-Gonzalez, Francisco E. Martínez-Perez, Carlos Soubervielle-Montalvo, Source code metrics: A systematic mapping study, Journal of Systems and Software, Volume 128, 2017 https://doi.org/10.1016/j.jss.2017.03.044.

[6] Source Graph (accessed on 02 April 2022).

[7] BM Watson Natural Language Understanding. Available online: https://www.ibm.com/cloud/watson-natural-language-understanding (accessed on 02 April 2022).

[8] Getting started with NLP using IBM Watson Studio by Aritro Mukherjee (accessed on 02 April 2022)

[9] Natural Language Toolkit (NLTK) (access on 02 April 2022)

[10] Google’s Natural Language AI (access on 02 April 2022)

[11] Difference between “updated_at” and “pushed_at” in repositories list response (access on 02 April 2022)

[12] Wood, S.N. Generalized Additive Models: An Introduction with R; Chapman and Hall/CRC: London, UK, 2006.

[13] Characterizing and predicting the popularity of github projects by Hudson Silva Borges https://repositorio.ufmg.br/handle/1843/BIRC-BBLN2S (access on 02 April 2022)

[14] Hudson Borges, Marco Tulio Valente, What’s in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform, Journal of Systems and Software,Volume 146, 2018, Pages 112-129, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2018.09.016.

[15] “GitHub Stars” is a very useful metric. But for what? asked by Left SE On 10_6_19 on opensource.stackexchange.com (access on 02 April 2022)

[16] Munaiah, N., Kroh, S., Cabrey, C. et al. Empir Software Eng (2017) 22: 3219. https://doi.org/10.1007/s10664-017-9512-6

[17] RepoReaper (access on 02 April 2022)

[18] Saving repositories with stars (access on 02 April 2022)

Avatar
Simon Renger
CI & Tools Engineer

Write programs that do one thing and do it well. Write programs to work together — McIlroy Unix philosophy

Related