Systematic review of repositories on GitHub with python (Game Dev Style)
While I was working on my Masters in Game Technology at Breda University of Applied Sciences (BUas), I came across an interesting problem: How am I executing a systematic review of GitHub repositories?
None of the standard literature regarding methodology covers this case. Okay okay… GitHub is relatively new and special!
The lack of answers in literature led me into the wild of the Internet. Sadly when googling ‘Systematic Review of GitHub Repositories’, you do not find a “HowTo” or actually a good GitHub repository with some software.
That I find little on regular Google is interesting because Google Scholar gives you a lot of entries when looking for Reviews on GitHub. Sadly, many of these reviews are, as per usual, behind a paywall. Moreover, those papers usually just claim, “We did this … description,” but there is seldom any code to be found, or if there is code, it is written in some obscure language an average Game Developer does not use…
Well, I just adjusted my search a bit. Finally, I came across this fantastic article from the Department of Information and Computing Sciences, Utrecht University: A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare by Zhengru Shen and Marco Spruit. They did a fantastic job by creating a paper that explains step-by-step how to do a systematic review of GitHub repositories!
Let me give you a TL;DR (for more details, you need to read my paper later or the original paper link above):
Data Extraction
We do a preliminary search of our topic with some keywords or topics of our liking by using GitHub Search and GitHub Topics.
We try this in multiple languages to find the correct language (I am talking about natural languages such as English). Maybe our specific topic is more common in Spanish than English, or we need to analyze both. This is important to know.
Might we already have a time frame based on our literature review? This is important to take into account when searching for anything on GitHub.
We also need to decide: Are the Programming Languages important or not? In general, what is needed from this search for our data analysis? For this, I recommend opening the REST API’s reference manual: GitHub search REST API. This has a list of things you can extract.
4.1 Do I need some extra information besides the repository? Or the byte size usage is essential? If yes, check the rest of the API’s documentation. Check out the API’s Reference
We connect to the GitHub search REST API with our python script using a token: How to use a token to authentify? We are using Python and PyGitHub, which does most of the work for us.
We can define multiple queries for the search, if needed for your topic. GitHub allows you to search besides regular queries such as “scripting languages” also for topics: “topic:scripting-languages". Using topic queries besides regular queries may increase the results of your search. Moreover, you can exclude things. For example, if you wanted to exclude all Visual Studio Extensions from your search, all you need to do is: “scripting languages NOT Visual+Studio” (The + is important because otherwise, it will ignore only “Visual” and not “Visual Studio"). For more info about the search syntax check: Understanding the search syntax
After the last step, you need to store the findings in some form. For example, you can keep your results in a CSV file or in a database.
Data Processing
This is where my TL;DR ends since this highly depends on your topic. For example, you can do a Descriptive Analysis, an example of which can be found in the original paper. You can also use Generalized Additive Models to process the data. Moreover, you might need an AI to analyze all the README files, descriptions, etc., to extract the extra data you need. Zhengru Shen and Marco Spruit are using Watson to do some of their topic modelings (see 2.4. Topic Modeling).
Anyways, as a game dev (or a game dev to be), I love sample code and practical things! This is why I really like the paper I mentioned before since the authors also provided the GitHub repository with the source code of the used code for their study. From an academic point of view, this delights my heart since I could reproduce their paper. The developer in me is happy since I have a script example for my paper! The original source code you can find here: ianshan0915/clinical-opensource-projects (a collection of python scripts).
Warning: This repository is not made to be reused to be used for your own project. You can (like I did), but you still need to read the source code and find things you do not need. As I said, I love, absolutely love sample code! This is what this code is for.
What does this repo contain? Before we open the GitHub repo and clone it, let’s have a look at how they describe it in their paper:
“The data extraction pipeline was written in Python using a third-party library, PyGitHub. The pipeline took the chosen search terms as input and received repository data in JSON. The JSON responses were first filtered and then converted to database records and pushed to tables in a MySQL database. Repositories with no description or no programming language specified were excluded from further analysis for the reason that clinical software was the focus of our study. The whole process is reproducible by running the Python scripts at Reference [22]. Moreover, replacing the search term with others scales the pipeline to other domains.” (22: Source Codes of Open Source Clinical Software. Available online: https://github.com/ianshan0915/clinical- opensource-projects (accessed on 25 November 2018)) A Systematic Review of Open Source Clinical Software on GitHub for Improving Software Reuse in Smart Healthcare
After reading the description of their data extraction pipeline, we open the repo. However, there are no instructions on how to use/install the scripts in the README. But no worries, I have a quick summary of what you need:
- PyGitHub communication with the GitHub REST API
- pandas
- Watson Developer Cloud Python SDK for analysing the readme files and descriptions
- textrazor
The conclusion for me is it is excellent that I have the code, but I would spend too much time bending it to my will. I can better follow their lead and write my own script, which is something I think all game devs know somehow.
My take on it
Now, this is what I did, and I ended up with a small Python script that can be used for my purposes. If you need just data collection from GitHub, you can also use it: GitHub search query python
The script allows you to write a simple configuration JSON file:
{
"token": "my token",
"readme_dir": "./",
"output": "./",
"format": "CSV",
"criteria": {
"time": {
"min": 2010,
"max": 2022
}
},
"terms": [
"MY SEARCH QUERY",
"MY SEARCH QUERY",
],
"attrs": [
"id",
"full_name",
]
}
To communicate with the GitHub API, you need a token you can obtain via your GitHub account. This field is optional, and you can also pass it as an argument to the script if you prefer this! As of writing, you have an hourly request rate of 5000 requests. Besides, the script obeys some cooldown time in between requests to not be locked out by the DDOS security of the API.
The output field lets you define where your collected data shall be stored. The file name will be repositories_DATE.[csv,json]
since I decided to spill out a CSV or JSON file, you can parse it later if you need to. If you need to download README
files, you also provide a readme_dir
field. They will be stored in there by repo + date
. If it is not present, the script assumes you do not need them. The criteria take the time frame from when to when do you need this, which will be used to collect repositories within the defined time frame.
The heart of your config is a list of the terms you are searching for:
"terms": [
"topic:visual-scripting-language NOT Visual+Studio",
"topic:visual-programming-language NOT Visual+Studio",
"topic:visual-programming NOT Visual+Studio",
"topic:visual-scripting NOT Visual+Studio",
"topic:visual-programming-editor NOT Visual+Studio",
"topic:dataflow-programming NOT Visual+Studio"
],
This will execute the criteria for those search terms every time!
Last but not least, we have attrs
that allow you to define the fields you care about from the REST API repository. There is more info on what to write in there: GitHub search REST API
Now you might wonder how to actually run this script:
python github-search-query.py --help
The previous command will give you some ideas on how to run it. But there is a faster way:
python github-search-query.py config.json
And if you want to pass a token along:
python github-search-query.py --token my_token config.json
Well that’s pretty much it! Have fun data collecting!