This week, Alexander Peltzer (@apeltzer) will present: GitHub contribution basics. This will cover:
- GitHub and
git- accounts and organisations
- GitHub contribution basics - Forking / editing / pull requests
- Best practices - Using feature branches / commit messages
- How to do a good code review
The talk will be live-streamed on YouTube:
- YouTube: https://youtu.be/gTEXDXWf4hE
This text has been edited to make it more suitable for reading.
We’re going to cover a bit of GitHub contribution basics today, as well as some basic introduction into git for beginners. This won’t cover everything, so it’s targeted towards beginners in that sense, but there are most likely parts that we cannot cover today because of the limited time that we have.
So we’re going to start with the basics of Git, then cover a bit of how we can actively contribute to nf-core in Github, go over to some best practices and finally learn how we can contribute and collaborate productively because this is something that most beginners struggle with a bit. We will cover code review, do’s and don’ts while reviewing code, and what should ideally be done while reviewing.
Let’s start by understanding git. Git is a free and open source version control system that most people know at least to some extent. It’s quite powerful, so even long time users that have dealt with it for years still seem to learn new things. There are some very advanced features in Git that can help organize code or documentation in a version controlled manner very efficiently. The basic features are actually not too complicated to use and there are also graphical user-interfaces that help people set up git in general if they want to finally adopt it. There are also a lot of how-to’s and introductions for beginners available on Github to cover most of the cases and also explain things. You could also just use google for a query to learn more about Git.
A few things that are common nomenclature in Git are commits and repositories (or short repos). So if you’re talking with people in nf-core, they might refer to things like a repo; just to point you towards one, so a repository is basically a project where your code is stored. An example would be the nf-core/rnaseq repository that hosts all the code for the RNA pipeline. We have many of these repositories or repos. For example each of the pipelines has a separate one, but we also have repositories that host code for our web page, some documentation or some test data sets so this is something that is a repository. A commit is a package of changes that are actually applied to a repository. A set of logical edits that are chunked together and can be committed to my repository so that they can be within this version control system.
The git tree of a repository can have branches so my repository might have different separate branches, which is kind of a way to organise code in the repository, so people can have a look at that later.
This slide shows some of the basic git commands. We have
git clone which means downloader;
git add, which means stage changes that I have already made, like adding a new file for example; ready to commit;
git commit is actually then adding this to the yeah more or less adding this to the repository with a comment. So I can actually comment on that, I can say I’ve added this file which would be quite a nice commit, write a message because that means that people know what I’ve done through this comment. Then I can also
git push, which means I can push these changes to the repository. If somebody else made some changes in the web for example on a repository that I actually want to integrate with my local copy of of the repository I can do a
git pull, which will pull these from there. There is also some additional vocabulary here that people would like to probably see a bit about; “PR” means pull request and fork means a derivative copy, so I can fork a repository that is present on Github and make a copy for my own that I can work on independently, I could continue development on that and never return to the original repository. That happens, people abandon projects quite frequently, but I could also use that to develop my own type of feature - let’s say mapping methods, changes to a pipeline or something like that and send them back to the main repository via a pull request. 4:27
All of this sounds a bit more complicated than it actually is, so basically it is something that one learns most easily by doing. Most people get along with just these five commands, but there’s very excellent tutorials, even interactive ones where you can actually train, and try out more complicated things if you want. Of course we use branches quite efficiently, so we have typically at least three branches per repository per pipeline in this case so we have a dev branch, this is where all the development code is and all the pull requests usually end up.
We have a master branch, this contains the stable releases only. This is what Nextflow typically pulls if you run a pipeline. We have a TEMPLATE branch that keeps template functionality that we develop in the nf-core tools. In sync uses a special sync approach that we’ve been developing in the team using the nf-core bot approach. 6:43
The next point in this talk is Github. So Github is as you might know from the name is a hub for hosting git repositories, so it is a cloud-based git repository hosting service. Some users refer to it as a social network for developers. You can share and collaborate on code, you can interact on or with code and also add documentation, plus a lot of other things that i won’t cover today. The basic functionality is very accessible and easy to work with. The basic account of github is free, a pro account is free if you’re an academic user or working for some academic licensing. The basic stuff is easy to learn, but the advanced stuff like continuous integration services and integration of slack and other plugins has a steeper learning curve. This will be partially covered in an upcoming bytesize talk.
For a basic overview on what Github looks like, log on!
Here’s my login now so you’ll see some recent activity in the middle, you see some recommendations like repositories that you might be interested in, notifications etc. It’s quite quite easy to understand; you can even search for things - repositories, or create new ones. So it’s not that difficult to actually get into, again, the account is free even if you’re just using a basic github account. Most of the functionality is available only if you’re logged in so you definitely need an account. The differentiation that is kind of crucial and some beginners seem to have issues with differentiating between personal and organisational accounts. apelzer is my personal account, and our organisational account is nf-core, which hosts all of the code for the nf core organization. Multiple people with the individual accounts can however, contribute to that organization and that’s exactly how we structured that in the past to keep code that is developed within the scope of nf-core within nf-core and not within private accounts of individuals. 8:26
Personal and organisational accounts can however have many repositories within them so one is not limited to just having either one of them, it is possible to have multiple within each of these type of accounts.
If you visit the nf-core organisation on github, you will see that we have 61 repositories with around 250 people, five teams, a couple of projects, some repositories - type here on find a repository and you’ll see that whatever you typed in will hopefully be found. 10:03
The basics of how you can actually contribute actively on Github is following this example workflow that I found on the web, which is called fork and branch tutorial. It is based on a tutorial that someone else actually made, so I’m sharing it in a link here and all the credits actually go to that person whose efforts those are. This Github contribution basic that assumes that you have a Github account and that you’re a member of the nf-core organisation. Otherwise certain steps of what I’m going to display now are not working well. So this Github contribution basics basically starts with forking a repository, so imagine you want to contribute to an existing pipeline, for example Sarek or Eager. Then you first have to start by making a copy of that repository, a copy which makes it possible to later on also add some pull requests. That can be done using a so-called fork, which is then basically you can go on the web page of the particular pipeline that you’re interested in, and on the top right you will see that nice little fork button here. You click on it and then you have to specify where to fork this repository. So what people usually do is they don’t fork it to other organisation accounts that you might have access to but they actually fork to their own private accounts. 10:42
I selected myself, then things are running for a couple of seconds and then you will see that there is a link to the copy made, which is also listed in the main repository here as a fork of the original repository in nf-core you can actually start working on. 11:08
So after we made that fork, we can make a local clone of that to work on this pipeline so we have two versions of the repository, the nf-core one and our own and now we can work on adding bits here and there. So the first step we will have to do then is remember the basic Git introduction. We have to clone it, download it to our local machine. So what I do here is hit clone and I type the url; it takes a couple of seconds depending on the pipeline size and then you hopefully have a local copy of that pipeline available. That of course also requires an installed Git version after that I can just use an editor or whatever other tool I have at hand, and change some stuff in the code and simply follow this approach after adding these local changes. So I’m making local changes, then we add the changed files for example if I added a new let’s say a document or a text file or an image or something like that then I can simply add that to the repository by using Git art. I commit these changes with a nice little comment message. So for example if I add some images I should probably have some nice commit message talking about like I’ve added images that are used for documentation; that would be a message where people later also would be able to understand okay what has been done in this specific commit in the past. And then I push these changes to my repository so that I just do a git push and this will actually copy the changes back to my own repository which is on Github. 12:22
The good thing about that is I can do as many of these steps as I want to, add as many changes as I want to in individual steps. The typical approach would now be how do I make the upstream project in nf-core aware that there are changes. How can I actually contribute that back to the main repository, to the main project itself? This works typically in nf-core via a pull request so I have to open a pull request, then the open pull request will undergo code review and someone will then hopefully merge my changes into the main repository so that my code or documentation or whatever I did to the main repository will actually end up on nf-core. After forking, editing, commenting and pushing the changes to our fork, we can now open a pull request (PR) that’s also some vocabulary that people often don’t understand in the beginning.
So we go to nf-core/test pipeline. In this case we click on pull request, and then we can simply open this nice little interface, click here compare across forks because otherwise our fork won’t be coming up here. We have to always open pull requests against the dev branch, so that’s something that you have to be aware of, otherwise you will get a notification by our internal checking-script on Github that will tell you that you will incorrectly open the pull request against the master branch. Then you can basically just click here on create pull requests that typically also opens up a little text editor where you can actually describe what you did in your pull-request and then people will be able to review what you did. And that’s part of what we do, of course to ensure that no breaking changes that nothings is actually implemented in one of the nf-core repositories without proper review by at least a couple of people. 15:02
A good best practice about keeping these branches apart from each other is that you should only have one conceptual change per branch. So if you add a new feature for example because you found the pipeline to be very nice or working for your stuff, but you found a bug, for example some output metrics are not the way you want them, then the first approach would be to discuss this within the core people in the slack channel. Or make an issue on github. After they agree, okay this is something that someone could contribute to, then you can actually work on this in your separate branch but only work on this in your branch in your repository. Don’t work on multiple things in one branch because that makes the review process extremely difficult. Reviews typically take time, so for example if people have a look at your code, they will first have to understand what you did there. They will have comments probably which means that you have to edit your code, add changes, and also it allows other people to also work on multiple features simultaneously to you. So especially for the bigger pipelines we typically have multiple branches, multiple pull requests open, and people working on separate things, so for example a new mapping method might be worked on, somebody else might be fixing a bug in some other step of the pipeline. So if you have that all in one branch that’s actually problematic, and you don’t want to see that happening especially for bug fixed branches. These can be very tiny, so for example if I just have a type of the documentation that is typically one-line that I changed, plus a typo in the change log that I fixed, that’s it, doesn’t have to be much more than that!
So to summarize, small is good, the less complex, the better for the reviewers which means that you also get feedback much quicker. It’s a good idea to keep them small.
Another best practice in code review, I should start with first talking about code reviews in general. So what we do, and of course whenever somebody opens a pull request, then we check this code against the rules that the nf-core has specified. So for example whether they are in the same code style that the markdown documentation is in line with. What we typically adhere to, is that the Nextflow code is written in a way not known to produce any bugs that we are aware of. The general rule is and you can also check that in the guidelines that I just spoke about.
Pull requests to the dev branch, so to the experimental code, requires one review by someone in the nf-core community, whereas the pull requests to the master branch actually require two independent reviewers. That’s something a lot of people struggle with, they don’t know about these rules, although these are I think written down somewhere, hidden deep within these guidelines actually. There are also some nice how-tos on how code can be reviewed effectively. 19:30
An example that I’ve found particularly helpful is actually listed on the slides. It’s free, and most of the rules that are written in there are actually also applied in the same way here in nf-core when we do a code reviews. 19:34
One of the important steps in code review is that you should read things quite carefully, so for example, a lot of people write code at night, and are very happy if people catch typos and things like that. So you need to carefully read documentation, and it’s very good to use certain features in Github. 20:01
So for example, there is a new feature; whenever there’s a pull request open you can click on files changed and you can go through the files that have been changed in that pull request and click on insert a suggestion. 20:11
Whenever you click on a line of the code that would then open a text box, they can actually suggest a change that would make it possible later on for the person who opened the pull request to just click on accept that change. 20:17
For example for this typo here it would be very easy to just type here and add such a suggestion fixing, and people don’t have to write their own code again or go back to the editor because they can just click on this in the Github interface and amend things. 20:46
Another good idea for effective code review is to describe the motivation and the purpose of the requested changes. For example if you have an idea or you find something particularly interesting, say you updated a mapper in an alignment pipeline, but the other person who opened the pull request might not know about then it would be a good idea to write that down in your code review and tell them about it. That’s just one example but there are multiple others so you should always describe why you’re actually giving the feedback you are.
It is a good idea to have a look at the tests, so for example we run certain linting tests which are checking the code for certain code structure and checks for markdown documentation as you see here. 21:45
So probably in the pull request that I opened yesterday, there is a failure in the markdown so the checks that we have in place actually failed, so you could check and click here on details and then find out what’s going on. 22:03
If you start a code review, or if you open a pull request, you can actually ask for code reviews. Github offers a nice feature on the right side where you can actually select who should review that pull request. Sometimes Github even makes suggestions, so suggested reviews are actually pre-selected by Github. That is typically done based on who contributed to that repository in the past, which is in many cases a good idea but not always the perfect idea because some people also are inactive now or might not actually have the time to work on that. So if you for example, you select one of the main developers on that pipeline as a review, but nothing happens for two days. There’s also the possibility to ask in the request-review channel on Slack, where you can indicate the pull request you just opened, and request another review.
It is also a good idea to ask people with expertise on their pipelines. For example, if you have a code that a couple of people on this direct pipeline would have expertise of, it might be a good idea to also ask them to review it because they might be able to evaluate it better. However, it’s always important to also give beginners a chance too. Everyone in the nf-core community can review. It’s just a good idea to jointly do that with somebody with experience of the particular pipeline, so it should maybe be a co-review in the beginning until people feel confident that they can actually understand the full pipeline.
There’s a sweet spot, having two to three reviewers or maybe four is fine, but if you have more than five reviewers, it can get very crowded and very messy because there can be different opinions on certain topics as well. You might end up having a really hard time actually appending all the changes that the reviewers have relayed back to you. So start with two to three and if nothing happens, you can add more.
A more general thing that I think also applies to the nf-core community, is to show respect and be nice to people. There might be people who have very little time to work on your review, so try to be nice. This also applies to beginners. Also apply some common sense when connecting with people doing both reviewing and coding. Honestly I also have to say don’t do a German gefälligkeitsgutachten, which means you should never do a review if you’re not entirely sure you can cover the entire code piece or if you don’t have the time to work on it.
It’s a good idea to either do it properly or just leave it be. I have to live up to that standard as well myself in some cases.
I also wanted to point out our code of conduct, which also applies to the entire process.