Version Control (git)
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How to keep track of changes in text files
Objectives
Learn the basic commands of git to version control text files
Version Control with Git
A version control system is a tool that keeps track of changes in a folder and all the files there in. That works by effectively creating different versions of our files. It allows us to decide which changes will be made to the next version (each record of these changes is called a commit), and keeps useful metadata about them. The complete history of commits for a particular project and their metadata make up a repository. Repositories can be kept in sync across different computers, facilitating collaboration among different people.
In scientific computing there are 3 typical scenarios where using a Version Control system is crucial in your effectiveness as researcher.
-
You write code, scripts and sometimes even data and you need to keep track on the changes on that code in case you need to go back and review for example, why a previous version was producing better results!.
-
You are writing a paper: Your paper will go into a spiral of reviews, by you, your collaborators, you PhD advisor, the referees. If you are using LateX for you paper, as you probably should do, it is easy to keep track of changes and keep your paper in sync across several computers.
-
You are writing your Thesis, same principle, your manuscript will have several cycles of review and corrections. You do not want to keep all your precious PhD Thesis on a single place and you do not want to get lost figuring out which version is the last version!.
Version control systems can help you with this, they are designed to keep track of changes on files, specially text files. Even if they work also for binary files like figures and plots, they are not able to detect the internal changes on those binary files, each version of the file, even if the change is small is an entirely different file.
We will go quickly to cover the basics of Version control using git, there is a complete lesson on Git on Software Carpentry Version Control with Git, consider to follow the lesson for a more detailed coverage about what git can do for you.
For the purpose of this hands-on, we will create and work entirely o an local repository. You can otherwise create an account on Github and get the ability to create public repositories for free or private repositories for a fee. WVU Research Computing offers to researchers private repositories for free via an special agreement with Github.
This are the first commands to prepare your environment for using with Github:
git config --global user.name "<Your Full Name>"
git config --global user.email "<username>@mix.wvu.edu"
{.source}
The default editor for commits is vim
and for the purpose of this introduction we will keep it like that.
Now, lets create a folder project_01
that we will convert into a git repository so Git can store versions of our files.
Move inside the folder (cd project_01
) and execute:
$ git init
If we use ls
to show the directory’s contents you will see no new file apparently.
$ ls
But if we add the -a
flag to show hidden files and folders,
we can see that Git has created a hidden folder within projexct_01
called .git
:
$ ls -a
. .. .git
Git uses this special sub-directory to store all the information about the project,
including all files and sub-directories located within the project’s directory.
If we ever delete the .git
sub-directory, we will lose the project’s history.
That folder is hidden for a good reason, in almost all circumstances, you are not suppose to manipulate the contents of the .git
folder directly.
We can check that everything is set up correctly by asking Git to tell us the status of our project:
$ git status
# On branch master
#
# Initial commit
#
nothing to commit (create/copy files and use "git add" to track)
We have now a repository, an empty one, lets start by adding some basic structure to it. Lets suppose that the project will include: the scripts that you use to create or manipulate the data, a curated set of data, the plots and the manuscript of your paper.
It is not a good idea to store the raw data, raw data is probably too big, we just need the data that will be actually used to create the plots.
A git repository is not indented to be a file backup. What we should keep in the repository is the code and text that allows to recreate the results from the raw data, if the raw data is lost (something that should never happen if you backup your data), you can reconstruct the results, by running all simulations again, using your scripts, producing equivalent plots and recreating your paper for submission.
Lets create the structure for project_01
$ mkdir data
$ mkdir scripts
$ mkdir plots
$ mkdir paper
$ touch data/README.md
$ touch scripts/README.md
$ touch plots/README.md
$ touch paper/README.md
With touch
we are creating some empty files, git does not allow us to add empty folders, those README.md
should contain instructions about the data contained on those folders, how the scripts, plots should operate of how the paper should be build. Research is a long time endeavor, do not assume that you will remember what you did several months ago. Now we are ready to version control those folders.
$ git add data paper plots scripts
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: data/README.md
new file: paper/README.md
new file: plots/README.md
new file: scripts/README.md
Now is time for our first commit:
$ git commit -m "My first commit"
[master (root-commit) 5b5a7ed] My first commit
4 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 data/README.md
create mode 100644 paper/README.md
create mode 100644 plots/README.md
create mode 100644 scripts/README.md
When we run git commit
, Git takes everything we have told it to save by using git add (right now empty files) and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is 5b5a7ed. Each identifier is different so you will see a different code.
Now we are ready to populate the repository with some files. On 1.IntroHPC/5.Git
there is a folder project with a tree structure that mimics the structure suggested above. Copy the files into your own repository, in particular copy the paper.tex
and Makefile
into folder paper
and Data Generator.py
into scripts. There is also a file Data Generator.ipynb
that is a IPython Notebook or Jupyter notebook.
Running the python script should produce both the data wave.dat
and the figure wave.pdf
in their respective folders. However, for the purpose of this lesson, the files are pre-generated so you can copy them and skip running the script.
$ git status
On branch master
Untracked files:
(use "git add <file>..." to include in what will be committed)
data/wave.dat
paper/Makefile
paper/paper.pdf
paper/paper.tex
plots/wave.pdf
scripts/Data_Generator.ipynb
scripts/Data_Generator.py
Git is telling you that there are a number of new files that are “Untracked”, it means, Git is not aware of having them Version Controlled. Git works better with text files, all the files above are text files except the pdf, the data file is not huge, those are basically the same points that we use for plotting. In real scenario your big datafiles should be out of version control. Lets add those files.
$ git add scripts paper plots data
$ git status
On branch master
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
new file: data/wave.dat
new file: paper/Makefile
new file: paper/paper.pdf
new file: paper/paper.tex
new file: plots/wave.pdf
new file: scripts/Data_Generator.ipynb
new file: scripts/Data_Generator.py
We can commit those files and create a new commit point.
$ git commit "Adding scripts and paper"
[master 4aaa7ad] Adding scripts and paper
7 files changed, 811 insertions(+)
create mode 100644 data/wave.dat
create mode 100644 paper/Makefile
create mode 100644 paper/paper.pdf
create mode 100644 paper/paper.tex
create mode 100644 plots/wave.pdf
create mode 100644 scripts/Data_Generator.ipynb
create mode 100644 scripts/Data_Generator.py
The file paper.tex
is a LaTeX file that simulates a research paper. If you are not familiar with LaTeX all that you have to do is execute make
in the paper
folder and a PDF of the paper will be generated.
$ cd paper/
$ make
pdflatex paper.tex
This is pdfTeX, Version 3.14159265-2.6-1.40.19 (TeX Live 2018/MacPorts 2018.47642_1) (preloaded format=pdflatex)
restricted \write18 enabled.
entering extended mode
(./paper.tex
LaTeX2e <2018-04-01> patch level 2
Babel <3.18> and hyphenation patterns for 46 language(s) loaded.
(/opt/local/share/texmf-texlive/tex/latex/base/article.cls
...(LONG OUTPUT)
Output written on paper.pdf (1 page, 63254 bytes).
Transcript written on paper.log.
Teaching LaTeX is out of scope here, at least on Spruce, the command for generating the paper should work, in your own computer, you need a LateX distribution see https://www.latex-project.org for LaTeX distros for your OS.
$ cd ..
$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: paper/paper.pdf
Untracked files:
(use "git add <file>..." to include in what will be committed)
paper/paper.aux
paper/paper.log
no changes added to commit (use "git add" and/or "git commit -a")
From one side, paper.pdf
have changed, as binary file, even a small change such as different internal dates will create a different binary. In general it is not a good idea to keep track of those files. In any case we have the source paper.tex
and from there we can recreate the PDF. Notice also the paper.aux
and paper.log
files. They were generated during the process of compiling the paper. We can take now the decision of not keep track of those files. If you are using emacs and you are editing those files you will also see files like paper.tex~
and we do not want to keep track of those too.
The solution is editing the file .gitignore
and indicate git that we do not want to keep track of those files. Using your favorite editor create a .gitignore
file on the base folder project_01
*~
*.log
*.aux
*.pdf
$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: paper/paper.pdf
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
no changes added to commit (use "git add" and/or "git commit -a")
As we make the mistake of adding paper.pdf
before we need to remove it from the cache, we do not need and probably we do not want to remove it from your working folder.
$ git rm --cache paper/paper.pdf
rm 'paper/paper.pdf'
$ git status
On branch master
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
deleted: paper/paper.pdf
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
$ git add .gitignore
$ git status
On branch master
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
new file: .gitignore
deleted: paper/paper.pdf
$ git commit -m "Adding .gitignore"
[master 0fcf4d1] Adding .gitignore
2 files changed, 4 insertions(+)
create mode 100644 .gitignore
delete mode 100644 paper/paper.pdf
Exercise: Using git
Edit the paper, just add some text on the places indicated with TO DO. After the file is edited, run make and add and commit the files with git
Change the script, if you do not know python, change for example np.sin() by np.cos(). That should be enough for changing the data file and the figure. Do the proper arrangements to version control the changes.
Working with remote Git Servers
When you use remote git servers, for example, github, gitlab or any other git server, two basic commands are added to the process:
git push
is most commonly used to publish an upload local changes to a central repository. You can accumulate local commits before a push.
git pull
command is used to fetch and download content from a remote repository and immediately update the local repository to match that content. Waiting too much before pulling the master repository increases the chances of conflicts specially when several individuals contribute.
Learning more about git
There are several resources available, we just explore the bare minimum you need to know to start using a version control system.
Software Carpentry offers an entire lesson on Git
Atlassian is a company that offers products for code collaboration. The have a very nice set of tutorials about git and their own service BitBucket.
Github is the most popular service for Git. The have several videos and tutorials called Github Guides
Key Points
Version control systems were created to keep track of changes in a set of files.
They work better with text files