Building Git in Elixir — Part 1 (Initialize Repo & Store blobs)

Meraj Molla
ITNEXT
Published in
7 min readMar 6, 2021

I recently got hold of this book Building Git, written by James Coglan. It’s a pretty thick book covering, as the title says, building Git version control system using Ruby programming language. I thought it would be a fun exercise to implement Git using Elixir as I read the book. This article is the first introductory one in a series of articles focusing on my journey implementing Git in Elixir programming language.

This is not an introductory Elixir programming lesson. So, I am not going to focus on Elixir concepts or libraries. For that please refer to [1] and [2]. Also, for in detailed knowledge gathering on Git internals I suggest you read the book above and/or go through the docs [3].

Since this is only the part 1 of these series of articles, I will explain a bit of theory on what we are trying to build in this article and will walk-through Elixir implementation.

A Bit of Theory

Git stores its repository information in .git directory. On a vanilla project, as we type git init, .git directory looks as below —

git repository structure

To give a brief overview —

  • HEAD — contains reference to the current commit either using commit ID or a symbolic reference to the current branch
  • config — contains configuration settings for this repository
  • description — contains name of the repository
  • hooks — contains various scripts executed by various git command
  • info — contains various metadata information about repository
  • objects this is git’s database and contains all contents it tracks. In this article we will mostly focus on building and populating this directory
  • refs — stores various pointers into .git/objects database. Most importantly, .git/refs/heads stores the latest commit on each local branch and .git/refs/tags stores tags

After committing a sample hello.txt file with content hello as below —

Repository structure looks as below with some new directories and files —

repository structure after initial commit

Here —

  • COMMIT_EDITMSG — contains commit message given
  • index — contains binary data which is used to build next commit
  • logs —contains various logs used by reflog command
  • new objects/<sub-directories> — some new subdirectories are created with cryptic hashes as file names. If we examine one of these with ID printed by git log command —

From commit ID we can see it matches one of the objects subdirectories —

We can use git cat-file -p to display information about this object from git database —

info about object

Above output refers to a tree with commit ID aaa96ced2d9a1c8e72c56b253a0e2fe78393feb7

We can examine this object using similar command —

tree representation

From output we can see this tree refers to another commit ID ce013625030ba8dba906f756967f9e9ca394464a which represents the file hello.txt.

Examining this commit with same command shows —

Now to examine how git stores objects in filesystem we can do —

Output shows compressed content of the object blob on filesystem.

The book provides a handy command to inflate the compressed contents —

alias inflate='ruby -r zlib -e "STDOUT.write Zlib::Inflate.inflate(STDIN.read)"'

Using this command on we can see the plaintext content of this object —

hexdump command applied on inflated blob

As we can see, git stores blobs by prepending them with the word blob, a space, the length of the blob, a null byte, followed by content compressed using zlib.

Focus of This Article

In this article, my focus would be to start with nothing and built the two commands —

  • git init — which will initialize .git with objects and refs directories only
  • git commit — which will create objects database with any files in current working directory

I will not focus on tree or commit history or message in this article. That will be focus of some future one.

From now on, I will refer to this elixir version using its executable name — egit

Elixir Code Walkthrough

The source code for this article is available from — https://github.com/imeraj/elixir_git

It contains a README file on how to build the executable and use git init and commit commands.

Parsing Command Line Arguments

cli.ex does command line arguments parsing —

Here,

  • parse_args (line 12 -17) — parses command line arguments and creates tuples for internally processing commands —

{:init, dir} → for egit init

{:commit, dir} → for egit commit

egit init command can also take optional directory as argument. If nothing is passed, it will create .git directory in current working directory.

  • process (line 51–57) — processes parsed command line argument and invokes necessary module for init and commit

Helper Module

helpers.ex provides some helper functions —

Here —

  • git_path (line 6–9) — builds the path for .git directory
  • db_path(line 11–14) — builds db path for objects
  • ls_r(line 16–30) — lists all files under current working directory including any subdirectories
  • generate_random_string(line 33–38) — generates random alphanumeric strings of given length

Implementing egit init

init.ex implements init command —

Here —

  • init(line 8–12) — gets called with the path where .git directory should be created
  • make_dirs(line 14- 28) — creates the directories under .git. For now it just creates .git/objects and .git/refs if .git does not already exist.
  • line 26 — prints a message if .git directory is initialized properly

Implementing egit commit

commit.ex implements commit command —

Here,

  • commit(line 8–20) — implements commit command. Right now it commits all files in current directory and any subdirectories.

I will discuss in more depth what’s going on here —

  • line 9 — lists all files in current directory including any subdirectories.

It relies on helper function ls_r() to do that

  • line 10–14 — for any files obtained in above step, it then builds a BLOB struct with file content and writes in objects database.

BLOB struct module (blob.ex) looks as below —

This struct has two fields — data and oid (I will discuss oid soon)

database.ex provides the Database module and does most of the heavy lifting of populating objects directory —

Here —

  • line 10 — builds content with the structure — blob word, a space, length of blob, and a null byte, followed by plaintext file content as string
  • line 11 — builds object ID hash using SHA-1 on content
  • line 12 — does the job of actual writing of blobs
  • line 19 — builds object_path from db_path by joining the first two characters of oid and the remaining characters. So we get a path like — db_path/<2 chars>/<rest chars> which represents the below in objects database —
  • line 21 — creates a temporary file path using generate_temp_name() function which creates the filename in the same format actual Git does. The code first writes blob in this temporary path and renames this to actual object_path (line 41) later.
  • line 23–35: creates the temporary file in read, write, and exclusive mode so that if the file already exists it errors out (in case our temporary filenames have a collision).
  • line 37: compresses the content
  • line 38–39: writes compress content and closes file

Taking egit For a Spin

Now that egit init and commit are implemented, let’s take egit for a spin and examine the database comparing with real Git’s database.

A sample session which initializes the repo and commit a file hello.txt with content hello is shown below —

a session with egit

Here —

  • I initialized .git at directory location egit
  • tree .git shows what’s under .git right after egit init
  • I created hello.txt with “hello” as content
  • tree .git shows what’s under .git after egit commit
  • finally, cat with inflate and hexdump shows the contents of object database blob in plain text which shows data is stored in similar format as real Git does

Conclusion

In part 1 of these article series, I have from scratch implemented egit init and egit commit (very basic) commands and demonstrated output comparing with real Git version control system. Later articles in this series will keep adding and refining more commands in Elixir as I progress through the book — Building Git.

For more elaborate and in depth future technical posts please follow me here or on twitter.

References

  1. https://elixir-lang.org/getting-started/introduction.html
  2. https://elixir-lang.org/docs.html
  3. https://git-scm.com/docs
  4. https://github.com/imeraj/elixir_git

Published in ITNEXT

ITNEXT is a platform for IT developers & software engineers to share knowledge, connect, collaborate, learn and experience next-gen technologies.

Responses (2)

Write a response