The Pains in Terraform Collaboration

Published in

ITNEXT

11 min readDec 21, 2022

The snags that may stall your Terraform adoption and what to do

I divide Infrastructure as Code (IaC) into three categories. Mark-up languages like CloudFormation and ARM have simple format, but the body of code sprawls enormously with more objects lumped together. Domain specific languages such as Terraform’s HCL, feature flexible syntax and a mild dose of abstraction, creating a pleasant coding experience. Libraries that supports general-purpose programming languages, such as AWS CDK and Pulumi, are extremely powerful yet requiring serious programming proficiencies to tame the hyper-abstractions.

It is no surprise that Terraform, situated in the middle ground, has gained momentum. In 2022, GitHub’s Octoverse report identifies HCL as the fastest growing language. Along with the rising adoptions comes a wave of suffering teams tinkering with arbitrary workflow setups. In this article I will articulate some pain points during collaboration in Terraform, and help you think through the planning of workflow tooling upfront.

From State to Workspace

Many of us were introduced to Terraform through the local run of “init-plan-apply”, a purposefully simplified scenario aiming to show the IaC magic in minutes. The init command pulls down modules and providers. The plan command projects changes. And the apply command commits the changes via cloud service endpoint. The whole trilogy is a one-man show that plays on a single laptop. As we divide coding tasks among team members, the once-solid runbook starts to crumble. The first hurdle a team often stumbles across is state management.

In a Terraform workspace, a state file serves as a cache of the last known state of the cloud resources. By design, a state file falls out of date frequently and needs refresh repeatedly. When sharing it remotely with a team, we have to handle racing condition, i.e. concurrent apply, or plan during apply. In addition, a state file contains sensitive data so we must protect access. Indeed, the idea of using state was born controversial and has since been challenged. Eventually Terraform developers determined that state is a necessary evil, and summed it up here.

The three moving targets in Terraform operation

To manage states remotely, Terraform provides a variety of interfaces with storage backends. For example, On Azure, we can store state file to blob storage. On AWS, we use S3 and Dynamo DB together. Even with the wealth of options, self-managing state file still feels grinding: we have to control access to state, and and clear up lock errors that arise from time to time. For peace of mind, consider a managed backend for remote state, and check out GitLab, Terraform Cloud, Scalr and Spacelift.

Managing state is just the first puzzle in operating terraform workflow. There are a few more problems. For code reusability, we often apply the same code base multiple times against different environments. In this case, one code repo is associated with multiple runs. As a result, we have to keep track of:

one state for each run;
the code revision (i.e. commit ID) that triggered each run;
the environment-specific input variables used by each run.

The open-source Terraform keeps states in workspaces. So we can address the first problem. However, workspace does not attempt to deal with the second and third problems.

Components of a multi-workspace deployment

For that sake, I regard the workspace feature in open-source Terraform as half-baked. It misses too much. I have seen teams using variable files to store input per-workspace input variables. However, the input variables may contain secrets too. In addition, one more item to keep track over time, is whether each state remains consistent with the actual target resources (drift detection), which is also tricky.

The stateful nature of IaC

IaC manages the lifecycle of cloud resources. For example, we change attributes, upgrade database engine, or clean up resources by running Terraform apply. The size of a workspace could be a single database or hundreds of VMs. Once created, we often leave the resources for a long time. It is a major mistake to assume that the resources would remain in their original state forever. In reality, we should see the resources drift wildly into different states over time. The cause might be configuration drift, incomplete recovery from outage, software bugs, or just disruptions of the cloud service. If the IaC technology was perfectly idempotent, it should be built with the intelligence to cope with these outliers in its low-level logics. In reality, we never assume such utopia.

We declare desired states in IaC repos. With apply action, Terraform provider calls cloud SDKs to reconcile the target resources in alignment with the desired state. The dynamics and complexities in this process have to do with the amount of associated resources, their real-time status and their interdependencies. As a result, the ramification of managing running resources with IaC, is that we cannot determine whether our code is good or bad, until it has come through the apply stage interacting with live resources.

I refer to this characteristic as the stateful nature of IaC. The apply actions are the most unpredictable part of the workflow. This makes an IaC workflow fundamentally different from a SDLC workflow, where one builds the code and tests the artifact in isolation, prior to the deployment stage.

The stateful nature profoundly shapes available collaboration models with Terraform, and possible remedies. Let’s take a look at some specific annoyances.

The Iterative Journey

The stateful nature requires us to iterate a lot. Suppose we apply new HCL commit to update three test workspaces first, and then three production workspaces. According to the stateful nature, we need to apply the new commit to all six workspaces with success, before we are assured about the code quality, which can take several newer commits. During the trial-and-error iteration, be wary of the following pitfalls:

One new commit may cause multiple types of errors on different workspaces;
A commit to fix a problem with one workspace, may fail on another workspace where the previous commit has already succeeded;
Failures may change the target resource. The next pass may run into a different state on the same resource, and fail for a different reason;
Changes to production environment often require advance schedule and approval.
While one update has passed test and is waiting to apply to production, a higher priority change may pop up and need to jump the queue.

All those scenarios are commonplace. If we keep the record of all workspace runs on each commits. It may look like the chart below:

The iterative journey of multiple workspace runs

The chart exemplifies an iterative journey of trail and error. At the top is a commit chain running left-to-right, arrow pointing to newer commit. The table below marks the result of workspace runs on each commit. If there’s a failure, we give it another shot with a new commit, and start all over. It takes eight commits (with eight roller coaster runs) until all workspace runs are successful, including four failures in production. The pain grows dramatically if the same source repo connects to more workspaces. For elusive errors in a sizeable footprint, the seemingly endless iterations quickly turn into a lengthy nightmare.

This iterative repetition of fail-commit until success was a challenge with SDLC workflows too. However, modern CI/CD pipelines empowers us to shift failures to the left, thereby expose the code defect early on in the pipeline. The premise of the shift-left capability is that building software is a stateless operation, that can occur in isolation from the target environment. Unfortunately, when it comes to IaC pipeline, the premise no longer holds true. Our ability to shift left (find as many errors early on) is limited. Consequently, we have no choice but to iterate a lot.

Pull Request driven workflow

By the stateful nature, the apply operation carries the most uncertainty. We’d better hold off commits to main branch until we get over the apply risks. Some teams thereby turned to pull request to drive apply activities. By raising a Pull Request (PR), the commit author proposes a change from a feature branch. The PR triggers automated actions to validate code quality, before interactive human approval. Upon approval, the PR merges the feature branch into the main branch. In the context of Terraform collaboration, a PR triggers plan and apply actions against multiple workspaces. If all successful, PR completes code merge. While PR-driven workflow is the right idea, it unfortunately entails even more nuances.

Competing plans

One action by PR is Terraform plan. The output of plan is based on the most recent commit on the feature branch, and the current state.

When there are two feature branches each with its own PR, the timing of two plan actions causes different (and often unexpected) effects. If they happen at the same time, then the PR with apply action that completed first, will invalidate the plan result of the other PR. Even if the second PR re-run its plan action, its code forked off an earlier commit on main branch still has no idea of the new resource.

A plan (at step 7) created from an outdated branch (6) with current state (5), unexpectedly projects to delete a recently added resource (from step 3 and 4)

For example, the original state starts with an EC2 instance. The first PR adds an RDS instance, and the second adds an S3 bucket. When the first PR is merged, it has created an RDS instance, leaving the current state with both an EC2 instance and a RDS instance. If the second PR runs terraform plan after the RDS addition, the code on its branch has no reference to the RDS instance. Therefore Terraform takes this as an intention to delete the RDS instance.

Atlantis brings two fixes. First, in the window between plan and apply, other attempts to run terraform plan should be locked out. Only one pending plan is allowed at a time. Second, just before any plan action, the PR should sync the code in its branch from main branch first, so that Terraform plan runs against current desired state. When plan locking is applied too broadly, it may hold up velocity. To mitigate interruption, Terrateam suggests conditions on the lock. Lock should be placed only on plans that involve changes, or during any apply action.

The merge-apply dilemma

The plan action occurs before interactive approval on the PR. After the approval, the PR has two more things to finish: merge the code from feature to main branch, and apply the commit to all workspaces. Deciding the order of the two activities is subtle. Neither way feels ideal at a glance:

merge-and-apply: if we merge the code first to main branch. The new commit on main branch will drive deployment. Given the stateful nature, we should assume a good chance of failure in the apply action. Therefore, we will have to go through the iterative journey along the main branch, via numerous PRs and merges. The main branch in this case is treated like a chatty scratchpad rather than a seriously guarded golden copy.
apply-and-merge: if we apply the commits from feature branch first, we can take the iterative journey along the feature branch associated with the pull request, which will keep the main branch cleaner. The apply-and-merge approach handles the risks introduced by the stateful nature in an elegant way. However, now awaiting us in the next step is the risk of merge conflict. If merge problems occur, we’d have to go over PR again with unwanted apply actions.

I call this the merge-apply dilemma in PR-driven workflow. Atlantis proposes apply requirements as a fix. Because PR has access to both branches, merge result is much more predictable than the result of workspace runs (i.e. apply activities). The PR therefore first evaluates whether merge will be successful. The PR then kicks off the apply actions against the workspaces, only on the condition that the merge is estimated to succeed. If the workspace runs go smooth, merge will get completed afterwards. The apply requirements feature minimizes merge risks and greatly improved the version control sanity.

The two pain points above demonstrates how PR-driven workflow is highly flexible but dangerous at the same time. Many small teams simply stay away from the PR-driven workflow. Instead, they use PRs just to review Terraform code, and keep them from performing apply action.

Tools and Solutions

I was overloaded with choices for tools and solutions to facilitate Terraform collaborations. Tools usually targets particular workflow problems whereas workflow solutions aspire to deliver end-to-end experience. As for tools, some teams might be tempted to extend the use of existing automation pipeline(e.g. Jenkins, GitHub Actions) for Terraform collaboration.

There are many purpose-built extensions (GitHub, Azure DevOps) to facilitate Terraform installation and command invocation. However, as discussed, the real pain point with Terraform collaboration is the statefulness and consequent issues. Automation pipelines fall short in this regard, despite of its significant role in continuous integration in SDLC. Its scripting capability can virtually achieve any programmable task, but it is not fun to juggle with numerous code paths to deal with state logistics and stateful resources. Check out this blog post from Spacelift to see what amount of effort this attempt may snowball to. In my experience, for multi-workspace deployment, we just cannot address those problems without making our pipelines awkwardly bloated.

Terragrunt is a well-known Terraform wrapper. It came to improve configurability and reusability in early versions of Terraform. However, Terraform’s own enhancements (coupled with pipelines) over the years have greatly improved code reusability. I would try to start new projects without Terragrunt.

For PR-based workflows, start with Atlantis and examine the customizations. Atlantis is an open-source Go application that requires user to self-host on a VM or a Kubernetes cluster. For teams with PR-driven deployment and the capacity to manage state and configure the server, Atlantis has been a great choice since 2017. An alternative for PR-driven deployment is Terrateam, with a SaaS option. Just launched in 2022, it is still in early stage.

Teams who have to focus on IaC coding may consider a workflow solution. These solutions are also known as Terraform Automation Collaboration Software (TACoS). Notable players are Scalr, Terraform Cloud/Enterprise, Env0 and Spacelift. For the aforementioned multi-workspace deployment model, workflow solutions usually have their enhanced implementations of workspaces (e.g. Scalr, Terraform Cloud) to close the gap of per-workspace input variables.

From the workflow perspective, these solutions usually favour VCS-driven workflow. To drive deployment, users configure Git hooks on IaC repos. Then it automatically starts plan action on all workspaces as soon as new commit makes to the monitored branch. Some of them allows users to plug their processes into existing pipelines or PR-driven workflows, using CLI tools and custom hooks.

Workflow solutions offer different hosting models and a suite of operation oriented features. This TACoS providers review post has more details.

Recommendation

The stateful nature makes IaC workflow substantially different from SDLC processes. Prospective Terraform adopters should investigate their technical capabilities and formulate a well-thought-out workflow requirement.

For teamwork, I do not recommend automation pipeline alone because of the strenuous efforts and excessive customization. If the team has strong Terraform expertise, Atlantis is the powerhouse for a customizable PR-driven workflow. For short-handed teams, spend time to pick a full-blown workflow solution. No matter which path, a thorough foresight into your future workflow can save your team from surprising pains in Terraform collaboration.