Automatic for the people: Taking Terraform to task

MPB

Published in

MPB Tech

6 min readMar 30, 2023

Photo by NASA, logo by Terraform (HashiCorp)

By Mathew Robinson, Head of Site Reliability, Engineering and IT at MPB

You have to love automation. Quite apart from the whole basis-of-civilisation thing, today’s continuous deployment pipelines and containerised applications have transformed the business of software engineering.

All those digital spinning plates just do their thing, perfectly balanced on other spinning plates yet miraculously unbothered by whatever’s happening with their fellow virtual crockery items.

The whole process of getting decent code working and published feels less like a military campaign and more like a … coffee machine. But one that makes a decent flat white.

For site reliability engineers it beats the heck out of the hand-crafted artisanal servers of yesteryear.

Here’s the ‘but’.

Trying to do this kind of thing with a traditional set of tiered static test environments — Dev > Stage > Production, say — quickly gets sticky. You can’t code confidently unless the test environment you’re using reflects Production with 100% accuracy.

In a dynamic SOA (service-oriented architecture) environment, with dozens of coders working at once, what was unreliable now becomes impossible. And running a local copy of the environment on a laptop isn’t a solution, because by the time the job is complete other things will have changed.

Even containerisation solutions like Docker don’t fix this, because they don’t account for the nuances of network segmentation, databases, security, firewalls, etc.

For us at MPB, the logical solution was to create individual, ephemeral Feature Environments — virtual infrastructure that exactly replicates Production, but lives and dies with the code branch it was created for. To make Feature Environments, we use Terraform.

So that’s the answer? Just use Terraform?

If only. It’s a great tool but it’s comparatively new and best practices are still emerging. You can do a lot of dumb things and Terraform probably won’t warn you not to shoot yourself in the foot.

It’s very possible to create an unresolvable dependency situation, where everything works while you write the code and only fails when you try to create an environment from scratch.

It’s also possible to set things up so that your system becomes very inflexible indeed, in a way you thought you’d got away from.

So what I want to share here is what we’ve learned about using Terraform in a dynamic environment — how we implemented a fairly complex system and worked both with and around Terraform’s features and fails.

Terraform basics

Feel free to skip ahead if you know these things already …

Terraform is an Infrastructure as Code (IaC) tool that’s primarily designed for the provisioning stage of cloud deployments.

It’s declarative — you don’t tell it what to do, you tell it what your desired (virtual) infrastructure looks like. Terraform builds a JSON-based state file, compares it to any pre-existing infrastructure and makes the changes.

Handily, Terraform includes API translators (providers) for hundreds of cloud hosting services and provisioning tools. It provisions what you asked for — Kubernetes clusters, VM instances, databases, you name it — all with minimal human input.

Want to set up a slightly different configuration? Create workspaces and you can have multiple, separate state files for the same Terraform codebase.

Taming Terraform

Here, then, is the nitty gritty — my guidelines for keeping your infrastructure manageable and avoiding some pitfalls which await the unwary Terraformer.

Keep It Simple, SRE

It’s all too easy to start by trying to codify all your existing infrastructure into a single Terraform codebase. I’ve found that this quickly morphs into something that rigidly defines your set-up, suffers from configuration drift and becomes hard to replicate.

Instead, try thinking of your entire Terraform codebase as a set of interconnected but separate modular pipelines, each with a defined set of inputs that produce a defined output. These can then be combined into larger mix-and-match pipelines, which can save an awful lot of time.

Split Terraform into stages

Terraform (is it too early to start calling it TF?) is, at its heart, an automated build system which knows how to correctly resolve a whole lot of dependencies. But not all of them.

The first thing it does at runtime is to initialise its set of providers and query them for the current state. But sometimes a provider may rely on output from a source that hasn’t yet been provisioned. Awkward.

For example, say we want to put a Kubernetes cluster, containing a Kubernetes secret resource, onto GCP.

In this scenario, you have the following dependencies:

The cluster depends on the GCP provider (it needs to ‘know’ where to go)
The Kubernetes provider depends on the cluster resource (it needs to know which cluster to provision on to)
The secret resource depends on the Kubernetes provider, and by extension the cluster resource.

Everything works as it should … until you try to replicate this setup in a fresh environment. The Kubernetes provider appears on the scene first, can’t find any of the other resources, and promptly has a meltdown. The only way we found to resolve this was to split the Terraform codebase.

Here’s the sequence we arrived at.

Pipelines and stages

First, we create a single-stage bootstrap pipeline, containing dependencies such as access management, GCP projects, policies and project labels — anything which isn’t really part of the provisioning pipeline but is required.

The bootstrap pipeline is deployed using a workspace for each project, so any of its outputs can be used by downstream pipelines.

Once a project is built, the bootstrap pipeline triggers the provisioning pipeline, using the same workspace name (to keep the state file readable throughout the process). The provisioning pipeline does pretty much what it says on the can. We set it up to run in stages, in the following order:

1. Infrastructure

2. Databases

3. Global cluster config settings

4. Any required Kubernetes custom resource definitions (CRDs)

5. MPB platform apps

6. Feature Environment bootstrap scripts (for example, loading in search results).

We chose Jenkins to automate our pipelines, mainly because it was our first attempt at this kind of setup and it offered us a useful mix of observability, compatibility and customisability. I’m not sure what we’d use if, for some reason, we needed to do it all again — but Jenkins has served us pretty well.

Application versions

Deploying containerised apps to Kubernetes sounds like it should be a simple thing. But, in reality, they need to be wrapped in quite a lot of ‘stuff’. We use Helm charts to package everything up, and these charts need to be version controlled.

So how do we specify which version to deploy? Specify them as a variable in the tfvars file? That’s a slippery slope back to the bad old days. The more manual configuration is required, obviously, the less automated our system becomes. Over time we’d also end up with configuration drift, since you’d be hand-cranking deployments by updating a text file.

Our solution was to use optional variables and add the following logic in Terraform:

If the variable service_name_version is specified in tfvars, use that version.
If not, use the service_name_version specified in the current project’s state file.
If neither of the above is true, fetch service_name_version value from the Staging state file.

Incidentally, it’s a good idea to have a couple of rules in place for variablising sanely in Terraform.

Use as few variables as possible
Prefer Booleans over integers, floats etc
Default to the version on your lowest-tier environment (as above).

Final thoughts

I’ve gone into a fair amount of detail here and it’s very possible you have other things to get on with, so I wouldn’t want to leave you without a tl;dr.

Here, then, are the key takeaways I wish I’d had. If you do nothing else before setting up your first Terraform codebase, a glance at the following points just might help you save a bunch of time:

Think of your Terraform code as a pipeline, which takes inputs and produces outputs
Think of your environments as a single output of these pipelines
Never use a single Terraform codebase to describe multiple environments at once. Instead, use Terraform codebases as a template (requiring inputs) for building an environment.
Use Terraform workspaces with remote state files.

Useful links

Terraform (HashiCorp)
8 Terraform best practices in 8 minutes (YouTube)
Jenkins (Jenkins.io)
Watch this talk at Silicon Brighton (YouTube)

Mathew Robinson is Head of Site Reliability, Engineering and IT at MPB, the world’s largest platform for buying, selling and trading used photography and videography kit. https://www.mpb.com