BLOG

A Guide to Terraform for Data Scientists

Daniel Martin-Alarcon

9 min read

As a data scientist leveling up from Jupyter notebooks, seeking to put your models into production, you sooner or later start using cloud compute services. After all, you can rent all manner of ready-made services for much less money and hassle than you’d need to build and maintain them yourself.

You might start out like I did, with a free-tier AWS account and a Sagemaker Notebook (it runs Jupyter!). It’s the same familiar data science environment you know and love, but running on an EC2 instance with way more RAM than your puny old laptop.

You soon start building more complex services. You want your model to process new data whenever it’s available, but you don’t want to pay to keep the model running idle all of the time. Since we now live in the brave new world of Function as a Service, you decide to build a serverless application on Amazon Web Services (AWS). Maybe something like this, from Ritchie Vink’s blog on how to deploy any machine learning model serverless in AWS:

Pretty neat! The problem is, by now you’ve started juggling AWS services and you’re losing track of how they were built. You chose your settings based on 10 different guides scattered throughout the internet. Did you write down everything you did? Could you build it again if you had to?

Creating these services by hand, pointing and clicking through the AWS graphical user interface, is a good way to get started but a terrible way to scale up. You want code to define everything about your cloud infrastructure. Code that you can share with your team and track on Github. Code that you can reuse for other projects. Code that you can run on a schedule, to tear down the services down at night and spin them right back up in the morning. That’s when you commit to stop building artisanal cloud services, and start writing infrastructure as code.

Who This Guide to Terraform is For

Here at Very, we use Terraform (and also the Serverless Framework) to build and keep track of our clients’ AWS services. I wrote this guide for the “me” of a few months ago, for data scientists new to the concept of infrastructure as code or just not familiar with how it’s implemented in Terraform.

You could always just go to the official TF Getting Started page, of course, but even with that resource I still spent a long time thoroughly confused by the platform. That’s why I wrote a blog post of all the parts that I actually find myself using in my day-to-day. I combine that with links to the documentation that I kept going back to, and with the mental models that finally made the system clear to me.

I hope this helps! For me, at least, learning a new platform directly from the documentation was always a bit like learning German by reading a dictionary. Mein Gott!

How to Think About Terraform

I like their high-level description of what the system is for. If you’re building an application with commercially available components, TF is what you use to describe, build, and manage that infrastructure. The components are called resources, and most of your time will be spent defining these resources in collections of .tf configuration files. These collections are called modules. When you’re done writing your whole TF configuration, you’ll use the TF command-line interface (CLI) to verify that the configuration is valid, automatically determine what needs to happen in order to build it, execute on that build, and later tear it all down.

You’ll write the configuration files using TF’s own configuration language. It is a declarative language, so it just states what the resources are and how they are connected. The language does not state any actions or procedures. In fact, you could write the whole configuration in one giant .tf file, with all the resources in random order.

It’s much better, though, to use multiple configuration files and a hierarchy of modules. TF will consider each folder to be a different module. All .tf files in that folder will be lumped together, sharing a namespace of resources and variables. One good practice is to use separate files for a module’s variables and outputs, so you know where to look for them.

It’s useful to think of modules in analogy to functions. Functions have three sorts of variables: input, output, and internal. In a TF module, the analogous structures are called variables, outputs, and locals. You declare all of these in the module’s .tf files.

How to Write Configuration Files in Terraform

Put a main.tf file in the root folder of your repo. Later, when you use the CLI, run commands from this folder. This file will be your root module, and is all you need for a simple TF configuration. If you add more .tf files to this folder, they will all contribute to the root module. To add more modules to your configuration, add more sub-folders to your directory and put .tf files in them. In order to connect child modules to the root module, call them with a module block in one of the .tf files in the root module

:module "servers" {  source = "./app-cluster"  num_servers = 5 }

When calling the child module, you give it a new local name to be used within the parent namespace (in this case, servers). You declare the module’s folder path with the argument source, and give values to any variables that are declared in the child (num_servers).

Variables are declared within a module like this:

variable "availability_zone_names" {  type = list(string)  default = ["us-west-1a"] }

For the root module, the values of those variables get set at runtime. For any child modules, variables must have a default value or get a value when the child module is called. If you fail to give values to all the variables declared in a child, TF will throw errors when you run terraform validate.

Outputs are declared like this:

output "instance_ip_addr" {  value = aws_instance.server.private_ip }

These outputs can be (optionally) accessed by any parent modules that call this child, using the syntax module.<NAME>.<OUTPUT> . Only values declared as outputs are accessible from outside the module.

Finally, locals allow you to name arbitrary expressions that you find yourself using a lot:

locals {  service_name = "forum"  owner  = "Community Team" }

Access them within the module using the syntax

local.<NAME>

Somewhere in your root module, you will declare a provider, such as AWS or Google Cloud Services. This tells TF what sort of resources you’ll be building.

provider "aws" {  version = "~> 2.22" }

Your choice of provider becomes important when you start declaring the most complex part of your configuration, the resources:

resource "aws_instance" "web" {  ami   = "ami-a1b2c3d4"  instance_type = "t2.micro" }

Each resource is a unit of infrastructure, like an S3 bucket or an SQS queue. A resource block declares a resource of a particular type (aws_instance) with an arbitrary name issued by you (web). The types of resources you can choose from depend on your provider. If you’re using AWS, then refer to the AWS provider documentation for the very long list of resources available (use the much-too-discrete filter button on the top left to narrow down the list). For example, these are the resource types available for

S3:aws_s3_account_public_access_block aws_s3_bucket aws_s3_bucket_inventory aws_s3_bucket_metric aws_s3_bucket_notification aws_s3_bucket_object aws_s3_bucket_policy aws_s3_bucket_public_access_block

Each resource type takes certain arguments when you declare it, some optional and some required. A declared resource has many attributes, which you can reference with the syntax

<TYPE>.<NAME>.<ATTRIBUTE>

All the arguments become attributes, and most resource types have many more read-only attributes.

You’ll also notice data sources, which are similar to resources but don’t create infrastructure. They just return information about something. The AWS provider documentation lists many possible types of data sources, alongside resources. Data sources are referenced as data.

<TYPE>.<NAME>.<ATTRIBUTE>data "aws_s3_bucket_object" "bootstrap_script" {  bucket = "ourcorp-deploy-config"  key = "ec2-bootstrap-script.sh" }

Data source types available for S3:

aws_canonical_user_id aws_s3_bucket aws_s3_bucket_object aws_s3_bucket_objects

When reading a TF project, one of the hardest parts is tracing all the dependencies. Declaring a resource means giving values to many arguments, and existing resources have many attributes that you can reference elsewhere. Once you’re familiar with the basic TF syntax, you’ll spend most of your time referring back to the AWS provider documentation to remember what arguments each resource takes, and which attributes you could be referencing elsewhere.

When declaring the values for these arguments, it will frequently be useful to build TF expressions, such as string interpolations, conditional evaluations, a few built-in functions, etc. The most common syntax will be the string interpolation, ${ … }, which evaluates the expression between the curly brackets, converts the result to a string if necessary, and then inserts it into the final string.

One final part of your root module to be aware of is the Terraform Settings. This is a block that configures important internal settings, such as what TF version to use, or where to store the configuration state, etc. Here’s an example that sets up an S3 bucket to store all these settings:

terraform {  backend "s3" {  bucket = "mybucket"  key = "path/to/my/key"  region = "us-east-1"  } }

If you’d like to see an example of how this all fits together, check out the official library of Terraform examples.

How to Use the Terraform CLI

First, you’ll want to

brew install terraform

so that it’s available globally on your computer, and set up your AWS credentials. Once you’ve done that and navigated to the root folder, these are the most important commands you’ll need from the Terraform

 CLI:terraform workspace list | (new | select <NAME>) terraform init terraform validate terraform plan terraform apply terraform destroy

First, you’ll want to set up or choose a TF workspace. These are similar to Python environments, and were originally called that before the word “environment” was deemed to be too overloaded. You can see the current list with terraform workspace list, make a new workspace with terraform workspace new <NAME>, and choose one with terraform workspace select <NAME>. You can use workspaces to maintain several independent instances of the same infrastructure, most likely one for development and one for production.

It’s common practice to incorporate the name of the current workspace into various resources. You can access the current workspace’s name using the interpolation sequence ${terraform.workspace}.

After first writing your configuration files, and after adding/deleting any modules, run terraform init to initialize the configuration, then terraform validate to make sure all the variables have been assigned and all the dependencies connected.

When you’re ready to build something, run terraform plan to create a long description of what changes it will make on AWS. Finally, and only after reviewing all the changes, run terraform apply to make it happen. When the work is done, you can run terraform destroy and know that TF will remove all your infrastructure.

Closing Thoughts for Data Scientists Using Terraform

That should be enough to get you started along the path towards more reproducible infrastructure. As with everything in reproducible data science, the theme is to only build things by hand once, when sketching them out. Your pipelines should generally be automated as soon as possible, and infrastructure as code is a great way to reach that goal. Happy coding!

If you want to learn more about how our data science capabilities can improve your next project, speak to a team member today, or learn more about our Machine Learning services.