Designing a GitOps Pipeline for AWS and Terraform

EDIT: An updated and more accurate diagram can be found in the project’s repository. The repository also contains further technical implementation details.

I have wanted to learn more about GitOps and found the perfect opportunity to do that with a hands-on approach at the GitOps for Terraform MiniCamp. In this post, I'll walk you through the technical requirements of the project—what tools, resources, and setups are needed to implement the fully functional GitOps pipeline to deploy AWS infrastructure using Terraform—before delving into the implementation details in future posts.

In this project, I will create a GitHub Actions CI/CD pipeline that will run through certain jobs and deploy the AWS infrastructure. I have summarized the steps of the pipeline and the way it will be triggered and connected to other services in the diagram below. This is, of course, my current understanding based on the requirements, and it might be that when I start the actual implementation, I will notice some parts are not accurate and this diagram will be modified.

Backend resources

The first step in the project will be to create the backend infrastructure using Cloudformation. This will be done separately, not as part of the Terraform infrastructure, as it will be used to store the Terraform state file itself. There will be a DynamoDB table that will have only one item—LockID. This item is simply used to make sure that several workflows can’t make changes to the Terraform infrastructure simultaneously, which would lead to conflicts. When our workflow is modifying the infrastructure, there will be a LockID in the DynamoDB table, which would prevent any other workflows from making changes until we have finished. The actual Terraform state file will be stored in an S3 bucket.

Additionally, the OIDC role that will be needed in the later stages to access AWS will be created with Cloudformation. All of these are resources that will remain unchanged during later work and will lay the foundation for this project.

GitHub Actions workflow

The first step I want to automate after making modifications to my code is a step that should run even before being able to commit any code changes. This step is called a pre-commit hook. The hook will run terraform fmt, which has the main purpose of ensuring that the Terraform code is formatted properly following style guidelines.

The next step will be to push my code to a feature branch in my GitHub repository. The main branch will be protected and the correct process will be to push the code always to a feature branch first, which will then trigger the GitHub actions workflow. The workflow starts running through several different steps. These will include

TSLint (analyze Terraform code for best practices, syntax issues, and possible errors)
Terraform fmt (same as the pre-commit hook, just to make sure)
Terraform validate (making sure that the Terraform configuration is valid according to the Terraform syntax)
Terraform plan
Infracost

In the last step, a tool called Infracost is used to estimate the cost of the Terraform infrastructure. At this point, also a tool called Open Policy Agent (OPA) will also be integrated into the workflow. The idea is to enforce a policy that fails if the estimated cost of infrastructure exceeds a certain threshold. This is a way of automatically enforcing cost controls in the Terraform workflow.

Dispatch step

We want to make sure that there is some kind of human interaction before the actual resources are deployed on AWS and there are several ways of doing that. The easiest way will be to add a dispatch step, which will ensure that the workflow requires manual approval before moving on to the deployment.

When the approval for deployment has been given, the workflow will need a way of authenticating itself to access your AWS account to provision resources. The most secure way of doing this is using OIDC—OpenID connect. OIDC will utilize the IAM role that has already been created using Cloudformation. GitHub actions will assume that IAM role, which will give it short-lived credentials to access AWS. These credentials will be destroyed afterwards and cannot be re-used.

Deployment

Once the workflow has access to AWS, it will first make sure that there is no LockID currently in the DynamoDB table. It will then move on to provision the resources that have been defined in the Terraform code. In this project, it would be a simple EC2 instance. In addition to that, some resources can be provisioned to monitor our infrastructure. AWS Config and EventBridge can be set up to run a scheduled drift detection, which would notify us if our Terraform state file doesn’t match the deployed infrastructure. Other services such as Lambda and CloudWatch events could be set up to run scheduled port accessibility checks for the EC2 instance.

After the required changes to the infrastructure have been made, the new state will be stored in the state file in the S3 bucket and the DynamoDB state lock will be released.

After successful deployment, it is time to merge the feature branch code to the main branch. In the approach that we are taking, the infrastructure is the ‘source of truth’, meaning that the deployment will happen first and until the merge, the main branch will actually have code that doesn’t anymore match the current infrastructure. The alternative approach would be to merge the code to the main branch first, which would keep the main branch strictly as the ‘source of truth’. This would come with challenges, such as having to roll back code if there turns out to be an issue with the deployment and for this reason, validating everything before merging, is often the preferred workflow.

Next steps

There are several ‘bonus challenges’ that I would like to add to the project as soon as I have implemented the above-described parts. One of them would be to deploy to multiple environments (stage, prod). It would also be very useful to automatically open an issue for the repository if the schedule check notices that the infrastructure has drifted. Another nice addition would be to configure the GitHub actions workflow to ignore non-terraform changes, meaning the workflow would only start running if there have been changes to the Terraform code. There might be other extension ideas once I start working through the implementation.