Using Open Policy Agent (OPA) to Validate Terraform Security Standards

Building an OPA Framework for validating Terraform builds across cloud providers

Learning Open Policy Agent was a bit of a pain for me. Here, I'd like to share some tips for using Open Policy Agent to validate Terraform builds.

This is not an Open Policy Agent 101 tutorial; if you'd like that kind of content, I suggest looking at their documentation first. However, if you're technically savvy, you can certainly follow along here, as this is less of a tutorial, and more of technical overview with examples with the intent of assisting with implementation strategy.

Motivation

Hopefully, when you are deploying infrastructure to any cloud provider, you are using vendor-agnostic Infrastructure as Code technologies such as Terraform. Trusting users to click through a console to deploy cloud infrastructure is error prone, and it's difficult to put enough guardrails in place to force those users to stick to security standards. While it might be fast to deploy insecure infrastructure by hand in the console if you know what you are doing, it's ultimately faster to deploy secure infrastructure at the organization level when you take a careful engineering approach and create reusable infrastructure as code. There's also the benefit of the infrastructure being secure, if you code it that way, and that you can automate the boring things and solve new problems.

Suggested Reading: Why we use Terraform and not Chef, Puppet, Ansible, or Saltstack by Gruntwork.io

However, simply using Terraform does not guarantee repeatable security - you need to bake these security standards into your code. This goes beyond the CIS Benchmarks for AWS. From a technical perspective, you'd want to generally have a defined secrets management solution, infrastructure as code, backup capabilities, asset discovery, a mature and enforced identity and access management approach, enforcement of protections, and incident response preparation, to name a few.

Suggested Reading: AWS Security Maturity Roadmap by Scott Piper

Now, let's say we have baked secure S3 bucket configuration into our Terraform modules. But from a realistic adoption standpoint, many questions remain.

  • How do we verify that everybody is using our module?
  • If the main developer gets hit by a bus, how would a new team member validate that the Terraform module in fact meets the desired security requirements?
  • If someone else writes Terraform code that doesn't use our secure module, how would we validate that it meets security requirements anyway?
  • How do we ensure these security standards are met, whether or not they use our module?

Well, you could take a reactive approach to cloud security, in this case, could include using some combination of AWS Config, CloudWatch Alarms, Amazon SNS messages, and a notification solution of your choice like Slack (or email, if that's still your thing). But that still allows the vulnerable infrastructure to be created in the first place. Time to remediation is slower, and it doesn't fix the root issue (like if 20 noncompliant cases are due to 20 teams using a noncompliant Terraform module).

Creating secure Terraform modules is an example of a proactive approach to cloud security. Ideally, you'd have a Continuous Integration (CI) server like Jenkins validating your infrastructure, and a Continuous Delivery (CD) server like Spinnaker or an Infrastructure-focused CD service like Terraform Enterprise to deploy your Terraform infrastructure. In fact, outside of Dev or Sandbox environments, it's best to force all cloud infrastructure to use this kind of solution.

If you do force all of your infrastructure to be deployed via a Terraform Enterprise, Spinnaker, or a Jenkins Shared Library, you have a great opportunity to enforce security standards programmatically - by using Open Policy Agent.

Open Policy Agent TLDR

In a nutshell, using OPA, you can Parse through JSON in a very effective manner (i.e., 8 lines of Python with nested loops to parse through JSON can be accomplished with 1 line of Rego). OPA essentially OPA eats some JSON, makes a decision, and spits out the decision (allow, deny, or warn).

The way that it makes a decision is based on your policy as code - written using the Rego language. Now, I'm not much of a fan of the Rego language - but it is damn powerful. If you approach it correctly, you can have a uniform method for managing your Terraform infrastructure.

OPA decision making

Planning our implementation

Let's take some basic principles for our implementation strategy:

  • Provision all Terraform infrastructure through a CICD pipeline
  • In that CICD pipeline, environment variables are supplied to indicate various characteristics about that pipeline's deployment. For example, COMPANY_SERVICE and ACCOUNT_ID. In real life, we should have more granular environment variables than this, but we'll leave it here for this example.
  • Exceptions requirements
    • We need to provide exceptions to rules, and have it be per-policy.
    • Terraform developers who do not work on OPA code should be able to make pull requests to add their pipelines to an exceptions file expressed in YAML. Those Terraform developers know which environment variables to use and which values.
    • When we write a policy, we should have a uniform way of managing exceptions. So we should have ONE way of specifying exceptions when you write a policy. That is outlined below.

My example code that creates this is here.

The nice thing about the approach outlined in that repository is that the policies don't have to actually know which account IDs, environments, service names, etc. are allowed per policy - they just know that the exceptions logic will take care of it, since the exceptions logic essentially takes things in from two sources in this code base:

The policy file, which feeds the service name and rule name over to the exceptions logic The plan file, which contains the Falcon environment variable values, such as the values for COMPANY_SERVICE, ACCOUNT_ID, etc. As a result, the exceptions folder can really be managed by anybody (doesn't have to be the devs!!! and definitely doesn't have to be someone who knows Rego). After all, the YAML File tells the exceptions logic This knows to look for service_name, rule_name, and the values for the environment variables.

…and the policy developers just have to include one two lines in their code:

import data.exception_logic
# ...then within their policy, call this function, and just specify the `service_name` and the `rule_name`
find_insecure_resources(resource_types) = {resource.name |
  # ...
  not exception_logic.is_account_id_exception("s3", "private_acl")

And the exception logic functions know how to evaluate the Terraform plan file for those environment variables, and read the exceptions/service_name.yaml file to determine whether this matches the exception criteria.

Walking through how every bit of this code works might be material for another day. I might create a separate blog post that will set the stage for the technical know-how to understand the code, and then one after this blog post for a tutorial. Stay tuned…

Takeaways

  • When dealing with Cloud Infrastructure security, it's important to always take two approaches to any given security control: a proactive approach, and a reactive approach.
    • Generally speaking, it's easier to roll out these reactive controls, so in my opinion you should always have these monitoring and alerting solutions in place before you have proactive controls in place.
    • However, if you only have reactive controls in place, time to remediation is lower and you don't fix the root cause of the issue.
  • Build all of your cloud infrastructure through a CICD pipeline. Otherwise, it's over-privileged madness.
  • When developing an OPA Framework for evaluating Terraform code, it's critical to have the following:
    • A centralized OPA testing framework that gets executed at deploy time to prevent insecure infrastructure from being deployed
    • Make sure your testing includes all Cloud provider security standards, and enforces your organization standards as well (like which services/accounts are allowed to have public IPs, who is allowed to have open security groups, not just “nobody can have open S3 buckets”)
    • Decouple the policy logic and the exceptions data with carefully created exceptions mechanisms
      • Make it easy for developers to support the exceptions logic (see the previous comment on “just add one line of code”)
    • Exceptions
      • Clear technical capabilities for handling exceptions
      • A human process for submitting those exceptions
      • Developers should be able to submit Pull Requests to add their pipelines to the exceptions
      • The exceptions criteria should be in a format that everyone can understand; they shouldn't have to be Rego developers or understand the syntax at all. Since everything in cloud is expressed in YAML or HCL these days, you can expect the developers and operations folks submitting these PRs to understand what they should place in YAML.

I hope this helps. Looking forward to creating some more content on this soon. Stay tuned…

Great code references:

Other blog posts:

Consider following some OPA influencers:

Avatar
Kinnaird McQuade
Lead Cloud Security Engineer

Always remove the french language pack: sudo rm -fr ./*