Comparing Infrastructure Tools: A First Look at the AWS Cloud Development Kit

by
Tags: , , , , ,

My background is as a developer, so when I think of “devops” and “infrastructure as code” I look for the loops and conditionals of a Turing-complete language. Unfortunately for me, popular devops tools lean toward a declarative format: you describe the environment that you want, and the tool makes whatever changes are needed to achieve that goal.

In the past, when I felt that I needed to express infrastructure as “real” code, I would turn to cfndsl, a Ruby gem that lets you generate CloudFormation templates. And while it simplifies some tasks, it has a lot of boilerplate and is an oddity for organizations that don’t already use Ruby.

Which is why I was excited about Amazon’s announcement of the Cloud Development Kit at the 2019 New York Summit. Not only does it provide a programmatic way to generate CloudFormation templates, it can abstract away much of the boilerplate: the demo at the Summit created a 700-line CloudFormation template from a few lines of TypeScript.

Clearly that example was picked for its “wow!” factor, but I was hopeful that other common tasks could be similarly condensed. To test that, I decided to do a head-to-head comparison between CDK, Terraform, and CFNDSL, using a common task: managing the users in your AWS accounts and the roles they can assume.

This is a long post … grab a cup of coffee.

The Task

Create IAM users, assign them to groups, and then allow the groups to assume roles.

This is a common infrastructure task for organizations that use multiple accounts to isolate workloads (eg, development and production). When users join or leave the organization, or switch roles, that change must be reflected in the roles that they can assume. Since this happens rather frequently, we want to script the process and maintain a log of all changes in our source repository. I add a slight twist, in that the users are created in a third “operations” account, and then assume roles to gain permissions in the “deployment” accounts.

Note: Amazon’s Single Sign-On service provides an alternative way to accomplish this task, bypassing much of the AWS Identity and Access Management (IAM) infrastructure. For some organizations, particularly those that already use Microsoft Active Directory and/or want single sign-on for third party applications, it may be a better choice. However, as-of this writing it’s not scriptable and does not support TOTP-based multi-factor authentication, so I still consider AWS-centric user management a valid choice.

The Contenders

CloudFormation, the AWS-provided tool for managing infrastructure. A CloudFormation template describes resources such as EC2 instances, RDS databases, or in this case, IAM users and groups. I’m including CloudFormation in this post as a baseline: it does not provide any native scripting facilities; everything must be explicitly declared.

Terraform: the go-to tool for many devops engineers. Terraform scripts are declarative: you specify the resources that you want to create, and the attributes of those resources, and Terraform does the work of making your deployment look like the declaration. Terraform supports user-defined modules, which gives you a reusable toolbox of standard infrastructure, and it supports a limited form of iteration, where you can specify that a group of similar resources are to be created at the same time. Terraform stores the current state of your deployment in a file that must be checked-in alongside the declaration; it compares the declaration against the recorded state to decide what to do.

CFNDSL: a Ruby gem that generates CloudFormation templates. It is typically invoked as a rake task, which works well if you’re already building Ruby applications, but is yet another tool to learn for those who don’t. It works by building an in-memory data structure that represents the CloudFormation template, then transforms that structure into JSON so that you can create a stack. Since you’re just manipulating a data structure, it’s easy to use Ruby constructs such as loops and conditionals. However, since the data structure must produce a CloudFormation template, you need to specify all of the information required by the template.

CDK, the AWS Cloud Development Kit, is an Amazon-sponsored open-source tool that lets you build CloudFormation templates programmatically. Like CFNDSL, it does this by manipulating an in-memory model of the template. Unlike CFNDSL, CDK provides “constructs” that have intelligent default configuration values, allowing a relatively small source file to generate a large CloudFormation template. CDK supports several scripting languages, but the documentation uses TypeScript so that’s what I’ll use for this post.

The Implementations

My goal with this post is to give the flavor of these different approaches. So I’m going to assume that (1) you’re familiar with how IAM users, groups, and roles work, and (2) have some experience with CloudFormation and the idea of declarative infrastructure. With these assumptions, I believe that you’ll be able to understand the other versions, even if you aren’t familiar with Terraform, Ruby, or TypeScript. Note that I also leave out parts of the scripts that distract from the points that I’m trying to make. If you want to look at fully functional versions of these scripts, they’re available from GitHub, along with execution instructions.

Since I’m a developer at heart, I want to manage my users via data tables, rather than repeated code blocks. Therefore, I’ve written each implementation (except native CloudFormation) around the following data structures:

  • The list of users.
  • The list of groups.
  • A lookup table that associates users with the groups they belong to.
  • A lookup table that associates groups with the roles that group members can assume.

Step 1: Creating Users

Creating the users for your organization is a task that many people do manually, via the AWS Console. Partly, that’s because just creating a user is just the start: you also have to enable access (programmatic and/or console), and securely convey the access keys and/or password to the person represented by that user. However, scripting can bring rigor and traceability to user management, especially in the case where a user leaves the organization.

For this task I create three users, and attach a pre-existing policy that lets them manage their own account (for example, to change the password). While I could also assign them an initial password, I consider that irrelevant to this post.

The CloudFormation variant of this task shows why other tools exist: you have to specify each user individually. In a large stack, this can easily become confusing, especially if you intermingle user resources with group assignments and group permissions. And in a larger organization, you will probably run into the limit of 200 resource definitions per stack.

I’m using YAML as my stack description language: it’s significantly more compact than JSON, and allows comments. For this first example I show the top-level stack elements; later examples will omit them.

AWSTemplateFormatVersion:           "2010-09-09"
Description:                        "Defines users, groups, and group permissions"

Parameters:

  # these will be defined in a later step

Resources:

  User1:
    Type:                           "AWS::IAM::User"
    Properties: 
      UserName:                     "user1"
      ManagedPolicyArns:            [ !Sub "arn:aws:iam::${AWS::AccountId}:policy/BasicUserPolicy" ]

  User2:
    Type:                           "AWS::IAM::User"
    Properties: 
      UserName:                     "user2"
      ManagedPolicyArns:            [ !Sub "arn:aws:iam::${AWS::AccountId}:policy/BasicUserPolicy" ]

  User3:
    Type:                           "AWS::IAM::User"
    Properties: 
      UserName:                     "user3"
      ManagedPolicyArns:            [ !Sub "arn:aws:iam::${AWS::AccountId}:policy/BasicUserPolicy" ]

By comparison, creating multiple resources is an area where Terraform shines: you can specify your list of users as a variable, and then declare a single resource that is applied to everyone in the list:

provider "aws" {}

variable "users" {
    type = list
    default = [ "user1", "user2", "user3" ]
}

resource "aws_iam_user" "users" {
  count = length(var.users)
  name  = "${var.users[count.index]}"
  force_destroy = true
}

This, however, isn’t the entire script: we also need to add the reference to the basic user policy. Unlike CloudFormation, which declares the policy attachment inside the user resource, Terraform requires creating another resource, which is also driven by the users array. We also need to create a “data” object to gain access to the invoking AWS account ID.

data "aws_caller_identity" "current" {}

resource "aws_iam_user_policy_attachment" "base_user_policy_attachment" {
  count      = length(var.users)
  user       = "${var.users[count.index]}"
  policy_arn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:policy/BasicUserPolicy"
}

Before I leave Terraform, I should note that this script doesn’t always work on the first try. I think it’s an eventual consistency issue: AWS has “created” the user, but it’s not ready to be assigned a policy. Running the script a second time will correct that, but it’s a little annoying.

Moving on to CFNDSL: it looks a lot like the CloudFormation template. You can guess that IAM_User corresponds to an AWS::IAM::User resource, and it has the same attributes. However, IAM_User is actually a function that adds the resource specification onto a tree of resources. Which means that it can be called from inside a loop, so that you only need to specify the resource once. When you run this program using the cfndsl gem, it will generate a CloudFormation template that is equivalent to the one I showed above.

users = [ "user1", "user2", "user3" ]

CloudFormation do
  Description "Manages the account's users"

  users.each do |user|
    IAM_User("#{user}") do
      UserName            user
      ManagedPolicyArns   [ FnSub("arn:aws:iam::${AWS::AccountId}:policy/BasicUserPolicy") ]
    end
  end
end

The CDK version is, surprisingly, the longest. There is a lot of boilerplate to a very simple stack: you have to create a Typescript class that represents the stack, with a constructor function. Not shown here are the files generated by the cdk init command, which form the “application” that includes your stack.

import cdk = require('@aws-cdk/core');
import iam = require('@aws-cdk/aws-iam');

const userNames : string[] = [ "user1", "user2", "user3" ]

export class UsersStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // since we're creating constructs inside functions, need to cache "this"
    let stack = this

    userNames.forEach(function(userName) {
      const user = new iam.User(stack, userName, {
        userName: userName
      })

      user.addManagedPolicy({
        managedPolicyArn: stack.formatArn({
          service:      "iam",
          region:       "",
          resource:     "policy",
          resourceName: "BasicUserPolicy"
        })
      })
    })
  }
}

This particular example was also lengthened because I had to construct the attached policy ARN from its components. There’s a function that hides this for AWS-managed policies, but not customer-managed policies (although there’s an open issue to fix this, and while I was writing this post a solution was created, but has not yet been released).

Step 2: Adding Users to Groups

One of the AWS IAM Best Practices is to assign user permissions via groups. For example, if your team manages two applications, you could create a group for each and then grant permissions (such as reading specific CloudWatch logs) to the group. If a user works on both applications, she is a member of both groups; if she stops working on one, she can be removed from the appropriate group without concern for her ability to work on the other.

For this example I create two groups, then divide the users from the previous step between them: user1 belongs to both groups, while user2 and user3 belong to one apiece.

With CloudFormation, this is, again, all explicit, but not overly onerous: I create a AWS::IAM::Group resource for each group (only one shown here), then update each AWS::IAM::User with a Groups attribute that refers to the appropriate group(s). If you’re creating the users and groups in the same stack, this will also establish an implicit deployment dependency: CloudFormation will create the groups groups first so that they can be referenced by the users.

Resources:

  Group1:
    Type:                           "AWS::IAM::Group"
    Properties: 
      GroupName:                    "group1"

  User1:
    Type:                           "AWS::IAM::User"
    Properties: 
      UserName:                     "user1"
      ManagedPolicyArns:            [ !Sub "arn:aws:iam::${AWS::AccountId}:policy/BasicUserPolicy" ]
      Groups:
       -                            !Ref Group1
       -                            !Ref Group2

The Terraform code to create groups looks a lot like the code that created users: a single resource definition that’s driven by an array.

variable "groups" {
    type = list
    default = [ "group1", "group2" ]
}

resource "aws_iam_group" "groups" {
  count = length(var.groups)
  name = "${var.groups[count.index]}"
}

Assigning members to groups is a bit more complex: as with attaching policies to users, you use a separate resource declaration to attach users to groups. And there are two ways to do that: either attach users to groups, or attach groups to users. The approach that you choose depends on how you want to manage your data structures.

In my view, it’s more readable to list the groups associated with each user, so I chose a mapping between username and a list of groups:

variable "group_members" {
    type = map(list(string))
    default = {
      "user1"  = [ "group1", "group2" ],
      "user2"  = [ "group1" ],
      "user3"  = [ "group2" ]
    }
}

The resource definition itself is based on the users variable, but extracts the groups for each user from that variable. This is about as tricky as I like to get with Terraform, but it’s undeniably compact and relatively easy to read.

resource "aws_iam_user_group_membership" "group-membership" {
  count = length(var.users)
  user  = "${var.users[count.index]}"
  groups = "${var.group_members[var.users[count.index]]}"
}

The CFNDSL version again looks a lot like the CloudFormation script: we create the groups via IAM_Group (again using a loop), and reference the groups in IAM_User. The one tricky bit is that I take the list of group names, and use the map function to translate them to references to the group resources.

users = [ "user1", "user2", "user3" ]

groups = [ "group1", "group2" ]

group_members = {
  "user1"  => [ "group1", "group2" ],
  "user2"  => [ "group1" ],
  "user3"  => [ "group2" ]
}


CloudFormation do

  groups.each do |group|
    IAM_Group("#{group}") do
      GroupName group
    end
  end

  users.each do |user|
    IAM_User("#{user}") do
      UserName            user
      ManagedPolicyArns   [ FnSub("arn:aws:iam::${AWS::AccountId}:policy/BasicUserPolicy") ]
      Groups              group_members[user].map { |group| Ref("#{group}") }
    end
  end
end

With CDK, I had to specify the user/group relationships using a typed Map object. It’s more boilerplate than a JavaScript-style nested object, but it keeps the compiler happy.

const userGroups = new Map()
userGroups.set("user1", [ "group1", "group2" ])
userGroups.set("user2", [ "group1" ])
userGroups.set("user3", [ "group2" ])

Keeping the compiler happy was also a big part of creating the user/group relationships. While the underlying IAM API refers to users and groups as strings, CDK scripts deal with objects. The User object has a method addToGroup() that takes an IGroup object, and there’s no way to simply add a group by name. This has several implications, the biggest being that you can’t create your users and groups in separate stacks, since you need the actual objects.

I resolved this issue by creating two maps: usersByName and groupsByName. These are populated by the code that creates the users and groups; the key is the user/group name, and the value is the CDK object.

export class UsersAndGroupsStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // since we creating constructs inside functions, we need to preserve "this"
    let stack = this

    // these two maps let us retrieve constructs for later use
    let groupsByName = new Map()
    let usersByName = new Map()

    groupNames.forEach(function(groupName) {
      const group = new iam.Group(stack, groupName, {
        groupName: groupName
      })
      groupsByName.set(groupName, group)
    })

    userNames.forEach(function(userName) {
      const user = new iam.User(stack, userName, {
        userName: userName
      })

      // add BasicUserPolicy

      usersByName.set(userName, user)
    })
  }
}

While I could have attached the groups while creating the users, I decided to write a separate loop to do so: I quickly learned that it was a lot easier to develop a working template if I added small pieces at a time. Since the CDK program is simply building an in-memory data model that it converts to a CloudFormation template, there’s no performance benefit to be gained from one approach over the other.

userGroups.forEach(function(groupNames, userName) {
  const user  = usersByName.get(userName)
  groupNames.forEach(function(groupName) {
    const group = groupsByName.get(groupName)
    if (user && group) {
        user.addToGroup(group)
    }
  })
})

Step 3: Assigning Permissions to Groups

As I described at the top of this post, groups grant users the ability to assume roles that belong to different accounts within the organization. To make this happen I add an inline policy to each group that grants sts:AssumeRole for those roles (creating roles is beyond the scope of this post) While I could have created a managed policy rather than an inline policy, I don’t see that it adds much benefit in this case.

As always, I’m going to start with CloudFormation, which requires everything to be declared explicitly. I use parameters to hold account IDs to make the script more readable: especially with a large organization, it can be difficult to remember which numeric ID belongs to which account. And for brevity, I again limit myself to one group.

Parameters:

  DevAccountId:
    Description:                    "Symbolic name for the development account ID"
    Type:                           "String"
    Default:                        "123456789012"

  ProdAccountId:
    Description:                    "Symbolic name for the development account ID"
    Type:                           "String"
    Default:                        "234567890123"

Resources:

  # users and groups get created here

  Group1Policy:
    Type:                           "AWS::IAM::Policy"
    Properties:
      Groups:                       [ !Ref Group1 ]
      PolicyName:                   "Group1-AssumeRole"
      PolicyDocument:
        Version:                    "2012-10-17"
        Statement:
          -
            Effect:                 "Allow"
            Action:                 [ "sts:AssumeRole" ]
            Resource:
             -                      !Sub "arn:aws:iam::${DevAccountId}:role/FooAppDeveloperRole"
             -                      !Sub "arn:aws:iam::${ProdAccountId}:role/FooAppReadOnlyRole"

For Terraform I maintain the relationships in a map, similar to the way that I handled user/group relationships. This map, however, has more levels: each group is associated with a list of roles, and each role in turn is composed of an account ID and role name.

variable "group_permissions" {
    type = map(list(list(string)))
    default = {
      "group1"  = [
                        [ "dev",    "FooAppDeveloper" ],
                        [ "prod",   "FooAppReadOnly" ]
                      ],
      "group2"  = [
                        [ "dev",    "BarAppDeveloper" ],
                        [ "prod",   "BarAppReadOnly" ]
                      ]
    }
}

I have another lookup table for that associates those account IDs with a human-readable name.

variable "account_id_lookup" {
    type = map
    default = {
        "dev"   = "123456789012",
        "prod"  = "234567890123"
    }
}

The actual resource definition starts with a “data” object that represents the policy. Data objects serve two very different purposes in Terraform. The first is retrieving information about the environment; you saw that in the first example, when I used one to retrieve the AWS account ID. The second purpose is to allow you to specify hierarchical data within the script using Terraform syntax, and then convert that object into JSON for use in a resource. This is the purpose that I’m using here: for each group, I create a policy object that holds the assumable roles, and then convert it to JSON to attach to the group.

data "aws_iam_policy_document" "group-policies" {
  count = length(var.groups)
  statement {
    sid = "1"
    actions = [
        "sts:AssumeRole"
    ]
    resources = [
      for acct_role in var.group_permissions[var.groups[count.index]]:
        "arn:aws:iam::${var.account_id_lookup[acct_role[0]]}:role/${acct_role[1]}"
    ]
  }
}

As with the previous definitions, group-policies uses a count to create multiple instances. However, it also makes use of another iteration technique, “for expressions,” which were introduced with version 0.12. A for-expression acts much like Ruby’s map function: it performs a transformation on each element of an array, producing an array of transformed values. Here I transform the list of account/role pairs associated with the group into ARNs for use in the policy. In terms of trickiness (and difficulty to read), this is past what I’m comfortable with, but it’s the only way I’ve found to make Terraform use nested data structures.

resource "aws_iam_group_policy" "group-policies" {
  count = length(var.groups)
  name = "group-policies-${count.index}"
  group = "${var.groups[count.index]}"
  policy = "${data.aws_iam_policy_document.group-policies[count.index].json}" 
}

Like the Terraform example, my CFNDSL example has a nested map defining account/role pairs, and a map defining a name/ID table for accounts.

account_lookup = {
  "dev"   => "123456789012",
  "prod"  => "234567890123"
}

group_permissions = {
  "group1"  => [
                  [ "dev",    "FooAppDeveloper" ],
                  [ "prod",   "FooAppReadOnly" ]
               ],
  "group2"  => [
                  [ "dev",    "BarAppDeveloper" ],
                  [ "prod",   "BarAppReadOnly" ]
               ]
}

And like the Terraform example, I create the policy document separately, and transform it to JSON when creating the policy. The Resource section of this document uses the map function, which you’ve seen before, to iterate over the list of permissions retrieved from the map. Unlike the Terraform for-in operation, however, this function executes a block of Ruby code. Which means that I can introduce “explanatory variables” that hold pieces of the role array, and make the code easier to understand.

CloudFormation do

  groups.each do |group|
    IAM_Group("#{group}") do
      GroupName group
    end

    policy_document = {
      "Version"   => "2012-10-17",
      "Statement" => [{
        "Effect"    => "Allow",
        "Action"    => [ "sts:AssumeRole" ],
        "Resource"  => group_permissions[group].map { |acct_role|
                        account_id = account_lookup[acct_role[0]]
                        role_name = acct_role[1]
                        "arn:aws:iam::#{account_id}:role/#{role_name}"
                       }
      }]
    }.to_json

    IAM_Policy("#{group}Policy") do
      PolicyName      "#{group}-AssumeRolePolicy"
      PolicyDocument  policy_document
      Groups          [ group ]
    end
  end

  # user creation omitted
end

Last up, CDK. At this point, I don’t think there’s anything new to describe: as with the previous examples I use a map of account names to IDs and a nested map of assignable roles. Then I iterate through that structure, create a policy statement, and add it to the group. Unlike other examples, where the policy statement had to be translated to JSON, CDK provides an object, iam.PolicyStatement, which manages the document elements and performs that translation internally.

const accountIds = new Map()
accountIds.set("dev",  "123456789012")
accountIds.set("prod", "234567890123")

const groupPermissions = new Map<string,Array>()
groupPermissions.set("group1", [
    [ "dev",    "FooAppDeveloper" ],
    [ "prod",   "FooAppReadOnly" ]
])
groupPermissions.set("group2", [
    [ "dev",    "BarAppDeveloper" ],
    [ "prod",   "BarAppReadOnly" ]
])

export class UsersAndGroupsStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // user and group creation/assignment go here; both are added to maps for future use

    groupPermissions.forEach(function(roleSpecs, groupName) {
      let assumableRoles = roleSpecs.map(function([accountName, roleName]) {
        // use a default value to avoid explicit existence test ... should always be valid
        let accountId = accountIds.get(accountName) || ""
        return stack.formatArn({
          service:      "iam",
          region:       "",
          account:      accountId,
          resource:     "role",
          resourceName: roleName
        })
      })
      let policyStatement = new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        actions: [ "sts:AssumeRole" ],
        resources: assumableRoles
      })
      let group = groupsByName.get(groupName)
      if (group) {
        group.addToPolicy(policyStatement)
      }
    })
  }
}

Conclusions

I think that, in the end, the debate is between Terraform and CDK. Pure CloudFormation is just too verbose for large deployments, and CFNDSL doesn’t provide a significant level of abstraction. Plus, Ruby may be off-putting for organizations that don’t already use it.

CDK is promising, but not ready for prime-time. To be fair, this task did not highlight its strength, which is highly abstracted, reusable constructs. But an infrastructure tool has to be able to do the basics, not just the flash. There are also places where it’s just broken, although I was able to work around those for this post. Perhaps in six months I’ll be willing to re-evaluate it.

Terraform remains my go-to for creating infrastructure, even though its declarative nature opposes my imperative nature. It is, however, trying to meet me halfway, with the addition of for-comprehensions in version 0.12 and the promise of more iteration in the future.