Using CodeArtifact with Poetry

by
Tags: , ,
Category:

In my last post, I showed how you could reference local project directories with Poetry, and said that the “correct solution” was to use an artifact repository such as CodeArtifact. This post is all about that solution.

CodeArtifact

CodeArtifact is an artifact server for Java, .Net, npm (JavaScript/NodeJS), and Python. If you’re not familiar with artifact servers, the basic idea is that you publish your company’s private libraries to the server, and then retrieve them in other projects. In addition, most artifact servers also allow you to cache packages from “upstream” servers such as Maven Central or PyPi. Not only does this make you a good neighbor to those servers by reducing their load, it also means that your development team won’t grind to a halt if those servers become unavailable.

When working with CodeArtifact, there are three terms you should know:

  • A package (aka artifact) is a file stored in the repository. For Python, these might be “wheel” archives; for Java, JARs and WARs.
  • A repository holds packages, either locally published or retrieved from an upstream server. Each repository holds artifacts for a single language (Python, Java, …), but you can have multipl repositories, to support different languages.
  • A domain is a collection of repositories. You can use the same CodeArtifact domain to serve artifacts for different languages.

Your build scripts interact with individual repositories, using an HTTPS URL that identifies the domain, account, and repository. For example:

https://example-123456789012.d.codeartifact.us-east-1.amazonaws.com/pypi/python/simple/

My preferred structure is to have a single “front-end” repository that your build scripts access to download artifacts. Behind it are repositories for locally-produced artifacts and artifacts that have been downloaded from PyPi:

Python Repository Hierarchy

You can add other “upstream” repositories. For example, you might create a repository named python-thirdparty, to which you upload non-public packages from a third-party provider. By adding it as an upstream for your python front-end, you can transparently use those packages while keeping them separate from your internal packages or those downloaded from PyPi. Or, if the third-party provider has a public repository, you can add it as an upstream and cache those packages (this happens a lot in the Java world, less often with Python).

Authentication and Authorization

CodeArtifact requires all operations to be authenticated, and uses a combination of identity-based and resource-based policies to control access. The identity-based policies control the actions that a user or assumed-role can perform, while the resource-based policies control what a repository or domain allows (and are typically used to manage cross-acount access). As with other services that combine identity- and resource-based policies, the actual permissions of a user or assumed role are the intersection of these two policies.

Resource-based policies

By default, the account owner has access to the domains and repositories in that account, so if you’re using CodeArtifact with a single account you can skip this section.

For cross-account access, you need to attach policies to the domain and its repositories that grant access to users from those accounts. The domain-level policy must allow the codeartifact:GetAuthorizationToken permission; it may include other permissions that allow users to learn about the repository. For example, to grant access to account 234567890123:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Principal": {
                "AWS": "arn:aws:iam::234567890123:root"
            },
            "Effect": "Allow",
            "Action": [
                "codeartifact:DescribeDomain",
                "codeartifact:GetAuthorizationToken",
                "codeartifact:ListRepositoriesInDomain"
            ],
            "Resource": "*"
        }
    ]
}

Since I’ve separated the repositories for readers (python) and writers (python-local), I need policies for each. The python-pypi repository doesn’t need a policy because it’s not accessed directly.

The following policy is appropriate for the “front-end” repository: it lets other accounts get information about the repository and download packages.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Principal": {
                "AWS": "arn:aws:iam::234567890123:root"
            },
            "Effect": "Allow",
            "Action": [
                "codeartifact:DescribePackageVersion",
                "codeartifact:DescribeRepository",
                "codeartifact:Get*",
                "codeartifact:List*",
                "codeartifact:ReadFromRepository"
            ],
            "Resource": "*"
        }
    ]
}

The following policy grants write-only access to the python-local repository. Note that this policy does not grant read permission: as I said above, I want all reads to go through through the python repository.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Principal": {
                "AWS": "arn:aws:iam::234567890123:root"
            },
            "Effect": "Allow",
            "Action": [
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:PublishPackageVersion",
                "codeartifact:PutPackageMetadata"
            ],
            "Resource": "*"
        }
    ]
}

Note: all of these policies use a wildcard (“*”) resource. Resource is a required component of a policy (at least when editing in the Console), so you have to put something there. But since the policy is attached to a single domain/repository the wildcard does not grant permissions on other domains/repositories in the account.

Identity-based policies

Regardless of whether you’re accessing CodeArtifact in the same account or a different account, you need to grant identity-based policies to the users or assumed roles that will access the repository. In the examples below, the domain is owned by account 123456789012, and I explicitly list the repositories/domain that the policy grants access to.

The first policy is the one that a developer would use: it grants read-only access to the “front-end” repository:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sts:GetServiceBearerToken"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "codeartifact:DescribeDomain",
                "codeartifact:GetAuthorizationToken",
                "codeartifact:ListRepositoriesInDomain"
            ],
            "Resource": "arn:aws:codeartifact:us-east-1:123456789012:domain/example"
        },
        {
            "Effect": "Allow",
            "Action": [
                "codeartifact:Describe*",
                "codeartifact:List*",
                "codeartifact:GetPackageVersionReadme",
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:ReadFromRepository"
            ],
            "Resource": [
                "arn:aws:codeartifact:us-east-1:123456789012:repository/example/python",
                "arn:aws:codeartifact:us-east-1:123456789012:repository/example/python/*",
                "arn:aws:codeartifact:us-east-1:123456789012:package/example/python/*"
            ]
        }
    ]
}

This is a fairly complex policy, so I’ll break it down by statement:

  • The first statement grants sts:GetServiceBearerToken, which is required as part of the authentication process (as-of this writing, the documentation for bearer tokens indicates that it should specify an explicit resource, but this is inaccurate and will cause an error if you do so; I’ve reported the error, so the documentation may change).
  • The second statement grants permissions to the domain. This mirrors the domain’s resource-based policy.
  • The third statement grants permissions to access artifacts in a single repository. The permissions require three different resource specifications for the same repository, to identify the repository itself and the packages within it.

Next up is the writer policy. This is normally granted to a CI/CD user or role, in addition to the reader policy above. As with the reader policy, I limit writes to a single repository (and I don’t allow direct reads on that repository!).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:PublishPackageVersion",
                "codeartifact:PutPackageMetadata"
            ],
            "Resource": [
                "arn:aws:codeartifact:us-east-1:123456789012:repository/example/python-local",
                "arn:aws:codeartifact:us-east-1:123456789012:repository/example/python-local/*",
                "arn:aws:codeartifact:us-east-1:123456789012:package/example/python-local/*"
            ]
        }
    ]
}

Access Token

With these policies in place, you can now access your repository. CodeArtifact uses HTTP Basic Auth to download and publish packages. The username is aws, and the password is an expiring token that you retrieve via one of the following command-line programs:

  1. aws codeartifact login is intended to be used with supported tools (pip and twine), and updates the tool’s configuration files. It also has a “dry run” mode that I find more useful, as you’ll see below.
  2. aws codeartifact get-authorization-token returns just the authorization token, in a JSON payload that you can transform with the --query CLI parameter.

Whichever way you retrieve the token, it remains valid for 12 hours.

Downloading artifacts with PIP

While the title of this article is “Using CodeArtifact with Poetry,” I still use pip to build Lambda deployment bundles, so it needs to be able to pull from the repository. Fortunately, it’s easy tell pip where to look for packages: you set the global.index-url configuration property to the repository’s URL.

While the AWS codeartifact login command will update your configuration file, the expiring token means that you (in practice) have to run it on every build. Which makes me worry about concurrent builds trying to update the file.

Instead, I prefer passing the repository URL on the pip command-line, using the --index-url option. This is where I use the codeartifact login “dry run” mode, combined with awk to extract the URL:

REPO_URL=$(aws codeartifact login --tool pip --domain-owner 123456789012 --domain example --repository python --dry-run | awk '{print $5}')
pip install --index-url $REPO_URL ## other options

Downloading artifacts with Poetry

Poetry makes things a little more difficult: you must provide the repository URL in pyproject.toml, and then use a configuration file or environment variables (my preference) to pass the login credentials.

Here’s an example configuration: the domain name is “example”, the account ID is “123456789012”, and the repository name is “python”. You can either substitute the values for your domain, or use the CLI to retrieve the URL for your repository. Beware that simple/ on the end is important, otherwise Poetry won’t find your packages!

[[tool.poetry.source]]
name = "frontend"
url = "https://example-123456789012.d.codeartifact.us-east-1.amazonaws.com/pypi/python/simple/"

To provide the credentials for this repository, set the following two environment variables. Note that the variable names include the name of the repository.

export POETRY_HTTP_BASIC_FRONTEND_USERNAME=aws
export POETRY_HTTP_BASIC_FRONTEND_PASSWORD=$(aws codeartifact get-authorization-token --domain-owner 123456789012 --domain example --query 'authorizationToken' --output text)

Publishing artifacts with Poetry

Poetry lets you provide the name of a repository when publishing:

poetry publish -r PUBLISH

As with downloading, you configure the repository credentials with environment variables. Unlike downloading, you can also specify the repository URL with an environment variable:

export POETRY_REPOSITORIES_PUBLISH_URL=$(aws codeartifact get-repository-endpoint --domain-owner 123456789012 --domain example --repository python-local --format pypi --query 'repositoryEndpoint' --output text)
export POETRY_HTTP_BASIC_PUBLISH_USERNAME=aws
export POETRY_HTTP_BASIC_PUBLISH_PASSWORD=$(aws codeartifact get-authorization-token --domain-owner 123456789012 --domain example --query 'authorizationToken' --output text)

Beware: CodeArtifact does not allow you to overwrite artifacts. If you want to republish an artifact you’ll need to go into the repository in the Console, and delete that artifact from both the python and python-local repositories.

If you’re actively developing a shared library together with the code that uses it, I recommend using a development dependency as I showed in my last post, only switching to a “mainline” dependency once the shared library version is fully tested and ready for publication.

Limitations and Caveats

While CodeArtifact and Poetry work well together, all is not sunshine and roses. Here are a few of the issues that I found.

Package Search

CodeArtifact provides a very limited subset of Python repository commands. In particular, it doesn’t support search, instead delegating to PyPi.

This means that you can’t use dependency ranges with private packages, since Poetry has no way to discover which versions of those packages are in your private repositories. So if you have a shared library that gets regular updates, you will need to go through the projects that use that library and explicitly update them.

Cost

Downloading directly from PyPi is “free” — more correctly, somebody else pays for it. CodeArtifact, like other AWS services, is “pay as you go,” and there are many components to your total bill. Of these, I think data transfer is the most likely to cause pain: you pay (for US regions) $0.09 per GB for everything you download, and if you’re running your builds inside a VPC, you’ll also pay NAT charges (because CodeArtifact runs outside of the VPC).

I don’t think that these charges will be excessive for a typical development organization, unless you’re constantly running builds with lots of large dependencies.

However, you should definitely check your bill for the first couple of months, and take advantage of whatever caching capabilities your build platform provides.

Wrapping Up

There are alternatives to CodeArtifact. In the past, I’ve used Sonatype Nexus and Artifactory. These products have been around longer than CodeArtifact, so provide more features (Nexus, in particular, gives you more choices for artifact types, including Docker images and arbitrary files).

The question of whether to use CodeArtifact or one of these alternatives depends on how much time you have to manage your infrastructure. It takes about the same time to bring up a CodeArtifact domain as it does to bring up a Nexus server: an hour or two. You have to make the same changes to your build for both. The primary benefit in my opinion is that CodeArtifact is a managed service: once you configure your domain, “it just works.” By comparison, with a self-hosted Nexus repository you have to take steps to ensure that the repository is regularly backed up, and be prepared to restore from backups if something happens.

Whichever path you go, I think a local repository is a “must-have” for any professional development team. It simplifies your builds, and protects you in case something happens to the central repository (such as someone deciding to unpublish a popular package).

 


 

Can we help you?

Ready to transform your business with customized cloud solutions? Chariot Solutions is your trusted partner. Our consultants specialize in managing cloud and data complexities, tailoring solutions to your unique needs. Explore our cloud and data engineering offerings or reach out today to discuss your project.