AWS Aurora Serverless V2 database doesn’t scale to zero after upgrading from Serverless V1

If you follow the migration guide to upgrade your Aurora Serverless V1 database to V2 (which a bunch of panicked folks are likely doing ahead of the deprecation date of 31st March), you might find that your new cluster doesn’t scale down to zero after the upgrade.

In my case, the fact that logical replication was enabled (rds.logical_replication) in a cluster-level paramter group prevented instances from scaling down to zero. Disabling logical replication got the correct auto-pause behaviour.

You’re directed to do this to enable blue/green deployments, but if you don’t need blue/green or else you don’t need logical replication after the upgrade then turning this off and rebooting the instances seems to cure the problem.

How do I know if my cluster is scaling to zero properly?

The easiest way to tell is to look at the Monitoring tab of the RDS page in the AWS Console, and the ACUUtilization metric in particular.

The ACUUtilization metric tells you what proportion of the maximum number of ACUs you’ve configured the cluster should scale to are in use at any given point – 100% means it’s running at 100% of the maximum capacity you configured, and 0% means it’s scaled to zero.

If you find that the chart never hits zero, but gets stuck at some specific percentage (in the above, you can see a baseline level of 25% activity) then this indicates that your cluster isn’t auto-pausing. You can determine what your minimum activity level is by dividing your minimum ACUs by your maximum – the above case has a minimum of 0.5 and a maximum of 2, so when running at 0.5 ACUs the database is using 0.5 / 2 = 25% of its capacity.

This might be because you have legitimate work being done in the database though – if you have idle database connections, the cluster will run at least at your minimum ACU level. You should also check the DatabaseConnections metric over the same time period.

If you’re seeing zero connections, but you’re still pinned to some minimum level of ACU utilisation then something might be blocking the cluster’s auto-pause functionality.

How can I tell what’s stopping the scaling to zero?

Easiest is to look at the instance logs.

  • In the RDS console, choose the instance you’re interested in
  • On the Logs & Events tab, in the Logs section at the bottom of the page find the log named ‘instance/instance.log’
  • You will probably find, every few minutes, a log entry along the lines of:
    [INFO] Auto-pause blockers registered since 2025-03-04T02:39:48.276Z: replication capability configured

CDK S3 ‘BucketDeployment’ doesn’t have to be slow – increase its memoryLimit parameter

If you’re deploying a static site to Cloudfront via CDK, you might be using the BucketDeployment construct to combine shipping a folder to S3 and causing a Cloudfront invalidation.

Behind the scenes, BucketDeployment creates a custom resource, a Lambda, that wraps a call to the AWS SDK’s s3 cp command to move files from the CDK staging area to the target S3 bucket.

While that’s happening within AWS’s infrastructure, the speed of that copy depends very strongly on the amount of resources the Lambda has – just like any other Lambda, CPU and network bandwidth scale with the requested memory limit.

The default memory limit for the custom resource Lambda is 128MiB – which is the smallest Lambda you can get, and accordingly the performance of that copy might be terrible if you have a lot of files, or large files, to transfer.

I’d strongly recommend upping that limit to 2048MiB or higher. This radically improved upload performance on two applications I deploy, with the upload rate going from @=~700KiB/s to >10MiB/s – a 10x increase.

This has a negligible cost implication as this Lambda only runs during a deployment, so shouldn’t be running all too frequently anyway. However the performance improvement is potentially dramatic for complex apps. We saw one build go from ~280s uploading to S3 come down to ~45s – an 84% reduction in that deployment step’s execution time, and about a 15% reduction in the deployment time of that stack overall – just for changing one parameter.

Bucket named ‘cdk-abcd12efg-assets-123456789-eu-west-1’ exists, but not in account 123456789. Wrong account?

When deploying a stack via CDK, you may encounter an error such as

Bucket named 'cdk-abcd12efg-assets-123456789-eu-west-1' exists, but not in account ***. Wrong account?

The most likely culprit here is that the role you’re using to deploy doesn’t have the right permissions on the staging bucket. CDK requires:

  • getBucketLocation
  • *Object
  • ListBucket

We hit this recently, and the underlying cause was that the IAM role used to deploy the stack had been amended to have a restricted set of permissions per least-privilege best practice. We’d deployed updates to the stack a number of times, but in this instance the particular change we were making required a re-upload of assets to the staging bucket, which uncovered the missing permission.

Cognito error: “Cannot use IP Address passed into UserContextData”

When using Cognito’s Advanced Security and adaptive authentication features, you need to ship contextual data about the logging-in user via the UserContextData type.

Some of this type data is collected via a Javascript snippet. However, you can also ship the user’s IP address (which the snippet cannot collect) in the same payload.

When doing so, you may get an error from Cognito:

“Cannot use IP Address passed into UserContextData”

Unhelpful error from Cognito

This is likely because you’ve not enabled ‘Accept additional user context data‘ on your user pool client – though the error message is pretty opaque.

You can do this in a number of ways:

  • Via the AWS console
  • Via the UpdateUserPoolClient CLI function
  • Via CDK, if you drop down to the Level 1 construct and set “enablePropagateAdditionalUserContextData: true” on your CfnUserPoolClient

Even the latest L2 constructs for Cognito don’t seem to support setting enablePropagateAdditionalUserContextData when controlling a user pool client via CDK, but using the L1 escape hatch is easy enough:

const cfnUserPoolClient = userPoolClient.node.defaultChild as CfnUserPoolClient;
cfnUserPoolClient.enablePropagateAdditionalUserContextData = true;

GitHub Actions, ternary operators and default values

Github Actions ‘type’ metadata on custom action or workflow inputs is, pretty much, just documentation – it doesn’t seem to be enforced, at least when it comes to supplying a default value. That means that just because you’ve claimed it’s a bool doesn’t make it so.

And worse, it seems that default values get coerced to strings if you use an expression.

At TILLIT we have custom GitHub composite actions to perform various tasks during CI. We recently hit a snag with one roughly structured as follows

name: ...
inputs:
   readonly:
      type: boolean
      default: ${{ some logic here }}

runs:
  using: "composite"
  steps:
    - name: ...
      uses: ...
      with:
        some-property: ...${{ inputs.readonly && 'true-val' || 'false-val' }}...

That mess in the some-property definition is the closest you can get in Github Actions to a ternary operator in the absence of any if-like construct, where you want to format a string based on some boolean.

In our case – the ‘true’ path was the only path ever taken. Diagnostic logging on the action showed that inputs.readonly was ‘false’. Wait, are those quotes?

Of course they are! The default value ended up being set to be a string, even though the input’s default value expression is purely boolean in nature and it’s specified as being a boolean.

The fix then is to our ternary, and to be very explicit as to the comparison being made.

with:
  some-property: ...${{ inputs.readonly == 'true' && 'true-val' || 'false-val' 

AWS SAM error “[ERROR] (rapid) Init failed error=Runtime exited with error: signal: killed InvokeID=” in VS Code

When debugging a lambda using the AWS Serverless Application Model tooling (the CLI and probably VS Code extensions), you might find that your breakpoint isn’t getting hit and you instead see an error in the debug console:

[ERROR] (rapid) Init failed error=Runtime exited with error: signal: killed InvokeID=" in VS Code

A thing to check is whether you’re running out of RAM or timing out in execution:

  • Open your launch.json file for the workspace
  • In your configuration, under the lambda section, add a specific memoryMb value – in my case 512 got me moving

This is incredibly frustrating because the debug console gives you no indication as to why the emulator terminated your lambda – but also helpful, because you can tell how large you need to specify your lambda when you deploy it ahead of time.

Invalid Request error when creating a Cloudfront response header policy via Cloudformation

I love Cloudformation and CDK, but sometimes neither will show an issue with your template until you actually try to deploy it.

Recently we hit a stumbling block while creating a Cloudfront response header policy for a distribution using CDK. The cdk diff came out looking correct, no issues there – but on deploying we hit an Invalid Request error for the stack.

An error displayed in the Cloudfront 'events' tab, indicating that there was an Invalid Request but giving no further clues
Cloudformation often doesn’t give much additional colour when you hit a stumbling block

The reason? We’d added a temporarily-disabled XSS protection header, but kept in the reporting URL so that when we turned it on it’d be correctly configured. However, Cloudfront rejects the creation of the policy if you spec a reporting URL on a disabled header setup.

The Cloudfront resource policy docs make it pretty clear this isn’t supported, but Cloudformation can’t validate it for us

A screenshot of a validation error message indicating that X-XSS-Protection cannot contain a Report URI when protection is disabled
Just jumping into the console to try creating the resource by hand is often the most effective debugging technique

How to diagnose Invalid Request errors with Cloudformation

A lot of the time the easiest way to diagnose a Invalid Request error when deploying a Cloudformation is to just do it by hand in the console in a test account, and see what breaks. In this instance, the error was very clear and it was a trivial patch to fix up the Cloudformation template and get ourselves moving.

Unfortunately, Cloudformation often doesn’t give as much context as the console when it comes to validation errors during stack creation – but hand-cranking the affected resource both gives you quicker feedback and a better feel for what the configuration options are and how they hang together.

A rule of thumb is that if you’re getting an Invalid Request back, chances are it’s essentially a validation error on what you’ve asked Cloudformation to deploy. Check the docs, simplify your test case to pinpoint the issue and don’t be afraid to get your hands dirty in the console.

DMARC failures even when AWS SES Custom Mail-From domain used

I was caught out by this, this week, so hopefully future-me will remember quicker how to fix this one.

Scenario

  • You want to get properly configured for DMARC for a domain you’re sending emails from via AWS SES
  • You’ve verified the sender domain as an identity
  • You’ve set up DKIM and SPF
  • You’ve set up a custom MAIL FROM
  • You’re still seeing SPF-related DMARC failures when sending emails

In my case, those failures were caused because I was sending email from a different identity that uses the same domain.

For example, I had ‘example.com’ set up as a verified identity in SES allowing me to send email from any address at that domain, and I configured a sender identity ‘contact@example.com’ to be used by my application to send emails so that I could construct an ARN for use with Cognito or similar.

What isn’t necessarily obvious is that you need to enable the custom MAIL FROM setting for the sender identity, and not just for the domain identity that you’ve configured assuming you have multiple. AWS SES does not fall back to the configuration for the domain identity and you have to individually enable custom MAIL FROM for each sender identity – even if the configuration is identical.

So in my case, the fix was:

  • Edit the Custom MAIL FROM setting for contact@example.com
  • Enable it to use mail.example.com (which was already configured)
  • Save settings

Using an AWS role to authenticate with Google Cloud APIs

I recently had a requirement to securely access a couple of Google Cloud APIs as a service account user, where those calls were being made from a Fargate task running on AWS. The until-relatively-recently way to do this was:

  • Create a service account in the Google Cloud developer console
  • Assign it whatever permissions it needs
  • Create a ‘key’ for the account – in essence a long-lived private key used to authenticate as that service account
  • Use that key in your Cloud SDK calls from your AWS Fargate instance

This isn’t ideal, because of that long-lived credential in the form of the ‘key’ – it can’t be scoped to require a particular originator and while you can revoke it from the developer console, if the credential leaks you’ve got an infinitely long-lived token usable from anywhere – you’d need to know it had leaked to prevent its use.

Google’s Workload Identity Federation is the new hotness in that regard, and is supported by almost all of the client libraries now. Not the .NET one though, irritatingly, which is why this post from Johannes Passing is, if you need to do this from .NET-land, absolutely the guide to go to.

The new approach is more in line with modern authentication standards and uses federation between AWS and Google Cloud to support generating short-lived, scoped credentials that are used for the actual work and no secrets needing to be shared between the two environments.

The docs are broadly excellent, but I was pleased at how clever the AWS <-> Google Cloud integration is given that there isn’t any AWS-supported explicit identity federation actually happening, in the sense of established protocols (like OIDC, which both clouds support in some fashion).

How it works

On the Google Cloud side, you set up a ‘Workload identity pool’ – in essence a collection of external identities that can be given some access to Google Cloud services. Aside from some basic metadata, a pool has one or more ‘providers’ associated with it. A provider represents an external source of identities, for our example here AWS.

A provider can be parameterised:

  • Mappings translate between the incoming assertions from the provider and those of Google Cloud’s IAM system
  • Conditions restrict the identities that can use the identity pool via a rich syntax

You can also attach Google service accounts to the pool, allowing those accounts to be impersonated by identities in the pool. You can restrict access to a given service account via conditions, in a very similar way to restricting access to the pool itself.

To get an access token on behalf of the service account, a few things are happening (in the background for most client libraries, and explicitly in the .NET case).

Authenticating with the pool

In AWS land, we authenticate with the Google pool by asking it to exchange a provider-issued token for one that Google’s STS will recognise. For AWS, the required token is (modulo some encoding and formatting) a signed ‘GetCallerIdentity’ request that you might yourself send to the AWS STS.

Our calling code in AWS-land doesn’t finish the call – we don’t need to. Instead, we sign a request and then pass that signed request to Google which makes the call itself. We include in the request (and the fields that are signed over) the URI of the ‘target resource’ on the Google side – the identity pool that we want to authenticate to.

The response from AWS to Google’s call to the STS will include the ARN of the identity for whom credentials on the AWS side are available. If you’re running in ECS or EC2, these will represent the IAM role of the executing task.

We need share nothing secret with Google to do this, and we can’t fake an identity on AWS that we don’t have access to.

  • The ARN of the identity returned in the response to GetCallerIdentity includes the AWS account ID and the name of any assumed role – the only thing we could ship to Google is proof of an identity that we already have access to on the AWS side.
  • The Google workflow identity pool identifier is signed over in the GetCallerIdentity request, so the token we send to Google can only be used for that specific user pool (and Google can verify that, again with no secrets involved). This means we can’t accidentally ship a token to the wrong pool on the Google side.
  • The signature can be verified without access to any secret information by just making the request to the AWS STS. If the signature is valid, Google will receive an identity ARN, and if the payload has been tampered with or is otherwise invalid then the request will fail.

None of the above requires any cooperation between AWS and Google cloud, save for AWS not changing ARN formats and breaking identity pool conditions and mappings.

What happens next?

All being well, the Google STS returns to us a temporary access token that we can then use to generate a real, scoped access token to use with Google APIs. That token can be nice and short lived, restricting the window over which it can be abused should it be leaked.

What about for long-lived processes?

Our tokens can expire in a couple of directions:

  • Our AWS credentials can and will expire and get rolled over automatically by AWS (when not using explicit access key IDs and just using the profile we’re assuming from the execution role of the environment)
  • Our short-lived Google service account credential can expire

Both are fine and handled the same way – re-run the whole process. Signing a new GetCallerIdentity request is quick, trivial and happens locally on the source machine. And Google just has to make one API call to establish that we’re still who we said we were and offer up a temporary token to exchange for a service account identity.

Creating a Route 53 Public Hosted Zone with a reusable delegation set ID in CDK

What’s a reuable delegation set anyway?

When you create a Route 53 public hosted zone, four DNS nameservers are allocated to the zone. You then use these name servers with your domain registrar to delegate DNS resolution to Route 53 for your domain.

However: each time you re-create a Route 53 hosted zone, the DNS nameservers allocated will change. If you’re using CloudFormation to manage your public hosted zone this means a destroy and recreate breaks your domain’s name resolution until you manually update your registrar’s records with the new combination of nameservers.

Route 53 reusable delegation sets are stable collections of Route 53 nameservers that you can create once and then reference when creating a public hosted zone. That zone will now have a fixed set of nameservers, regardless of how often it’s destroyed and recreated.

Shame it’s not in CloudFormation

There’s a problem though. You can only create route 53 reusable delegation sets using the AWS CLI or the AWS API. There’s no CloudFormation resource that represents it (yet).

Worse, you can’t even reference an existing, manually-created delegation set using CloudFormation. Again, you can only do it by creating your public hosted zone using the CLI or API.

The AWS CloudFormation documentation makes reference to a ‘DelegationSetId’ element that doesn’t actually exist on the Route53::HostedZone resource. Nor is the element mentioned anywhere else in that article or any SDK. I’ve opened a documentation bug for that. Hopefully its presence indicates that we’re getting an enhancement to the Route53::HostedZone resource some time soon…

So how can we achieve our goal of defining a Route 53 public hosted zone in code, while still letting it reference a delegation set ID?

Enter CDK and AwsCustomResource

CDK generates CloudFormation templates from code. I tend to use TypeScript when building CDK stacks. On the face of it, CDK doesn’t help us as if we can’t do something by hand-cranking some CloudFormation, surely CDK can’t do it either.

Not so. CDK also exposes the AwsCustomResource construct that lets us call arbitrary AWS APIs as part of a CloudFormation deployment. It does this via some dynamic creation of Lambdas and other trickery. The upshot is that if it’s in the JavaScript SDK, you can call it as part of a CDK stack with very little extra work.

Let’s assume that we have an existing delegation set whose ID we know, and we want to create a public hosted zone linked to that delegation set. Wouldn’t it be great to be able to write something like:

new PublicHostedZoneWithReusableDelegationSet(this, "PublicHostedZone", {
    zoneName:  `whatever.example.com`,
    delegationSetId: "N05_more_alphanum_here_K"
 // Probably pulled from CI/CD
});

Well we can! Again in TypeScript, and you’ll need to reference the @aws-cdk/custom-resources package:

import { IPublicHostedZone, PublicHostedZone, PublicHostedZoneProps } from "@aws-cdk/aws-route53";
import { Construct, Fn, Names } from "@aws-cdk/core";
import { PhysicalResourceId } from "@aws-cdk/custom-resources";
import { AwsCustomResource, AwsCustomResourcePolicy } from "@aws-cdk/custom-resources";

export interface PublicHostedZoneWithReusableDelegationSetProps extends PublicHostedZoneProps {
    delegationSetId: string
};

export class PublicHostedZoneWithReusableDelegationSet extends Construct {
    private publicHostedZone: AwsCustomResource;
    private hostedZoneName: string;

    constructor(scope: Construct, id: string, props: PublicHostedZoneWithReusableDelegationSetProps) {
        super(scope, id);

        this.hostedZoneName = props.zoneName;

        const normaliseId = (id: string) => id.split("/").slice(-1)[0];
        const normalisedDelegationSetId = normaliseId(props.delegationSetId);

        this.publicHostedZone = new AwsCustomResource(this, "CreatePublicHostedZone", {
            onCreate: {
                service: "Route53",
                action: "createHostedZone",
                parameters: {
                    "CallerReference": Names.uniqueId(this),
                    "Name": this.hostedZoneName,
                    "DelegationSetId": normalisedDelegationSetId,
                    "HostedZoneConfig": {
                        "Comment": props.comment,
                        "PrivateZone": false
                    }
                },
                physicalResourceId: PhysicalResourceId.fromResponse("HostedZone.Id")
            },
            onUpdate: {
                service: "Route53",
                action: "getHostedZone",
                parameters: {
                    Id: new PhysicalResourceIdReference()
                },
                physicalResourceId: PhysicalResourceId.fromResponse("HostedZone.Id")
            },
            onDelete: {
                service: "Route53",
                action: "deleteHostedZone",
                parameters: {
                    "Id": new PhysicalResourceIdReference()
                }
            },
            policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE })
        });
    }

    asPublicHostedZone() : IPublicHostedZone {
        return PublicHostedZone.fromHostedZoneAttributes(this, "CreatedPublicHostedZone", {
            hostedZoneId: Fn.select(2, Fn.split("/", this.publicHostedZone.getResponseField("HostedZone.Id"))),
            zoneName: this.hostedZoneName
        });
    }
}

Note: thanks to Hugh Evans for patching a bug in this where the CallerReference wasn’t adequately unique to support a destroy and re-deploy

How does it work?

The tricky bits of the process are handled entirely by CDK – all we’re doing is telling CDK that when we create a ‘PublicHostedZoneWithReusableDelegationSet‘ construct, we want it to call the Route53::createHostedZone API endpoint and supply the given DelegationSetId.

On creation we track the returned Id of the new hosted zone (which will be of the form ‘/hostedzone/the-hosted-zone-id’).

The above resource doesn’t support updates properly, but you can extend it as you wish. And the interface for PublicHostedZoneWithReusableDelegationSet is exactly the same as the standard PublicHostedZone, just with an extra property to supply the DelegationSetId – you can just drop in the new type for the old when needed.

When you want to reference the newly created PublicHostedZone, there’s the asPublicHostedZone method which you can use in downstream constructs.