Creating a Route 53 Public Hosted Zone with a reusable delegation set ID in CDK

What’s a reuable delegation set anyway?

When you create a Route 53 public hosted zone, four DNS nameservers are allocated to the zone. You then use these name servers with your domain registrar to delegate DNS resolution to Route 53 for your domain.

However: each time you re-create a Route 53 hosted zone, the DNS nameservers allocated will change. If you’re using CloudFormation to manage your public hosted zone this means a destroy and recreate breaks your domain’s name resolution until you manually update your registrar’s records with the new combination of nameservers.

Route 53 reusable delegation sets are stable collections of Route 53 nameservers that you can create once and then reference when creating a public hosted zone. That zone will now have a fixed set of nameservers, regardless of how often it’s destroyed and recreated.

Shame it’s not in CloudFormation

There’s a problem though. You can only create route 53 reusable delegation sets using the AWS CLI or the AWS API. There’s no CloudFormation resource that represents it (yet).

Worse, you can’t even reference an existing, manually-created delegation set using CloudFormation. Again, you can only do it by creating your public hosted zone using the CLI or API.

The AWS CloudFormation documentation makes reference to a ‘DelegationSetId’ element that doesn’t actually exist on the Route53::HostedZone resource. Nor is the element mentioned anywhere else in that article or any SDK. I’ve opened a documentation bug for that. Hopefully its presence indicates that we’re getting an enhancement to the Route53::HostedZone resource some time soon…

So how can we achieve our goal of defining a Route 53 public hosted zone in code, while still letting it reference a delegation set ID?

Enter CDK and AwsCustomResource

CDK generates CloudFormation templates from code. I tend to use TypeScript when building CDK stacks. On the face of it, CDK doesn’t help us as if we can’t do something by hand-cranking some CloudFormation, surely CDK can’t do it either.

Not so. CDK also exposes the AwsCustomResource construct that lets us call arbitrary AWS APIs as part of a CloudFormation deployment. It does this via some dynamic creation of Lambdas and other trickery. The upshot is that if it’s in the JavaScript SDK, you can call it as part of a CDK stack with very little extra work.

Let’s assume that we have an existing delegation set whose ID we know, and we want to create a public hosted zone linked to that delegation set. Wouldn’t it be great to be able to write something like:

new PublicHostedZoneWithReusableDelegationSet(this, "PublicHostedZone", {
    zoneName:  ``,
    delegationSetId: "N05_more_alphanum_here_K"
 // Probably pulled from CI/CD

Well we can! Again in TypeScript, and you’ll need to reference the @aws-cdk/custom-resources package:

import { IPublicHostedZone, PublicHostedZone, PublicHostedZoneProps } from "@aws-cdk/aws-route53";
import { Construct, Fn } from "@aws-cdk/core";
import { PhysicalResourceId } from "@aws-cdk/custom-resources";
import { AwsCustomResource, AwsCustomResourcePolicy } from "@aws-cdk/custom-resources";

export interface PublicHostedZoneWithReusableDelegationSetProps extends PublicHostedZoneProps {
    delegationSetId: string

export class PublicHostedZoneWithReusableDelegationSet extends Construct {
    private publicHostedZone: AwsCustomResource;
    private hostedZoneName: string;

    constructor(scope: Construct, id: string, props: PublicHostedZoneWithReusableDelegationSetProps) {
        super(scope, id);

        this.hostedZoneName = props.zoneName;

        const normaliseId = (id: string) => id.split("/").slice(-1)[0];
        const normalisedDelegationSetId = normaliseId(props.delegationSetId);

        this.publicHostedZone = new AwsCustomResource(this, "CreatePublicHostedZone", {
            onCreate: {
                service: "Route53",
                action: "createHostedZone",
                parameters: {
                    "CallerReference": id + "-" + this.hostedZoneName,
                    "Name": this.hostedZoneName,
                    "DelegationSetId": normalisedDelegationSetId,
                    "HostedZoneConfig": {
                        "Comment": props.comment,
                        "PrivateZone": false
                physicalResourceId: PhysicalResourceId.fromResponse("HostedZone.Id")
            policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE })

        new AwsCustomResource(this, "DeletePublicHostedZone", {
            onDelete: {
                service: "Route53",
                action: "deleteHostedZone",
                parameters: {
                    "Id": this.publicHostedZone.getResponseField("HostedZone.Id")
            policy: AwsCustomResourcePolicy.fromSdkCalls({ resources: AwsCustomResourcePolicy.ANY_RESOURCE })

    asPublicHostedZone() : IPublicHostedZone {
        return PublicHostedZone.fromHostedZoneAttributes(this, "CreatedPublicHostedZone", {
            hostedZoneId:, Fn.split("/", this.publicHostedZone.getResponseField("HostedZone.Id"))),
            zoneName: this.hostedZoneName

How does it work?

The tricky bits of the process are handled entirely by CDK – all we’re doing is telling CDK that when we create a ‘PublicHostedZoneWithReusableDelegationSet‘ construct, we want it to call the Route53::createHostedZone API endpoint and supply the given DelegationSetId.

On creation we track the returned Id of the new hosted zone (which will be of the form ‘/hostedzone/the-hosted-zone-id’).

We need to specify the deletion action as a separate resource so that we can grab a reference to the created hosted zone ID, working around a limitation of CDK.

The above resource doesn’t support updates, but you can extend it as you wish. And the interface for PublicHostedZoneWithReusableDelegationSet is exactly the same as the standard PublicHostedZone, just with an extra property to supply the DelegationSetId – you can just drop in the new type for the old when needed.

When you want to reference the newly created PublicHostedZone, there’s the asPublicHostedZone method which you can use in downstream constructs.

How to (not) do depth-first search in Neo4j

I found a Stack Overflow question with no answers that seemed like it should be straightforward – how can you traverse a tree-like structure in depth-first order. The problem had a couple of features:

  • Each node had an order property that described the order in which sibling nodes should be traversed
  • Each node was connected to its parent via a PART_OF relationship

A depth-first traversal of a tree is pretty easy to understand.

Whenever we find a node with children, we choose the first and explore as deep into the tree as we can until we can’t go any further. Next we step up one level and choose the next node we haven’t explored yet and go as deep as we can on that one until we’ve traversed the graph.

Neo4j supports a depth-first traversal of a graph by way of the function.

Given some tree-like graph where nodes of label ‘Node’ are linked by relationships of type :PART_OF:

// First, some test data to represent a tree with nodes connected by a
// 'PART_OF' relationship:
// N1 { order: 1 }
//  N2 { order: 1 }
//    N4 { order: 1 }
//      N5 { order: 1 }
//      N6 { order: 2 }
//  N3 { order: 2 }
//    N7 { order: 1 }
MERGE (n1: Node { order: 1, name: 'N1' })
MERGE (n2: Node { order: 1, name: 'N2' })
MERGE (n3: Node { order: 2, name: 'N3' })
MERGE (n4: Node { order: 1, name: 'N4' })
MERGE (n5: Node { order: 1, name: 'N5' })
MERGE (n6: Node { order: 2, name: 'N6' })
MERGE (n7: Node { order: 1, name: 'N7' })
MERGE (n2)-[:PART_OF]->(n1)
MERGE (n4)-[:PART_OF]->(n2)
MERGE (n5)-[:PART_OF]->(n4)
MERGE (n6)-[:PART_OF]->(n4)
MERGE (n3)-[:PART_OF]->(n1)
MERGE (n7)-[:PART_OF]->(n3)

We can see which nodes are visited by Neo4j’s DFS algorithm:

MATCH (startNode: Node { name: 'N1' } )
CALL'Node', 'PART_OF', 'BOTH', id(startNode))
YIELD nodeIds 
UNWIND nodeIds as nodeId
WITH algo.asNode(nodeId) as n

The outupt here will vary – possibly even between runs. While we’ll always see a valid depth-first traversal of the nodes in the tree, there’s no guarantee that we’ll always see nodes visited in the same order. That’s because we’ve not told Neo4j in what order to traverse sibling nodes.

If you need control over the order siblings are expanded, you should use application code to write the DFS yourself.

But: given some constraints and accepting some caveats…

  • That’s there’s only one relationship that links nodes in the tree
  • That sibling nodes are sortable by some numeric property – here ‘order`, which is mandatory
  • There are not more than 1,000,000 sibling nodes for any given parent
  • Sibling nodes all have a distinct order property value
  • That this will perform like a dog on large graphs – potentially not completing, given it has some N^2 characteristics…

…you can do this in pure Cypher. Here’s one approach, which we’ll then break down
to see how it works:

MATCH (root: Node { name: 'N1' }), pathFromRoot=shortestPath((root)<-[:PART_OF*]-(leaf: Node)) WHERE NOT ()-[:PART_OF]->(leaf)
WITH nodes(pathFromRoot) AS pathFromRootNodes
WITH pathFromRootNodes, reduce(pathString = "", pathElement IN pathFromRootNodes | pathString + '/' + right("00000" + toString(pathElement.order), 6)) AS orderPathString ORDER BY orderPathString
WITH reduce(concatPaths = [], p IN collect(pathFromRootNodes) | concatPaths + p) AS allPaths
WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder
RETURN [x in traversalOrder |]

Finding the deepest traversals

Given some root node, we can find a list of traversals to each leaf node using shortestPath. A leaf node is a node with no children of its own, and shortestPath (so long as we’re looking at a tree) will tell us the series of hops that get us from that leaf back to the root.

Sorting the paths

We’re trying to figure out the order in which these paths would be traversed, then extract the nodes from those paths to find the order in which nodes would be visited.

The magic is happening in this line:

WITH pathFromRootNodes, reduce(pathString = "", pathElement IN pathFromRootNodes | pathString + '/' + right("00000" + toString(pathElement.order), 6)) AS orderPathString ORDER BY orderPathString

The reduce is, given a node from root to leaf, building up a string that combines the order property of each node in the path with forward-slashes to separate them. This is much like folder paths in a file system. To make this work, we need each segment of the path to be the same length – therefore we pad out the order property with zeroes to six digits, to get paths like:


These strings now naturally sort in a way that gives us a depth-first traversal of a graph using our order property. If we order by this path string we’ll get the order in which leaf nodes are visited, and the path that took us to them.

Deduplicating nodes

The new problem is extracting the traversal from these paths. Since each path is a complete route from the root node to the leaf node, the same intermediate nodes will appear multiple times across all those paths.

We need a way to look at each of those ordered paths and collect only new nodes – nodes we haven’t seen before – and return them. As we do this we’ll be building up the node traversal order that matches a depth-first search.

WITH reduce(concatPaths = [], p IN collect(pathFromRootNodes) | concatPaths + p) AS allPaths
WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder

First we collect all the paths (which are now sorted by our traversal ordering) into one big list. The same nodes are going to appear more than once for the reasons above, so we need to remove them.

We can’t just DISTINCT the nodes, because there’s no guarantee that the ordering that we’ve worked hard to create will be maintained.

Instead, we use another reduce and iterate over the list of nodes, only adding a node to our list if we haven’t seen it before. Since the list is ordered, we take only the first of each duplicate and ignore the rest. Our CASE statement is doing the heavy lifting here:

WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder


  • Create a variable called distinctNodes and set it to be an empty list
  • For each node n in our flattened list of nodes in each path from root to each leaf:
  • If we’ve seen n before (if it’s in our ‘distinctNodes’ list) then set distinctNodes = distinctNodes – effectively a no-op
  • If we haven’t seen n before, set distinctNodes = distinctNodes + n – adding it to the list

This is a horrendously inefficient operation – for a very broad, shallow tree (one where each node has many branches) we’ll be doing on the order of n^2 operations. Still, it’s only for fun.

We’re done! From our original graph, we’re expecting a traversal order of:

N1, N2, N4, N5, N6, N3, N7

And our query?


Another for the annals of ‘Just because you can, doesn’t mean you should’.

Cypher performance: WITH statements and indexes

If you come from a SQL background, you would probably not expect aliasing a column or table to have a measurable impact on performance.

Not quite so in Cypher.

Let’s build a sample graph of one million nodes (Neo4j Desktop running 3.5.13 Enterprise, 16GB max heap, 8GB initial heap):


FOREACH (r in range(1, 1000000) | MERGE (n: Node { id: r }))

Profiling a basic query on :Node(id) shows that we’re hitting the index just fine and this query runs quick as you like:

WHERE = 123456

// 2 total db hits in 1 ms

If we introduce a WITH into the mix, before the WHERE clause – what would you expect? Let’s add a redundant one and see:

WHERE = 123456

// 2 total db hits in 0 ms

No difference, nor would you expect one – the WHERE clause obviously relates to an indexed property, irrespective of the WITH.

What then if we give the WITH an alias?

WITH n as nAlias
WHERE = 123456

// 2000001 total db hits in 422 ms

Even though to your eye that’s the same query, the optimiser in Neo4j has lost track of the indexability of the query, and we’re down to a NodeByLabelScan of 1,000,000 nodes.

I’d argue this is an optimiser bug, and have raised a bug for it (though it’s likely super low priority, and I suspect the sort of thing that sounds easier to fix than it actually is).

Still, it’s easy to fall into this sort of trap when refactoring big queries. For clarity and safety, if you’re going to try to hit an index do it in a WHERE clause attached directly to your MATCH. And regardless, re-profile your queries after refactoring them as part of your validation.

Commas in Cypher patterns vs multiple matches

I answered a Stack Overflow question the other day about why two queries performed very differently. One involved a chain of OPTIONAL MATCHes, the other a single OPTIONAL MATCH with portions of a pattern separated by commas.

There’s a difference between the two that will almost certainly give you different results.

Patterns in Neo4j

A pattern is a description of a structure in a graph we’re trying to match. Patterns can assign nodes and relationships to variables we use in subsequent processing. They express both the nodes we’re looking for and how they must be related. They’re fundamental to a graph query – and the docs aren’t super-clear on how patterns with commas work.

A pattern might be something like ‘Actors connected to Movies via an ‘ACTED_IN’ relationship’.

Patterns with commas

A MATCH statement where the pattern contains a comma is still one MATCH statement. Every component of the pattern must be satisfied for the MATCH to return anything.

Let’s say we’re using the Movies database, and we want to find the directors of movies Tom Cruise has starred in.

One way is to assign a variable to the Movie node, and use commas:

MATCH (tom: Person { name: 'Tom Hanks' })-[:ACTED_IN]->(m: Movie), (director: Person)-[:DIRECTED]->(m)

Even though it might look as though we’re specifying two patterns we’re not: it’s all part of the same pattern. If Neo4j can’t satisfy any portion of the MATCH, it’ll return nothing.

Alternatively we could also use two MATCH statements one after the other:

MATCH (tom: Person { name: 'Tom Hanks' })-[:ACTED_IN]->(m: Movie)
MATCH (m)<-[:DIRECTED]-(director: Person)

In this example there’s no difference between the two queries – they’re functionally identical. Both return the same data and have identical EXPLAIN and PROFILE results. For this specific query there’s no difference in the two approaches.

But one situation where there is a big difference between the two is when we’re using OPTIONAL MATCH instead of MATCH.


Recall that an OPTIONAL MATCH returns a row if our pattern is satisfied, and NULL if not – it’s basically an OUTER JOIN but expressed in a graph query.

Let’s look at a fairly contrived example using the Movies database. Here, we want every Movie and the list of people who produced it, if any are known. We use OPTIONAL MATCH because not all Movies have producers in our database, but we still want to return the Movie title.

MATCH (m: Movie)
OPTIONAL MATCH (producer: Person)-[:PRODUCED]->(m)
RETURN m.title, collect(
// 38 records returned

38 records are returned, one for each Movie in the database, of which 10 have a non-empty collection of producer names.

What if we now also wanted to list the names of anyone who reviewed those films, alongside the producers. We can chain two OPTIONAL MATCH statements for this:

MATCH (m: Movie)
OPTIONAL MATCH (producer: Person)-[:PRODUCED]->(m)
OPTIONAL MATCH (reviewer: Person)-[:REVIEWED]->(m)
RETURN m.title, collect(, collect(

We still get 38 records, some of which have a producer (like The Matrix and Jerry Maguire) and some of which have a reviewer (like Jerry Maguire and The Replacements). Most records only have one or the other – a few have both.

You might look at that query and wonder, “can’t we simplify that by doing all the OPTIONAL MATCHes in one go?”. It might look like this:

MATCH (m: Movie)
OPTIONAL MATCH (producer: Person)-[:PRODUCED]->(m),
               (reviewer: Person)-[:REVIEWED]->(m)
RETURN m.title, collect(, collect(

Well – no, you can’t. The two queries are expressing different things. The commas in the second query’s OPTIONAL MATCH are expressing a single pattern – we need to find a Person who :PRODUCED the movie, and a Person who :REVIEWED the movie to return anything from the optional match at all. If we can’t match any part of the comma-separated portions of the pattern, we’ll return nothing at all.

We no longer return any producer or reviewer information for The Matrix, nor The Replacements. The Matrix has a producer but no reviewer, and the Replacements has a reviewer but no producer.

Get your EKS pod to assume an IAM role using IRSA

IRSA, or ‘IAM Roles for Service Accounts’ is a new AWS mechanism that lets pods running in your EKS cluster automatically assume an IAM role when using other AWS resources.

For example, we might have a DynamoDB with a read-write role associated, and we want a payment processing pod in our cluster to write to it. IRSA should make this pretty transparent.

First off, follow this tutorial. The rest of this post is basically an errata sheet for the missing information and steps to get it working.

Our approach

We’re going to tackle this in two sections:

  • Get the s3-echoer example app working in our cluster – this proves the cluster setup is correct
  • Amend our application to correctly use IRSA

Getting the cluster configuration right

If you’re using Terraform…

Two possible screw-ups here depending on which tutorial you’re following – I went with this one, which needs some tweaks:

OpenID Provider CA thumprints missing

Using Terraform to spin out your cluster? This bug will currently leave the OpenID Connect provider in an invalid state, because no certificate thumbprints are added. You’ll need to either hack some Bash into your build, or hard-code the CA thumb in your .tf file. The hacking method is fine if you’re only ever targeting one AWS region. Without this step, you will find that the s3-echoer portion of the original tutorial fails.

Wrong role assumption policy

The tutorial adds a policy to your service account role that limits who can assume the role to your aws-node pods running in your cluster. Since your code will be running in your own pods using your own service account, you need to use your own namespace and service account name instead. For example, if our deployment were under a service user called ‘payments-service-user’ running in the default namespace, we’d want:

data "aws_iam_policy_document" "example_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]
    effect  = "Allow"

    condition {
      test     = "StringEquals"
      variable = "${replace(aws_iam_openid_connect_provider.example.url, "https://", "")}:sub"
      values   = ["system:serviceaccount:default:payments-service-user"]

    principals {
      identifiers = ["${aws_iam_openid_connect_provider.example.arn}"]
      type        = "Federated"

You should now be at a point that you can get the s3-echoer to run.

You can check if your own cluster’s service account is setup OK by Bash’ing into any pod and running the env command. We need the following three environment variables set:


Amending our application

While the SDKs just work with other AWS hosting models, you need to do some manual work to get EKS IAM role assumption to work. This isn’t consistent across SDK languages – the Go SDK seems to fall into the Just Works category, but others like the Java SDK need a tweak.

In Java, in particular, you need to add an instance of STSAssumeRoleWithWebIdentitySessionCredentialsProvider to a credentials chain, and pass that custom chain to your SDK init code via the withCredentials builder method.

This class doesn’t automatically come as part of the credentials chain. Nor does it automatically initialise itself from environment variables the same way other providers do.

You’ll have to pass in the web identity token file, region name and role ARN to get it running.

Mirroring the NuGet Catalog API locally

TL;DR: You can pull the 3GB clone of the Catalog API from this link, though note that it will unpack to 4.8 million files over 1.6 million folders and weigh in at about 52GB uncompressed.

You can pull the .NET Core console application that produced the clone from GitHub.

NuGet is the .NET package manager, and hosts almost all publicly-published NuGet packages. A package is essentially a ZIP file with a metadata component and some binaries. The metadata portion details version and author information, descriptions and so on – it also lists the dependencies of the package upon other packages in the ecosystem using SemVer.

Package management data sources like this are interesting for playing around with graphs. They’re big, well-used, well-structured data sets of highly connected data . At time of writing, there are 167,733 unique packages published with over 1.8 million versions of those packages listed.

As part of an upcoming series, I wanted to load that information into a Neo4j graph database to see if there were any interesting insights or visualisations we could create when given access to such a big data set. Unfortunately each call to the API takes between 100ms and 500ms – doesn’t sound like much, but if you’re pulling 4.8 million documents you’re looking at around 23 days of time just sequentially pulling files.

You also have to process the files sequentially – each catalog page has a commit timestamp that gives it a strong ordering, and catalog entries are essentially events that have happened to a package version. It’s possible a single package version has multiple different package metadata pages associated with it spanning an arbitrary period of time as the package is listed, de-listed or metadata amended.

I wanted to have Neo4j load the data via REST API calls, rather than going with a standard file load as that was the point of the exercise. This meant that not only did I have to clone the dataset, but I had to host it locally so that it looked like the live API.

Catalog API

The Catalog API exposes every version of every package published to NuGet. Publishes and updates to published packages are recorded as separate documents, and the catalog is paged into batches of 500 or so changes per batch, plus or minus.

There are nearly 9,000 batches reported by the Catalog API, and a total of around 4.8 million documents. Some of those documents relate to the same version of a package – for example, when a package gets de-listed for some reason there will be an updated document in one of the Catalog API pages detailing the new state of the package.

There’s no rate limit on the Catalog API – it’s just files hosted in Azure blob storage – and each document contains navigable URLs to related information. If we wanted to pull a clone of the Catalog API, we could just start from the root document and crawl the links found.

Cloning the documents

Even though Neo4j needs to process the files sequentially, we don’t need to clone them sequentially. To retrieve 4.5 million documents in any sensible amount of time, we need to go multi-threaded. I bashed together a quick .NET console app to do just that.

The app just a pretty simple breadth-first search of links found in documents. We start off with the catalog root URL which details the URL of each of the 8,800 ish catalog pages:

We then search for anything that looks like a URI within the file, check we’ve not already processed that URI and then add it to a ConcurrentQueue. Since the .NET BCL doesn’t have a ConcurrentSet, we use a ConcurrentDictionary and just care about the keys within it.

The queue ends up containing 4.8 million items at its peak, and the ‘processed files set’ slowly grows as the queue is drained.

We spin out 64 System.Threading.Tasks.Task objects in an array. Each Task takes a single URI from the queue, quickly validates that we actually have work to do, pulls the contents of the file and parses out any new URIs. Each new URI is added to the processing queue, and the contents of the file are written to disk in a folder structure that mirrors the segments of the URI so that we can easily host it later. The Task then polls the queue again and waits until there’s more work to do, or until the queue is drained of work items.

The method doing the work just spins on the queue until we run out of work to do

Every minute or so a watchdog Task clones the work queue and the ‘processed’ set and persists them to disk. Fun fact – awaiting a WriteLineAsync is incredibly slow relative to just letting it block, especially when you’re calling it millions of times in a loop.

On startup, the app looks for these checkpoint files and reloads its state from them if found – that way we can pause the processing by just killing the process.

At peak the process was using around 2GB of RAM and pulling down files at around 10MBps.


You can find the source for the application on GitHub, as well as a link to the generated output but the tool itself would need more work before you could for example resume from a snapshot tarball such as this.

24 hours later, what I ended up with was… extensive.

Just calculating the summary took twelve minutes
The unnerving feeling of seeing 1.6 million folders in a single folder doesn’t really go away

Now what?

Serving a mirror of the Catalog API with Nginx

Now that we’ve got our archive of files, we can spin out a super simple docker-compose script to host Nginx, serve the files from some URL and then replace our usages of with localhost:8192 or whatever.

First off, the docker-compose file – we’re assuming that the file lives in the same folder as the root folder of documents:

  image: nginx
   - ./
   - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
   - "8192:80"

You’ll have to make sure Docker has access to the drive where the mirrored files are stored to pull this off. We just serve that folder directly as Nginx’s default document base, but then also map in a file to configure Nginx to rewrite any URL that looks like ‘’ to point to localhost:

server {
    listen 80;
    location / {
    	root /usr/share/nginx/html;
        sub_filter ''  'http://$http_host/';
        sub_filter_types 'application/json';
        sub_filter_once off;

Because the crawler just saved files to a directory structure that matches the URI components, we can trivially serve the files without needing to translate names.

A quick docker-compose up and we’re off to the races:

Of course, we could just use the .JSON files directly, without getting Nginx in the way but since the Neo4j post I did this for was about crawling a REST API, it seemed like a bit of a cheat to just index files on disk instead.


You can download the archive from this link – it’ll take… some time to decompress, and you’ll end up with around 52GB of space consumed on the drive once you’ve done so, as well as 4.8 million new files and 1.6 million new folders so maybe do this on a drive you don’t care that much about.

The source for the crawler is available on GitHub though it’s super rough-and-ready and provided with absolutely no guarantees, aside from that it’s likely to make your day worse by running it.

Neo4j 4.0 MR2 RBAC – an exploration (with examples)

Neo4j 4.0 brings fine-grained access control and general RBAC capabilities to the table. MR2 has started showing how these features are going to work, though they’re only available in the Enterprise edition and MR2 is obviously pre-release so very much not production-ready.

Role support

A role is a collection of zero or more privileges assigned to zero or more users. Roles let you collect together the privileges required to accomplish some task. By allocating a role to a user, you in essence have them take on the privileges of the role in conjunction with any other roles they have. If you add a privilege to a role, all users with that role immediately have the new privilege.

Privilege support

As of MR2, Neo4j supports four privileges:

  • READ – the ability to read properties of the object being secured
  • TRAVERSE – the ability to include the object in a graph traversal – that is, the ability to find the node or relationship at all
  • MATCH – a combination of the READ and TRAVERSE privileges
  • WRITE – the ability to modify nodes, relationships and properties. In MR2 this isn’t as tightly defined as the other roles (though it will be eventually), and doesn’t really count as fine-grained but in its current form it does at least let you support read-only users by not including the WRITE privilege

WRITE is currently a graph-wide ‘insert/update/delete’ privilege, though support for finer-grained label-based control is coming. READ, TRAVERSE and MATCH are more interesting.

Privileges are granted to Roles, which are allocated to Users.

Cloudbooks – an imaginary SaaS provider

We need an example graph to explore this stuff, so: Cloudbooks is an imaginary provider of accounting software. Their software is multi-tenant, which means that all of their customers inhabit a single database in the cloud.

Neo4j is used amongst other things to track which features of the software users are interacting with. Our data model is roughly as follows:

A Tenant (in red) has multiple Users (brown) as members. Each User may have used zero or more Features (green).

Getting setup – Neo4j 4.0 MR2 Enterprise in Docker

RBAC functionality is only available in Neo4j Enterprise edition, so let’s spin out a Docker container with it in.

docker load < neo4j-enterprise-4.0.0-alpha09mr02-docker-complete.tar
docker run --publish=7474:7474 --publish=7687:7687 --env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes -t neo4j:4.0.0-alpha09mr02-enterprise

Things have changed in Neo 4.0 – there’s now support for multiple databases. When we first log in using Neo4j Browser we’ll be connected to the default ‘neo4j’ database, but we also have an option to use the ‘system’ database. The system database is where we can create new databases as well as define roles, create users and assign access to roles using the new RBAC functionality so let’s do that.

We’ll create a new database called ‘cloudbooks’, then add some sample data to it.


:use cloudbooks

MERGE (t1: Tenant { id: 1 })
MERGE (t2: Tenant { id: 2 })
MERGE (t1u1: User { name: 'Geoff Capes', id: 100 })
MERGE (t1u2: User { name: 'Bill Kazmaier', id: 101 })
MERGE (t2u1: User { name: 'Jón Páll Sigmarsson', id: 102 })
MERGE (f1: Feature { name: 'Create account' })
MERGE (f2: Feature { name: 'Print PDF' })
MERGE (f3: Feature { name: 'Merge accounts' })
MERGE (t1)<-[:MEMBER_OF]-(t1u1)
MERGE (t1)<-[:MEMBER_OF]-(t1u2)
MERGE (t2)<-[:MEMBER_OF]-(t2u1)
MERGE (t1u1)-[:USED]->(f1)
MERGE (t1u1)-[:USED]->(f2)
MERGE (t1u2)-[:USED]->(f3)
MERGE (t2u1)-[:USED]->(f2)

Defining roles

We need a few roles to support this application:

  • Support – these members of the Cloudbooks team support staff can view all records in the system
  • Analyst – this role represents data analysts who might analyse feature usage by tenant. These users aren’t allowed access to information about individual users of the Cloudbooks system

While we’re at it, we’re going to create two new users, one per role

  • supportuser
  • analystuser

Each user’s password is just their username again, since this is just a test.

:use system



GRANT ROLE Support TO supportuser
GRANT ROLE Analyst to analystuser

If we log in as any of those users, we’ll find that by default we have no privileges at all.

Let’s allocate some privileges to the roles.

Graph-wide privileges

Neo4j 4.0 supports graph-wide privileges – that is, we can specify a privilege be allocated to every node and relationship in the graph without needing to call out specific labels or relationship types.

Let’s give the Support role read access to the entire graph:

:use system

GRANT MATCH (*) ON GRAPH cloudbooks TO Support

If we log in as the support user we can check that we see the whole graph:

But we can’t make any modifications to the graph:

Good – that’s basic, graph-wide RBAC working.

The READ privilege and how DENY works

Let’s look at the following query, first:

MATCH (t: Tenant)<-[:MEMBER_OF]-(:User)-[:USED]->(f: Feature)
RETURN as `Tenant ID`, AS `Feature Name`,
       count(*) AS `Usage Count`
ORDER BY `Tenant ID`, `Feature Name`

We want to list, per tenant, the features they’re using and how often they’re getting used. Our result if we log in as the admin account (which can see and do everything) is what you’d expect:

Notice that we don’t actually care which User used the feature, nor do we return any information about them – we just traverse that node to get the insights we want.

We could use a Deny permission on User so prevent anyone in the Analysis group from reading user details – we’re going to drop and recreate the role here to properly clean up behind the scenes, as there seems to be a couple of issues at the minute in MR2 with REVOKE not quite cleaning house how it should:

:use system

GRANT ROLE Analyst to analystuser
GRANT MATCH (*) ON GRAPH cloudbooks TO Analyst
DENY READ (*) ON GRAPH cloudbooks NODES User TO Analyst

If we log in as our analystuser account we’ll find that the same query still works:

But we can’t return any information about the user nodes themselves:

As with most such systems, DENY takes precedence over GRANT – even though we have MATCH on everything in the graph we still can’t read information from the privileged User nodes because of the DENY rule.

Property-specific privileges

Neo has nulled-out properties on the User object because we’ve got a DENY rule on our role. Let’s relax that a bit, and allow access to user IDs, but not user names.

GRANT ROLE Analyst to analystuser
GRANT MATCH (*) ON GRAPH cloudbooks NODES User TO Analyst
DENY READ (name) ON GRAPH cloudbooks NODES User TO Analyst

READ and MATCH both support specifying one or more properties to allow or deny access to, with the (*) wildcard representing ‘all properties’. In the above, the DENY READ (name) privilege prevents us from reading the ‘name’ property of User nodes, but since no other DENY exists for other properties on User nodes we can cheerily read out the user’s ID.

The TRAVERSE privilege vs the MATCH privilege

We’ve been using the MATCH permission above, which is a combination of READ and TRAVERSE. Let’s see what happens if you just use one or the other.

We’ll grant MATCH on everything except User, and then try the two fine-grained controls on just User nodes.

The TRAVERSE privilege

GRANT ROLE Analyst to analystuser
GRANT MATCH (*) ON GRAPH cloudbooks Nodes Tenant, Feature TO Analyst
GRANT TRAVERSE ON GRAPH cloudbooks Relationships * TO Analyst

With no further permissions, our query returns no results – because we aren’t permitted to traverse User nodes, we can’t make the jump from Tenant to Feature (even though we have a TRAVERSE privilege on every relationship in the graph):

MATCH (t: Tenant)<-[:MEMBER_OF]-(u:User)-[:USED]->(f: Feature)
RETURN as `Tenant ID`, AS `Feature Name`, AS `User name`, AS `User ID`,
       count(*) AS `Usage Count`

If we add the TRAVERSE privilege into the Analyst role on User nodes:

GRANT TRAVERSE ON GRAPH cloudbooks Nodes User TO Analyst

We start getting results again. We’ve had to include a TRAVERSE privilege on both the User node and the relationship types into and out of that node that we wanted to query on.

Because our privileges don’t extend to READ on the User node’s properties, we don’t get any data back about the user:

The MATCH privilege

MATCH is a combination of READ and TRAVERSE, so we would expect a MATCH privilege on User nodes to let us read out the properties of those nodes in a way TRAVERSE wouldn’t. Let’s try it:

GRANT MATCH (id) ON GRAPH cloudbooks NODES User TO Analyst


Neo’s new RBAC functionality isn’t finished yet, and in MR2 things like the WRITE privilege not being fine-grained and REVOKE not cleaning things up properly are issues that will be fixed.

That said, it’s super promising and the splitting out of TRAVERSE from READ (and ability to specify privileges at the node and relationship level separately) will allow some interesting data protection scenarios to be cooked up.

Minecraft Crafting Tree in Neo4j – Part 6 – Retrospective

This is the last in a series of posts where we created a Neo4j model of the Minecraft crafting tree and played around with querying it to build a shopping list for any given item.

Over the course of the series we’ve:

What have we learned?

Visualising your data to cleanse it works well

We found data quality issues and corrected them, but finding them was made easier because of Neo4j Browser’s easy visualisation of our graph. You can see data issues quicker than you can query for them, and a human will spot anomalies pretty quickly. For example:

  • Isolated nodes where you expected a fully-connected graph
  • Cycles in the graph where you expect it to be acyclic
  • Multiple relationships between nodes where you expect only a single relationship

If you’re working with a large dataset (remember: the Minecraft dataset was just a toy) you could consider sampling your source data and testing your ingest on that sample. You may not find all data issues in the sample, but quickly visualising and proving out your graph model is much easier if you can fit the whole model on a few screens.

Neo4j is radical overkill for this specific problem

We knew this going in to it – we don’t need a proper database for this, we could just as well have represented the whole graph in memory in Javascript and gotten the same answers out.

That said – while we’re not using any large fraction of the power of Neo4j for our example, we did touch upon a good few core concepts:

  • Data loading via LOAD CSV and MERGE
  • Constraints and indexes
  • Visualisation
  • Querying in Cypher
  • Aggregation in Cypher
  • APOC

APOC is huge and very capable

We used a single function from the APOC library which boasts over 450 functions exposed to your Neo4j instance. It addresses a lot of functionality shortfalls in Cypher, and if you start digging into it further features like graph refactoring support, virtual nodes and node grouping all become very attractive tools for your mental toolkit.

I’m surprised it’s not just bundled by default.

It’s tempting to write APOC calls or application code over pure Cypher

I still haven’t found the dividing line yet where I’m comfortable saying ‘just express that in Cypher’ or ‘just use APOC’. We found two queries that were broadly equivalent, one in some fairly long-winded Cypher and what amounted to a one-liner in APOC. Was one better than the other?

If you’re a developer and you take the dependency on APOC to express a query then you’re buying your future self (and future colleagues) into that ecosystem. You can probably express almost any Cypher concept in a series of APOC calls – should you? Your future self now has to remember exactly how, for example, apoc.path.subgraphAll works to figure out what a query’s up to, while the plain Cypher was easier to figure out from first-principles.

There isn’t just one way to model your domain

Neo4j claims that, and I’m paraphrasing, ‘your whiteboard model is your data model’ and that’s true – we literally did that, replete with a photo of our whiteboard.

However, if you can represent something multiple ways on your whiteboard then that doesn’t save you from the underlying domain complexity. There isn’t just one way to model your domain, and the model you choose has implications on:

  • How easy it is to query the graph
  • Whether you need application code in places where you could have been using Cypher
  • How easy it is to ingest new data and maintain consistency within your model
  • Performance

The list goes on. We only looked at two representations of our data – there were more we could have explored. The accepted guidance seems to be to optimise your data model around the most common query or queries you’ll be performing – i.e. cover the 80% case and worry about the long tail of other queries later.

Graph refactoring and experimentation is easy

Luckily Neo4j makes it super easy to refactor your graph, and either APOC or the Neo4j cypher-shell let you export and re-import data or subsets of your data to try experiments quickly.

In addition Neo4j Desktop is very capable and a whole lot easier to use than the old days where you had to stand up new instances of the database or play with configuration files to swap graphs in and out. Installing APOC takes two clicks, and staying on the latest database version is as easy and choosing from a drop-down box.


The graph model we chose made certain operations more difficult. Still – we overcame the issues we had either with Cypher or application code, and built something that answers interesting questions from the graph.

Our Minecraft example wasn’t a good use-case for Neo4j as it stands, but modelling BOMs in Neo4j is a good fit in general (which is what we started to show in our simplified example back in Part 4).

Being able to answer questions like ‘what’s the most contrived item to build’ may sound frivolous, but in manufacturing translates to:

  • Lead-time estimation – from materials ingest, how long before step X of the process can expect work, and how long is our production pipeline in minutes/hours?
  • Quality and yield modelling – what’s the cost of rework for a defect found at a particular part of the process?
  • Design-for-manufacture decision making support – how many different fastenings are used across the product, can we reduce that?

Minecraft Crafting Tree in Neo4j – Part 5

Last time we found that writing the query that told us which items to gather from the world to build any given item was between tricky and impossible to express in one Cypher query without some application code.

We outlined an algorithm that used a simulated shopping list to keep track of what we needed to run for each Recipe in the tree.

Rather than walk through the build of that here, I’ve had a stab some something poorly approximating Literate Programming and posted the code in a Gist that I’ll also embed here:

Let’s try it out then – our previous Cypher query told us that we needed 1x Wood to build a Wood Axe but that was wrong, so how does the new algorithm work out?

The output of our Node.js app for producing a Wood Axe

That’s looking tidier, we need 2x Wood.

How about producing a Carrot on a Stick?

The output of our Node.js app for producing a Carrot on a Stick

We can see from the indentation in both shots how each Recipe ends up running other Recipes recursively until we finally hit resources that have no Recipe – raw materials.

Next steps

We’re not going to go any further with this application – we’ve tried a few things out, explored how far we can push Cypher for our use case and pulled together a quick Node app to talk to the database to do the heavy-lifting when we couldn’t manage in Cypher.

Next time we’ll do a quick retrospective on the project to wrap things up.

Minecraft Crafting Tree in Neo4j – Part 4

In the last post we found that while there are different ways to represent the Minecraft crafting list in a graph database, just generating a list of materials to be gathered for a given recipe is not straightforward.

In this post we’ll look at what we could do in pure Cypher if Minecraft were a slightly different game. We’ll then rough out an algorithm to walk a requirements tree for a given Recipe and figure out what materials we need to gather that does work with Minecraft.

Note: This post will continue with our ‘Recipe’-based graph from Part 3, where Resource and Recipe nodes are separate. You can download the state of our database so far via this Gist, which is a Cypher script to be run into a clean graph.

What if Recipes produced only one output unit?

If Minecraft were different, and a Recipe only ever produced one unit of output then we could actually create an ingredients list for any item just in Cypher.

Remember earlier we had two ways to produce the ‘path’ from a given Resource to its constituent ingredients, listing all the intermediate Recipes. Let’s look again at the clumsier, non-APOC version for a Carrot on a Stick:

MATCH (r: Resource { name: 'Carrot on a Stick' })<-[:PRODUCES]-(recipe1: Recipe)-[req1:REQUIRES]->(dep1: Resource)
OPTIONAL MATCH (dep1)-[:PRODUCES]-(recipe2: Recipe)-[req2:REQUIRES]->(dep2: Resource)
OPTIONAL MATCH (dep2)-[:PRODUCES]-(recipe3: Recipe)-[req3:REQUIRES]->(dep3: Resource)
OPTIONAL MATCH (dep3)-[:PRODUCES]-(recipe4: Recipe)-[req4:REQUIRES]->(dep4: Resource)
RETURN r, recipe1, dep1, recipe2, dep2, recipe3, dep3, recipe4, dep4

One thing to note is that this gives you three rows of results – one for each full subtree from the Carrot on a Stick node to a terminal raw-ingredient node.

Here’s another, this time for Wood Pickaxe:

This example’s arguably more interesting – the Pickaxe is made of Sticks and Planks. Sticks are themselves made of Planks – that means there’s two sub-trees to terminal nodes:

  • The tree that creates the Sticks for our Pickaxe (Sticks -> Planks -> Wood)
  • The tree that creates the Planks for our Pickaxe (Planks ->Wood)

Both trees end in the same resource, unlike the Carrot on a Stick case.

What we want here is to group the information by the name of the resource in the terminal nodes, and multiply together the input quantities down the tree. Why multiply? Consider the Pickaxe case in our simplified model of Minecraft (where each recipe outputs just one resource), and just look at the ‘Sticks’ component of our tree.

  • A Pickaxe requires 2x Sticks and 3x Planks
  • Each Stick requires 2x Plank
  • Each Plank requires 1x Wood

The total Wood needed in this case is:

  • (Wood per Plank) * (Planks per Stick) * (Sticks per Pickaxe)
  • = 1 * 2 * 2
  • = 4

How does grouping work in Neo4j and Cypher?

If you use any aggregate function (like sum, count, etc) then Neo4j will group by the previous non-aggregate values to compute the value you’re asking for. Let’s see an example by amending the RETURN statement of our previous query to just return the number of times a given Resource appeared in our result set (which will be the number of subtrees that end with that resource):

MATCH (r: Resource { name: 'Wood Pickaxe' })<-[:PRODUCES]-(recipe1: Recipe)-[req1:REQUIRES]->(dep1: Resource)
 OPTIONAL MATCH (dep1)-[:PRODUCES]-(recipe2: Recipe)-[req2:REQUIRES]->(dep2: Resource)
 OPTIONAL MATCH (dep2)-[:PRODUCES]-(recipe3: Recipe)-[req3:REQUIRES]->(dep3: Resource)
 OPTIONAL MATCH (dep3)-[:PRODUCES]-(recipe4: Recipe)-[req4:REQUIRES]->(dep4: Resource)
 RETURN COALESCE(,,,, count(*)

That matches our previous analysis. So how do we get the total number of Wood we need? Well, multiplying the quantities together directly won’t work because sometimes the values will be null. Examine the first row of the query before last:

The variables recipe3, dep3, recipe4 and dep4 are all NULL because that sub-tree doesn’t go deep enough that they’re needed – after two hops we’ve hit Wood and there’s no further to go. And Neo4j, just like every other database, will return NULL on mathematical operations involving NULL as an operand. For instance:

  • 4 * NULL = NULL
  • 2 + NULL = NULL

We need to turn all those NULLs into 1s so that we can safely, blindly multiply them without changing the result of the calcuation. We can use the COALESCE operator for this, giving us this final query:

MATCH (r: Resource { name: 'Wood Pickaxe' })<-[:PRODUCES]-(recipe1: Recipe)-[req1:REQUIRES]->(dep1: Resource)
OPTIONAL MATCH (dep1)-[:PRODUCES]-(recipe2: Recipe)-[req2:REQUIRES]->(dep2: Resource)
OPTIONAL MATCH (dep2)-[:PRODUCES]-(recipe3: Recipe)-[req3:REQUIRES]->(dep3: Resource)
OPTIONAL MATCH (dep3)-[:PRODUCES]-(recipe4: Recipe)-[req4:REQUIRES]->(dep4: Resource)
RETURN COALESCE(,,,, sum(coalesce(req4.qty, 1) * coalesce(req3.qty, 1) * coalesce(req2.qty, 1) * coalesce(req1.qty))

Does it work for Carrot on a Stick?


Can we make it work for the real rules of the game?

Our toy above works because we cheated. But could we make it work if we allow multiple outputs per recipe?

The actual answer for Carrot on a Stick, if we work it out by hand, is:

  • 1x Carrot
  • 2x String
  • 1x Wood

The Carrot and the String are straightforward raw materials needed directly by the Carrot on a Stick recipe itself. The Wood comes in to it because we need to make:

  • 3x Sticks
  • Sticks require 2x Wooden Planks – but we produce 4 Sticks for each run of the recipe so 2x Wooden Planks is enough to cover all the Sticks we need
  • Wooden Planks require 1x Wood – but we produce 4 planks for each run of the recipe so 1x Wood is enough to cover all the Wooden Planks we need

Since our graph has the output quantity of a Recipe encoded on the :PRODUCES relationship qty property, could we divide the input requirement by the output requirement to get to what we want?

Let’s try. We’ll need to add variables for the :PRODUCES relationships now, as we need that qty property for our maths.

Here’s a first pass:

MATCH (r: Resource { name: 'Wood Sword' })<-[out1:PRODUCES]-(recipe1: Recipe)-[req1:REQUIRES]->(dep1: Resource)
OPTIONAL MATCH (dep1)-[out2:PRODUCES]-(recipe2: Recipe)-[req2:REQUIRES]->(dep2: Resource)
OPTIONAL MATCH (dep2)-[out3:PRODUCES]-(recipe3: Recipe)-[req3:REQUIRES]->(dep3: Resource)
OPTIONAL MATCH (dep3)-[out4:PRODUCES]-(recipe4: Recipe)-[req4:REQUIRES]->(dep4: Resource)
       		(coalesce(toFloat(req4.qty), 1.0) / coalesce(toFloat(out4.qty), 1.0)) 
      	  * (coalesce(toFloat(req3.qty), 1.0) / coalesce(toFloat(out3.qty), 1.0)) 
          * (coalesce(toFloat(req2.qty), 1.0) / coalesce(toFloat(out2.qty), 1.0)) 
          * (coalesce(toFloat(req1.qty), 1.0) / coalesce(toFloat(out1.qty), 1.0))

Some notes:

  • We need to case the quantities to floating point integers, as otherwise Neo4j will do integral division, meaning we can’t do maths on the fractions of resources we might need to build a given item
  • We divide the requirement for a given recipe by the output quantity to get in essence the number of times that recipe needs to run to yield the correct amount of output
  • We ceil the result, as a requirement for (for example) 1.5x Wood is really a requirement for 2x Wood since we can’t collect Wood in any smaller unit

This almost certainly doesn’t actually work, because we can’t just directly do maths like this.

Close but not quite

Let’s look at the example of a Wood Axe. To make a Wood Axe we need:

  • 2x Sticks
  • 3x Wood Planks

Let’s do the same analysis as we did before:

  • To make 2x Sticks we need 2x Wood Planks
  • To make 2x Wood Planks we need 1x Wood
  • To make the remaining 3x Wood Plank we need a further Wood for a total of 2 Wood

So why does our query tell is that we only need 1x Wood?

Our query doesn’t discern between the different reasons our recipe tree needs Wood. In its ‘mind’:

  • To make 2x Sticks is 0.5 units of output of the Stick recipe (since it outputs 4x Sticks at a cost of 2x Wood Planks)
  • To make 2x Wood Planks is 0.5 units of output of the Wood Plank recipe (since it outputs 4x Wood Planks at a cost of 1x Wood) – so we need 0.5 * 0.5 = 0.25 units of Wood to produce 2x Sticks
  • To make the remaining 3x Wood Planks is 0.75 units of output of the Wood Plank recipe by the logic above – so we need 0.75 units of Wood to produce the remaining 3x Wood Planks
  • We therefore need 0.25 + 0.75 = 1x Wood in total

The query’s fallen down because it’s assuming we can run Recipes a fractional number of times and not handling the input quantities correctly as a result:

  • To produce Sticks in bundles of 4 requires 2x Wooden Planks – even though we only need 2x Stick we still need to burn 2x Wooden Planks to make them
    • It’s on this step that the earlier query really falls down
  • We need 3x Wooden Planks directly in the Wood Axe recipe for a total of 5x Wood Planks
  • To make 5x Wood Planks requires 2x runs of the Wood Plank recipe, each run of which consumes 1x Wood

An algorithm for application code

Let’s stop thinking about this in Cypher terms for a minute and design a way to walk the Recipe tree that’ll give us correct answers.

Suppose we maintain a table with three columns:

  • The name of a Resource
  • The number of that Resource that we’ve created running Recipes
  • The number of that Resource that we require to run the Recipes so far

Each time we want to obtain some resource, we’ll figure out if it’s produced by Recipes or a raw material.

If it’s a raw material, we just increase the ‘Required’ entry in the table for that Resource by the amount we need – we can’t make more of it via a Recipe.

If it’s produced by a recipe, we need to do a bit more work because there are three cases:

The ‘already in stock’ case

Say we’ve run the Sticks recipe once already for some other step, and consumed 1x Stick. The Sticks recipe outputs 4x Sticks, so our resource table might look like:


We are now trying to make something else requiring a further 2x Stick. While there’s a Recipe for this, we don’t need to run it because we have a surplus of 3x Stick already.

In this case, we just increase the ‘Required’ column by the amount we need and don’t bother running the Recipe again.

The ‘entirely out of stock’ case

Say we’ve not run the Sticks recipe at all yet, but have a need of 6x Sticks for a Recipe upstream. We don’t have any Sticks so we’ll have to run the Sticks recipe. We need to run it twice, because the Recipe outputs 4x Sticks per run and we need 6.

We run the Recipe twice, each time adding its output to the table in the ‘Created’ column. We then add 6 to our ‘Required’ column for Stick to earmark some of that output for the upstream recipe:


The ‘partly in stock’ case

Say we’ve run the Sticks recipe once, used 1x Stick and now need a further 4x Sticks. At the start, our table looks like this:


The table says we have 3x Stick going spare just now, so we have a shortfall of 1x Stick. We need to run the Stick recipe once, and add its output to the table before allocating what we need. We run the recipe:


Now there’s no shortfall, so we take the 4x Stick by increasing the Required column:


We recursively do this until we hit raw-material nodes. Because raw materials cannot be produced but must be collected from the world, we just add them to the Required column. For example, to produce a Wooden Plank, we need 1x Wood. After running the Wooden Plank recipe, our table might look like this:

Wooden Plank41

Once we’ve hit a raw material node we can’t recurse further – it’s the base-case for our recursion.

Then, getting a shopping list is as simple as finding the cases where the required number of a Resource is non-zero but the created number is zero – that is, we need some of that Resource, but we couldn’t create any.

Next steps

Next time we’ll take that algorithm and bake it into a quick Node command-line application where we can query for a given Resource and get back the list of ingredients we need to go and find. We’ll also see how the approach above lets us answer more interesting questions.