Category Archives: Uncategorized

Invalid Request error when creating a Cloudfront response header policy via Cloudformation

I love Cloudformation and CDK, but sometimes neither will show an issue with your template until you actually try to deploy it.

Recently we hit a stumbling block while creating a Cloudfront response header policy for a distribution using CDK. The cdk diff came out looking correct, no issues there – but on deploying we hit an Invalid Request error for the stack.

An error displayed in the Cloudfront 'events' tab, indicating that there was an Invalid Request but giving no further clues
Cloudformation often doesn’t give much additional colour when you hit a stumbling block

The reason? We’d added a temporarily-disabled XSS protection header, but kept in the reporting URL so that when we turned it on it’d be correctly configured. However, Cloudfront rejects the creation of the policy if you spec a reporting URL on a disabled header setup.

The Cloudfront resource policy docs make it pretty clear this isn’t supported, but Cloudformation can’t validate it for us

A screenshot of a validation error message indicating that X-XSS-Protection cannot contain a Report URI when protection is disabled
Just jumping into the console to try creating the resource by hand is often the most effective debugging technique

How to diagnose Invalid Request errors with Cloudformation

A lot of the time the easiest way to diagnose a Invalid Request error when deploying a Cloudformation is to just do it by hand in the console in a test account, and see what breaks. In this instance, the error was very clear and it was a trivial patch to fix up the Cloudformation template and get ourselves moving.

Unfortunately, Cloudformation often doesn’t give as much context as the console when it comes to validation errors during stack creation – but hand-cranking the affected resource both gives you quicker feedback and a better feel for what the configuration options are and how they hang together.

A rule of thumb is that if you’re getting an Invalid Request back, chances are it’s essentially a validation error on what you’ve asked Cloudformation to deploy. Check the docs, simplify your test case to pinpoint the issue and don’t be afraid to get your hands dirty in the console.

DMARC failures even when AWS SES Custom Mail-From domain used

I was caught out by this, this week, so hopefully future-me will remember quicker how to fix this one.

Scenario

  • You want to get properly configured for DMARC for a domain you’re sending emails from via AWS SES
  • You’ve verified the sender domain as an identity
  • You’ve set up DKIM and SPF
  • You’ve set up a custom MAIL FROM
  • You’re still seeing SPF-related DMARC failures when sending emails

In my case, those failures were caused because I was sending email from a different identity that uses the same domain.

For example, I had ‘example.com’ set up as a verified identity in SES allowing me to send email from any address at that domain, and I configured a sender identity ‘contact@example.com’ to be used by my application to send emails so that I could construct an ARN for use with Cognito or similar.

What isn’t necessarily obvious is that you need to enable the custom MAIL FROM setting for the sender identity, and not just for the domain identity that you’ve configured assuming you have multiple. AWS SES does not fall back to the configuration for the domain identity and you have to individually enable custom MAIL FROM for each sender identity – even if the configuration is identical.

So in my case, the fix was:

  • Edit the Custom MAIL FROM setting for contact@example.com
  • Enable it to use mail.example.com (which was already configured)
  • Save settings

Using an AWS role to authenticate with Google Cloud APIs

I recently had a requirement to securely access a couple of Google Cloud APIs as a service account user, where those calls were being made from a Fargate task running on AWS. The until-relatively-recently way to do this was:

  • Create a service account in the Google Cloud developer console
  • Assign it whatever permissions it needs
  • Create a ‘key’ for the account – in essence a long-lived private key used to authenticate as that service account
  • Use that key in your Cloud SDK calls from your AWS Fargate instance

This isn’t ideal, because of that long-lived credential in the form of the ‘key’ – it can’t be scoped to require a particular originator and while you can revoke it from the developer console, if the credential leaks you’ve got an infinitely long-lived token usable from anywhere – you’d need to know it had leaked to prevent its use.

Google’s Workload Identity Federation is the new hotness in that regard, and is supported by almost all of the client libraries now. Not the .NET one though, irritatingly, which is why this post from Johannes Passing is, if you need to do this from .NET-land, absolutely the guide to go to.

The new approach is more in line with modern authentication standards and uses federation between AWS and Google Cloud to support generating short-lived, scoped credentials that are used for the actual work and no secrets needing to be shared between the two environments.

The docs are broadly excellent, but I was pleased at how clever the AWS <-> Google Cloud integration is given that there isn’t any AWS-supported explicit identity federation actually happening, in the sense of established protocols (like OIDC, which both clouds support in some fashion).

How it works

On the Google Cloud side, you set up a ‘Workload identity pool’ – in essence a collection of external identities that can be given some access to Google Cloud services. Aside from some basic metadata, a pool has one or more ‘providers’ associated with it. A provider represents an external source of identities, for our example here AWS.

A provider can be parameterised:

  • Mappings translate between the incoming assertions from the provider and those of Google Cloud’s IAM system
  • Conditions restrict the identities that can use the identity pool via a rich syntax

You can also attach Google service accounts to the pool, allowing those accounts to be impersonated by identities in the pool. You can restrict access to a given service account via conditions, in a very similar way to restricting access to the pool itself.

To get an access token on behalf of the service account, a few things are happening (in the background for most client libraries, and explicitly in the .NET case).

Authenticating with the pool

In AWS land, we authenticate with the Google pool by asking it to exchange a provider-issued token for one that Google’s STS will recognise. For AWS, the required token is (modulo some encoding and formatting) a signed ‘GetCallerIdentity’ request that you might yourself send to the AWS STS.

Our calling code in AWS-land doesn’t finish the call – we don’t need to. Instead, we sign a request and then pass that signed request to Google which makes the call itself. We include in the request (and the fields that are signed over) the URI of the ‘target resource’ on the Google side – the identity pool that we want to authenticate to.

The response from AWS to Google’s call to the STS will include the ARN of the identity for whom credentials on the AWS side are available. If you’re running in ECS or EC2, these will represent the IAM role of the executing task.

We need share nothing secret with Google to do this, and we can’t fake an identity on AWS that we don’t have access to.

  • The ARN of the identity returned in the response to GetCallerIdentity includes the AWS account ID and the name of any assumed role – the only thing we could ship to Google is proof of an identity that we already have access to on the AWS side.
  • The Google workflow identity pool identifier is signed over in the GetCallerIdentity request, so the token we send to Google can only be used for that specific user pool (and Google can verify that, again with no secrets involved). This means we can’t accidentally ship a token to the wrong pool on the Google side.
  • The signature can be verified without access to any secret information by just making the request to the AWS STS. If the signature is valid, Google will receive an identity ARN, and if the payload has been tampered with or is otherwise invalid then the request will fail.

None of the above requires any cooperation between AWS and Google cloud, save for AWS not changing ARN formats and breaking identity pool conditions and mappings.

What happens next?

All being well, the Google STS returns to us a temporary access token that we can then use to generate a real, scoped access token to use with Google APIs. That token can be nice and short lived, restricting the window over which it can be abused should it be leaked.

What about for long-lived processes?

Our tokens can expire in a couple of directions:

  • Our AWS credentials can and will expire and get rolled over automatically by AWS (when not using explicit access key IDs and just using the profile we’re assuming from the execution role of the environment)
  • Our short-lived Google service account credential can expire

Both are fine and handled the same way – re-run the whole process. Signing a new GetCallerIdentity request is quick, trivial and happens locally on the source machine. And Google just has to make one API call to establish that we’re still who we said we were and offer up a temporary token to exchange for a service account identity.

How to (not) do depth-first search in Neo4j

I found a Stack Overflow question with no answers that seemed like it should be straightforward – how can you traverse a tree-like structure in depth-first order. The problem had a couple of features:

  • Each node had an order property that described the order in which sibling nodes should be traversed
  • Each node was connected to its parent via a PART_OF relationship

A depth-first traversal of a tree is pretty easy to understand.

Whenever we find a node with children, we choose the first and explore as deep into the tree as we can until we can’t go any further. Next we step up one level and choose the next node we haven’t explored yet and go as deep as we can on that one until we’ve traversed the graph.

Neo4j supports a depth-first traversal of a graph by way of the algo.dfs.stream function.

Given some tree-like graph where nodes of label ‘Node’ are linked by relationships of type :PART_OF:

// First, some test data to represent a tree with nodes connected by a
// 'PART_OF' relationship:
// N1 { order: 1 }
//  N2 { order: 1 }
//    N4 { order: 1 }
//      N5 { order: 1 }
//      N6 { order: 2 }
//  N3 { order: 2 }
//    N7 { order: 1 }
MERGE (n1: Node { order: 1, name: 'N1' })
MERGE (n2: Node { order: 1, name: 'N2' })
MERGE (n3: Node { order: 2, name: 'N3' })
MERGE (n4: Node { order: 1, name: 'N4' })
MERGE (n5: Node { order: 1, name: 'N5' })
MERGE (n6: Node { order: 2, name: 'N6' })
MERGE (n7: Node { order: 1, name: 'N7' })
MERGE (n2)-[:PART_OF]->(n1)
MERGE (n4)-[:PART_OF]->(n2)
MERGE (n5)-[:PART_OF]->(n4)
MERGE (n6)-[:PART_OF]->(n4)
MERGE (n3)-[:PART_OF]->(n1)
MERGE (n7)-[:PART_OF]->(n3)

We can see which nodes are visited by Neo4j’s DFS algorithm:

MATCH (startNode: Node { name: 'N1' } )
CALL algo.dfs.stream('Node', 'PART_OF', 'BOTH', id(startNode))
YIELD nodeIds 
UNWIND nodeIds as nodeId
WITH algo.asNode(nodeId) as n
RETURN n

The outupt here will vary – possibly even between runs. While we’ll always see a valid depth-first traversal of the nodes in the tree, there’s no guarantee that we’ll always see nodes visited in the same order. That’s because we’ve not told Neo4j in what order to traverse sibling nodes.

If you need control over the order siblings are expanded, you should use application code to write the DFS yourself.

But: given some constraints and accepting some caveats…

  • That’s there’s only one relationship that links nodes in the tree
  • That sibling nodes are sortable by some numeric property – here ‘order`, which is mandatory
  • There are not more than 1,000,000 sibling nodes for any given parent
  • Sibling nodes all have a distinct order property value
  • That this will perform like a dog on large graphs – potentially not completing, given it has some N^2 characteristics…

…you can do this in pure Cypher. Here’s one approach, which we’ll then break down
to see how it works:

MATCH (root: Node { name: 'N1' }), pathFromRoot=shortestPath((root)<-[:PART_OF*]-(leaf: Node)) WHERE NOT ()-[:PART_OF]->(leaf)
WITH nodes(pathFromRoot) AS pathFromRootNodes
WITH pathFromRootNodes, reduce(pathString = "", pathElement IN pathFromRootNodes | pathString + '/' + right("00000" + toString(pathElement.order), 6)) AS orderPathString ORDER BY orderPathString
WITH reduce(concatPaths = [], p IN collect(pathFromRootNodes) | concatPaths + p) AS allPaths
WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder
RETURN [x in traversalOrder | x.name]

Finding the deepest traversals

Given some root node, we can find a list of traversals to each leaf node using shortestPath. A leaf node is a node with no children of its own, and shortestPath (so long as we’re looking at a tree) will tell us the series of hops that get us from that leaf back to the root.

Sorting the paths

We’re trying to figure out the order in which these paths would be traversed, then extract the nodes from those paths to find the order in which nodes would be visited.

The magic is happening in this line:

WITH pathFromRootNodes, reduce(pathString = "", pathElement IN pathFromRootNodes | pathString + '/' + right("00000" + toString(pathElement.order), 6)) AS orderPathString ORDER BY orderPathString

The reduce is, given a node from root to leaf, building up a string that combines the order property of each node in the path with forward-slashes to separate them. This is much like folder paths in a file system. To make this work, we need each segment of the path to be the same length – therefore we pad out the order property with zeroes to six digits, to get paths like:

/000001/000001/000001
/000001/000001/000002
/000001/000002

These strings now naturally sort in a way that gives us a depth-first traversal of a graph using our order property. If we order by this path string we’ll get the order in which leaf nodes are visited, and the path that took us to them.

Deduplicating nodes

The new problem is extracting the traversal from these paths. Since each path is a complete route from the root node to the leaf node, the same intermediate nodes will appear multiple times across all those paths.

We need a way to look at each of those ordered paths and collect only new nodes – nodes we haven’t seen before – and return them. As we do this we’ll be building up the node traversal order that matches a depth-first search.

WITH reduce(concatPaths = [], p IN collect(pathFromRootNodes) | concatPaths + p) AS allPaths
WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder

First we collect all the paths (which are now sorted by our traversal ordering) into one big list. The same nodes are going to appear more than once for the reasons above, so we need to remove them.

We can’t just DISTINCT the nodes, because there’s no guarantee that the ordering that we’ve worked hard to create will be maintained.

Instead, we use another reduce and iterate over the list of nodes, only adding a node to our list if we haven’t seen it before. Since the list is ordered, we take only the first of each duplicate and ignore the rest. Our CASE statement is doing the heavy lifting here:

WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder

Equivalently:

  • Create a variable called distinctNodes and set it to be an empty list
  • For each node n in our flattened list of nodes in each path from root to each leaf:
  • If we’ve seen n before (if it’s in our ‘distinctNodes’ list) then set distinctNodes = distinctNodes – effectively a no-op
  • If we haven’t seen n before, set distinctNodes = distinctNodes + n – adding it to the list

This is a horrendously inefficient operation – for a very broad, shallow tree (one where each node has many branches) we’ll be doing on the order of n^2 operations. Still, it’s only for fun.

We’re done! From our original graph, we’re expecting a traversal order of:

N1, N2, N4, N5, N6, N3, N7

And our query?

["N1","N2","N4","N5","N6","N3","N7"]

Another for the annals of ‘Just because you can, doesn’t mean you should’.

Mirroring the NuGet Catalog API locally

TL;DR: You can pull the 3GB clone of the Catalog API from this link, though note that it will unpack to 4.8 million files over 1.6 million folders and weigh in at about 52GB uncompressed.

You can pull the .NET Core console application that produced the clone from GitHub.

NuGet is the .NET package manager, and https://nuget.org hosts almost all publicly-published NuGet packages. A package is essentially a ZIP file with a metadata component and some binaries. The metadata portion details version and author information, descriptions and so on – it also lists the dependencies of the package upon other packages in the ecosystem using SemVer.

Package management data sources like this are interesting for playing around with graphs. They’re big, well-used, well-structured data sets of highly connected data . At time of writing, there are 167,733 unique packages published with over 1.8 million versions of those packages listed.

As part of an upcoming series, I wanted to load that information into a Neo4j graph database to see if there were any interesting insights or visualisations we could create when given access to such a big data set. Unfortunately each call to the API takes between 100ms and 500ms – doesn’t sound like much, but if you’re pulling 4.8 million documents you’re looking at around 23 days of time just sequentially pulling files.

You also have to process the files sequentially – each catalog page has a commit timestamp that gives it a strong ordering, and catalog entries are essentially events that have happened to a package version. It’s possible a single package version has multiple different package metadata pages associated with it spanning an arbitrary period of time as the package is listed, de-listed or metadata amended.

I wanted to have Neo4j load the data via REST API calls, rather than going with a standard file load as that was the point of the exercise. This meant that not only did I have to clone the dataset, but I had to host it locally so that it looked like the live API.

Catalog API

The Catalog API exposes every version of every package published to NuGet. Publishes and updates to published packages are recorded as separate documents, and the catalog is paged into batches of 500 or so changes per batch, plus or minus.

There are nearly 9,000 batches reported by the Catalog API, and a total of around 4.8 million documents. Some of those documents relate to the same version of a package – for example, when a package gets de-listed for some reason there will be an updated document in one of the Catalog API pages detailing the new state of the package.

There’s no rate limit on the Catalog API – it’s just files hosted in Azure blob storage – and each document contains navigable URLs to related information. If we wanted to pull a clone of the Catalog API, we could just start from the root document and crawl the links found.

Cloning the documents

Even though Neo4j needs to process the files sequentially, we don’t need to clone them sequentially. To retrieve 4.5 million documents in any sensible amount of time, we need to go multi-threaded. I bashed together a quick .NET console app to do just that.

The app just a pretty simple breadth-first search of links found in documents. We start off with the catalog root URL which details the URL of each of the 8,800 ish catalog pages:

We then search for anything that looks like a URI within the file, check we’ve not already processed that URI and then add it to a ConcurrentQueue. Since the .NET BCL doesn’t have a ConcurrentSet, we use a ConcurrentDictionary and just care about the keys within it.

The queue ends up containing 4.8 million items at its peak, and the ‘processed files set’ slowly grows as the queue is drained.

We spin out 64 System.Threading.Tasks.Task objects in an array. Each Task takes a single URI from the queue, quickly validates that we actually have work to do, pulls the contents of the file and parses out any new URIs. Each new URI is added to the processing queue, and the contents of the file are written to disk in a folder structure that mirrors the segments of the URI so that we can easily host it later. The Task then polls the queue again and waits until there’s more work to do, or until the queue is drained of work items.

The method doing the work just spins on the queue until we run out of work to do

Every minute or so a watchdog Task clones the work queue and the ‘processed’ set and persists them to disk. Fun fact – awaiting a WriteLineAsync is incredibly slow relative to just letting it block, especially when you’re calling it millions of times in a loop.

On startup, the app looks for these checkpoint files and reloads its state from them if found – that way we can pause the processing by just killing the process.

At peak the process was using around 2GB of RAM and pulling down files at around 10MBps.

Image

You can find the source for the application on GitHub, as well as a link to the generated output but the tool itself would need more work before you could for example resume from a snapshot tarball such as this.

24 hours later, what I ended up with was… extensive.

Just calculating the summary took twelve minutes
The unnerving feeling of seeing 1.6 million folders in a single folder doesn’t really go away

Now what?

Serving a mirror of the Catalog API with Nginx

Now that we’ve got our archive of files, we can spin out a super simple docker-compose script to host Nginx, serve the files from some URL and then replace our usages of api.nuget.org with localhost:8192 or whatever.

First off, the docker-compose file – we’re assuming that the file lives in the same folder as the api.nuget.org root folder of documents:

web:
  image: nginx
  volumes:
   - ./api.nuget.org:/usr/share/nginx/html:ro
   - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
  ports:
   - "8192:80"

You’ll have to make sure Docker has access to the drive where the mirrored files are stored to pull this off. We just serve that folder directly as Nginx’s default document base, but then also map in a file to configure Nginx to rewrite any URL that looks like ‘api.nuget.org’ to point to localhost:

server {
    listen 80;
    
    location / {
    	root /usr/share/nginx/html;
        sub_filter 'https://api.nuget.org/'  'http://$http_host/';
        sub_filter_types 'application/json';
        sub_filter_once off;
    }
}

Because the crawler just saved files to a directory structure that matches the URI components, we can trivially serve the files without needing to translate names.

A quick docker-compose up and we’re off to the races:

Of course, we could just use the .JSON files directly, without getting Nginx in the way but since the Neo4j post I did this for was about crawling a REST API, it seemed like a bit of a cheat to just index files on disk instead.

Downloads

You can download the archive from this link – it’ll take… some time to decompress, and you’ll end up with around 52GB of space consumed on the drive once you’ve done so, as well as 4.8 million new files and 1.6 million new folders so maybe do this on a drive you don’t care that much about.

The source for the crawler is available on GitHub though it’s super rough-and-ready and provided with absolutely no guarantees, aside from that it’s likely to make your day worse by running it.

Graph data modelling – inferred vs explicit categories and labels

When building graph data models we frequently have to deal with a degree of polymorphism for our entities just like the real world.

For instance – I’m a person, but I’m also a parent, a spouse, a sibling, a child, a…

Implicit categorisation

Sometimes the entity categories are entirely defined by relationships to other entities. In most of the above examples, we can categorise me because of how I relate to other people in my family:

  • I’m a husband because I have a ‘married to’ relationship to my wife
  • I’m a sibling because I have a ‘sibling of’ relationship to my brother
  • I’m a child because I have a ‘child of’ relationship to both my mother and father

These categories are all fairly simple one-hop affairs – we can categorise me in different ways by looking at how I’m directly connected to other entities in the graph.

A more involved category is ‘parent’ – in a family tree we could be explicit when dealing with parent/child relationships and add both of them to the graph or just add one (say, :PARENT_OF) and flipping our query around a bit to figure out what should have a ‘child’ category.

These two representations aren’t quite equivalent, but we can answer the same questions with the first as the second by caging everything in terms of the :PARENT_OF relationship

There are disadvantages to the second approach when the relationship is symmetrical (you now have to maintain two relationship types that are logically mutually dependent, additional storage requirements and there’s no longer a canonical way to query the children of a given Person) but it’s still a viable model.

When I say ‘a canonical way to query’ we can consider the question of ‘Who is Tony Stark’s father’. The following two queries work on the right hand graph schema, express the same intent and return the same result:

MATCH (father: Person)-[:PARENT_OF]->(tony: Person { name: 'Tony Stark' })
RETURN father
MATCH (tony: Person { name: 'Tony Stark'})-[:CHILD_OF]->(father: Person)
RETURN father

In the first graph there’s only one way to figure out Tony’s parent, which may be considered an advantage.

Explicit categorisation

There are a couple of situations where we may want to explicitly categorise nodes (via node labels or node properties):

  • The category is not entirely defined by the entity’s labels and how that entity relates to the wider graph
  • On a busy graph we’re frequently interested in just the categories of nodes, without necessarily being interested in how those categories were arrived at (we’ll see an example of this in a bit)

The first is fairly obvious, and we can use an HR example to demonstrate it.

In Big Co, every employee has exactly one manager. Managers can have any number of employees reporting to them, including none. A manager isn’t a specific job title or position – you can manage people as part of your day job, but there are also dedicated staff managers whose only role is line management.

Here it’s easy to think that the ‘manager’ category is dependent on their being ‘REPORTS_TO’ relationships into the node, as follows:

CREATE (gill: Employee { name: 'Gill' })
CREATE (peter: Employee { name: 'Peter' })
CREATE (geoff: Employee { name: 'Geoff' })
MERGE path=(geoff)-[:REPORTS_TO]->(peter)-[:REPORTS_TO]->(gill)
RETURN path

And we can pull a list of managers by just looking for anyone with a :REPORTS_TO into them:

MATCH (:Employee)-[:REPORTS_TO]->(manager: Employee)
RETURN DISTINCT manager.name

manager.name
"Gill"
"Peter"

But we haven’t covered the case where a dedicated staff manager doesn’t have any reports yet. Suppose a new manager called Sandra joins the company reporting to Peter – on their first few days they won’t have any reports as they’re still getting trained up, but they’re still a manager according to our definition.

We now need to explicitly categorise the new node somehow. Either via a label:

CREATE (sandra:Employee:Manager { name: 'Sandra' })
WITH sandra
MATCH (peter:Employee {name: 'Peter' })
MERGE path=(sandra)-[:REPORTS_TO]->(peter)
RETURN path

Or via some property on the Sandra node:

CREATE (sandra:Employee { name: 'Sandra', isManager: true })
WITH sandra
MATCH (peter:Employee {name: 'Peter' })
MERGE path=(sandra)-[:REPORTS_TO]->(peter)
RETURN path

To make sure we get Sandra back in our earlier ‘get all managers’ query we now have a few options. Here’s a couple:

-- Assumes we went with maintaining a 'Manager' label
MATCH (:Employee)-[:REPORTS_TO]->(manager: Employee)
RETURN manager.name
-- Union will distinct for us, so we can remove it from the two RETURNs
UNION MATCH (manager:Manager)
RETURN manager.name

-- Alternate phrasing of the above without the UNION
MATCH (manager:Employee)
WHERE (:Employee)-[:REPORTS_TO]->(manager: Employee)
   OR (manager:Manager)
RETURN DISTINCT manager.name
-- Assumes we went with a node property 'isManager' and that we've indexed it
MATCH (:Employee)-[:REPORTS_TO]->(manager: Employee)
RETURN manager.name
UNION MATCH (manager:Employee { isManager: true })
RETURN manager.name

Dirty third option

There is a fairly dirty third solution here, which is to have a dummy Employee node that represents a placeholder employee – a dummy entry, from which we can create a :REPORTS_TO relationship to Sandra. Now our original inference is correct again (if you have inbound :REPORTS_TO relationships then you’re a manager), but our data model no longer matches the business definition because we may have multiple managers listed for that dummy node (breaking the ‘exactly one manager’ rule). Again, workable around by creating a dummy employee node for each manager who lacks reports.

This option is also problematic because we would need to detach and reattach the dummy node when :REPORTS_TO relationships are created or destroyed, and still have to store the explicit ‘isManager’ flag for that process to reliably work.

It also has shifted the problem somewhere else – we’ve made it easy to get a list of managers, but how do we now get a list of employees while excluding the dummy ones?

Impacts of explicit categorisation

There are several big impacts to explicitly categorising nodes, though ultimately if your business requirement doesn’t allow you to reliably infer categories from relationships you’ve not many choices.

Query complexity and performance

The above, really straightforward query is hard to express succinctly because:

  • Every OR in a query expands our search space and slows us down – we’re no longer just hitting indexes to look things up, and we need to combine result sets to get us to the right answer
  • Neo4j doesn’t have an efficient mechanism to query with a disjunction of node labels – we can’t for example say ‘MATCH (:Employee | :Manager)’. Our ‘alternate’ query above basically does a label disjunction, but relies on every Manager also being an Employee (which we can specify for this case but not in general). For other cases, label disjunctions essentially scan every node in the graph

However, you may find that some queries become quicker because you’re doing less work for the ‘flag’ cases where you just want to know if someone’s a manager, and not who they manage. If you could maintain a :Manager label semi-automatically, those queries become trivial, but that maintenance itself isn’t free and is extra logic for your application to contain.

Cognitive complexity

Our logical definition of ‘Manager’ is now:

  • Has any inbound ‘reports to’ relationships
  • OR: is explicitly flagged as a Manager (via label or property)

This means in any piece of code where we need a list of managers we have to embed that logic into the search. If our definition changes we’d need to update a lot of places at once, and because the logic is probably hard to make performant in general it might be expressed in very different ways in different queries depending on the situation.

We can’t get around the cognitive complexity of what the business defines as a manager, but if we’re automating the maintenance of a :Manager label and then in our schema only ever use :Manager to determine the list of managers, while still using the :REPORTS_TO relationship to find out who reports to each manager then we have a clearer delineation and an easier rule to lint for.

Automatically maintaining the category has plenty of failure modes

Settings labels based on relationships is pretty perilous and really only viable if you can infer the intended label from one or two hops (plus whatever manual flag is being set). We now also need to assess the impact if we screw up or the automated label maintenance fails/is delayed:

  • What happens if we forget to set or remove a label?
  • How do we document that we need to fix up the :Manager category each time we amend :REPORTS_TO relationships?

Automating the category tightly couples parts of your application that needn’t be

Back to our family tree example: let’s say that we want to find all the Uncles in the family tree. The maths for being an uncle is pretty easy – X is an uncle of Z if X is the :SIBLING_OF Y and Y is the :PARENT_OF Z.

That transitive relationship causes us a headache though – now when my brother has a child, the corresponding ‘Add Child’ code has to:

  • Create the child node (we had to do this before too)
  • Create a :PARENT_OF relationship between my brother and the new child (same as before)
  • Find all siblings of my brother and label them as :Uncle

The code that adds children to parents shouldn’t care at all about uncles, aunts, nieces and nephews but now it has to (or has to at least know that there might be downstream impacts) to fix up the graph so the categories match the data.

Upshot

I think from having played around with this both in relational and graph databases I’ve roughly come down on:

  • If the category represents a fundamental classification of an entity that broadly doesn’t change and doesn’t depend on the entity’s place in the wider graph, use a node label/explicit category
  • If the category is defined mostly or entirely by its place in the graph, keep its definition to be relative to the graph – i.e. follow the relationships to answer questions, perhaps with disjunctions for explicit classification flags on the node (the ‘isManager’ example above)
  • If performance becomes an issue then consider automating the maintenance of indexable fields or labels based on whatever logic dictates the classification

Graphs aren’t magic

Ultimately, if you have complicated logic in your business domain to classify entities then that complexity has to go somewhere and graph databases aren’t exempt from that. You can cover it off in your application, or you can make your data model more complex/less representative of the real world but chances are you’ll have the same sorts of hurdles to overcome whether using a graph or SQL.

SonarTsPlugin retired and archived

Some time late 2014 I wrote SonarTsPlugin, which for a few years was one of the only ways to get Typescript analysis into SonarQube. It was:

  • The first time I’d used tslint
  • The first time (and last, though for purely coincidental reasons) I’d written an analyser for SonarQube
  • The most popular thing I’ve committed to GitHub

Over 180 stars, over 100 forks and 8.5k downloads of the most recent version. I’m pretty pleased.

Fortunately, the SonarQube guys wrote their own official Typescript plugin a while back and it’s both stable, well-supported and covers off most everything mine did or can be supplemented by other existing plugins to cover the remaining functionality. Not only does this make my plugin redundant, it also makes it a source of confusion – people come raising issues thinking it’s the official plugin, and I’m not equipped time-wise to give much help (especially since Elspeth was born).

I’ve not been able to make updates to it at all for at least 18 months, which means it’s drifted behind SonarQube upgrades and tslint rule changes.

So as of today the repository has been archived with a note, and it’ll be left up as a reference only.

Taking a career break

You love developing software – the hands-on sketching, writing, debugging and deploying of it, prototyping out ideas and proving out new approaches to problems in code.

The industry has a well-known problem for people like you though – especially in large companies, progression tends to take you away from the hands-on day-to-day development work and towards management, architecture and governance roles.

What’s the career equivalent of feature creep?

For a lot of folk this is perfect. You take on different responsibilities, you’re forced to stretch into roles you’ve never held before and develop skills that are broadly very useful but probably not part of a university CS degree. There’s a catch though – making the step back to deep technical work becomes harder the longer you’re away from it, and it’s easy to find yourself in a position where your CV no longer looks like that of a developer but looks rather more like that of a manager.

So you read Hacker News on the train, you spend evenings and weekends hacking about with Angular or Azure Event Grid or AWS Lambda and you get to keep up to speed with what’s happening around you a bit. Flex your coding muscles while still getting to do all the other management and less hands-on technical bits your work requires. You mightn’t even notice how much professional development you’re doing off the clock.

Enter a new challenger

This is where I was at until our daughter arrived in 2013, I just didn’t know it. She’s amazing – shouty and stompy and with a grin that’d turn the sky blue in a storm but looking after your children is all-consuming and you throw yourself into it entirely.

But there’s no time to tinker with TinkerPop when your little girl’s yelling “SLIDE!” at you and bouncing up and down, there’s sliding to be done! Reading that book on domain-driven design or microservices comes much further down the list than making her favourite dinner. And forget about that Azure Friday backlog you’ve got, you’ve got your own dinner to make once she’s in bed.

All of this has forced me to account for my time a bit more, and realise that I was compensating for spending less time being a developer at work by spending more and more of my spare time doing technical things. That’s fine, obviously – software development’s my hobby as well as my career, but when that time is no longer available to you it’s easy to get resentful or sad about how quickly things are whipping past you. It’s easy to lose your love for what you’re doing.

Taking a step back and some time off

So I decided to take a break, hand my notice in and take three or so months off dedicated to the things I want to do and learn. My last day at work was the end of May, so I’ve been able to focus my time since then on two main areas.

First off – I get to spend a lot more time with my daughter. No commute means I’m there to greet her from nursery, do drop-offs when my wife’s working from home, spend a full extra day a week with her and my wife for the day she’s not at nursery when I’d usually be in the office. If nothing else, seeing that massive grinning loon barrelling toward you from the nursery garden yelling “daddy!” – and with scant regard for her friends as she barges past – has emotionally paid for this time off entirely.

Secondly I get to do focused learning on topics that I have had bookmarked for ages but not gotten around to, where I can take on board a new framework or technology or try to better understand the underpinnings of stuff I’ve already been doing. It’s learning for its own sake, but with a rough direction I’m heading towards.

Learning targets

The rough plan, then is to learn or improve my knowledge in a few specific areas:

React and Redux

I’ve been a full-stack web developer for the past eight or so years (so well over half my career now), but the front-end work has almost always been in Knockout for reasonably good reasons. It’s stable, it works well enough and I’ve been extremely productive with it but new projects rarely start in it. I chose to learn React for a couple of reasons:

  • So that I’ve got a mental model of how a different front-end framework operates
  • To get more exposure to newer client-side build pipeline standards

I’ve chosen Udemy as the basis for this learning, an online course which I’ll be backing up with practical explorations using some of the other tech on my list.

Neo4j Graph Database

At my previous employer I worked on and was responsible for our bespoke institutional investor CRM and marketing platform, which was all built on Oracle and ASP.NET. When you get down to it though, you’re modelling graph-like structures in a relational database – so I spent some time exploring Neo4j as an alternative back-end but in a fairly unstructured fashion.

In my break I’ve already spent a long while doing the Neo4j self-learning courses, brushed up on my Cypher syntax and built out multi-node clusters for testing various scenarios as part of their operations training. I’ve also become a Neo4j Certified Professional to back that up. I plan to explore what the implications would be of having a large-scale CRM built in Neo4j in some of my remaining learning time.

GraphQL and the GRAND Stack

GraphQL’s having a good run at the minute but it’s fairly early doors in terms of tooling and support, especially on the .NET front (when compared with something like Apollo). That’s good for me – I’ll get to learn it while it’s still a bit of a moving target, but it’ll also force me to look broader than my .NET background for back-ends to do that learning.

More interesting for me is the so-called GRAND stack of GraphQL, React, Apollo and Neo4j for building graph-based applications. A lot of CRM data is natively hierarchical or structured as connections between related entities – perfect for a graph database, ideal for GraphQL. I’m not going to have time to deep dive the whole stack, but a working knowledge of all the moving parts is the aim and that Neo4j CRM prototype is intended to be built against some of this stack.

Azure and Serverless

Having worked almost exclusively on on-premise applications for the past handful of years I’ve watched Azure mature and expand as a platform but not managed to keep as current with it as I’d like. Sure, App Service and Azure SQL and Table Storage and Azure Virtual Machines are all essentially unchanged but I’ve not played with Event Grid, nor tried to break CosmosDB, or really gotten into Azure Functions at all.

This’ll be a mix of Azure online learning resources and just trying things out. Ideally I’d be in a position to do a certification at the end of this but that’s a nice-to-have and not the goal.

Life after the career break

I’ve not yet decided what comes after my break. Obviously just finding another job is high up the list, but I’m already more willing to consider things like remote working, or part-time working, or contracting given how vast the benefits to my home life have been just for spending more time with my family.

What’ll be important though is that I get to keep learning as I go, and that there are technical challenges to overcome. While learning and prototyping for its own sake is a wonderful indulgence, for me nothing beats getting an idea out of someone’s head and into production.

SonarTsPlugin 1.0.0 released

In something of a milestone for the project SonarTsPlugin 1.0.0 has been released. While the last blog post that mentioned the plugin had it at v0.3, there have been a great many changes since then to the point that I might as well outline the total feature set:

  • Analyses TypeScript code using tslint, or consumes existing tslint output and reports issues to the SonarQube interface
  • Analyses code coverage information in LCOV format
    • Also supports Angular-CLI output
  • Derives lines-of-code in your TypeScript project
  • Supports user-defined rule breach reporting
  • Supports custom tslint rule specification
  • Compatible with Windows and Linux, supports various CI environments including VSTS
  • Compatible with SonarQube 5.6 LTS and above
  • A demo site exists
  • Sample projects demonstrating setup of the plugin are available

The project readme has fairly detailed information on how to configure the plugin, which I’m shortly to turn into a wiki on GitHub with a little more structure.

The plugin has been downloaded over a thousand times now, and appears to be getting increasing use given the recent trend of issues and activity on the project. Hopefully it’s now in a good place to build upon, with the core functionality done.

The next big milestone is to get the plugin listed on the SonarQube Update Centre, which will require fixing a few code issues before going through a review process and addressing anything that comes out of that. Being on the Update Centre is the easiest way as a developer to consume the plugin and to receive updates, so is a real priority for the next few months.

SonarQube TypeScript plugin 0.3 released and demo site available

I’ve recently made some changes to my SonarQube TypeScript plugin pithily named ‘SonarTsPlugin’ that:

  • Make it easier to keep up to date with changes to TsLint
  • Fix minor bugs
  • Support custom TsLint rules

Download links

Breaking change

In a breaking change, the plugin no longer generates a configuration file for TsLint based on your configured project settings, but instead requires that you specify the location of a tslint.json file to use for analysis via the sonar.ts.tslintconfigpath project-level setting.

There were several reasons for the change as detailed on the initial GitHub issue:

  • The options for any given TsLint rule are somewhat fluid and change over time as the language evolves – either we model that with constant plugin changes, or we push the onus onto the developer
  • Decouples the TsLint version from the plugin somewhat – so long as rules with the same names remain supported, a TsLint upgrade shouldn’t break anything
  • Means your local build process and SonarQube analysis can use literally the same tslint.json configuration

Custom rule support

New to 0.3 is support for specifying a custom rule directory. TsLint supports user-created rules, and several large open-source projects have good examples – in fact, there’s a whole repository of them. You can now specify a path to find your custom rules via the sonar.ts.tslintrulesdir project property.

NCLOC accuracy improvements

A minor defect in NCLOC was fixed, where the inside of block comments longer than 3 lines were considered code.

Demo site

To test the plugin against some larger and more interesting code-bases, there’s a SonarQube 5.4 demo installation with the plugin installed available for viewing. Sadly so far none of the projects I’ve analysed have any major issues versus their custom rule setup…

Future

There remains minor work to do on the plugin, and I’ll keep it up to date with TsLint changes where possible.