I answered a Stack Overflow question the other day about why two queries performed very differently. One involved a chain of OPTIONAL MATCHes, the other a single OPTIONAL MATCH with portions of a pattern separated by commas.
There’s a difference between the two that will almost certainly give you different results.
Patterns in Neo4j
A pattern is a description of a structure in a graph we’re trying to match. Patterns can assign nodes and relationships to variables we use in subsequent processing. They express both the nodes we’re looking for and how they must be related. They’re fundamental to a graph query – and the docs aren’t super-clear on how patterns with commas work.
A pattern might be something like ‘Actors connected to Movies via an ‘ACTED_IN’ relationship’.
Patterns with commas
A MATCH statement where the pattern contains a comma is still one MATCH statement. Every component of the pattern must be satisfied for the MATCH to return anything.
Let’s say we’re using the Movies database, and we want to find the directors of movies Tom Cruise has starred in.
One way is to assign a variable to the Movie node, and use commas:
MATCH (tom: Person { name: 'Tom Hanks' })-[:ACTED_IN]->(m: Movie), (director: Person)-[:DIRECTED]->(m)
RETURN DISTINCT director.name
Even though it might look as though we’re specifying two patterns we’re not: it’s all part of the same pattern. If Neo4j can’t satisfy any portion of the MATCH, it’ll return nothing.
Alternatively we could also use two MATCH statements one after the other:
MATCH (tom: Person { name: 'Tom Hanks' })-[:ACTED_IN]->(m: Movie)
MATCH (m)<-[:DIRECTED]-(director: Person)
RETURN DISTINCT director.name
In this example there’s no difference between the two queries – they’re functionally identical. Both return the same data and have identical EXPLAIN and PROFILE results. For this specific query there’s no difference in the two approaches.
But one situation where there is a big difference between the two is when we’re using OPTIONAL MATCH instead of MATCH.
OPTIONAL MATCH chains
Recall that an OPTIONAL MATCH returns a row if our pattern is satisfied, and NULL if not – it’s basically an OUTER JOIN but expressed in a graph query.
Let’s look at a fairly contrived example using the Movies database. Here, we want every Movie and the list of people who produced it, if any are known. We use OPTIONAL MATCH because not all Movies have producers in our database, but we still want to return the Movie title.
MATCH (m: Movie)
OPTIONAL MATCH (producer: Person)-[:PRODUCED]->(m)
RETURN m.title, collect(producer.name)
// 38 records returned
38 records are returned, one for each Movie in the database, of which 10 have a non-empty collection of producer names.
What if we now also wanted to list the names of anyone who reviewed those films, alongside the producers. We can chain two OPTIONAL MATCH statements for this:
MATCH (m: Movie)
OPTIONAL MATCH (producer: Person)-[:PRODUCED]->(m)
OPTIONAL MATCH (reviewer: Person)-[:REVIEWED]->(m)
RETURN m.title, collect(producer.name), collect(reviewer.name)
We still get 38 records, some of which have a producer (like The Matrix and Jerry Maguire) and some of which have a reviewer (like Jerry Maguire and The Replacements). Most records only have one or the other – a few have both.
You might look at that query and wonder, “can’t we simplify that by doing all the OPTIONAL MATCHes in one go?”. It might look like this:
MATCH (m: Movie)
OPTIONAL MATCH (producer: Person)-[:PRODUCED]->(m),
(reviewer: Person)-[:REVIEWED]->(m)
RETURN m.title, collect(producer.name), collect(reviewer.name)
Well – no, you can’t. The two queries are expressing different things. The commas in the second query’s OPTIONAL MATCH are expressing a single pattern – we need to find a Person who :PRODUCED the movie, and a Person who :REVIEWED the movie to return anything from the optional match at all. If we can’t match any part of the comma-separated portions of the pattern, we’ll return nothing at all.
We no longer return any producer or reviewer information for The Matrix, nor The Replacements. The Matrix has a producer but no reviewer, and the Replacements has a reviewer but no producer.
This is the last in a series of posts where we created a Neo4j model of the Minecraft crafting tree and played around with querying it to build a shopping list for any given item.
We found data quality issues and corrected them, but finding them was made easier because of Neo4j Browser’s easy visualisation of our graph. You can see data issues quicker than you can query for them, and a human will spot anomalies pretty quickly. For example:
Isolated nodes where you expected a fully-connected graph
Cycles in the graph where you expect it to be acyclic
Multiple relationships between nodes where you expect only a single relationship
If you’re working with a large dataset (remember: the Minecraft dataset was just a toy) you could consider sampling your source data and testing your ingest on that sample. You may not find all data issues in the sample, but quickly visualising and proving out your graph model is much easier if you can fit the whole model on a few screens.
Neo4j is radical overkill for this specific problem
We knew this going in to it – we don’t need a proper database for this, we could just as well have represented the whole graph in memory in Javascript and gotten the same answers out.
That said – while we’re not using any large fraction of the power of Neo4j for our example, we did touch upon a good few core concepts:
Data loading via LOAD CSV and MERGE
Constraints and indexes
Visualisation
Querying in Cypher
Aggregation in Cypher
APOC
APOC is huge and very capable
We used a single function from the APOC library which boasts over 450 functions exposed to your Neo4j instance. It addresses a lot of functionality shortfalls in Cypher, and if you start digging into it further features like graph refactoring support, virtual nodes and node grouping all become very attractive tools for your mental toolkit.
I’m surprised it’s not just bundled by default.
It’s tempting to write APOC calls or application code over pure Cypher
I still haven’t found the dividing line yet where I’m comfortable saying ‘just express that in Cypher’ or ‘just use APOC’. We found two queries that were broadly equivalent, one in some fairly long-winded Cypher and what amounted to a one-liner in APOC. Was one better than the other?
If you’re a developer and you take the dependency on APOC to express a query then you’re buying your future self (and future colleagues) into that ecosystem. You can probably express almost any Cypher concept in a series of APOC calls – should you? Your future self now has to remember exactly how, for example, apoc.path.subgraphAll works to figure out what a query’s up to, while the plain Cypher was easier to figure out from first-principles.
There isn’t just one way to model your domain
Neo4j claims that, and I’m paraphrasing, ‘your whiteboard model is your data model’ and that’s true – we literally did that, replete with a photo of our whiteboard.
However, if you can represent something multiple ways on your whiteboard then that doesn’t save you from the underlying domain complexity. There isn’t just one way to model your domain, and the model you choose has implications on:
How easy it is to query the graph
Whether you need application code in places where you could have been using Cypher
How easy it is to ingest new data and maintain consistency within your model
Performance
…
The list goes on. We only looked at two representations of our data – there were more we could have explored. The accepted guidance seems to be to optimise your data model around the most common query or queries you’ll be performing – i.e. cover the 80% case and worry about the long tail of other queries later.
Graph refactoring and experimentation is easy
Luckily Neo4j makes it super easy to refactor your graph, and either APOC or the Neo4j cypher-shell let you export and re-import data or subsets of your data to try experiments quickly.
In addition Neo4j Desktop is very capable and a whole lot easier to use than the old days where you had to stand up new instances of the database or play with configuration files to swap graphs in and out. Installing APOC takes two clicks, and staying on the latest database version is as easy and choosing from a drop-down box.
Conclusions
The graph model we chose made certain operations more difficult. Still – we overcame the issues we had either with Cypher or application code, and built something that answers interesting questions from the graph.
Our Minecraft example wasn’t a good use-case for Neo4j as it stands, but modelling BOMs in Neo4j is a good fit in general (which is what we started to show in our simplified example back in Part 4).
Being able to answer questions like ‘what’s the most contrived item to build’ may sound frivolous, but in manufacturing translates to:
Lead-time estimation – from materials ingest, how long before step X of the process can expect work, and how long is our production pipeline in minutes/hours?
Quality and yield modelling – what’s the cost of rework for a defect found at a particular part of the process?
Design-for-manufacture decision making support – how many different fastenings are used across the product, can we reduce that?
In this series of posts, I’m going to try to represent the Minecraft crafting tree in Neo4j. In Part 2 we imported some data, explored it a little and noticed some inconsistencies which we fixed up. We found that we can write some simple and interesting queries against it. In this post we’ll try building a shopping list for a Wooden Sword and realise that our data model isn’t right yet – then we’ll try and refactor the model to something more useful.
Note: You can download the state of our database so far via this Gist, which is a Cypher script to be run into a clean graph. You’ll be able to download the state of play after all our refactoring work from this post at the end.
A problem lurks – output quantities
So far our model is good for the dependency chain of Minecraft items (i.e. ‘some number of these things makes up item X’) but we can’t produce a list of required materials to gather for any given Minecraft item – the shopping list of items-and-quantities that we’d have to go find or mine to make it.
Our graph falls down because we’ve made a bogus assumption – that some variable number of input items produce one unit of output.
For instance – a Torch is made of 1x Stick and 1x Coal – if we have one each of one of those, we produce one Torch and all is well.
But while making Sticks requires 2x Wood, we actually output 4x Sticks for every 2x Wood that comes into the crafting bench. Our graph didn’t encode this at all, so we can’t reliably figure out how much Wood we need to produce one Stick. Or three Sticks.
Representation is everything
It’s not even clear how best to represent this. At the minute our :REQUIRES relationship has a qty attribute detailing how many of the input item it takes to build one unit of the output item. Our current model has
(stick)-[:REQUIRES { qty: 2 }]->(wood)
which is strictly true (in the sense that we can’t make a Sticks until we have 2x Wood) but obviously doesn’t reflect the actual cost of producing a Stick.
Mathematically more accurate would maybe be to allow fractional quantities – 4 Sticks are produced when we put 2x Wooden Planks together, so the cost of one Stick is in some sense 0.5 units of Wood.
(stick)-[:REQUIRES { qty: 0.5 }]->(wood)
But now we’ve lost the concept of how many input items are needed to build the smallest buildable bundle of Sticks – we can’t bring half a Wood unit to the crafting bench, half-units don’t exist at all and even if they did the recipe requires 2 units to kick off. We can’t infer that from the quantity on the relationship.
Let’s explore a couple of refactorings of our model to help.
Option 1 – add outputSize as a property on our Resource nodes
If we could reliably say that Sticks are only ever produced in bundles of 4 then we could express that on the Stick node in our graph and keep representing input quantities how we are just now (i.e. the actual amount required to kick off the crafting). For example:
Now to figure out how much it’ll cost to make a Torch we could walk the graph and do some maths along the way:
1x Coal is just found by mining
1x Stick is 0.25 of the outputSize of a Stick crafting run. Since we can’t have fractional crafting runs – we either produce 4x Sticks or none at all – we could maybe express the requirement as ceil(1 Stick / outputSize) = 1 Sticks in whatever application logic we have
To make 1x Stick requires 2x Wood
Thus the minimum requirement is 1x Coal + 2x Wood, and we’ll end up with some Sticks left over at the end
If there are multiple multiple ways to produce the same item then this falls down – and while our graph just now has only one way to produce things, Minecraft does support variations. For example, you can use an Anvil to repair any Sword by combining two of that Sword type into one.
Thus we could make a sword by building it from scratch, or by combining two broken swords into one new one. Our graph above doesn’t let us express those as different options – if we try to represent it naively we’ll make it look as though the cost of making a sword is that of all its raw ingredients, plus another sword (giving us cycles, if nothing else, and meaning we’d never be able to make the things).
So far it looks OK as an approach so long as we’re willing to accept that we can’t model there being multiple ways to make a given item (which for some games is entirely fine).
The following few statements will update our original graph with the output quantities:
// Recipes that produce 2 of something
MATCH (r:Resource)
WHERE r.name in ['Trapdoor', 'Fence', 'Blaze Powder']
SET r.outputQty = 2;
// Recipes that produce 3 of something
MATCH (r:Resource)
WHERE r.name in ['Fire Charge', 'Ladder', 'Paper', 'Sign', 'Bottle']
SET r.outputQty = 3;
// Recipes that produce 4 of something
MATCH (r:Resource)
WHERE r.name in ['Wooden Plank', 'Stick', 'Torch', 'Arrow', 'Stone Brick', 'Smooth Sandstone', 'Wood Stair', 'Cobblestone Stair', 'Bowl', 'Pumpkin Seed']
SET r.outputQty = 4;
// Recipes that produce 6 of something
MATCH (r:Resource)
WHERE r.name in ['Wooden Slab', 'Stone Slab', 'Stone Brick Slab', 'Brick Slab', 'Cobblestone Slab', 'Nether Brick Fence', 'Cobblestone Wall', 'Mossy Cobblestone Wall']
SET r.outputQty = 6;
// Recipes that produce 8 of something
MATCH (r:Resource)
WHERE r.name in ['Cookie']
SET r.outputQty = 8;
// Recipes that product 15 of something
MATCH (r:Resource)
WHERE r.name in ['Iron Bar', 'Glass Pane']
SET r.outputQty = 16;
// Give a default output quantity of 1 for everything else
MATCH (r:Resource)
WHERE r.outputQty IS NULL
SET r.outputQty = 1;
Option 2 – represent a Recipe as a first-class citizen of the graph
Option 1 is trying to work around the fact that our graph’s representing two things in one go:
The actual items being crafted – like Sticks and Pickaxes
The way in which those items are crafted – their dependencies
In short, a node in the graph represents both the output item and the recipe to make it, even though they’re two different concerns.
What if Recipes were nodes in our graph?
A Recipe:
:REQUIRES Resource nodes (where the qty attribute expresses the number required to complete the recipe)
:PRODUCES Resource nodes (where the qty attribute expresses how many of that Resource are produced for each run of the Recipe)
We could create Recipe nodes for every Resource node that has any inbound :REQUIRES edges – for example. we won’t end up with a Recipe for Wood (because you find it, you don’t craft it) but will have a Recipe for Wooden Planks.
This feels like a more correct approach somehow, so let’s explore how we might implement it.
Introducing Recipe nodes
First let’s create one Recipe node for each Resource node in the graph that has any inbound :REQUIRES edge (i.e. that isn’t a raw material):
MATCH (out: Resource)-[rel:REQUIRES]->(in: Resource)
MERGE (outRecipe: Recipe { name: out.name })
MERGE (outRecipe)-[:REQUIRES { qty: rel.qty }]->(in)
MERGE (outRecipe)-[:PRODUCES { qty: 1 }]->(out)
RETURN out, in, outRecipe
Added 137 labels, created 137 nodes, set 493 properties, created 356 relationships, started streaming 219 records after 98 ms and completed after 115 ms.
We could also delete the original :REQUIRES relationships:
MATCH (:Resource)-[rel:REQUIRES]->(:Resource)
DELETE rel
Completed after 32 ms.
And let’s throw a unique constraint on the new Recipe node’s name attribute while we’re at it.
CREATE CONSTRAINT ON (recipe: Recipe) ASSERT recipe.name IS UNIQUE
Added 1 constraint, completed after 104 ms.
Querying with Recipe nodes and pure Cypher
Our first stop should be to ask the updated graph how do you make Wooden Planks. But we’ve got a problem with this new approach – by adding in extra hops, we can’t easily use variable length path querying to figure out how to get from Wooden Plank to all its constituent raw materials and the intermediate Recipe nodes.
If we knew ahead of time that there was only one hop, then the following would work:
But if we try that trick on, say – a Stick we’ll only get as far as the Wooden Plank requirement and have to make another query to figure out how to make those:
Before we’d just have said ‘traverse as many :PRODUCES relationship hops as you need’ – recall our Wood Sword example:
MATCH path= (r:Resource { name: 'Wood Sword' })-[:REQUIRES*]->(:Resource)
RETURN path
We can’t use that trick any more – we sort-of want to wrap up the (Resource)<-[:PRODUCES]-(Recipe)-[:REQUIRES]-(Resource) double-hop and say ‘run that double-hop as many times as you need’ but that can’t really be done.
We could use a chain of OPTIONAL MATCHes and some clumsy Cypher to do the job if we knew the most hops we’d ever need. For example, we could support an ‘up to 3-hop’ traversal with two OPTIONAL MATCH clauses.
MATCH (r: Resource { name: 'Wood Sword' })<-[:PRODUCES]-(recipe1: Recipe)-[:REQUIRES]->(dep1: Resource)
OPTIONAL MATCH (dep1)-[:PRODUCES]-(recipe2: Recipe)-[:REQUIRES]->(dep2: Resource)
OPTIONAL MATCH (dep2)-[:PRODUCES]-(recipe3: Recipe)-[:REQUIRES]->(dep3: Resource)
RETURN r, recipe1, dep1, recipe2, dep2, recipe3, dep3
Here we’d just tack in extra OPTIONAL MATCH clauses for each level of depth we want to support.
Ugly but workable, and pure Cypher.
Querying with Recipe nodes and APOC
Instead, we can use APOC – the apoc.path.subgraphAll procedure lets us specify a repeating pattern of relationships to navigate and also limit the traversal to a specific min and max depth.
Here we’re using a map to configure the traversal with two properties:
relationshipFilter is an APOC filter string to express the relationships we should seek and traverse
maxLevel is the maximum depth from r we should traverse to
The documentation’ll do a better job than me, but essentially:
<PRODUCES – find :PRODUCES relationships going into our r node
REQUIRES> – from nodes linked via that :PRODUCES relationship (which we know to be Recipe nodes), find the nodes that are related via an outbound :REQUIRES relationship
The comma between the two patterns asks APOC to apply them in sequence – find a :PRODUCES relationship, then expand out from there using :REQUIRES relationships.
APOC will then iterate that process up to 10 times (our maxLevel parameter), expanding via :PRODUCES, then via :REQUIRES, then via :PRODUCES again until we run out of nodes or hit our limit.
It’s essentially the same as the other query, but now we’re relying on APOC and not just pure Cypher to do the work, and it isn’t particularly easier to read. Still – it’s good to have options.
Fixing up our output quantities
We still need to make sure that our output quantities are right. So far we have each Recipe requiring the correct quantity of materials through its :REQUIRES relationship to them, but we assumed that each Recipe only :PRODUCES one of that item. That’s not true in a number of cases, so first let’s fix that.
// Recipes that produce 2 of something
MATCH (:Recipe)-[relProduces:PRODUCES]->(r: Resource)
WHERE r.name in ['Trapdoor', 'Fence', 'Blaze Powder']
SET relProduces.qty = 2;
// Recipes that produce 3 of something
MATCH (:Recipe)-[relProduces:PRODUCES]->(r: Resource)
WHERE r.name in ['Fire Charge', 'Ladder', 'Paper', 'Sign', 'Bottle']
SET relProduces.qty = 3;
// Recipes that produce 4 of something
MATCH (:Recipe)-[relProduces:PRODUCES]->(r: Resource)
WHERE r.name in ['Wooden Plank', 'Stick', 'Torch', 'Arrow', 'Stone Brick', 'Smooth Sandstone', 'Wood Stair', 'Cobblestone Stair', 'Bowl', 'Pumpkin Seed']
SET relProduces.qty = 4;
// Recipes that produce 6 of something
MATCH (:Recipe)-[relProduces:PRODUCES]->(r: Resource)
WHERE r.name in ['Wooden Slab', 'Stone Slab', 'Stone Brick Slab', 'Brick Slab', 'Cobblestone Slab', 'Nether Brick Fence', 'Cobblestone Wall', 'Mossy Cobblestone Wall']
SET relProduces.qty = 6;
// Recipes that produce 8 of something
MATCH (:Recipe)-[relProduces:PRODUCES]->(r: Resource)
WHERE r.name in ['Cookie']
SET relProduces.qty = 8;
// Recipes that produce 16 of something
MATCH (:Recipe)-[relProduces:PRODUCES]->(r: Resource)
WHERE r.name in ['Iron Bar', 'Glass Pane']
SET relProduces.qty = 16;
Asking a basic question – how much Wood do two Wooden Planks take to produce?
We know a Wooden Plank recipe requires 1x Wood to kick off, and produces 4x Wooden Plank. In this case then, making 2x Wooden Plank requires we run the Wooden Plank recipe once (as 4 > 2), which requires 1x Wood. We can’t run the Recipe a fractional number of times, we just over-produce.