This is the last in a series of posts where we created a Neo4j model of the Minecraft crafting tree and played around with querying it to build a shopping list for any given item.
Over the course of the series we’ve:
- Scraped data from a webpage and done some cleansing
- Built a Neo4j model by importing that data, and explored its limitations
- Found different representations of the graph that have different characteristics
- Tried using pure Cypher to walk the graph and found we hit a wall
- Built a Node version of a tree walking algorithm to correctly calculate a bill of materials
What have we learned?
Visualising your data to cleanse it works well
We found data quality issues and corrected them, but finding them was made easier because of Neo4j Browser’s easy visualisation of our graph. You can see data issues quicker than you can query for them, and a human will spot anomalies pretty quickly. For example:
- Isolated nodes where you expected a fully-connected graph
- Cycles in the graph where you expect it to be acyclic
- Multiple relationships between nodes where you expect only a single relationship
If you’re working with a large dataset (remember: the Minecraft dataset was just a toy) you could consider sampling your source data and testing your ingest on that sample. You may not find all data issues in the sample, but quickly visualising and proving out your graph model is much easier if you can fit the whole model on a few screens.
Neo4j is radical overkill for this specific problem
We knew this going in to it – we don’t need a proper database for this, we could just as well have represented the whole graph in memory in Javascript and gotten the same answers out.
That said – while we’re not using any large fraction of the power of Neo4j for our example, we did touch upon a good few core concepts:
- Data loading via LOAD CSV and MERGE
- Constraints and indexes
- Visualisation
- Querying in Cypher
- Aggregation in Cypher
- APOC
APOC is huge and very capable
We used a single function from the APOC library which boasts over 450 functions exposed to your Neo4j instance. It addresses a lot of functionality shortfalls in Cypher, and if you start digging into it further features like graph refactoring support, virtual nodes and node grouping all become very attractive tools for your mental toolkit.
I’m surprised it’s not just bundled by default.
It’s tempting to write APOC calls or application code over pure Cypher
I still haven’t found the dividing line yet where I’m comfortable saying ‘just express that in Cypher’ or ‘just use APOC’. We found two queries that were broadly equivalent, one in some fairly long-winded Cypher and what amounted to a one-liner in APOC. Was one better than the other?
If you’re a developer and you take the dependency on APOC to express a query then you’re buying your future self (and future colleagues) into that ecosystem. You can probably express almost any Cypher concept in a series of APOC calls – should you? Your future self now has to remember exactly how, for example, apoc.path.subgraphAll
works to figure out what a query’s up to, while the plain Cypher was easier to figure out from first-principles.
There isn’t just one way to model your domain
Neo4j claims that, and I’m paraphrasing, ‘your whiteboard model is your data model’ and that’s true – we literally did that, replete with a photo of our whiteboard.
However, if you can represent something multiple ways on your whiteboard then that doesn’t save you from the underlying domain complexity. There isn’t just one way to model your domain, and the model you choose has implications on:
- How easy it is to query the graph
- Whether you need application code in places where you could have been using Cypher
- How easy it is to ingest new data and maintain consistency within your model
- Performance
- …
The list goes on. We only looked at two representations of our data – there were more we could have explored. The accepted guidance seems to be to optimise your data model around the most common query or queries you’ll be performing – i.e. cover the 80% case and worry about the long tail of other queries later.
Graph refactoring and experimentation is easy
Luckily Neo4j makes it super easy to refactor your graph, and either APOC or the Neo4j cypher-shell
let you export and re-import data or subsets of your data to try experiments quickly.
In addition Neo4j Desktop is very capable and a whole lot easier to use than the old days where you had to stand up new instances of the database or play with configuration files to swap graphs in and out. Installing APOC takes two clicks, and staying on the latest database version is as easy and choosing from a drop-down box.
Conclusions
The graph model we chose made certain operations more difficult. Still – we overcame the issues we had either with Cypher or application code, and built something that answers interesting questions from the graph.
Our Minecraft example wasn’t a good use-case for Neo4j as it stands, but modelling BOMs in Neo4j is a good fit in general (which is what we started to show in our simplified example back in Part 4).
Being able to answer questions like ‘what’s the most contrived item to build’ may sound frivolous, but in manufacturing translates to:
- Lead-time estimation – from materials ingest, how long before step X of the process can expect work, and how long is our production pipeline in minutes/hours?
- Quality and yield modelling – what’s the cost of rework for a defect found at a particular part of the process?
- Design-for-manufacture decision making support – how many different fastenings are used across the product, can we reduce that?