Graph data modelling – inferred vs explicit categories and labels

When building graph data models we frequently have to deal with a degree of polymorphism for our entities just like the real world.

For instance – I’m a person, but I’m also a parent, a spouse, a sibling, a child, a…

Implicit categorisation

Sometimes the entity categories are entirely defined by relationships to other entities. In most of the above examples, we can categorise me because of how I relate to other people in my family:

  • I’m a husband because I have a ‘married to’ relationship to my wife
  • I’m a sibling because I have a ‘sibling of’ relationship to my brother
  • I’m a child because I have a ‘child of’ relationship to both my mother and father

These categories are all fairly simple one-hop affairs – we can categorise me in different ways by looking at how I’m directly connected to other entities in the graph.

A more involved category is ‘parent’ – in a family tree we could be explicit when dealing with parent/child relationships and add both of them to the graph or just add one (say, :PARENT_OF) and flipping our query around a bit to figure out what should have a ‘child’ category.

These two representations aren’t quite equivalent, but we can answer the same questions with the first as the second by caging everything in terms of the :PARENT_OF relationship

There are disadvantages to the second approach when the relationship is symmetrical (you now have to maintain two relationship types that are logically mutually dependent, additional storage requirements and there’s no longer a canonical way to query the children of a given Person) but it’s still a viable model.

When I say ‘a canonical way to query’ we can consider the question of ‘Who is Tony Stark’s father’. The following two queries work on the right hand graph schema, express the same intent and return the same result:

MATCH (father: Person)-[:PARENT_OF]->(tony: Person { name: 'Tony Stark' })
RETURN father
MATCH (tony: Person { name: 'Tony Stark'})-[:CHILD_OF]->(father: Person)
RETURN father

In the first graph there’s only one way to figure out Tony’s parent, which may be considered an advantage.

Explicit categorisation

There are a couple of situations where we may want to explicitly categorise nodes (via node labels or node properties):

  • The category is not entirely defined by the entity’s labels and how that entity relates to the wider graph
  • On a busy graph we’re frequently interested in just the categories of nodes, without necessarily being interested in how those categories were arrived at (we’ll see an example of this in a bit)

The first is fairly obvious, and we can use an HR example to demonstrate it.

In Big Co, every employee has exactly one manager. Managers can have any number of employees reporting to them, including none. A manager isn’t a specific job title or position – you can manage people as part of your day job, but there are also dedicated staff managers whose only role is line management.

Here it’s easy to think that the ‘manager’ category is dependent on their being ‘REPORTS_TO’ relationships into the node, as follows:

CREATE (gill: Employee { name: 'Gill' })
CREATE (peter: Employee { name: 'Peter' })
CREATE (geoff: Employee { name: 'Geoff' })
MERGE path=(geoff)-[:REPORTS_TO]->(peter)-[:REPORTS_TO]->(gill)
RETURN path

And we can pull a list of managers by just looking for anyone with a :REPORTS_TO into them:

MATCH (:Employee)-[:REPORTS_TO]->(manager: Employee)
RETURN DISTINCT manager.name

manager.name
"Gill"
"Peter"

But we haven’t covered the case where a dedicated staff manager doesn’t have any reports yet. Suppose a new manager called Sandra joins the company reporting to Peter – on their first few days they won’t have any reports as they’re still getting trained up, but they’re still a manager according to our definition.

We now need to explicitly categorise the new node somehow. Either via a label:

CREATE (sandra:Employee:Manager { name: 'Sandra' })
WITH sandra
MATCH (peter:Employee {name: 'Peter' })
MERGE path=(sandra)-[:REPORTS_TO]->(peter)
RETURN path

Or via some property on the Sandra node:

CREATE (sandra:Employee { name: 'Sandra', isManager: true })
WITH sandra
MATCH (peter:Employee {name: 'Peter' })
MERGE path=(sandra)-[:REPORTS_TO]->(peter)
RETURN path

To make sure we get Sandra back in our earlier ‘get all managers’ query we now have a few options. Here’s a couple:

-- Assumes we went with maintaining a 'Manager' label
MATCH (:Employee)-[:REPORTS_TO]->(manager: Employee)
RETURN manager.name
-- Union will distinct for us, so we can remove it from the two RETURNs
UNION MATCH (manager:Manager)
RETURN manager.name

-- Alternate phrasing of the above without the UNION
MATCH (manager:Employee)
WHERE (:Employee)-[:REPORTS_TO]->(manager: Employee)
   OR (manager:Manager)
RETURN DISTINCT manager.name
-- Assumes we went with a node property 'isManager' and that we've indexed it
MATCH (:Employee)-[:REPORTS_TO]->(manager: Employee)
RETURN manager.name
UNION MATCH (manager:Employee { isManager: true })
RETURN manager.name

Dirty third option

There is a fairly dirty third solution here, which is to have a dummy Employee node that represents a placeholder employee – a dummy entry, from which we can create a :REPORTS_TO relationship to Sandra. Now our original inference is correct again (if you have inbound :REPORTS_TO relationships then you’re a manager), but our data model no longer matches the business definition because we may have multiple managers listed for that dummy node (breaking the ‘exactly one manager’ rule). Again, workable around by creating a dummy employee node for each manager who lacks reports.

This option is also problematic because we would need to detach and reattach the dummy node when :REPORTS_TO relationships are created or destroyed, and still have to store the explicit ‘isManager’ flag for that process to reliably work.

It also has shifted the problem somewhere else – we’ve made it easy to get a list of managers, but how do we now get a list of employees while excluding the dummy ones?

Impacts of explicit categorisation

There are several big impacts to explicitly categorising nodes, though ultimately if your business requirement doesn’t allow you to reliably infer categories from relationships you’ve not many choices.

Query complexity and performance

The above, really straightforward query is hard to express succinctly because:

  • Every OR in a query expands our search space and slows us down – we’re no longer just hitting indexes to look things up, and we need to combine result sets to get us to the right answer
  • Neo4j doesn’t have an efficient mechanism to query with a disjunction of node labels – we can’t for example say ‘MATCH (:Employee | :Manager)’. Our ‘alternate’ query above basically does a label disjunction, but relies on every Manager also being an Employee (which we can specify for this case but not in general). For other cases, label disjunctions essentially scan every node in the graph

However, you may find that some queries become quicker because you’re doing less work for the ‘flag’ cases where you just want to know if someone’s a manager, and not who they manage. If you could maintain a :Manager label semi-automatically, those queries become trivial, but that maintenance itself isn’t free and is extra logic for your application to contain.

Cognitive complexity

Our logical definition of ‘Manager’ is now:

  • Has any inbound ‘reports to’ relationships
  • OR: is explicitly flagged as a Manager (via label or property)

This means in any piece of code where we need a list of managers we have to embed that logic into the search. If our definition changes we’d need to update a lot of places at once, and because the logic is probably hard to make performant in general it might be expressed in very different ways in different queries depending on the situation.

We can’t get around the cognitive complexity of what the business defines as a manager, but if we’re automating the maintenance of a :Manager label and then in our schema only ever use :Manager to determine the list of managers, while still using the :REPORTS_TO relationship to find out who reports to each manager then we have a clearer delineation and an easier rule to lint for.

Automatically maintaining the category has plenty of failure modes

Settings labels based on relationships is pretty perilous and really only viable if you can infer the intended label from one or two hops (plus whatever manual flag is being set). We now also need to assess the impact if we screw up or the automated label maintenance fails/is delayed:

  • What happens if we forget to set or remove a label?
  • How do we document that we need to fix up the :Manager category each time we amend :REPORTS_TO relationships?

Automating the category tightly couples parts of your application that needn’t be

Back to our family tree example: let’s say that we want to find all the Uncles in the family tree. The maths for being an uncle is pretty easy – X is an uncle of Z if X is the :SIBLING_OF Y and Y is the :PARENT_OF Z.

That transitive relationship causes us a headache though – now when my brother has a child, the corresponding ‘Add Child’ code has to:

  • Create the child node (we had to do this before too)
  • Create a :PARENT_OF relationship between my brother and the new child (same as before)
  • Find all siblings of my brother and label them as :Uncle

The code that adds children to parents shouldn’t care at all about uncles, aunts, nieces and nephews but now it has to (or has to at least know that there might be downstream impacts) to fix up the graph so the categories match the data.

Upshot

I think from having played around with this both in relational and graph databases I’ve roughly come down on:

  • If the category represents a fundamental classification of an entity that broadly doesn’t change and doesn’t depend on the entity’s place in the wider graph, use a node label/explicit category
  • If the category is defined mostly or entirely by its place in the graph, keep its definition to be relative to the graph – i.e. follow the relationships to answer questions, perhaps with disjunctions for explicit classification flags on the node (the ‘isManager’ example above)
  • If performance becomes an issue then consider automating the maintenance of indexable fields or labels based on whatever logic dictates the classification

Graphs aren’t magic

Ultimately, if you have complicated logic in your business domain to classify entities then that complexity has to go somewhere and graph databases aren’t exempt from that. You can cover it off in your application, or you can make your data model more complex/less representative of the real world but chances are you’ll have the same sorts of hurdles to overcome whether using a graph or SQL.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.