How to (not) do depth-first search in Neo4j

I found a Stack Overflow question with no answers that seemed like it should be straightforward – how can you traverse a tree-like structure in depth-first order. The problem had a couple of features:

  • Each node had an order property that described the order in which sibling nodes should be traversed
  • Each node was connected to its parent via a PART_OF relationship

A depth-first traversal of a tree is pretty easy to understand.

Whenever we find a node with children, we choose the first and explore as deep into the tree as we can until we can’t go any further. Next we step up one level and choose the next node we haven’t explored yet and go as deep as we can on that one until we’ve traversed the graph.

Neo4j supports a depth-first traversal of a graph by way of the algo.dfs.stream function.

Given some tree-like graph where nodes of label ‘Node’ are linked by relationships of type :PART_OF:

// First, some test data to represent a tree with nodes connected by a
// 'PART_OF' relationship:
// N1 { order: 1 }
//  N2 { order: 1 }
//    N4 { order: 1 }
//      N5 { order: 1 }
//      N6 { order: 2 }
//  N3 { order: 2 }
//    N7 { order: 1 }
MERGE (n1: Node { order: 1, name: 'N1' })
MERGE (n2: Node { order: 1, name: 'N2' })
MERGE (n3: Node { order: 2, name: 'N3' })
MERGE (n4: Node { order: 1, name: 'N4' })
MERGE (n5: Node { order: 1, name: 'N5' })
MERGE (n6: Node { order: 2, name: 'N6' })
MERGE (n7: Node { order: 1, name: 'N7' })
MERGE (n2)-[:PART_OF]->(n1)
MERGE (n4)-[:PART_OF]->(n2)
MERGE (n5)-[:PART_OF]->(n4)
MERGE (n6)-[:PART_OF]->(n4)
MERGE (n3)-[:PART_OF]->(n1)
MERGE (n7)-[:PART_OF]->(n3)

We can see which nodes are visited by Neo4j’s DFS algorithm:

MATCH (startNode: Node { name: 'N1' } )
CALL algo.dfs.stream('Node', 'PART_OF', 'BOTH', id(startNode))
YIELD nodeIds 
UNWIND nodeIds as nodeId
WITH algo.asNode(nodeId) as n
RETURN n

The outupt here will vary – possibly even between runs. While we’ll always see a valid depth-first traversal of the nodes in the tree, there’s no guarantee that we’ll always see nodes visited in the same order. That’s because we’ve not told Neo4j in what order to traverse sibling nodes.

If you need control over the order siblings are expanded, you should use application code to write the DFS yourself.

But: given some constraints and accepting some caveats…

  • That’s there’s only one relationship that links nodes in the tree
  • That sibling nodes are sortable by some numeric property – here ‘order`, which is mandatory
  • There are not more than 1,000,000 sibling nodes for any given parent
  • Sibling nodes all have a distinct order property value
  • That this will perform like a dog on large graphs – potentially not completing, given it has some N^2 characteristics…

…you can do this in pure Cypher. Here’s one approach, which we’ll then break down
to see how it works:

MATCH (root: Node { name: 'N1' }), pathFromRoot=shortestPath((root)<-[:PART_OF*]-(leaf: Node)) WHERE NOT ()-[:PART_OF]->(leaf)
WITH nodes(pathFromRoot) AS pathFromRootNodes
WITH pathFromRootNodes, reduce(pathString = "", pathElement IN pathFromRootNodes | pathString + '/' + right("00000" + toString(pathElement.order), 6)) AS orderPathString ORDER BY orderPathString
WITH reduce(concatPaths = [], p IN collect(pathFromRootNodes) | concatPaths + p) AS allPaths
WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder
RETURN [x in traversalOrder | x.name]

Finding the deepest traversals

Given some root node, we can find a list of traversals to each leaf node using shortestPath. A leaf node is a node with no children of its own, and shortestPath (so long as we’re looking at a tree) will tell us the series of hops that get us from that leaf back to the root.

Sorting the paths

We’re trying to figure out the order in which these paths would be traversed, then extract the nodes from those paths to find the order in which nodes would be visited.

The magic is happening in this line:

WITH pathFromRootNodes, reduce(pathString = "", pathElement IN pathFromRootNodes | pathString + '/' + right("00000" + toString(pathElement.order), 6)) AS orderPathString ORDER BY orderPathString

The reduce is, given a node from root to leaf, building up a string that combines the order property of each node in the path with forward-slashes to separate them. This is much like folder paths in a file system. To make this work, we need each segment of the path to be the same length – therefore we pad out the order property with zeroes to six digits, to get paths like:

/000001/000001/000001
/000001/000001/000002
/000001/000002

These strings now naturally sort in a way that gives us a depth-first traversal of a graph using our order property. If we order by this path string we’ll get the order in which leaf nodes are visited, and the path that took us to them.

Deduplicating nodes

The new problem is extracting the traversal from these paths. Since each path is a complete route from the root node to the leaf node, the same intermediate nodes will appear multiple times across all those paths.

We need a way to look at each of those ordered paths and collect only new nodes – nodes we haven’t seen before – and return them. As we do this we’ll be building up the node traversal order that matches a depth-first search.

WITH reduce(concatPaths = [], p IN collect(pathFromRootNodes) | concatPaths + p) AS allPaths
WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder

First we collect all the paths (which are now sorted by our traversal ordering) into one big list. The same nodes are going to appear more than once for the reasons above, so we need to remove them.

We can’t just DISTINCT the nodes, because there’s no guarantee that the ordering that we’ve worked hard to create will be maintained.

Instead, we use another reduce and iterate over the list of nodes, only adding a node to our list if we haven’t seen it before. Since the list is ordered, we take only the first of each duplicate and ignore the rest. Our CASE statement is doing the heavy lifting here:

WITH reduce(distinctNodes = [], n IN allPaths | CASE WHEN n IN distinctNodes THEN distinctNodes ELSE distinctNodes + n end) AS traversalOrder

Equivalently:

  • Create a variable called distinctNodes and set it to be an empty list
  • For each node n in our flattened list of nodes in each path from root to each leaf:
  • If we’ve seen n before (if it’s in our ‘distinctNodes’ list) then set distinctNodes = distinctNodes – effectively a no-op
  • If we haven’t seen n before, set distinctNodes = distinctNodes + n – adding it to the list

This is a horrendously inefficient operation – for a very broad, shallow tree (one where each node has many branches) we’ll be doing on the order of n^2 operations. Still, it’s only for fun.

We’re done! From our original graph, we’re expecting a traversal order of:

N1, N2, N4, N5, N6, N3, N7

And our query?

["N1","N2","N4","N5","N6","N3","N7"]

Another for the annals of ‘Just because you can, doesn’t mean you should’.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.