Kotan Code 枯淡コード

In search of simple, elegant code

Menu Close

Avoiding Graph Modeling Antipatterns

In my previous blog post, which I just posted this morning, it turns out that I made a pretty large modeling no-no. In terms of consequences to the application and the survival of the zombie apocalypse, I think the issue was pretty minor. However, it does illustrate that no matter how awesome a graph database product is, you still need to put some thought into how you model the database.

In the previous model, I could make assertions like user-[:SPOTTED]-zombie and there were some attributes on the spotted relationship such as the date the sighting took place. As it turns out, this is an antipattern. There are a couple of things I did wrong here.

Firstly, the attributes on the relationship have nothing to do with the weighting or nature of the relationship. The attributes I put on that relationship were informational and arguably belonged on a separate entity. In other words, the sighting was a noun that I had elided from my relationship sentence “a user spotted a zombie”

In the forthcoming Graph Databases book being published by O’Reilly, there is a best practice in graph modeling that says that when the interaction of two nodes produces something of significance, that interaction should be modeled as its own entity rather than modeled as properties on the relationship between those nodes.  In my sample zombie domain, this means that a user produces a sighting when they spotzombie. The sighting has metadata about the actual sighting, which can include the latitude and longitude (I know that neo4j has its own built-in geospatial support, which I will ignore for this example) of where the sighting was, and other information about the sighting itself, such as the user’s confidence level that they saw what they think they saw (aka the “bigfoot factor”).

Now I can go back and remodel my previous database using the following graph model created via Cypher:

	CREATE user1={_label: 'user', firstname: 'Kevin', lastname: 'Hoffman'},
		   user2={_label: 'user', firstname: 'Bob', lastname: 'Bobberson'},
	       alvin={_label: 'zombie', name: 'Alvin', power:20},
	       simon={_label: 'zombie', name: 'Simon', power:30},
	       theo={_label: 'zombie', name: 'Theodore', power:40},
	       sighting1={_label: 'sighting', date:'01/01/2013', lat: 42.12, long: 21.03},
	       sighting2={_label: 'sighting', date:'01/02/2013', lat: 42.00, long: 25.00},
	       sighting3={_label: 'sighting', date:'01/03/2013', lat: 47.32, long: 30.21},
	       sighting4={_label: 'sighting', date:'01/01/2013', lat: 40.22, long: 27.03},
	       sighting5={_label: 'sighting', date:'01/02/2013', lat: 43.00, long: 21.12},
	       user1-[:SPOTTED {confidence: 10}]->sighting1-[:TARGET]->alvin,
	       user1-[:SPOTTED {confidence: 12}]->sighting2-[:TARGET]->simon,
	       user1-[:SPOTTED {confidence: 27}]->sighting3-[:TARGET]->theo,
	       user2-[:SPOTTED {confidence: 50}]->sighting4-[:TARGET]->alvin,
	       user2-[:SPOTTED {confidence: 80}]->sighting5-[:TARGET]->simon

Now I’ve added a confidence factor to the relationship itself, which allows me to do a weighted traversal over the relationship nodes. This is a far better user for attributes belonging to a relationship. Also note that the non-traversal information is now more appropriately located as attributes on an entity. This should all actually seem pretty familiar to you if you’ve done any work with Domain-Driven Design.

With this new model in place, I can now visualize my graph like this:

Modified Zombie Sighting Graph

Modified Zombie Sighting Graph

Rest assured, I can still make effective queries against the graph, they just involve another hop to get the information that I used to get before. So, I can write the query to get the list of zombies that have two “spotted” arrows pointing to them (two different users spotted the same zombie), I just need to extend the shape of the subgraph I’m looking for to traverse over the “sighting” node. In the query below, I don’t care about the information on the node, so I can use double-parens for an anonymous node:

START u=node:node_auto_index(_label='user')
MATCH u-[:SPOTTED]->()-[:TARGET]->(zombie)<-[:TARGET]-()<-[:SPOTTED]-(somebodyelse)
RETURN DISTINCT zombie.name

This matches all zombies that have a spotted->(sighting)->target path from a user and also have the reverse <-target-(sighting)<-spotted relationship coming from another user. As expected, the above query returns "Alvin" and "Simon" but not "Theodore". I dare you to try and jump two join tables going one way and then jump the same two join tables going back out without ripping a hole in your RDBMS and/or brain. Moral of the story - even amazing tools like Neo4j require thoughtful care of the data model, though Neo4j can certainly handle refactoring and upgrading much more easily than a regular SQL database.