Kotan Code 枯淡コード

In search of simple, elegant code

Menu Close

Hello (post-Apocalyptic) World with the Neo4j Graph Database

Last week I was talking to a colleague about when a graph database might be overkill versus when it might be the right solution. That got me to thinking, and so I spent several days taking a look at various data models that I used on a regular basis and re-imagining those models as graph models. The results were pretty interesting in that it was the rare case where I felt an application could not be improved by the use of a graph model.

So this got me interested in messing around with a graph database for my favorite sample domain: zombies. In many previous posts and sample applications, I have used a sample domain of an application server that receives zombie sighting messages and in turn sends out messages to aid and inform those attempting to survive the zombie apocalypse.

In my domain, users (human survivors of the zombie apocalypse) can spot zombies. A zombie sighting is a record of vital information about a particular zombie so that zombie can be tracked. Ideally, some higher-level analysis can be done about clusters of zombies (geospatial analysis … also ideally suited to graph databases) and we might also want to do some other analysis on zombies, like correlations between sightings or, better yet, if two people keep reporting the same zombie in similar areas, then perhaps those people should team up.

First thing we need to do is create some sample data. In a graph database, sample sets can be ridiculously big so I’m going to keep it simple and create a “diamond shape” (you’ll see that in a bit) as well as another relationship dangling off the side. This sample set has two users who are reporting sightings of three different zombies: Alvin, Simon, and Theodore. To get this data into the database I’ll use a Cypher (neo4j’s query language) statement. It’s worth noting how the description of relationships is very much like an ASCII Art depiction of what the relationship would look like if you drew it on a whiteboard. This is a huge benefit and one of the things that drew me to Neo4j to start with:

CREATE user1 = { label: 'user', firstname: 'Kevin', lastname: 'Hoffman' },
       user2 = { label: 'user', firstname: 'Bob', lastname: 'Bobberson' },
       alvin = { label: 'zombie', name: 'Alvin', power: 20 },
       simon = { label: 'zombie', name: 'Simon', power: 50 },
       theodore = { label: 'zombie', name: 'Theodore', power:80 },
       (user1)-[:SPOTTED {date: '01/02/2013'}]->(alvin),
       (user1)-[:SPOTTED {date: '01/03/2013'}]->(simon),
       (user1)-[:SPOTTED {date: '01/02/2013'}]->(theodore),
       (user2)-[:SPOTTED {date: '01/02/2013'}]->(alvin),
       (user2)-[:SPOTTED {date: '01/03/2013'}]->(simon)

So here we’re creating two users and three zombies, and we can see that the Kevin user spotted alvin and simon while the Bob user spotted all three of the zombies. We can use Neo4j’s built-in data visualizer to see what this graph looks like:

Zombie Neo4j Graph

Zombie Neo4j Graph

There are a couple of really important points in this graph. The first is that the connections between the nodes are directional. The other is that there is data associated with the relationships as well as with the nodes themselves, which is one area where the use of a graph database truly shines over a traditional RDBMS with “join tables”.

The next thing to notice is that the left side of this graph forms a very familiar “diamond” pattern. The diamond pattern, in social networking circles, can be used to determine mutual friends. In our case, we don’t really have mutual friends but we do have mutual zombie sightings. To get a list of the zombies which have been spotted by multiple users, we can issue the following query:

START u=node:node_auto_index(label='user')
MATCH u-[:SPOTTED]-(zombie)<-[:SPOTTED]-(somebodyelse)

Here we are matching all user nodes with the zombies they spotted, which have also been spotted by another user. This could potentially return multiple duplicates of the zombie (Bob and Kevin both spotted Alvin) so we remove duplicate nodes with the DISTINCT clause.

To be more specific, we can also run a query to list off the zombies that were spotted by Bob and Kevin (e.g. the “mutual friends”) query:

START bob=node:node_auto_index(firstname='Bob'),
MATCH (kevin)-[:SPOTTED]->(zombie)<-[:SPOTTED]-(bob)
RETURN zombie.name

When I execute this code, I get the results “Alvin” and “Simon” (the diamond pattern) but I don’t get “Theodore” because only Kevin saw Theodore, not Bob.

You can see that with unbelievably simple, and remarkably readable syntax like (kevin)-[:SPOTTED]->(zombie)<-[:SPOTTED]-(bob), you can describe intricate relationships and perform absolutely mindblowing queries against a graph database. This is how Facebook gives you the information it gives you, it’s how Amazon gives you product recommendations, and it’s how LinkedIn knows that you have “2nd” and “3rd” degree contacts in your professional network (it’s counting hops across the “connected to” relationships in a graph traversal).

This is just the tip of the tip of the iceberg, barely scratching the surface of the power of graph databases. They can be used for all kinds of amazing things, most notably social networking and post-apocalyptic zombie sighting management. What could you use one for on your projects?