Kotan Code 枯淡コード

In search of simple, elegant code

Menu Close

Tag: neo4j

First Exposure to OrientDB

So I decided to try and give Medium a shot. As a result, I’ve posted my initial thoughts on the graph/document database OrientDB over there. Click here to read the full blog.

Accessing Neo4j from Play Framework 2.1 with Web Service Futures

As you may have noticed, I’ve recently been doing a little bit of talking about Neo4j and graph databases. A database on its own is a fantastic thing, but it doesn’t do anybody much good if the information contained within stays there, or if you can’t put information into it from some application. In my own proof of concept, I wanted to see how easy it would be to access Neo4j from my favorite web application framework, Play.

Turns out it was a little annoying because I started looking for frameworks to talk to Neo4j. First I looked at AnormCypher, which is a library that provides an “anorm-like” API access to Cypher. I couldn’t use this one because it failed with multiple runtime errors when I attempted to bring it into my Play application. Next, I tried another Cypher wrapper that I found on github and that one failed miserably, too – it had a conflict between the version of a JSON parser it was using versus the one that my Play application was using.

Then I figured, why even bother? It’s just a REST API. So, I decided to try accessing Neo4j using nothing but Play framework’s own Web services API, which is basically the WS object. This turned out to be stupidly easy and, when I figured out future chaining in combination with that WS API, amazing stuff started to happen. You know, like, productivity.

First, I created a teeny little wrapper around the Neo4j REST API URL, I just called it NeoService:

class NeoService(rootUrl: String) {
    def this() = this("http://default/neo/URL/location/db/data")

    val stdHeaders = Seq( ("Accept", "application/json"), ("Content-Type", "application/json") )

    def executeCypher(query: String, params: JsObject) : Future[Response] = {
        WS.url(rootUrl + "/cypher".withHeaders(stdHeaders:_*).post(Json.obj(
            "query" -> query,
            "params" -> params
            ))
       )
    }

    // Convert a Future[Response] into a Future[Int]!
    def findNodeIdByKindAndName(kind:String, name:String) : Future[Option[Int]] = {
        val cypher = """
          START n=node:node_auto_index(kind={theKind})
          WHERE n.name = {theName}
          RETURN id(n) as id
       """.stripMargin
        val params = Json.obj( "theName" -> name, "theKind" -> kind )
        for (r <- executeCyper(cypher, params)) yield {
            val theData = (r.json \ "data").as[JsArray]
            if (theData.value.size == 0)
               None
            else
               // Json: "data" : [ [ ### ] ]
               Some(theData.value(0).as[JsArray].value(0).as[Int])
        }
    }
}

So, in this little tiny class I’ve got a function that I can use to execute Cypher queries and because I know that I have an auto-index’d node property called kind and I have another property called name, I can attempt to find a node’s ID based on the kind and name properties using a Cypher query. Instead of finding them synchronously, I can just convert my Future[Response]  into a Future[Option[Int]] where the Option[Int] is the result of looking through the Neo4j database for that data. I’ve just converted a future with a future, and being able to do so is pretty freaking awesome.

Not only can I do that, but I can chain these method calls from other code, like this:

val neo = new utils.NeoService()
for (nodeId <- neo.findNodeIdByKindAndName("zombie", "bob"))
    for (zombieId <- otherObject.tweakZombie(zombieId))
        yield {
           zombieId.map { id => println("I got a zombie!") }.getOrElse { println("Zombie doesn't exist!") }
        }

In the code here, tweakZombie can be written to be aware of the Option[] so that if it’s empty, it doesn’t tweak anything, allowing me to chain call after call after call and not worry about slapping a crapload of if statements in there – a judicious use of map, and for, and options and futures gives me a ridiculous amount of power.

All of this is made possible by the fact that the Play Framework Web Services API is based on Futures. I was originally skeptical of futures because the old code I had seen before was more confusing than single-threaded, synchronous programming. With the new Futures syntax and well-coded libraries like the Play WS object, you can do ridiculous things in a very small number of lines of code.

Modeling Arrays and Complex, Nested Objects in Neo4j

Today I was messing around with Neo4j, as I am wont to do, and I ran across a modeling scenario that I hadn’t expected. My previous experience with NoSQL databases has been with MongoDB. I’ve used Mongo for multiple projects, including a MUD and an MMORPG server both written in Akka and Scala.

When using MongoDB as a backing store for my enterprise application, I’ve been storing multi-level objects in there without much concern. I’ve got a root object that contains an array of nested objects which, in turn, actually contain even more nested objects. The structure is a few levels deep, but using techniques like Play Framework’s JSON inception, it makes serializing an de-serializing to and from Scala case classes pretty easy. All in all, it’s been working out quite well.

When looking at how to store information that I might otherwise put in a NoSQL store in Neo4j to take advantage of its graph query and traversal capabilities, I noticed that Neo4j, while “speaking JSON”, doesn’t do so at quite the robust level that MongoDB does. In short, you can’t store arrays or nested objects in a Neo4j node or relationship. Now, before you complain about this, there are a pile of good reasons for this that I could really only begin to touch the surface of if I tried to explain them. Bottom line is that nodes and relationships in Neo4j have properties, which are name-value pairs, not recursive name-value maps like you find in Mongo.

So, with that restriction in place, how do we model complex, nested objects and arrays in Neo4j?

First we can take a look at why we need an array. In many cases in my model, the arrays of nested objects on a root object were actually modeling a parent-child type relationship where the children had their own set of properties. So, let’s say you want a node in your graph to be a WellArmedZombie and that particular zombie needs an array of weapons. If that zombie owns or contains or is responsible for those weapons, you can split that single node into many, one node for the core WellArmedZombie, one node for each of its weapons, and a relationship (probably -[:WEAPON]->) between the zombie and each weapon.

In my own exercise attempting to remodel a MongoDB database as a Neo4j database, nested objects were very easy to convert. In fact, in almost all cases, I was able to take the field name that referred to the nested item and turn it into a relationship pointing to a related node containing the properties that used to belong to the nested item.

Let’s say this same WellArmedZombie node, in Mongo, had a property called armor which was a nested JSON object containing properties like strengthmaterial, and color. It is pretty easy to switch that into Neo4j where we have a relationship that looks like (zombie)-[:ARMOR]->(armor).

So, the one lesson I learned today was really that thinking relationally requires some unlearning before you can really take advantage of a NoSQL database model. Further, thinking in a graph requires you to unlearn some pretty standard conventions and best practices that are a part of thinking in documents rather than nodes and edges. I realize that as we discover new tools, we often feel like the shiny new tool will fix all problems. It doesn’t, and I’ve got a couple models/domains I work with that I don’t think are good candidates for Neo4j. I have another model that I’ve used where I think a hybrid Neo4j/Mongo approach might be appropriate – put all the information you need for queries, traversals, and quick summary display into the Neo4j database and put the large, bulky (heavily nested and composed) data into Mongo for supplementary query.

Regardless, I am pretty damn excited to be living and working in a time where the kind of computing power that Neo4j offers is pretty much ubiquitous, free, and readily available on my machine and in the cloud. It’s a great time to be a developer.

To Entity or Not to Entity, That is the Graph DB Design Question

In my last blog post, I offered up a sample zombie domain as an area where refactoring a property-laden relationship into an entity on its own might be beneficial. Unfortunately, I think some of the context was lost in the discussion about zombies. The main point that I thought was important to the refactor is that if two entities interact with each other in a meaningful way at some point in time, that interaction is a likely candidate for producing a third entity as opposed to simply putting information about the interaction on the relationship.

The biggest motivator for this is if you have two entities, and one entity can interact multiple times with the other entity and the name of the relationship describing that interaction is the same then you absolutely need a third entity in the way and you simply can’t model it with a simple A->B relationship. In the comments section of my previous blog post, I discussed a possible scenario where you might have a wildlife monitoring scenario. In this scenario, multiple researchers interact with animals where they take vital statistics (I used temperature because I know nothing about wildlife conservation and figured body temp might be useful…). The key here is that the same researcher can interact with the same animal for the same purpose (containing different metadata) multiple times.

Here’s a visualization of a really simple graph that has 2 researchers, three animals, and multiple recorded “check ups”:

Wildlife Checkin Graph

Wildlife Checkin Graph

Here’s the Cypher that created this graph:

CREATE joe={_label:'researcher', firstname:'Joe', lastname:'Doe'},
  sally={_label:'researcher', firstname:'Sally', lastname:'McSmart'},
  ann={_label:'researcher', firstname:'Ann', lastname:'DuScience'},
  bubba={_label:'animal', name:'Bubba'},
  kong={_label:'animal', name:'Kong'},
  dp1={_label: 'checkpoint', date:'04/13/2013', temp:98, lat:40.7657, long:73.9856},
  (joe)-[:CHECKED]->(dp1)-[:ANIMAL]->(bubba),
  dp2={_label: 'checkpoint', date:'04/14/2013', temp:99, lat:40.7657, long:73.9856},
  (sally)-[:CHECKED]->(dp2)-[:ANIMAL]->(bubba),
  dp3={_label: 'checkpoint', date:'04/14/2013', temp:102, lat:40.7657, long:73.9856},
  (ann)-[:CHECKED]->(dp3)-[:ANIMAL]->(kong),
  dp4={_label: 'checkpoint', date:'04/15/2013', temp:103, lat:40.7657, long:73.9856},
  (ann)-[:CHECKED]->(dp4)-[:ANIMAL]->(kong),
  dp5={_label: 'checkpoint', date:'04/15/2013', temp:99, lat:40.7657, long:73.9856},
  (ann)-[:CHECKED]->(dp5)-[:ANIMAL]->(bubba)

From the graph (and from the Cypher if you’re good at inferring from that) we can see that Ann checked in with the same animal twice and got two different body temperature measurements. Some queries that I can run on this simple dataset:

Bubba’s average body temperature:

START bubba=node:node_auto_index(name='Bubba')
MATCH bubba<-[:ANIMAL]-(checkpoint)
RETURN avg(checkpoint.temp)

Checkins by researcher and the average body temp taken by said researcher (note the lack of a “Group by” statement here):

START animal=node:node_auto_index(_label = 'animal')
MATCH animal<-[:ANIMAL]-(checkpoint)<-[:CHECKED]-(researcher)
RETURN researcher.firstname, count(researcher) as checkCount, avg(checkpoint.temp)
ORDER BY checkCount DESC

Animals, the number of times they were checked, and the highest temperature recorded:

START animal=node:node_auto_index(_label = 'animal')
MATCH animal<-[:ANIMAL]-(checkpoint)<-[:CHECKED]-(researcher)
RETURN animal.name, count(checkpoint), max(checkpoint.temp)

If I were better at graph databases, and my dataset was richer, I might be able to write queries that show me clusters, perhaps showing me that a particular researcher is recording much higher than normal temps, and so there may be a correlation between their locations and sick animals, etc.

No matter which way you want to query this particular type of graph, the one thing that I do know is that this one truly does require the kind of refactoring I mentioned in the previous blog post, where we have to take a direct relationship between two entities and put a new entity in the middle.

Avoiding Graph Modeling Antipatterns

In my previous blog post, which I just posted this morning, it turns out that I made a pretty large modeling no-no. In terms of consequences to the application and the survival of the zombie apocalypse, I think the issue was pretty minor. However, it does illustrate that no matter how awesome a graph database product is, you still need to put some thought into how you model the database.

In the previous model, I could make assertions like user-[:SPOTTED]-zombie and there were some attributes on the spotted relationship such as the date the sighting took place. As it turns out, this is an antipattern. There are a couple of things I did wrong here.

Firstly, the attributes on the relationship have nothing to do with the weighting or nature of the relationship. The attributes I put on that relationship were informational and arguably belonged on a separate entity. In other words, the sighting was a noun that I had elided from my relationship sentence “a user spotted a zombie”

In the forthcoming Graph Databases book being published by O’Reilly, there is a best practice in graph modeling that says that when the interaction of two nodes produces something of significance, that interaction should be modeled as its own entity rather than modeled as properties on the relationship between those nodes.  In my sample zombie domain, this means that a user produces a sighting when they spotzombie. The sighting has metadata about the actual sighting, which can include the latitude and longitude (I know that neo4j has its own built-in geospatial support, which I will ignore for this example) of where the sighting was, and other information about the sighting itself, such as the user’s confidence level that they saw what they think they saw (aka the “bigfoot factor”).

Now I can go back and remodel my previous database using the following graph model created via Cypher:

	CREATE user1={_label: 'user', firstname: 'Kevin', lastname: 'Hoffman'},
		   user2={_label: 'user', firstname: 'Bob', lastname: 'Bobberson'},
	       alvin={_label: 'zombie', name: 'Alvin', power:20},
	       simon={_label: 'zombie', name: 'Simon', power:30},
	       theo={_label: 'zombie', name: 'Theodore', power:40},
	       sighting1={_label: 'sighting', date:'01/01/2013', lat: 42.12, long: 21.03},
	       sighting2={_label: 'sighting', date:'01/02/2013', lat: 42.00, long: 25.00},
	       sighting3={_label: 'sighting', date:'01/03/2013', lat: 47.32, long: 30.21},
	       sighting4={_label: 'sighting', date:'01/01/2013', lat: 40.22, long: 27.03},
	       sighting5={_label: 'sighting', date:'01/02/2013', lat: 43.00, long: 21.12},
	       user1-[:SPOTTED {confidence: 10}]->sighting1-[:TARGET]->alvin,
	       user1-[:SPOTTED {confidence: 12}]->sighting2-[:TARGET]->simon,
	       user1-[:SPOTTED {confidence: 27}]->sighting3-[:TARGET]->theo,
	       user2-[:SPOTTED {confidence: 50}]->sighting4-[:TARGET]->alvin,
	       user2-[:SPOTTED {confidence: 80}]->sighting5-[:TARGET]->simon

Now I’ve added a confidence factor to the relationship itself, which allows me to do a weighted traversal over the relationship nodes. This is a far better user for attributes belonging to a relationship. Also note that the non-traversal information is now more appropriately located as attributes on an entity. This should all actually seem pretty familiar to you if you’ve done any work with Domain-Driven Design.

With this new model in place, I can now visualize my graph like this:

Modified Zombie Sighting Graph

Modified Zombie Sighting Graph

Rest assured, I can still make effective queries against the graph, they just involve another hop to get the information that I used to get before. So, I can write the query to get the list of zombies that have two “spotted” arrows pointing to them (two different users spotted the same zombie), I just need to extend the shape of the subgraph I’m looking for to traverse over the “sighting” node. In the query below, I don’t care about the information on the node, so I can use double-parens for an anonymous node:

START u=node:node_auto_index(_label='user')
MATCH u-[:SPOTTED]->()-[:TARGET]->(zombie)<-[:TARGET]-()<-[:SPOTTED]-(somebodyelse)
RETURN DISTINCT zombie.name

This matches all zombies that have a spotted->(sighting)->target path from a user and also have the reverse <-target-(sighting)<-spotted relationship coming from another user. As expected, the above query returns "Alvin" and "Simon" but not "Theodore". I dare you to try and jump two join tables going one way and then jump the same two join tables going back out without ripping a hole in your RDBMS and/or brain. Moral of the story - even amazing tools like Neo4j require thoughtful care of the data model, though Neo4j can certainly handle refactoring and upgrading much more easily than a regular SQL database.

Hello (post-Apocalyptic) World with the Neo4j Graph Database

Last week I was talking to a colleague about when a graph database might be overkill versus when it might be the right solution. That got me to thinking, and so I spent several days taking a look at various data models that I used on a regular basis and re-imagining those models as graph models. The results were pretty interesting in that it was the rare case where I felt an application could not be improved by the use of a graph model.

So this got me interested in messing around with a graph database for my favorite sample domain: zombies. In many previous posts and sample applications, I have used a sample domain of an application server that receives zombie sighting messages and in turn sends out messages to aid and inform those attempting to survive the zombie apocalypse.

In my domain, users (human survivors of the zombie apocalypse) can spot zombies. A zombie sighting is a record of vital information about a particular zombie so that zombie can be tracked. Ideally, some higher-level analysis can be done about clusters of zombies (geospatial analysis … also ideally suited to graph databases) and we might also want to do some other analysis on zombies, like correlations between sightings or, better yet, if two people keep reporting the same zombie in similar areas, then perhaps those people should team up.

First thing we need to do is create some sample data. In a graph database, sample sets can be ridiculously big so I’m going to keep it simple and create a “diamond shape” (you’ll see that in a bit) as well as another relationship dangling off the side. This sample set has two users who are reporting sightings of three different zombies: Alvin, Simon, and Theodore. To get this data into the database I’ll use a Cypher (neo4j’s query language) statement. It’s worth noting how the description of relationships is very much like an ASCII Art depiction of what the relationship would look like if you drew it on a whiteboard. This is a huge benefit and one of the things that drew me to Neo4j to start with:

CREATE user1 = { label: 'user', firstname: 'Kevin', lastname: 'Hoffman' },
       user2 = { label: 'user', firstname: 'Bob', lastname: 'Bobberson' },
       alvin = { label: 'zombie', name: 'Alvin', power: 20 },
       simon = { label: 'zombie', name: 'Simon', power: 50 },
       theodore = { label: 'zombie', name: 'Theodore', power:80 },
       (user1)-[:SPOTTED {date: '01/02/2013'}]->(alvin),
       (user1)-[:SPOTTED {date: '01/03/2013'}]->(simon),
       (user1)-[:SPOTTED {date: '01/02/2013'}]->(theodore),
       (user2)-[:SPOTTED {date: '01/02/2013'}]->(alvin),
       (user2)-[:SPOTTED {date: '01/03/2013'}]->(simon)

So here we’re creating two users and three zombies, and we can see that the Kevin user spotted alvin and simon while the Bob user spotted all three of the zombies. We can use Neo4j’s built-in data visualizer to see what this graph looks like:

Zombie Neo4j Graph

Zombie Neo4j Graph

There are a couple of really important points in this graph. The first is that the connections between the nodes are directional. The other is that there is data associated with the relationships as well as with the nodes themselves, which is one area where the use of a graph database truly shines over a traditional RDBMS with “join tables”.

The next thing to notice is that the left side of this graph forms a very familiar “diamond” pattern. The diamond pattern, in social networking circles, can be used to determine mutual friends. In our case, we don’t really have mutual friends but we do have mutual zombie sightings. To get a list of the zombies which have been spotted by multiple users, we can issue the following query:

START u=node:node_auto_index(label='user')
MATCH u-[:SPOTTED]-(zombie)<-[:SPOTTED]-(somebodyelse)
RETURN DISTINCT zombie

Here we are matching all user nodes with the zombies they spotted, which have also been spotted by another user. This could potentially return multiple duplicates of the zombie (Bob and Kevin both spotted Alvin) so we remove duplicate nodes with the DISTINCT clause.

To be more specific, we can also run a query to list off the zombies that were spotted by Bob and Kevin (e.g. the “mutual friends”) query:

START bob=node:node_auto_index(firstname='Bob'),
      kevin=node:node_auto_index(firstname='Kevin')
MATCH (kevin)-[:SPOTTED]->(zombie)<-[:SPOTTED]-(bob)
RETURN zombie.name

When I execute this code, I get the results “Alvin” and “Simon” (the diamond pattern) but I don’t get “Theodore” because only Kevin saw Theodore, not Bob.

You can see that with unbelievably simple, and remarkably readable syntax like (kevin)-[:SPOTTED]->(zombie)<-[:SPOTTED]-(bob), you can describe intricate relationships and perform absolutely mindblowing queries against a graph database. This is how Facebook gives you the information it gives you, it’s how Amazon gives you product recommendations, and it’s how LinkedIn knows that you have “2nd” and “3rd” degree contacts in your professional network (it’s counting hops across the “connected to” relationships in a graph traversal).

This is just the tip of the tip of the iceberg, barely scratching the surface of the power of graph databases. They can be used for all kinds of amazing things, most notably social networking and post-apocalyptic zombie sighting management. What could you use one for on your projects?