Kotan Code 枯淡コード

In search of simple, elegant code

Menu Close

Tag: graph databases

First Exposure to OrientDB

So I decided to try and give Medium a shot. As a result, I’ve posted my initial thoughts on the graph/document database OrientDB over there. Click here to read the full blog.

What can the NSA do with Verizon’s phone records?

In case you’ve been unplugged and haven’t heard, there’s yet another scandal a-brewing: The NSA obtained, through a special warrant, the ability to pull phone records from a Verizon subsidiary individually and in bulk. When I say phone records I mean a list of phone numbers and the numbers those numbers called and the duration of that call. No subscriber names or other personal information is included in this dump.

For those unfamiliar with this type of operation, it’s basic data mining 101, more specifically, they’re looking for patterns and, at least during pattern discovery, they don’t particularly care about the names of the pattern participants, only in the shape of those patterns. If you take a look at the type of information the NSA requested, it becomes immediately obvious to anyone who has worked with graph databases that they intend to build a network and then run analysis on it.

If you think about making a phone call, you are establishing a link between two participants, the caller and the callee. This establishes what in graph database parlance is called a directed relationship. In other words, you can think of it as a little picture that looks like this:

867-5309 —CALLED—>555-1212

Since we know the duration of the call, we can use that duration to weight the relationship between the two phone numbers, assuming that a longer duration call implies a “heavier” weight on the CALLED relationship:

867-5309—CALLED(5 minutes)—>555-1212

So now that we have this massive database filled with links between seemingly anonymous phone numbers, what would the NSA do with it? The first thing they can do is network detection. They can take a look at the numbers and see if particular shapes show up, the simplest of which is a “ring”:

A–CALLED–>B–CALLED–>C–CALLED–>A

This could be a completely innocent ring, or it could be terrorist information passing where the inbound call comes from someone who was never in direct outbound contact with the original caller. If you’re looking at a straight pull of phone records looking only at a single target phone number then you would never be able to tell that a seemingly innocent outbound call to B resulted in an inbound call from C (unless of course the bad guys were dumb enough to repeat this same pattern often enough so someone would see the same inbound call always happens 10 minutes after the same outbound call).

So, the NSA can detect rings of phone numbers. They can also detect a bunch of other well-known network/subnetwork shapes like diamonds, triangles, etc. Of course, the real power of putting this information in a true graph database is that they can detect networks of unknown shapes. You can ask the database to find recurring patterns over time of a shape that you’ve never seen before. Regular communication patterns would start to appear.

This is just one phase in a data mining process. Once you’ve got a potential list of networks (innocent or criminal, at this point it doesn’t matter, they’re just identifying shapes) then you can run that against the list of calls to or from foreign countries. Now you can see which networks have regular communication with foreign phone numbers. Then you can see if any of those foreign phone numbers are in hot spots or potentially owned by or used by the bad guys.

The list of information the NSA can mine from this data (remember that information and data are not the same thing, information has meaning, data is just data until meaning is gleaned from it in the form of information) goes on and on. When you think about the other resources and databases the NSA has at its disposal that they can use to cross-reference the information from this operation, the conclusions or insights they might gain are staggering.

Should you be upset that your rights have been violated if you happen to have your phone on this Verizon subsidiary? I’m not going to tell you whether or not to feel violated, that’s not the point of this blog post. You should, however, feel pretty confident that nobody at the NSA cares if you are making regular calls to your mistress, or any other activities that might be in your “network”. If, on the other hand, you make regular phone calls to a guy who receives regular phone calls from a terrorist, then you might expect a phone call or a visit from men in suits 😉

Regardless of how you feel about the means by which this information is gathered, it seems obvious to me that potential terrorist acts might be prevented by analyzing this information. Whether our rights were violated in the execution of this data mining operation is a topic for another blog post entirely.

Modeling Arrays and Complex, Nested Objects in Neo4j

Today I was messing around with Neo4j, as I am wont to do, and I ran across a modeling scenario that I hadn’t expected. My previous experience with NoSQL databases has been with MongoDB. I’ve used Mongo for multiple projects, including a MUD and an MMORPG server both written in Akka and Scala.

When using MongoDB as a backing store for my enterprise application, I’ve been storing multi-level objects in there without much concern. I’ve got a root object that contains an array of nested objects which, in turn, actually contain even more nested objects. The structure is a few levels deep, but using techniques like Play Framework’s JSON inception, it makes serializing an de-serializing to and from Scala case classes pretty easy. All in all, it’s been working out quite well.

When looking at how to store information that I might otherwise put in a NoSQL store in Neo4j to take advantage of its graph query and traversal capabilities, I noticed that Neo4j, while “speaking JSON”, doesn’t do so at quite the robust level that MongoDB does. In short, you can’t store arrays or nested objects in a Neo4j node or relationship. Now, before you complain about this, there are a pile of good reasons for this that I could really only begin to touch the surface of if I tried to explain them. Bottom line is that nodes and relationships in Neo4j have properties, which are name-value pairs, not recursive name-value maps like you find in Mongo.

So, with that restriction in place, how do we model complex, nested objects and arrays in Neo4j?

First we can take a look at why we need an array. In many cases in my model, the arrays of nested objects on a root object were actually modeling a parent-child type relationship where the children had their own set of properties. So, let’s say you want a node in your graph to be a WellArmedZombie and that particular zombie needs an array of weapons. If that zombie owns or contains or is responsible for those weapons, you can split that single node into many, one node for the core WellArmedZombie, one node for each of its weapons, and a relationship (probably -[:WEAPON]->) between the zombie and each weapon.

In my own exercise attempting to remodel a MongoDB database as a Neo4j database, nested objects were very easy to convert. In fact, in almost all cases, I was able to take the field name that referred to the nested item and turn it into a relationship pointing to a related node containing the properties that used to belong to the nested item.

Let’s say this same WellArmedZombie node, in Mongo, had a property called armor which was a nested JSON object containing properties like strengthmaterial, and color. It is pretty easy to switch that into Neo4j where we have a relationship that looks like (zombie)-[:ARMOR]->(armor).

So, the one lesson I learned today was really that thinking relationally requires some unlearning before you can really take advantage of a NoSQL database model. Further, thinking in a graph requires you to unlearn some pretty standard conventions and best practices that are a part of thinking in documents rather than nodes and edges. I realize that as we discover new tools, we often feel like the shiny new tool will fix all problems. It doesn’t, and I’ve got a couple models/domains I work with that I don’t think are good candidates for Neo4j. I have another model that I’ve used where I think a hybrid Neo4j/Mongo approach might be appropriate – put all the information you need for queries, traversals, and quick summary display into the Neo4j database and put the large, bulky (heavily nested and composed) data into Mongo for supplementary query.

Regardless, I am pretty damn excited to be living and working in a time where the kind of computing power that Neo4j offers is pretty much ubiquitous, free, and readily available on my machine and in the cloud. It’s a great time to be a developer.

To Entity or Not to Entity, That is the Graph DB Design Question

In my last blog post, I offered up a sample zombie domain as an area where refactoring a property-laden relationship into an entity on its own might be beneficial. Unfortunately, I think some of the context was lost in the discussion about zombies. The main point that I thought was important to the refactor is that if two entities interact with each other in a meaningful way at some point in time, that interaction is a likely candidate for producing a third entity as opposed to simply putting information about the interaction on the relationship.

The biggest motivator for this is if you have two entities, and one entity can interact multiple times with the other entity and the name of the relationship describing that interaction is the same then you absolutely need a third entity in the way and you simply can’t model it with a simple A->B relationship. In the comments section of my previous blog post, I discussed a possible scenario where you might have a wildlife monitoring scenario. In this scenario, multiple researchers interact with animals where they take vital statistics (I used temperature because I know nothing about wildlife conservation and figured body temp might be useful…). The key here is that the same researcher can interact with the same animal for the same purpose (containing different metadata) multiple times.

Here’s a visualization of a really simple graph that has 2 researchers, three animals, and multiple recorded “check ups”:

Wildlife Checkin Graph

Wildlife Checkin Graph

Here’s the Cypher that created this graph:

CREATE joe={_label:'researcher', firstname:'Joe', lastname:'Doe'},
  sally={_label:'researcher', firstname:'Sally', lastname:'McSmart'},
  ann={_label:'researcher', firstname:'Ann', lastname:'DuScience'},
  bubba={_label:'animal', name:'Bubba'},
  kong={_label:'animal', name:'Kong'},
  dp1={_label: 'checkpoint', date:'04/13/2013', temp:98, lat:40.7657, long:73.9856},
  (joe)-[:CHECKED]->(dp1)-[:ANIMAL]->(bubba),
  dp2={_label: 'checkpoint', date:'04/14/2013', temp:99, lat:40.7657, long:73.9856},
  (sally)-[:CHECKED]->(dp2)-[:ANIMAL]->(bubba),
  dp3={_label: 'checkpoint', date:'04/14/2013', temp:102, lat:40.7657, long:73.9856},
  (ann)-[:CHECKED]->(dp3)-[:ANIMAL]->(kong),
  dp4={_label: 'checkpoint', date:'04/15/2013', temp:103, lat:40.7657, long:73.9856},
  (ann)-[:CHECKED]->(dp4)-[:ANIMAL]->(kong),
  dp5={_label: 'checkpoint', date:'04/15/2013', temp:99, lat:40.7657, long:73.9856},
  (ann)-[:CHECKED]->(dp5)-[:ANIMAL]->(bubba)

From the graph (and from the Cypher if you’re good at inferring from that) we can see that Ann checked in with the same animal twice and got two different body temperature measurements. Some queries that I can run on this simple dataset:

Bubba’s average body temperature:

START bubba=node:node_auto_index(name='Bubba')
MATCH bubba<-[:ANIMAL]-(checkpoint)
RETURN avg(checkpoint.temp)

Checkins by researcher and the average body temp taken by said researcher (note the lack of a “Group by” statement here):

START animal=node:node_auto_index(_label = 'animal')
MATCH animal<-[:ANIMAL]-(checkpoint)<-[:CHECKED]-(researcher)
RETURN researcher.firstname, count(researcher) as checkCount, avg(checkpoint.temp)
ORDER BY checkCount DESC

Animals, the number of times they were checked, and the highest temperature recorded:

START animal=node:node_auto_index(_label = 'animal')
MATCH animal<-[:ANIMAL]-(checkpoint)<-[:CHECKED]-(researcher)
RETURN animal.name, count(checkpoint), max(checkpoint.temp)

If I were better at graph databases, and my dataset was richer, I might be able to write queries that show me clusters, perhaps showing me that a particular researcher is recording much higher than normal temps, and so there may be a correlation between their locations and sick animals, etc.

No matter which way you want to query this particular type of graph, the one thing that I do know is that this one truly does require the kind of refactoring I mentioned in the previous blog post, where we have to take a direct relationship between two entities and put a new entity in the middle.

Avoiding Graph Modeling Antipatterns

In my previous blog post, which I just posted this morning, it turns out that I made a pretty large modeling no-no. In terms of consequences to the application and the survival of the zombie apocalypse, I think the issue was pretty minor. However, it does illustrate that no matter how awesome a graph database product is, you still need to put some thought into how you model the database.

In the previous model, I could make assertions like user-[:SPOTTED]-zombie and there were some attributes on the spotted relationship such as the date the sighting took place. As it turns out, this is an antipattern. There are a couple of things I did wrong here.

Firstly, the attributes on the relationship have nothing to do with the weighting or nature of the relationship. The attributes I put on that relationship were informational and arguably belonged on a separate entity. In other words, the sighting was a noun that I had elided from my relationship sentence “a user spotted a zombie”

In the forthcoming Graph Databases book being published by O’Reilly, there is a best practice in graph modeling that says that when the interaction of two nodes produces something of significance, that interaction should be modeled as its own entity rather than modeled as properties on the relationship between those nodes.  In my sample zombie domain, this means that a user produces a sighting when they spotzombie. The sighting has metadata about the actual sighting, which can include the latitude and longitude (I know that neo4j has its own built-in geospatial support, which I will ignore for this example) of where the sighting was, and other information about the sighting itself, such as the user’s confidence level that they saw what they think they saw (aka the “bigfoot factor”).

Now I can go back and remodel my previous database using the following graph model created via Cypher:

	CREATE user1={_label: 'user', firstname: 'Kevin', lastname: 'Hoffman'},
		   user2={_label: 'user', firstname: 'Bob', lastname: 'Bobberson'},
	       alvin={_label: 'zombie', name: 'Alvin', power:20},
	       simon={_label: 'zombie', name: 'Simon', power:30},
	       theo={_label: 'zombie', name: 'Theodore', power:40},
	       sighting1={_label: 'sighting', date:'01/01/2013', lat: 42.12, long: 21.03},
	       sighting2={_label: 'sighting', date:'01/02/2013', lat: 42.00, long: 25.00},
	       sighting3={_label: 'sighting', date:'01/03/2013', lat: 47.32, long: 30.21},
	       sighting4={_label: 'sighting', date:'01/01/2013', lat: 40.22, long: 27.03},
	       sighting5={_label: 'sighting', date:'01/02/2013', lat: 43.00, long: 21.12},
	       user1-[:SPOTTED {confidence: 10}]->sighting1-[:TARGET]->alvin,
	       user1-[:SPOTTED {confidence: 12}]->sighting2-[:TARGET]->simon,
	       user1-[:SPOTTED {confidence: 27}]->sighting3-[:TARGET]->theo,
	       user2-[:SPOTTED {confidence: 50}]->sighting4-[:TARGET]->alvin,
	       user2-[:SPOTTED {confidence: 80}]->sighting5-[:TARGET]->simon

Now I’ve added a confidence factor to the relationship itself, which allows me to do a weighted traversal over the relationship nodes. This is a far better user for attributes belonging to a relationship. Also note that the non-traversal information is now more appropriately located as attributes on an entity. This should all actually seem pretty familiar to you if you’ve done any work with Domain-Driven Design.

With this new model in place, I can now visualize my graph like this:

Modified Zombie Sighting Graph

Modified Zombie Sighting Graph

Rest assured, I can still make effective queries against the graph, they just involve another hop to get the information that I used to get before. So, I can write the query to get the list of zombies that have two “spotted” arrows pointing to them (two different users spotted the same zombie), I just need to extend the shape of the subgraph I’m looking for to traverse over the “sighting” node. In the query below, I don’t care about the information on the node, so I can use double-parens for an anonymous node:

START u=node:node_auto_index(_label='user')
MATCH u-[:SPOTTED]->()-[:TARGET]->(zombie)<-[:TARGET]-()<-[:SPOTTED]-(somebodyelse)
RETURN DISTINCT zombie.name

This matches all zombies that have a spotted->(sighting)->target path from a user and also have the reverse <-target-(sighting)<-spotted relationship coming from another user. As expected, the above query returns "Alvin" and "Simon" but not "Theodore". I dare you to try and jump two join tables going one way and then jump the same two join tables going back out without ripping a hole in your RDBMS and/or brain. Moral of the story - even amazing tools like Neo4j require thoughtful care of the data model, though Neo4j can certainly handle refactoring and upgrading much more easily than a regular SQL database.

Hello (post-Apocalyptic) World with the Neo4j Graph Database

Last week I was talking to a colleague about when a graph database might be overkill versus when it might be the right solution. That got me to thinking, and so I spent several days taking a look at various data models that I used on a regular basis and re-imagining those models as graph models. The results were pretty interesting in that it was the rare case where I felt an application could not be improved by the use of a graph model.

So this got me interested in messing around with a graph database for my favorite sample domain: zombies. In many previous posts and sample applications, I have used a sample domain of an application server that receives zombie sighting messages and in turn sends out messages to aid and inform those attempting to survive the zombie apocalypse.

In my domain, users (human survivors of the zombie apocalypse) can spot zombies. A zombie sighting is a record of vital information about a particular zombie so that zombie can be tracked. Ideally, some higher-level analysis can be done about clusters of zombies (geospatial analysis … also ideally suited to graph databases) and we might also want to do some other analysis on zombies, like correlations between sightings or, better yet, if two people keep reporting the same zombie in similar areas, then perhaps those people should team up.

First thing we need to do is create some sample data. In a graph database, sample sets can be ridiculously big so I’m going to keep it simple and create a “diamond shape” (you’ll see that in a bit) as well as another relationship dangling off the side. This sample set has two users who are reporting sightings of three different zombies: Alvin, Simon, and Theodore. To get this data into the database I’ll use a Cypher (neo4j’s query language) statement. It’s worth noting how the description of relationships is very much like an ASCII Art depiction of what the relationship would look like if you drew it on a whiteboard. This is a huge benefit and one of the things that drew me to Neo4j to start with:

CREATE user1 = { label: 'user', firstname: 'Kevin', lastname: 'Hoffman' },
       user2 = { label: 'user', firstname: 'Bob', lastname: 'Bobberson' },
       alvin = { label: 'zombie', name: 'Alvin', power: 20 },
       simon = { label: 'zombie', name: 'Simon', power: 50 },
       theodore = { label: 'zombie', name: 'Theodore', power:80 },
       (user1)-[:SPOTTED {date: '01/02/2013'}]->(alvin),
       (user1)-[:SPOTTED {date: '01/03/2013'}]->(simon),
       (user1)-[:SPOTTED {date: '01/02/2013'}]->(theodore),
       (user2)-[:SPOTTED {date: '01/02/2013'}]->(alvin),
       (user2)-[:SPOTTED {date: '01/03/2013'}]->(simon)

So here we’re creating two users and three zombies, and we can see that the Kevin user spotted alvin and simon while the Bob user spotted all three of the zombies. We can use Neo4j’s built-in data visualizer to see what this graph looks like:

Zombie Neo4j Graph

Zombie Neo4j Graph

There are a couple of really important points in this graph. The first is that the connections between the nodes are directional. The other is that there is data associated with the relationships as well as with the nodes themselves, which is one area where the use of a graph database truly shines over a traditional RDBMS with “join tables”.

The next thing to notice is that the left side of this graph forms a very familiar “diamond” pattern. The diamond pattern, in social networking circles, can be used to determine mutual friends. In our case, we don’t really have mutual friends but we do have mutual zombie sightings. To get a list of the zombies which have been spotted by multiple users, we can issue the following query:

START u=node:node_auto_index(label='user')
MATCH u-[:SPOTTED]-(zombie)<-[:SPOTTED]-(somebodyelse)
RETURN DISTINCT zombie

Here we are matching all user nodes with the zombies they spotted, which have also been spotted by another user. This could potentially return multiple duplicates of the zombie (Bob and Kevin both spotted Alvin) so we remove duplicate nodes with the DISTINCT clause.

To be more specific, we can also run a query to list off the zombies that were spotted by Bob and Kevin (e.g. the “mutual friends”) query:

START bob=node:node_auto_index(firstname='Bob'),
      kevin=node:node_auto_index(firstname='Kevin')
MATCH (kevin)-[:SPOTTED]->(zombie)<-[:SPOTTED]-(bob)
RETURN zombie.name

When I execute this code, I get the results “Alvin” and “Simon” (the diamond pattern) but I don’t get “Theodore” because only Kevin saw Theodore, not Bob.

You can see that with unbelievably simple, and remarkably readable syntax like (kevin)-[:SPOTTED]->(zombie)<-[:SPOTTED]-(bob), you can describe intricate relationships and perform absolutely mindblowing queries against a graph database. This is how Facebook gives you the information it gives you, it’s how Amazon gives you product recommendations, and it’s how LinkedIn knows that you have “2nd” and “3rd” degree contacts in your professional network (it’s counting hops across the “connected to” relationships in a graph traversal).

This is just the tip of the tip of the iceberg, barely scratching the surface of the power of graph databases. They can be used for all kinds of amazing things, most notably social networking and post-apocalyptic zombie sighting management. What could you use one for on your projects?