Kotan Code 枯淡コード

In search of simple, elegant code

Menu Close

To Entity or Not to Entity, That is the Graph DB Design Question

In my last blog post, I offered up a sample zombie domain as an area where refactoring a property-laden relationship into an entity on its own might be beneficial. Unfortunately, I think some of the context was lost in the discussion about zombies. The main point that I thought was important to the refactor is that if two entities interact with each other in a meaningful way at some point in time, that interaction is a likely candidate for producing a third entity as opposed to simply putting information about the interaction on the relationship.

The biggest motivator for this is if you have two entities, and one entity can interact multiple times with the other entity and the name of the relationship describing that interaction is the same then you absolutely need a third entity in the way and you simply can’t model it with a simple A->B relationship. In the comments section of my previous blog post, I discussed a possible scenario where you might have a wildlife monitoring scenario. In this scenario, multiple researchers interact with animals where they take vital statistics (I used temperature because I know nothing about wildlife conservation and figured body temp might be useful…). The key here is that the same researcher can interact with the same animal for the same purpose (containing different metadata) multiple times.

Here’s a visualization of a really simple graph that has 2 researchers, three animals, and multiple recorded “check ups”:

Wildlife Checkin Graph

Wildlife Checkin Graph

Here’s the Cypher that created this graph:

CREATE joe={_label:'researcher', firstname:'Joe', lastname:'Doe'},
  sally={_label:'researcher', firstname:'Sally', lastname:'McSmart'},
  ann={_label:'researcher', firstname:'Ann', lastname:'DuScience'},
  bubba={_label:'animal', name:'Bubba'},
  kong={_label:'animal', name:'Kong'},
  dp1={_label: 'checkpoint', date:'04/13/2013', temp:98, lat:40.7657, long:73.9856},
  dp2={_label: 'checkpoint', date:'04/14/2013', temp:99, lat:40.7657, long:73.9856},
  dp3={_label: 'checkpoint', date:'04/14/2013', temp:102, lat:40.7657, long:73.9856},
  dp4={_label: 'checkpoint', date:'04/15/2013', temp:103, lat:40.7657, long:73.9856},
  dp5={_label: 'checkpoint', date:'04/15/2013', temp:99, lat:40.7657, long:73.9856},

From the graph (and from the Cypher if you’re good at inferring from that) we can see that Ann checked in with the same animal twice and got two different body temperature measurements. Some queries that I can run on this simple dataset:

Bubba’s average body temperature:

START bubba=node:node_auto_index(name='Bubba')
MATCH bubba<-[:ANIMAL]-(checkpoint)
RETURN avg(checkpoint.temp)

Checkins by researcher and the average body temp taken by said researcher (note the lack of a “Group by” statement here):

START animal=node:node_auto_index(_label = 'animal')
MATCH animal<-[:ANIMAL]-(checkpoint)<-[:CHECKED]-(researcher)
RETURN researcher.firstname, count(researcher) as checkCount, avg(checkpoint.temp)
ORDER BY checkCount DESC

Animals, the number of times they were checked, and the highest temperature recorded:

START animal=node:node_auto_index(_label = 'animal')
MATCH animal<-[:ANIMAL]-(checkpoint)<-[:CHECKED]-(researcher)
RETURN animal.name, count(checkpoint), max(checkpoint.temp)

If I were better at graph databases, and my dataset was richer, I might be able to write queries that show me clusters, perhaps showing me that a particular researcher is recording much higher than normal temps, and so there may be a correlation between their locations and sick animals, etc.

No matter which way you want to query this particular type of graph, the one thing that I do know is that this one truly does require the kind of refactoring I mentioned in the previous blog post, where we have to take a direct relationship between two entities and put a new entity in the middle.