Everyman Software

2011-08-25

Getting Neo4jPHP working with CodeIgniter 2

This post is based heavily off of Integrating Doctrine 2 with CodeIgniter 2. I haven't tested it myself, so please use it as a starting point and let me know if I should update anything.

Here are the steps to get Neo4jPHP set up as a CodeIgniter 2 library:

Copy the 'Everyman' folder to application/libraries
Create a file called Everyman.php in application/libraries

Copy the following code into Everyman.php:

<?php
class Everyman
{
    public function __construct()
    {
        spl_autoload_register(array($this,'autoload'));
    }

    public function autoload($sClass)
    {
        $sLibPath = __DIR__.DIRECTORY_SEPARATOR;
        $sClassFile = str_replace('\\',DIRECTORY_SEPARATOR,$sClass).'.php';
        $sClassPath = $sLibPath.$sClassFile;
        if (file_exists($sClassPath)) {
            require($sClassPath);
        }
    }
}

Add the following line to application/config/autoload.php:
```
$autoload['libraries'] = array('everyman');
```

This is actually a generic autoloader that will attempt to autoload any namespaced class under application/library where the class name matches the file path.

2011-07-23

Performance of Graph vs. Relational Databases

A few weeks ago, Emil Eifrem, CEO of Neo Technology gave a webinar introduction to graph databases. I watched it as a lead up to my own presentation on graph databases and Neo4j.

Right around the 54 minute mark, Emil talks about a very interesting experiment showing the performance difference between a relational database and a graph database for a certain type of problem called "arbitrary path query", specifically, given 1,000 users with an average of 50 "friend" relationships each, determine if one person is connected to another in 4 or fewer hops.

Against a popular open-source relational database, the query took around 2,000 ms. For a graph database, the same determination took 2 ms. So the graph database was 1,000 times faster for this particular use case.

Not satisfied with that, they then decided to run the same experiment with 1,000,000 users. The graph database took 2 ms. They stopped the relational database after several days of waiting for results.

I showed this clip at my presentation to a few people who stuck around afterwards, and we tried to figure out why the graph database had the same performance with 1,000 times the data, while the relational database became unusably slow. The answer has to do with the way in which each type of database searches information.

Relational databases search all of the data looking for anything that meets the search criteria. The larger the set of data, the longer it takes to find matches, because the database has to examine everything in the collection.

Here's an example: let's assume there is a table with "friend" relationships:

> SELECT * FROM friends;
+-------------+--------------+
| user_id     | friend_id    |
+-------------+--------------+
| 1           | 2            |
| 1           | 3            |
| 1           | 4            |
| 2           | 5            |
| 2           | 6            |
| 2           | 7            |
| 3           | 8            |
| 3           | 9            |
| 3           | 10           |
+-------------+--------------+

In order to see if user "9" is connected to user "2", the database has to find all the friends of user "9", and see if user "2" is in that list. If not, find all of their friends, and then see if user "2" is in that list. The database has to scan the entire table each time. This means that if you double the number of rows in the table, you've doubled the amount of data to search, and thus doubled the amount of time it takes to find what you are looking for. (Even with indexing, it still has to find the values in the index tree, which involves traversing that tree. The index tree grows larger with each new record, meaning the time it takes to traverse grows larger as well. And for each search, you always start at the root of the tree.)

Each new batch of friends to look at requires an entirely new scan of the table/index. So more records leads to more search time.

Conversely, a graph database looks only at records that are directly connected to other records. If it is given a limit on how many "hops" it is allowed to make, it can ignore everything more than that number of hops away.

Only the blue records are ever seen during the search. And since a graph traversal "remembers" where it is at any time, it never has to start from the beginning, only from its last known position.

You could add another ring of records around the outside, or add a thousand more rings of records outside, but the search would never need to look at them because they are more steps away than the limit. The only way to increase the number of records searched (and thereby decrease performance) is to add more records within the 2-step limit, or add more relationships between the existing records.

So the reason why having 1,000 vs. 1,000,000 records causes such a stark difference between a relational and a graph database is that relational database performance decreases in relation to the number of records in the table, while graph database performance decreases in relation to the number of connections between the records.

2011-07-19

TriPUG Meetup slides

Here are the slides for my presentation on graph databases at the TriPUG meetup tonight: http://www.slideshare.net/JoshAdell/graphing-databases

Overall, I think it was pretty well received. I didn't mean for it to last an hour and a half, but there it is. There were a lot of new folks there; I'm encouraged at the enthusiasm of the developer community in the Triangle region. Thanks for having me!

2011-07-16

agile Adoption on a Non-Technical Team

I'm pretty excited about a new development at my office. Based on my team's success in increasing our productivity, efficiency and transparency, another team has decided to adopt "agile-like" practices. The most interesting thing is that this team isn't a technical team at all.

It's probably fair to say that what they're adopting is less like agile and more like lean. The goal was to increase visibility to their management structure on what the team was doing each day, increase the efficiency of their work, minimize interruptions of team members and raise problems earlier. To accomplish this, they have implemented a stripped-down version of Scrum, with the following practices:

Planning:
Daily Standup:
Scrum Board:
Retrospective:

The team's goals are very different from that of a software development team, so we encouraged them to not follow our process dogmatically, and instead, to come up with methods that work for their team's unique needs (the core of "a"gile vs. "Agile".) For instance, my team operates on a 2 week sprint cadence (a new sprint every two weeks.) Their team receives the team goals on a monthly basis. Rather than adopt 2-week or 4-week sprints, they have chosen to break up the monthly goals into smaller and more transparent 1-week goals. This provides them with earlier warning if things start to go sideways, and provides motivation to not let things stack up until the end of the month.

Also, their work does not fit easily into "delivering feature or product X by the end of the month." Rather, they are tasked with having certain regions of the country covered by certain baselines of recruitment effort (for partner and vendor recruiting.) So the "User Stories" on their backlog are phrased as "15% coverage of NJ/NY metro area" or "Decrease turnover 10% in southern VA." Their tasks become things like calling or emailing specific contacts, setting up conference calls and facilitating the transition of a contact into a partner.

The Scrum board also has little stickers (a frog for their direct manager and Venom for her manager) that they can attach to a specific task to signal that they need additional help with an item. I think this is a neat idea, because it means that their managers don't have to ask them many times a day if everything is going smoothly. A quick glance at the Scrum board gives them the same information.

A few sprints ago, my team decided to start sharing a few minutes about our agile practices at each Sprint Review. I'm really glad to see that not only were people not spacing out (which I had feared they might) but were actually absorbing the information and thinking about how it could apply to their own team. I'm very excited to see how this new process works for our recruitment team.

2011-07-06

Neo4jPHP beta released

I've been working on a PHP client for the Neo4j REST API for little while. I think it's ready for some real-life testing. The beta version is available here: https://github.com/jadell/Neo4jPHP/tarball/0.0.1-beta

Features:

Developed against the Neo4j 1.4 milestone releases
Simple, object-oriented API
Almost complete REST API coverage
Indexing of nodes and relationships, including exact match and query support
Cypher queries (thanks to Jacob Hansson)
Traversal support, including paged traversals
Lazy-loading of node and relationship data

Hopefully coming soon:

Client-side caching
Batch operations

There are some usage examples included.

It's a beta release, so please be gentle (on me, that is; be as rough as you want with the code.) If anyone finds any bugs or has feature requests, please use the GitHub issues page.

Update 2011-07-08:I would be remiss if I didn't mention the other PHP Neo4j REST client available at https://github.com/tchaffee/Neo4J-REST-PHP-API-client. It is built for the current stable release of Neo4j 1.3.

2011-06-27

Path finding with Neo4j

In my previous post I talked about graphing databases (Neo4j in particular) and how they can be applied to certain classes of problems where data may have multiple degrees of separation in their relationships.

The thing that makes graphing databases useful is the ability to find relationship paths from one node to another. There are many algorithms for finding paths efficiently, depending on the use case.

Consider a graph that represents a map of a town. Every node in the graph represents an intersection of two or more streets. Each direction of a street is represented as a different relationship: two-way streets will have two relationships between the same two nodes, one pointing from the first node to the second, and the other pointing from the second node to the first. One-way streets will have a single relationship. Each street relationship will also have a distance property. Here is what one such map might look like:

Every street is two-way, except for Market Alley and Park Drive. The numbers in parentheses represent the street distance in hundreds of yards (i. e. 3 = 300 yards.)

Here is the code to create this graph in the database, using the REST client I've been working on (connection initialization is not shown):

$intersections = array(
    "First & Main",
    "Second & Main",
    "Third & Main",
    "First & Oak",
    "Second & Oak",
    "Third & Market",
);

$streets = array(
    // start, end, [direction, distance, name]
    array(0, 1, array('direction'=>'east', 'distance'=>2, 'name'=>'Main St.')),
    array(0, 3, array('direction'=>'south', 'distance'=>2, 'name'=>'First Ave.')),
    array(0, 4, array('direction'=>'south', 'distance'=>3, 'name'=>'Park Dr.')),

    array(1, 0, array('direction'=>'west', 'distance'=>2, 'name'=>'Main St.')),
    array(1, 2, array('direction'=>'east', 'distance'=>2, 'name'=>'Main St.')),
    array(1, 4, array('direction'=>'south', 'distance'=>2, 'name'=>'Second Ave.')),

    array(2, 1, array('direction'=>'west', 'distance'=>2, 'name'=>'Main St.')),
    array(2, 5, array('direction'=>'south', 'distance'=>1, 'name'=>'Third Ave.')),

    array(3, 0, array('direction'=>'north', 'distance'=>2, 'name'=>'First Ave.')),
    array(3, 4, array('direction'=>'east', 'distance'=>2, 'name'=>'Oak St')),

    array(4, 3, array('direction'=>'west', 'distance'=>2, 'name'=>'Oak St.')),
    array(4, 1, array('direction'=>'north', 'distance'=>2, 'name'=>'Second Ave.')),

    array(5, 2, array('direction'=>'north', 'distance'=>1, 'name'=>'Third Ave.')),
    array(5, 4, array('direction'=>'west', 'distance'=>2, 'name'=>'Market Alley')),
);

$nodes = array();
$intersectionIndex = new Index($client, Index::TypeNode, 'intersections');
foreach ($intersections as $intersection) {
    $node = new Node($client);
    $node->setProperty('name', $intersection)->save();
    $intersectionIndex->add($node, 'name', $node->getProperty('name'));
    $nodes[] = $node;
}

foreach ($streets as $street) {
    $start = $nodes[$street[0]];
    $end = $nodes[$street[1]];
    $properties = $street[2];
    $start->relateTo($end, 'CONNECTS')
        ->setProperties($properties)
        ->save();
}

First, we set up a list of the intersections on the map, and also a list of the streets that connect those intersections. Each street will be a relationship that has data about the streets direction, distance and name.

Then, we create each intersection node. Each node is also added to an index. Indexes allow stored nodes to be found quickly. Nodes and relationships can be indexed, and indexing can occur on any node or relationship field. You can even index on fields that don't exist on the node or relationship. We'll use the index later to find the start and end points of our path by name.

Finally, we create a relationship for every street. Relationships can have arbitrary data attached to them, just like nodes.

Now we create way to find and display driving directions from any intersection to any other intersection:

$turns = array(
    'east' => array('north' => 'left','south' => 'right'),
    'west' => array('north' => 'right','south' => 'left'),
    'north' => array('east' => 'right','west' => 'left'),
    'south' => array('east' => 'left','west' => 'right'),
);

$fromNode = $intersectionIndex->findOne("Second & Oak");
$toNode = $intersectionIndex->findOne("Third & Main");

$paths = $fromNode->findPathsTo($toNode, 'CONNECTS', Relationship::DirectionOut)
    ->setMaxDepth(5)
    ->getPaths();

foreach ($paths as $i => $path) {
    $path->setContext(Path::ContextRelationship);
    $prevDirection = null;
    $totalDistance = 0;

    echo "Path " . ($i+1) .":\n";
    foreach ($path as $j => $rel) {
        $direction = $rel->getProperty('direction');
        $distance = $rel->getProperty('distance') * 100;
        $name = $rel->getProperty('name');

        if (!$prevDirection) {
            $action = 'Head';
        } else if ($prevDirection == $direction) {
            $action = 'Continue';
        } else {
            $turn = $turns[$prevDirection][$direction];
            $action = "Turn $turn, and continue";
        }
        $prevDirection = $direction;
        $step = $j+1;
        $totalDistance += $distance;

        echo "\t{$step}: {$action} {$direction} on {$name} for {$distance} yards.\n";
    }
    echo "\tTravel distance: {$totalDistance} yards\n\n";
}

Note that in the `findPathsTo` call, we specify that we want only outgoing relationships. This is how we prevent getting directions that may send us the wrong way down a one-way street. We also set a max depth, since the default search depth is 1. Also, note the `setContext` call which tells the path that we want the `foreach` to loop through the relationships, not the nodes. Running this produces the following directions:

Path 1:
    1: Head north on Second Ave. for 200 yards.
    2: Turn left, and continue west on Main St. for 200 yards.
    Travel distance: 400 yards

We only got back one path because, by default, the path finding algorithm finds the shortest path, meaning the path with the least number of relationships. If we switch the to and from intersections, we can get directions back to our starting point:

Path 1:
    1: Head west on Main St. for 200 yards.
    2: Turn left, and continue south on Second Ave. for 200 yards.
    Travel distance: 400 yards

Path 2:
    1: Head south on Third Ave. for 100 yards.
    2: Turn right, and continue west on Market Alley for 200 yards.
    Travel distance: 300 yards

We got back two paths this time, but one of them is longer than the other. How can this be if we are using the shortest path algorithm? The answer is that shortest path refers only to the number of relationships in the path. We didn't tell our path finder to pay attention to the cost of one path over another. Fortunately, we can do that quite easily by telling the `findPathsTo` call to use a property of each relationship when determining which path is the least expensive:

$paths = $fromNode->findPathsTo($toNode, 'CONNECTS', Relationship::DirectionOut)
    ->setAlgorithm(PathFinder::AlgoDijkstra)
    ->setCostProperty('distance')
    ->setMaxDepth(5)
    ->getPaths();

We're telling the path finder to use Dijkstra's algorithm, an algorithm to find the lowest cost path, which is different from the shortest path. Since our map uses distance, the cost of a path is the total of all the distance properties of the relationships in that path. Running the script now gives us only one path, the path with the least distance:

Path 1:
    1: Head south on Third Ave. for 100 yards.
    2: Turn right, and continue west on Market Alley for 200 yards.
    Travel distance: 300 yards

If there were another path with the same total distance, it would also be displayed. Path cost ignores the number of relationships in the path. If there were a path with 3 relationships totaling 300 yards, and another with 1 relationship totaling 300 yards, both of them would be displayed.

Neo4j has several path finding algorithms built in. Some of them are:

shortest path - find paths with the fewest relationships
dijkstra - find paths with the lowest cost
simple path - find paths with no repeated nodes
all - find all paths between two nodes

Neo4j also comes with an expressive traversal API that allows you to create your own path finding algorithms for domain specific needs.

2011-06-15

Neo4j for PHP

Update 2011-09-14: I've put a lot of effort into Neo4jPHP since this post was written. It's pretty full-featured and covers almost 100% of the Neo4j REST interface. Anyone interested in playing with Neo4j from PHP should definitely check it out. I would love some feedback!

Lately, I've been playing around with the graph database Neo4j and its application to certain classes of problems. Graph databases are meant to solve problems in domains where data relationships can be multiple levels deep. For example, in a relational database, it's very easy to answer the question "Give me a list of all actors who have been in a movie with Kevin Bacon":

> desc roles;
+-------------+--------------+------+-----+---------+----------------+
| Field       | Type         | Null | Key | Default | Extra          |
+-------------+--------------+------+-----+---------+----------------+
| id          | int(11)      | NO   | PRI | NULL    | auto_increment |
| actor_name  | varchar(100) | YES  |     | NULL    |                |
| role_name   | varchar(100) | YES  |     | NULL    |                |
| movie_title | varchar(100) | YES  |     | NULL    |                |
+-------------+--------------+------+-----+---------+----------------+

> SELECT actor_name, role_name FROM roles WHERE movie_title IN (SELECT DISTINCT movie_title FROM roles WHERE actor_name='Kevin Bacon')

Excuse my use of sub-queries (re-write it to a JOIN in your head if you wish.)

But suppose you want to get the names of all the actors who have been in a movie with someone who has been in a movie with Kevin Bacon. Suddenly, you have yet another JOIN against the same table. Now add a third degree: someone who has been in a movie with someone who has been in a movie with someone who has been in a movie with Kevin Bacon. As you continue to add degrees, the query becomes increasingly unwieldy, harder to maintain, and less performant.

This is precisely the type of problem graph databases are meant to solve: finding paths between pieces of data that may be one or more relationships removed from each other. They solve it very elegantly by modeling domain objects as graph nodes and edges (relationships in graph db parlance) and then traversing the graph using well-known and efficient algorithms.

The above example can be very easily modeled this way: every actor is a node, every movie is a node, and every role is a relationship going from the actor to the movie they were in:

Now it becomes very easy to find a path from a given actor to Kevin Bacon.

Neo4j is an open source graph database, with both community and enterprise licensing structures. It supports transactions and can handle billions of nodes and relationships in a single instance. It was originally built to be embedded in Java applications, and most of the documentation and examples are evidence of that. Unfortunately, there is no native PHP wrapper for talking to Neo4j.

Luckily, Neo4j also has a built-in REST server and PHP is very good at consuming REST services. There's already a good Neo4j REST PHP library out there, but I decided to write my own to get a better understanding of how the REST interface actually works. You can grab it here and all the code examples below are written using it. The concepts can easily be ported to any Neo4j REST client.

First, we need to initialize a connection to the database. Since this is a REST interface, there is no persistent connection, and in fact, no data communication happens until the first time we need to read or write data:

use Everyman\Neo4j\Client,
    Everyman\Neo4j\Transport,
    Everyman\Neo4j\Node,
    Everyman\Neo4j\Relationship;

$client = new Client(new Transport('localhost', 7474));

Now we need to create a node for each of our actors and movies. This is analogous to the INSERT statements used to enter data into a traditional RDBMS:

$keanu = new Node($client);
$keanu->setProperty('name', 'Keanu Reeves')->save();
$laurence = new Node($client);
$laurence->setProperty('name', 'Laurence Fishburne')->save();
$jennifer = new Node($client);
$jennifer->setProperty('name', 'Jennifer Connelly')->save();
$kevin = new Node($client);
$kevin->setProperty('name', 'Kevin Bacon')->save();

$matrix = new Node($client);
$matrix->setProperty('title', 'The Matrix')->save();
$higherLearning = new Node($client);
$higherLearning->setProperty('title', 'Higher Learning')->save();
$mysticRiver = new Node($client);
$mysticRiver->setProperty('title', 'Mystic River')->save();

Each node has `setProperty` and `getProperty` methods that allow storing arbitrary data on the node. No server communication happens until the `save()` call, which must be called for each node.

Linking an actor to a movie means setting up a relationship between them. In RDBMS terms, the relationship takes the place of a join table or a foreign key column. In the example, the relationship always starts with the actor pointing to the movie, and is tagged as an "acted in" type relationship:

$keanu->relateTo($matrix, 'IN')->save();
$laurence->relateTo($matrix, 'IN')->save();

$laurence->relateTo($higherLearning, 'IN')->save();
$jennifer->relateTo($higherLearning, 'IN')->save();

$laurence->relateTo($mysticRiver, 'IN')->save();
$kevin->relateTo($mysticRiver, 'IN')->save();

The `relateTo` call returns a Relationship object, which is like a node in that it can have arbitrary properties stored on it. Each relationship is also saved to the database.

The direction of the relationship is totally arbitrary; paths can be found regardless of which direction a relationship points. You can use whichever semantics make sense for your problem domain. In the example above, it makes sense that an actor is "in" a movie, but it could just as easily be phrased that a movie "has" an actor. The same two nodes can have multiple relationships to each other, with different directions, types and properties.

The relationships are all set up, and now we are ready to find links between any actor in our system and Kevin Bacon. Note that the maximum length of the path is going to be 12 (6 degrees of separation multiplied by 2 nodes for each degree; an actor node and a movie node.)

$path = $keanu->findPathsTo($kevin)
    ->setMaxDepth(12)
    ->getSinglePath();

foreach ($path as $i => $node) {
    if ($i % 2 == 0) {
        echo $node->getProperty('name');
        if ($i+1 != count($path)) {
            echo " was in\n";
        }
    } else {
        echo "\t" . $node->getProperty('title') . " with\n";
    }
}

A path is an ordered array of nodes. The nodes alternate between being actor and movie nodes.

You can also do the typical "find all the related movies" type queries:

echo $laurence->getProperty('name') . " was in:\n";
$relationships = $laurence->getRelationships('IN');
foreach ($relationships as $relationship) {
    $movie = $relationship->getEndNode();
    echo "\t" . $movie->getProperty('title') . "\n";
}

`getRelationships` returns an array of all relationships that a node has, optionally limiting to only relationships of a given type. It is also possible to specify only incoming or outgoing relationships. We've set up our data so that all 'IN' relationships are from an actor to a movie, so we know that the end node of any 'IN' relationship is a movie.

There is more available in the REST interface, including node and relationship indexing, querying, and traversal (which allows more complicated path finding behaviors.) Transaction/batch operation support over REST is marked "experimental" for now. I'm hoping to add wrappers for more of this functionality soon. I'll also be posting more on the different types of problems that can be solved very elegantly with graphing databases.

The next post dives into a bit more detail about path-finding and some of the different algorithms available.