2011-06-15

Neo4j for PHP

Update 2011-09-14: I've put a lot of effort into Neo4jPHP since this post was written. It's pretty full-featured and covers almost 100% of the Neo4j REST interface. Anyone interested in playing with Neo4j from PHP should definitely check it out. I would love some feedback!

Lately, I've been playing around with the graph database Neo4j and its application to certain classes of problems. Graph databases are meant to solve problems in domains where data relationships can be multiple levels deep. For example, in a relational database, it's very easy to answer the question "Give me a list of all actors who have been in a movie with Kevin Bacon":
> desc roles;
+-------------+--------------+------+-----+---------+----------------+
| Field       | Type         | Null | Key | Default | Extra          |
+-------------+--------------+------+-----+---------+----------------+
| id          | int(11)      | NO   | PRI | NULL    | auto_increment |
| actor_name  | varchar(100) | YES  |     | NULL    |                |
| role_name   | varchar(100) | YES  |     | NULL    |                |
| movie_title | varchar(100) | YES  |     | NULL    |                |
+-------------+--------------+------+-----+---------+----------------+

> SELECT actor_name, role_name FROM roles WHERE movie_title IN (SELECT DISTINCT movie_title FROM roles WHERE actor_name='Kevin Bacon')
Excuse my use of sub-queries (re-write it to a JOIN in your head if you wish.)

But suppose you want to get the names of all the actors who have been in a movie with someone who has been in a movie with Kevin Bacon. Suddenly, you have yet another JOIN against the same table. Now add a third degree: someone who has been in a movie with someone who has been in a movie with someone who has been in a movie with Kevin Bacon. As you continue to add degrees, the query becomes increasingly unwieldy, harder to maintain, and less performant.

This is precisely the type of problem graph databases are meant to solve: finding paths between pieces of data that may be one or more relationships removed from each other. They solve it very elegantly by modeling domain objects as graph nodes and edges (relationships in graph db parlance) and then traversing the graph using well-known and efficient algorithms.

The above example can be very easily modeled this way: every actor is a node, every movie is a node, and every role is a relationship going from the actor to the movie they were in:
Now it becomes very easy to find a path from a given actor to Kevin Bacon.

Neo4j is an open source graph database, with both community and enterprise licensing structures. It supports transactions and can handle billions of nodes and relationships in a single instance. It was originally built to be embedded in Java applications, and most of the documentation and examples are evidence of that. Unfortunately, there is no native PHP wrapper for talking to Neo4j.

Luckily, Neo4j also has a built-in REST server and PHP is very good at consuming REST services. There's already a good Neo4j REST PHP library out there, but I decided to write my own to get a better understanding of how the REST interface actually works. You can grab it here and all the code examples below are written using it. The concepts can easily be ported to any Neo4j REST client.

First, we need to initialize a connection to the database. Since this is a REST interface, there is no persistent connection, and in fact, no data communication happens until the first time we need to read or write data:
use Everyman\Neo4j\Client,
    Everyman\Neo4j\Transport,
    Everyman\Neo4j\Node,
    Everyman\Neo4j\Relationship;

$client = new Client(new Transport('localhost', 7474));
Now we need to create a node for each of our actors and movies. This is analogous to the INSERT statements used to enter data into a traditional RDBMS:
$keanu = new Node($client);
$keanu->setProperty('name', 'Keanu Reeves')->save();
$laurence = new Node($client);
$laurence->setProperty('name', 'Laurence Fishburne')->save();
$jennifer = new Node($client);
$jennifer->setProperty('name', 'Jennifer Connelly')->save();
$kevin = new Node($client);
$kevin->setProperty('name', 'Kevin Bacon')->save();

$matrix = new Node($client);
$matrix->setProperty('title', 'The Matrix')->save();
$higherLearning = new Node($client);
$higherLearning->setProperty('title', 'Higher Learning')->save();
$mysticRiver = new Node($client);
$mysticRiver->setProperty('title', 'Mystic River')->save();
Each node has `setProperty` and `getProperty` methods that allow storing arbitrary data on the node. No server communication happens until the `save()` call, which must be called for each node.

Linking an actor to a movie means setting up a relationship between them. In RDBMS terms, the relationship takes the place of a join table or a foreign key column. In the example, the relationship always starts with the actor pointing to the movie, and is tagged as an "acted in" type relationship:
$keanu->relateTo($matrix, 'IN')->save();
$laurence->relateTo($matrix, 'IN')->save();

$laurence->relateTo($higherLearning, 'IN')->save();
$jennifer->relateTo($higherLearning, 'IN')->save();

$laurence->relateTo($mysticRiver, 'IN')->save();
$kevin->relateTo($mysticRiver, 'IN')->save();
The `relateTo` call returns a Relationship object, which is like a node in that it can have arbitrary properties stored on it. Each relationship is also saved to the database.

The direction of the relationship is totally arbitrary; paths can be found regardless of which direction a relationship points. You can use whichever semantics make sense for your problem domain. In the example above, it makes sense that an actor is "in" a movie, but it could just as easily be phrased that a movie "has" an actor. The same two nodes can have multiple relationships to each other, with different directions, types and properties.

The relationships are all set up, and now we are ready to find links between any actor in our system and Kevin Bacon. Note that the maximum length of the path is going to be 12 (6 degrees of separation multiplied by 2 nodes for each degree; an actor node and a movie node.)
$path = $keanu->findPathsTo($kevin)
    ->setMaxDepth(12)
    ->getSinglePath();

foreach ($path as $i => $node) {
    if ($i % 2 == 0) {
        echo $node->getProperty('name');
        if ($i+1 != count($path)) {
            echo " was in\n";
        }
    } else {
        echo "\t" . $node->getProperty('title') . " with\n";
    }
}
A path is an ordered array of nodes. The nodes alternate between being actor and movie nodes.

You can also do the typical "find all the related movies" type queries:
echo $laurence->getProperty('name') . " was in:\n";
$relationships = $laurence->getRelationships('IN');
foreach ($relationships as $relationship) {
    $movie = $relationship->getEndNode();
    echo "\t" . $movie->getProperty('title') . "\n";
}
`getRelationships` returns an array of all relationships that a node has, optionally limiting to only relationships of a given type. It is also possible to specify only incoming or outgoing relationships. We've set up our data so that all 'IN' relationships are from an actor to a movie, so we know that the end node of any 'IN' relationship is a movie.

There is more available in the REST interface, including node and relationship indexing, querying, and traversal (which allows more complicated path finding behaviors.) Transaction/batch operation support over REST is marked "experimental" for now. I'm hoping to add wrappers for more of this functionality soon. I'll also be posting more on the different types of problems that can be solved very elegantly with graphing databases.

The next post dives into a bit more detail about path-finding and some of the different algorithms available.

31 comments:

  1. Very interesting!

    Never heard of this kind of database.
    Sounds that it worth a try.

    Did you check the performance of the PHP connection to the database itself?
    It might be much slower than libraries that are implemented as extensions in C...

    ReplyDelete
  2. @Shay, I didn't do much (any) performance testing. I was more trying to get things working so I could play with the database. At some point it may make sense for someone to make a C extension of the REST client since there are no C client libraries for Neo4j, only Java.

    ReplyDelete
  3. A quick, nice and easy post to understanding Neo4j. :) Thanks!

    ReplyDelete
  4. hey,can you explain the difference between neo4j and hadoop

    ReplyDelete
  5. I don't know much about Hadoop, so rather than give you information that may possibly be incorrect, here is Peter Neubauer (VP of Product Management at Neo4j) explaining the difference:

    http://lists.neo4j.org/pipermail/user/2009-April/001118.html

    If you have more questions about Neo4j in general, the Neo4j forums are a great source of information: http://neo4j.org/forums/

    ReplyDelete
  6. good and short trip into the graph db world! will give it a try

    ReplyDelete
  7. Hi, it's a nice and good readable piece of code you wrote/still writing there. Why do you think it's still beta? It's seems quite complete.

    ReplyDelete
  8. How to delete a relationship between two nodes

    ReplyDelete
  9. How do i change the node id

    ReplyDelete
  10. You can't and you shouldn't. The node id is used internally by Neo4j. If your application needs to have ids that can be altered, you should assign your own ids to an indexed node property, and query using the index.

    ReplyDelete
  11. Thanks Josh.

    Here's one more question.
    I am creating node with one of the property "tags" and its values are in array of String and would want to index "tags". I tried doing that using code
    $tagIndex->add($node, "tags", $node->getProperty("tags"));
    but the string values of tags are not indexed separately.

    I am doing cyhper query like
    START root=node:tags_index(tags='Yoga')
    RETURN root.name, root.id, root.tags

    but not getting 0(zero) results.

    looking at this url http://grokbase.com/t/gg/neo4j/12abdzyeyb/how-are-properties-with-array-values-indexed, It says array values are saved as individual values to the same key.

    But this is not happening using PHP Everyman client, is there some issue with Everyman library.

    ReplyDelete
  12. how can import data from mysql db to neo4j ?

    ReplyDelete
  13. You can write a script for that ;)

    ReplyDelete
  14. Hi Josh,

    I have been following Neo4j for the past one year and am quite impressed with the whole idea of graph database. We have been developing a product using Neo4j with Nodejs for quite sometime now. Hopefully, we will get it live soon.

    Personally, I am working on another pet project (a simple mobile app). And for this one I wanted to use Neo4j with PHP (to create APIs for my app) as I am more comfortable programming with PHP than Node.js (which my fellow developers prefer). I started experimenting with the Neo4jPHP API a couple of days back and I have to admit that you have done a great job. The whole API is very easy to understand (especially since I have worked on the core JAVA library of Neo4j earlier and your API is structured in a very object oriented manner like the JAVA library).

    Thanks a lot for creating Neo4jPHP and more thanks for providing great guidelines for using Neo4jPHP (including the API documentation)

    Will certainly let you know when either of the two projects go live.

    Regards

    Nickle

    ReplyDelete
  15. Hi Josh,


    How do I ensure uniqunes of a node?


    for example I have the following CSV file.


    friend1, friend2
    friend1, friend3


    I read this file line by line and establish relationship. From the second line it creates friend1 as another node.


    how do I fix it? I would really appricate for any help.



    regards,
    Burhan

    ReplyDelete
  16. Hi Josh,


    How can we use the nodes we create in a code in subsequent PHPs?


    Do we have to store the address of one of the nodes in an external file or is there an other more efficient method?


    regards,
    Virat Kapoor

    ReplyDelete
  17. hey there friends,
    i need to know how to add a value which is taken from a form in html as a parameter in setpropertyfunction....???
    i tried to extract $_POST and assign the created form variable.....but it doesn't work.....plz help

    ReplyDelete
  18. Hi Nickle,,


    Let me know how to retrieve nodes(index name) in the webpage that i created using Neo4jphp.


    Thanks for advance.,

    ReplyDelete
  19. Hi Josh,

    Thank for the Neo4J php wrapper. I have been writing tests to import our current database into graphdb format an using your wrapper. Its works great. But I have a question. I could create all the nodes, indexes and relations between them. But I cant understand how to prevent relationship duplication. I could iterate through all the relationships and check one by one, but its not a good solution.

    I am trying to use the relationship index, but I don;t find any resource to use it. how can I add a relationship to the relationship index and how to query it? hat would be great and prevent me from duplicating relationships.

    thanks again.

    ReplyDelete
  20. Documentation for creating and using indexes, including relationship indexes are available here: https://github.com/jadell/neo4jphp/wiki/Indexes

    Also, if you are using Neo4j 2.0, you might want to look at the documentation on Cypher query language "CREATE UNIQUE" syntax. It will only create a relationship if one does not already exist. http://docs.neo4j.org/chunked/stable/query-create-unique.html



    Hope that helps

    ReplyDelete
  21. Thanks for the pointer Josh. Now I am getting segmentation faults with including the neo4jphp.phar file. I think it has to do with the Php version change. but not sure. still trying to get it work again.

    ReplyDelete
  22. Try installing the library with Composer. The PHAR file is highly out of date.

    ReplyDelete
  23. I just love this plugin. You did a wonderful job. I was about to work on java api to connect with neo4j and then make calls to java api. you just saved me bro:) , Hats off to you

    ReplyDelete
  24. Thank you! got everything working with Composer, with neo4j 2.0. Your php wrapper is working great, but I have one question. Currently I am trying to import 20 million rows from a sql database, but its too slow using create nodes & relations.


    Are the Batch operations of the REST Api exposed in the php wrapper? That would be great for writing db import operations.

    ReplyDelete
  25. https://github.com/jadell/neo4jphp/wiki/Batches

    ReplyDelete
  26. Hi Josh, the Batch operations are working fine, but I get memory limit error for larger batches, also once in a while without batch operations. I tried increasing the php memory_limit but for some reason it does not reflect in the error message, and the memory error always is the same.

    Do you know if this is Php memory specific error? thanks

    Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 298112365 bytes) in /Applications/MAMP/htdocs/queremos/vendor/everyman/neo4jphp/lib/Everyman/Neo4j/Transport/Curl.php on line 101

    ReplyDelete
  27. in Neo4jPHP is creating a relationship using "relateTo" make "me-[:knows]->you" the same as "you-[:knows]->me" OR are does "relateTo" consider the position of the nodes?

    that is, in the sense of social connection, there's a twitter model and facebook model AKA exclusive relationship vs mutual relationship

    ReplyDelete
  28. `relateTo` considers the position of the nodes. `$me->relateTo($you, 'knows')` creates a one-way relationship from the $me node to the $you node.

    ReplyDelete
  29. Thanks, Josh.


    by the way, OUTSTANDING work on this driver.


    I'm using it with the Slim framework and mixing in the neo4j spatial server plugin. I'll follow up via github if there's anything to contribute on that front.

    ReplyDelete
  30. Kanna!

    Brilliant work, very very intuitive.

    May guruvayorappan be with you!

    ReplyDelete