2011-12-25

PHP Fog Quickstart

For a while, I've been trying to find a good hosting service to use for testing out PHP apps and ideas I have. My criteria was to find a service that would be like Heroku for PHP apps: simple to get started, pre-configured with access to a database, scalable on demand, and no-hassle deployment (able to integrate with continuous deployment.)

Funnily enough, if you do a Google search for "heroku php" the first thing that comes up is PHP Fog. They provide hosting for up to 3 projects for free on shared hosting. They also offer MongoDB and application monitoring as add-ons.

Being it was Christmas and all, I decided to give myself a present and sign up. I was very surprised by how easy it was to get up and running. I managed to build a simple "echo" service in about 8 minutes, following roughly these steps:

After logging in, click "Launch New App". This brings you to a screen where you can choose from several pre-built applications (popular CMS or blogging platforms) or skeleton projects built on the most popular PHP frameworks.

For my purposes, I chose "Custom App". This will build an application containing exactly one "Hello world" PHP file. Enter a password for the MySQL database that will be set up, and a subdomain. The subdomain must be unique among all PHP Fog apps, and the setup won't continue until it is. After your app is set up, it can be given a custom domain name from $5 per month.

Click "Create App". This will bring you to your project's management page. From here, you can find everything you need to start developing your app. The git repository address for the app is displayed, as are the database credentials, helpfully wrapped in a mysql_connect function call.

Eventually, the "Status" light will turn green, indicating that your app is available. Clicking "View Live Site" brings you to your application, live to the world. Not much there. Time to change that.

Follow the instructions to add an SSH key to PHP Fog, which will give you access to write to your application's git repo. Then use the given git address to clone the repo to your development environment. Also, grab the Silex microframework to start building the app:
> git clone git@git01.phpfog.com:myapp.phpfogapp
> cd myapp.phpfogapp
> wget http://silex.sensiolabs.org/get/silex.phar
Edit index.php to provide a simple endpoint which will greet users:
<?php
require_once __DIR__.'/silex.phar';

$app = new Silex\Application();

$app->get('/', function() use($app) {
    return 'Checkout <a href="/hello/world">Hello...</a>';
});

$app->get('/hello/{name}', function($name) use($app) {
    return 'Hello '.$app->escape($name);
});

$app->run();
Also, create an .htaccess file that will redirect all traffic to index.php:
RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^ index.php [L]
Add the files to repository, commit them, then push the master branch to make the changes live:
> git add index.php .htaccess silex.phar
> git commit
> git push origin master
Reload your site to see the changes. Follow the "Hello" link, and play with the URL to see the results.

After the holidays, I hope to actually build something interesting on the platform.

2011-12-17

Decode JSON in Bash with PHP

I recently found myself needing to make cUrl calls from the command-line to an endpoint which returned JSON responses. Rather than parsing through the JSON as a string, or downloading some third party tool to format it for me, I created this handy Bash alias that decodes JSON from the command-line using PHP.

Put the following in your ~/.bashrc file:
alias json-decode="php -r 'print_r(json_decode(file_get_contents(\"php://stdin\")));'"

Usage from any process:
echo '{"foo":"bar","baz":"qux"}' | json-decode

Or from a cUrl call:
curl -s http://example.com/json/endpoint | json-decode

2011-12-08

Codeworks '11 Raleigh Rundown

I had a great time at CodeWorks 2011 in Raleigh this week, put on by the great folks at php|architect. Here is a rundown of the presentations, with some of my thoughts.

CI:IRL

Beth Tucker Long started out the day with a talk about continuous integration tools for automating away the pain of building and deploying PHP projects. One of her main themes was the importance of standardization in your code to avoid wasting mental effort on things like variable and class naming, which lines to put curly braces on, or how many spaces are in a single indent (editor: the answer is none; use a tab character.) The same can be said about manual testing and deployment processes.

The tools discussed included PHP_CodeSniffer, the venerable and nigh ubiquitous PHPUnit, and multiple options for automated builds and testing (CruiseControl and Jenkins/Hudson.) Beth also talked about Phing, which is one of my new favorite tools. She also covered tools for automated documentation and report generation. I would have gone with DocBlox instead of phpDocumentor, but that's just a personal preference. The most useful thing about most of these tools is that many of them are aware of each other, directly or through plugins, and can be chained together to build a flexible CI pipeline.

Most of the tools I had already heard of or am actively using. In addition to PHP_CodeSniffer for coding standards compliance, I would also throw in one or both of PHP Mess Detector and PHP Copy-Paste Detector. These three tools together give great insights into the current state of your code base and its future maintainability, which Beth mentioned when discussing "technical debt".

An important thing that Beth stressed at the end of her presentation was that not every project requires the use of every tool. It's important to take into account things like size/complexity of the project, timeline, budgets and team size. I think this was a valuable piece of advice that a lot of people forget. If you're anything like me, when you first learn about a new tool your response is to go out and find as many ways to use it as possible. I'm glad Beth ended with the reminder to keep in mind the constraints of your project when picking which tools to apply.

How Beer Made Me A Better Developer

Next up was Jason Austin, who is one of the coordinators of my local PHP meetup group. Jason's talk was a call for all developers to find their passion. His two passions are beer and PHP, so he decided to combine them into a side project, and Pint Labs was born. All developers should find a side project related to their passion. It is something he looks for as a hiring manager. They also prevent burnout in your day job.

Side projects give developers an opportunity to experiment without the consequences of failure that usually accompany employed or contract work. More than that, they allow for greater opportunities to bring new knowledge to a paid job. "The dumbest thing a manager can do is stop their developers from working on their own side projects," was one of my favorite lines from the presentation.

I can't agree more, and I'm lucky to have never worked in an environment where an employer has either forbidden me from having side projects, or claimed that anything I work on in my own time belongs to the company. I learned at lunch that many people aren't so fortunate. I would be very wary of any employment where that was the case.

As for side projects, even if you don't start your own, there are tons of open source projects out there who would love more hands on keyboards helping out. There is no excuse for not getting involved, even at a small degree.

What's New in PHP 5.4

Of all the talks, Cal Evans's about the new features coming in PHP 5.4 was probably the least applicable to my current work, but was one of my favorites. Cal is an engaging speaker, and there is some really cool stuff coming up in PHP.

The bits I was most interested in were improvements in Closures, Traits, and the built-in HTTP server. For those who don't know, in PHP 5.4, closures created in an object method will be able to reference $this from within the closure. That may not seem like a big deal, but if you're working with classes that build themselves or are factories for other objects, having the ability to reference the creating object is incredibly useful. The closure can access protected and private stuff from the instantiated class.

The caveat is that $this will always reference the object in which the closure was created. It would be great to be able to bind $this to whatever object was invoking the closure, like you can with Javascript's .call() and .apply() functions.

Lots has been written about Traits. Of all the features in PHP 5.4, this is the one I'm most looking forward to. They will solve a lot of problems that currently can only be solved by having deeply nested class hierarchies. Combining traits with the ability to bind closures to an object paves the way for having completely dynamically created classes, which would be useful in many situations. It also gives more evidence for my theory that every programming language evolves until it can replicate Javascript.

One other thing that Cal mentioned was the new JsonSerializable interface. This is a great addition for anyone who deals with sending JSON responses in PHP. I only have one problem with it: there is no complementary JsonUnserializable interface that would allow an object to populate its properties from a JSON string. I don't watch the PHP internals list, so I don't know why it was decided not to include such obvious functionality. Does anyone have any insights?

Refactoring and Other Small Animals

After lunch, we got right into the second half of the day with Marco Tabini and a presentation on refactoring. The presentation centered around a piece of "legacy" code. He wrote a test to assert the existing behavior, then proceeded to follow the refactoring steps of abstract, break (apart) and rename to show how the code could be made more maintainable and testable.

Marco's talk was to the point, with a great example, but it tilted more towards the theoretical side of why you should refactor. It also took a very purist look at what refactoring means (absolutely no changes to interface, even changes that would not affect calling code.) While I agree that this is the technical definition of refactoring, it's hard to start poking at a piece of code without wanting to improve its usage instead of just its implementation. I think there is a fuzzy line that developers cross often, at least mentally, between refactoring and improving the interface.

At the after-party, some of my coworkers had a longer discussion with Marco where he expounded on some of the ideas he presented and gave some more practical advice for our own legacy application. I wasn't there, but I do appreciate speakers and experts taking the time to apply their own knowledge and experience to their audience's issues. So, thank you Marco.

Jquery Mobile and PhoneGap

Terry Ryan was the representative from Adobe, one of the conference sponsors. His talk began with a quick demo of building an app with jQuery Mobile. I'll admit, I'm not a big fan of doing UI or front-end development, but even I could see how working with a library like that could make such development fun. It seems some of the kinks still need to be worked out, but it looks like a promising library and I hope to play with it some day soon.

Following that was an overview of PhoneGap, and more impressively, PhoneGap:Build, where Terry tweaked his application, ran it through the cloud building service, and had the new app deployed to his phone in just a few minutes, all live. Impressive stuff!

Then, Terry showed off some of the things Adobe has in the works with CSS shaders. The effects he demoed were neat to look at, but I don't feel like there was anything there that WebGL doesn't already cover.

For a finale, Terry live demoed Adobe's Proto application, which allows tablet users to create UI wireframes and mock-ups by drawing with their fingers. Imagine a virtual whiteboard that takes your scratchings and turns them into divs, tables, text, headers and image frames live while you draw. Now imagine sharing that view with remote users, also live. Very cool stuff.

REST Best Practices

Keith Casey's presentation was an informative look into the philosophy and reasoning behind REST architecture, and was another one of my favorites. He talked about the myth of "strictly RESTful" and how the vast majority of APIs out there don't even come close (hint #1: if you're passing an authentication token or session ID on every request, your API is not RESTful.) I also liked his use of "-ilities": discoverability, readability, flexibility, maintainability, etc.

He used Twilio as his example API, and showed how a good REST API should follow the 6 constraints of REST: client-server, stateless (where most authentication schemes fail at being REST), cacheable, layered, uniform interface, and code-on-demand. The last constraint is the hardest one for me to get my head around, so I appreciated his use of GMail as an example of providing not just data, but the code that manipulates that data. I think that JSONP might also fall into that category, but I didn't get a chance to ask.

Like with Marco's talk, I appreciated the theoretical nature of the topic, but I would have liked some more examples of real-world practicality trumping academic purity. It's always nice when a speaker acknowledges that sometimes you just need to get things done and points out which portions of a technology are more flexible than others and amenable to not following the letter of the law. Besides that, I thought it was a great look into why REST is the way it is.

After-Party

No development conference is complete without an after-party and no after-party is complete without free beer. I'd like to thank the folks at SugarCRM for sponsoring the after-party at Tir na nOg, one of my favorite pubs in downtown Raleigh.

Besides free booze, I love to talk with speakers and other conference attendees in a less formal setting. It's great to get to know other developers at the top of their craft, and the people in the PHP community are some of the friendliest and easiest going. It's one of the reasons I love being a PHP developer so much. Keith, his coworker at Twilio Devin Rader, a few other people and I had a great conversation that shifted topics basically at the speed of thought. John Mertic from SugarCRM was there and answered some of my questions, and Beth explained the php|architect submission process in a way that didn't sound scary at all.

To all the conference speakers, organizers and attendees, I want to say thank you for a fun and educational experience. Hopefully some of us will meet up again at phptek 2012!

2011-11-05

Development Setup for Neo4j and PHP: Part 2

Update 2014-02-15: Using Neo4jPHP from a downloaded PHAR file has been deprecated. The preferred and supported way to install the library is via Composer. The examples below have been updated to reflect this. It has also been updated for the 2.0 release of Neo4j.

This is Part 2 of a series on setting up a development environment for building projects using the graph database Neo4j and PHP. In Part 1 of this series, we set up unit test and development databases. In this part, we'll build a skeleton project that includes unit tests, and a minimalistic user interface.

All the files will live under a directory on our web server. In a real project, you'll probably want only the user interface files under the web server directory and your testing and library files somewhere more protected.

Also, I won't be using any specific PHP framework. The principles in the code below should apply equally to any framework you decide to use.

Create the Project and Testing Harness

Create a project in a directory on your web server. For this project, mine is "/var/www/neo4play" on my local host. We'll also need a Neo4j client library. I recommend Neo4jPHP (disclaimer: I'm the author) and all the code samples in Part 2 use it.
> cd /var/www/neo4play
> mkdir -p tests/unit
> mkdir lib
> echo '{"require":{"everyman/neo4jphp":"dev-master"}}' > composer.json
> composer install
> echo "<?php phpinfo(); ?>" > index.php
Test the setup by browsing to http://localhost/neo4play. You should see the output of `phpinfo`.

Now we'll create a bootstrap file that we can include to do project-wide and environment specific setup. Call this file "bootstrap.php" in the root project directory.
<?php
require_once(__DIR__.'/vendor/autoload.php');

error_reporting(-1);
ini_set('display_errors', 1);

if (!defined('APPLICATION_ENV')) {
    define('APPLICATION_ENV', 'development');
}

$host = 'localhost';
$port = (APPLICATION_ENV == 'development') ? 7474 : 7475;
$client = new Everyman\Neo4j\Client($host, $port);
The main point of this file at the moment is to differentiate between our development and testing environments, and set up our connection to the correct database. We do this by attaching the database client to the correct port based on an application constant.

We'll use the bootstrap file to setup a different, unit testing specific bootstrap file. Create the following file as "tests/bootstap-test.php":
<?php
define('APPLICATION_ENV', 'testing');
require_once(__DIR__.'/../bootstrap.php');

// Clean the database
$query = new Everyman\Neo4j\Cypher\Query($client, 'MATCH n-[r]-m DELETE n, r, m');
$query->getResultSet();
The purpose of this file to to tell our application bootstrap that we are in the "testing" environment. Then it cleans out the database so that our tests run from a known state.

Tell PHPUnit to use our test bootstrap with the following config file, called "tests/phpunit.xml":
<phpunit colors="true" bootstrap="./bootstrap-test.php">
    <testsuite name="Neo4j Play Test Results">
        <directory>./unit</directory>
    </testsuite>
</phpunit>
And because we're following TDD, we'll create our first test file, "tests/unit/ActorTest.php":
<?php
class ActorTest extends PHPUnit_Framework_TestCase
{
    public function testCreateActorAndRetrieveByName()
    {
        $actor = new Actor();
        $actor->name = 'Test Guy '.rand();
        Actor::save($actor);

        $actorId = $actor->id;
        self::assertNotNull($actorId);

        $retrievedActor = Actor::getActorByName($actor->name);
        self::assertInstanceOf('Actor', $retrievedActor);
        self::assertEquals($actor->id, $retrievedActor->id);
        self::assertEquals($actor->name, $retrievedActor->name);
    }

    public function testActorDoesNotExist()
    {
        $retrievedActor = Actor::getActorByName('Test Guy '.rand());
        self::assertNull($retrievedActor);
    }
}
So we know we want a domain object called "Actor" (apparently we're building some sort of movie application) and that Actors have names and ids. We also know we want to be able to look up an Actor by their name. If we can't find the Actor by name, we should get a `null` value back.

Run the tests:
> cd tests
> phpunit
Excellent, our tests failed! If you've been playing along, they probably failed because the "Actor" class isn't defined. Our next step is to start creating our domain objects.

Defining the Application Domain

So far, we only have one domain object, and a test that asserts its behavior. In order to make the test pass, we'll need to connect to the database, persist entities to it, and then query it for those entities.

For persisting our entities, we'll need a way to get the client connection in our "Actor" class and any other domain object classes we define. To do this, we'll create an application registry/dependency-injection container/pattern-buzzword-of-the-month class. Put the following in the file "lib/Neo4Play.php":
<?php
class Neo4Play
{
    protected static $client = null;

    public static function client()
    {
        return self::$client;
    }

    public static function setClient(Everyman\Neo4j\Client $client)
    {
        self::$client = $client;
    }
}
Now our domain objects will have access to the client connection through `Neo4Play::client()` when we persist them to the database. It's time to define our actor class, in the file "lib/Actor.php":
<?php
use Everyman\Neo4j\Node,
    Everyman\Neo4j\Index;

class Actor
{
    public $id = null;
    public $name = '';

    public static function save(Actor $actor)
    {
    }

    public static function getActorByName($name)
    {
    }
}
Requiring our classes and setting up the client connection is part of the bootstrapping process of the application, so we'll need to add some thing to "bootstrap.php":
<?php
require_once(__DIR__.'/vendor/autoload.php');
require_once(__DIR__.'/lib/Neo4Play.php');
require_once(__DIR__.'/lib/Actor.php');

// ** set up error reporting, environment and connection... **//

Neo4Play::setClient($client);
We have a stub class for the domain object. The tests will still fail when we run them again, but at least all the classes should be found correctly.

Let's start with finding an Actor by name. With our knowledge of graph databases, we know this will involve an index lookup, and that we will get a Node object in return. If the lookup returns no result, we'll get a `null`. If we do get a Node back, we'll want to hold on to it, for updating the Actor later.

Modify the Actor class with the following contents:
class Actor
{
    //

    protected $node = null;

    //

    public static function getActorByName($name)
    {
        $actorIndex = new Index(Neo4Play::client(), Index::TypeNode, 'actors');
        $node = $actorIndex->findOne('name', $name);
        if (!$node) {
            return null;
        }

        $actor = new Actor();
        $actor->id = $node->getId();
        $actor->name = $node->getProperty('name');
        $actor->node = $node;
        return $actor;
    }
}
The main thing we're trying to accomplish here is keeping our domain classes as Plain-Old PHP Objects, that don't require any special class inheritance or interface, and that hide the underlying persistence layer from the outside world.

The tests still fail. We'll finish up our Actor class by saving the Actor to the database.
class Actor
{
    //

    public static function save(Actor $actor)
    {
        if (!$actor->node) {
            $actor->node = new Node(Neo4Play::client());
        }

        $actor->node->setProperty('name', $actor->name);
        $actor->node->save();
        $actor->id = $actor->node->getId();

        $actorIndex = new Index(Neo4Play::client(), Index::TypeNode, 'actors');
        $actorIndex->add($actor->node, 'name', $actor->name);
    }

    //
}
Run the tests again. If you see all green, then everything is working properly. To double check, browse to the testing instance webadmin panel http://localhost:7475/webadmin/#. You should see 2 nodes and 1 property (Why 2 nodes? Because there is a node 0 -- the reference node -- that is not deleted when the database is cleaned out.)

Build Something Useful

It's time to start tacking on some user functionality to our application. Thanks to our work on the unit tests, we can create actors in the database and find them again via an exact name match. Let's expose that functionality.

Change the contents of "index.php" to the following:
<?php
require_once('bootstrap.php');

if (!empty($_POST['actorName'])) {
    $actor = new Actor();
    $actor->name = $_POST['actorName'];
    Actor::save($actor);
} else if (!empty($_GET['actorName'])) {
    $actor = Actor::getActorByName($_GET['actorName']);
}

?>
<form action="" method="POST">
Add Actor Name: <input type="text" name="actorName" />
<input type="submit" value="Add" />
</form>

<form action="" method="GET">
Find Actor Name: <input type="text" name="actorName" />
<input type="submit" value="Search" />
</form>

<?php if (!empty($actor)) : ?>
    Name: <?php echo $actor->name; ?><br />
    Id: <?php echo $actor->id; ?><br />
<?php elseif (!empty($_GET['actorName'])) : ?>
    No actor found by the name of "<?php echo $_GET['actorName']; ?>"<br />
<?php endif; ?>
Browse to your index file. Mine is at http://localhost/neo4play/index.php.

You should see the page you just created. Enter a name in the "Add Actor Name" box and click the "Add" button. If everything went according to plan, you should see the actor name and the id assigned to the actor by the database.

Try finding that actor using the search box. Note the actor's id.

Browse to http://localhost:7474/webadmin/# and click the "Data browser" tab. Enter the actor id in the text box at the top. The node you created when you added the actor should show up.

The interesting thing is that our actual application doesn't know anything about how the Actors are stored. Nothing in "index.php" references graphs or nodes or indexes. This means that, in theory, we could swap out the persistence layer for a SQL databases later, or MongoDB, or anything else, and nothing in our application would have to change. If we started with a SQL database, we could easily transition to a graph database.

Explore the Wonderful World of Graphs

Your development environment is now set up, and your application is bootstrapped. There's a lot more to add to this application, including creating movies, and linking actors and movies together. Maybe you'll want to add a social aspect, with movie recommendations. Graph databases are powerful tools that enable such functionality to be added easily.

Go ahead and explore the rest of the Neo4jPHP library (wiki and API). Also, be sure to checkout the Neo4j documentation, especially the sections about the REST API, Cypher and Gremlin (two powerful graph querying and processing languages.)

All the code for this sample application is available as a gist: http://gist.github.com/1341833.

Happy graphing!

Development Setup for Neo4j and PHP: Part 1

Update 2014-02-15: Using Neo4jPHP from a downloaded PHAR file has been deprecated. The preferred and supported way to install the library is via Composer. The examples below have been updated to reflect this. It has also been updated for the 2.0 release of Neo4j.

I would really love to see more of my fellow PHP developers talking about, playing with, and building awesome applications on top of graph databases. They really are a powerful storage solution that fits well into a wide variety of domains.

In this two part series, I'll detail how to set up a development environment for building a project with a graph database (specifically Neo4j). Part 1 will show how to set up the development and unit testing databases. In Part 2, we'll create a basic application that talks to the database, including unit tests.

All the steps below were performed on Ubuntu 10.10 Maverick, but should be easy to translate to any other OS.

Grab the Components

There are a few libraries and software tools needed for the project. First off, we'll need to install our programming environment, specifically PHP, PHPUnit and Java. How to do this is dependent on your operating system. My setup is PHP 5.3, PHPUnit 3.6 and Sun Java 6.

Next, we need to get Neo4j. Download the latest tarball from http://neo4j.org/download. I usually go with the latest milestone release (2.0 at this time.)
> cd ~/Downloads
> wget http://dist.neo4j.org/neo4j-community-2.0.1-unix.tar.gz

Set Up the Development Instance

It is possible to put test and development data in the same database instance, but I prefer a two database solution, even when using a SQL database. I do this because 1) I don't have to worry about accidentally blowing away any development data I care about, and 2) I don't have to worry about queries and traversals in my unit tests accidentally pulling in development data, and vice versa.

Unlike SQL or many NOSQL database servers, Neo4j cannot have more than one database in a single instance. The only way to get two databases is to run two separate instances on two separate ports or hosts. The instructions below describe how to set up our development and unit testing databases on the same host but listening on different ports.

First, create a point to hold both neo4j instances, and unpack the development instance:
> mkdir -p ~/neo4j
> cd ~/neo4j
> tar -xvzf ~/Downloads/neo4j-community-2.0.1-unix.tar.gz
> mv neo4j-community-2.0.1 dev
Next, configure the server by editing the file "~/neo4j/dev/conf/neo4j-server.properties". Make sure that the server is on port 7474
org.neo4j.server.webserver.port=7474
If you will be accessing the database instance from a host other than localhost, uncomment the following line:
org.neo4j.server.webserver.address=0.0.0.0
Save the config and exit. Now it's time to start the instance.
> ~/neo4j/dev/bin/neo4j start
You should be able to browse to the Neo4j Browser panel: http://localhost:7474

Set Up the Testing Instance

Unpack the test instance:
> cd ~/neo4j
> tar -xvzf ~/Downloads/neo4j-community-2.0.1-unix.tar.gz
> mv neo4j-community-2.0.1 test
Next, configure the server by editing the file "~/neo4j/test/conf/neo4j-server.properties". Make sure that the server is on port 7475 (note that this is a different port from the development instance!)
org.neo4j.server.webserver.port=7475
If you will be accessing the database instance from a host other than localhost, uncomment the following line:
org.neo4j.server.webserver.address=0.0.0.0
We need one more change to differentiate the testing from the development instance. Edit the file "~/neo4j/test/conf/neo4j-wrapper.properties" and change the instance name line:
wrapper.name=neo4j-test
Save and exit, then start the instance.
> ~/neo4j/test/bin/neo4j start
You should be able to browse to the Neo4j Browser panel: http://localhost:7475

Congratulations! Everything is set up and ready for us to build our application. In Part 2 of this series, I'll talk about setting up unit tests and building a basic application using a graph database as a backend.

2011-10-04

A Humbling Reference

Sometimes I have an experience that reminds me of how near-sighted I can get when I'm head-down in a problem.

What do you think the output of the following code will be (no cheating by running it first!):
<?php
$a = array(array("result" => "a"));
for ($i=0; $i < 2; $i++) {
    foreach ($a as $j => $v) {
        $v["result"] .= "a";
        echo $v["result"] ."\n";
    }
}
foreach ($a as $k => $v) echo $v["result"] ."\n";
If the goal is to add the character "a" twice onto each result element of each sub-array, at first glance this seems like a perfectly reasonable way to go about it.

But hold up! Each time through, we seem to have only added one "a" onto the result. And in the second foreach loop, it seems like we didn't modify the value at all! What gives?

Well, it might be obvious to most of you, but when it was costing me an hour and a half of my work day it was buried in a nest of process forking and stream handling, part of which looked something like this:
while ($currentExecution) {
    foreach ($currentExecution as $i => $handle) {
        $handle['output'] .= stream_get_contents($handle['outstream']);
        $status = proc_get_status($handle['process']);
        if (!$status['running']) {
            fclose($handle['outstream']);
            proc_terminate($handle['process']);
            echo $handle['output']."\n";
            unset($currentExecution[$i]);
        }
    }
}
Same concept as before: I have an array of process handles (opened with proc_open) and I'm looping through, gathering the output of each process as it is generated, until the process is finished running, at which point I close the process and display the output. The symptom was that, sometimes, I would only see the end of the output and not the beginning.

The times when this happened were the times where a process handle went through the outer while loop more than once, the same as in the contrived example above. The reason for this is because when you are working with a value inside a foreach loop, you are actually working with a copy of that value (technically a copy-on-write). So modifications made to the value do not persist through multiple passes through the while loop.

As it turns out, the solution here is quite simple. Actually, there are several solutions. Here's the easiest one, applied to the original example:
<?php
$a = array(array("result" => "a"));
for ($i=0; $i < 2; $i++) {
    foreach ($a as $j => &$v) {
        $v["result"] .= "a";
        echo $v["result"] ."\n";
    }
}
foreach ($a as $k => $v) echo $v["result"] ."\n";
The difference is subtle: it's the "&" prepended to the $v in the foreach loop. This tells the loop to use a reference to the value instead of a copy. Here's another possible solution:
<?php
$a = array(array("result" => "a"));
for ($i=0; $i < 2; $i++) {
    foreach ($a as $j => $v) {
        $a[$j]["result"] .= "a";
        echo $a[$j]["result"] ."\n";
    }
}
foreach ($a as $k => $v) echo $v["result"] ."\n";
In this case, we're not even using the variable $v. We're using the array index to refer to the value inside of the array directly. Because we're modifying the actual array value and not the copy, the changes persist through the loop.

The final solution (and the one I ended up going with) is to make the value an object instead of an array:
<?php
$obj->result = "a";
$a = array($obj);
for ($i=0; $i < 2; $i++) {
        foreach ($a as $j => $v) {
                $v->result .= "a";
                echo $v->result ."\n";
        }
}
foreach ($a as $k => $v) echo $v->result ."\n";
Objects in PHP are always passed around by reference, including in foreach loops. So modifying the object's property inside the loop modifies the actual object, not a copy.

The most annoying thing about this whole experience? I had solved this problem for someone else a few weeks ago, telling them what the issue was just from hearing them describe the situation and without even seeing their code run. So the deeper lesson here is this: when you've been staring at code for so long that it's making you cross-eyed and frustrated, walk away. And most importantly, get someone else who hasn't been working on it to look it over.

...something something forest...something something trees...

Sample code available at http://gist.github.com/1263522

2011-09-23

Phar Flung Phing

One of the cooler features of PHP 5.3 is the ability to package up a set of PHP class files and scripts into a single archive, known as a PHAR ("PHp ARchive"). PHAR files are pretty much equivalent to Java's JAR files: they allow you to distribute an entire library or application as a single file. I decided to see how easy it would be to wrap up Neo4jPHP in a PHAR for distribution.

Surprisingly, there isn't much information out there on creating PHAR files outside of the PHP manual and a few older posts. They deal mainly with creating a PHAR with a hand-written script specific to the project being packaged, using PHP's PHAR classes and methods.

Since I also started playing with Phing recently, I decided to see if I could incorporate packaging a project as a PHAR into my build system. It turns out, it's pretty easy, given that Phing has a built-in PharPackage task.

All steps below were performed on Ubuntu 10.10 Maverick

Step 1 - Get Phing

The Phing docs are pretty explicit and easy to follow. If you have PEAR installed, simply do:
> sudo pear channel-discover pear.phing.info
> sudo pear install phing/phing
If all goes well, you should the location of the phing command line utility when you run
> which phing

Step 2 - Allow PHP to create PHARs

This is one that most of the posts I found on creating PHARs fail to mention. In the default installation of PHP, PHARs are read-only (this prevents a nasty security hole, whereby any PHP script on your machine has the ability to modify other PHARs.) In order to prevent this, the php.ini setting "phar.readonly" must be turned off. I did this by creating a new file in my PHP conf.d directory:
> sudo sh -c "echo 'phar.readonly=0' > /etc/php5/conf.d/phar.ini"

Step 3 - Create a stub

In a PHAR, the stub file is run when the PHAR is `include`d or `require`d in a script (or when the PHAR is invoked from the command line.) The purpose of the stub is to do any setup necessary for the library contained in the PHAR. Since I'm packaging a library, the stub I created doesn't do anything more than set up autoloading of my library's classes.

Save the following file as "stub.php" in the root of your project:
<?php
Phar::mapPhar('myproject.phar');
spl_autoload_register(function ($className) {
	$libPath = 'phar://myproject.phar/lib/';
	$classFile = str_replace('\\',DIRECTORY_SEPARATOR,$className).'.php';
	$classPath = $libPath.$classFile;
	if (file_exists($classPath)) {
		require($classPath);
	}
});
__HALT_COMPILER();
The "__HALT_COMPILER();" line must be the last thing in the file. The call to `Phar::mapPhar()` registers the PHAR with PHP and allows any of the files inside to be found via the "phar://" stream wrapper. Any files or directories contained in the PHAR can be referenced as "phar://myproject.phar/path/inside/phar/MyClass.php". Since all my project files are namespaced, autoloading is a matter of translating the class name into a path inside the PHAR.

Step 4 - Create a build file

Phing's documentation is pretty extensive, so I won't go into to much detail about what each piece of the build file means. The important part is the PharPackage task in the "package" target.

Save the following file as "build.xml" in the root of your project:
<?xml version="1.0" encoding="UTF-8"?>
<project name="MyProject" default="package">

  <!-- Target: build -->
  <target name="build">
    <delete dir="./build" />
    <mkdir dir="./build" />
    <copy todir="./build">
      <fileset dir=".">
        <include name="lib/" />
      </fileset>
    </copy>
  </target>

  <!-- Target: package -->
  <target name="package" depends="build">
    <delete file="./myproject.phar" />
    <pharpackage
      destfile="./myproject.phar"
      basedir="./build"
      compression="gzip"
      stub="./stub.php"
      signature="sha1">
      <fileset dir="./build">
        <include name="**/**" />
      </fileset>
      <metadata>
        <element name="version" value="1.2.3" />
        <element name="authors">
          <element name="Joe Developer">
            <element name="email" value="joe.developer@example.com" />
          </element>
        </element>
      </metadata>
    </pharpackage>
  </target>
</project>
The important part is the "package" target, the default target of this build file. It depends on the "build" target, whose only purpose is to copy out all the files that will be packaged in the PHAR into a separate "build" directory. This strips out any testing, example or other non-library files. Any previously created package is deleted.

The main work of packaging is done via the "pharpackage" task. In this case, I'm specifying that the output package should be called "myproject.phar" and be put in the same directory as the build file, the PHAR should be compressed with "gzip", the stub to load when requiring the file is at "./stub.php" and the PHAR should be signed with a SHA1 hash.

Metadata appears to be optional, but Phing throws up a nasty looking warning if it's not there (it tries to convert an empty value to an array.) Nested metadata is turned into a PHP array and serialized.

Step 5 - Phling your PHAR

Run the following command in the directory with your build file:
> phing
# or, if the "package" task is not the default
> phing package
If all went well, there should be a file called "myproject.phar" in the root of your project. You can test it out by running this test script:
<?php
require("myproject.phar");

$thing = new Class\In\My\Project();
If PHP doesn't complain about a missing class, the PHAR set up correctly and autoloading is working.

The Neo4jPHP PHAR is available at http://github.com/downloads/jadell/Neo4jPHP/neo4jphp.phar.

2011-08-25

Getting Neo4jPHP working with CodeIgniter 2

This post is based heavily off of Integrating Doctrine 2 with CodeIgniter 2. I haven't tested it myself, so please use it as a starting point and let me know if I should update anything.

Here are the steps to get Neo4jPHP set up as a CodeIgniter 2 library:
  1. Copy the 'Everyman' folder to application/libraries
  2. Create a file called Everyman.php in application/libraries
  3. Copy the following code into Everyman.php:
    <?php
    class Everyman
    {
        public function __construct()
        {
            spl_autoload_register(array($this,'autoload'));
        }
    
        public function autoload($sClass)
        {
            $sLibPath = __DIR__.DIRECTORY_SEPARATOR;
            $sClassFile = str_replace('\\',DIRECTORY_SEPARATOR,$sClass).'.php';
            $sClassPath = $sLibPath.$sClassFile;
            if (file_exists($sClassPath)) {
                require($sClassPath);
            }
        }
    }
    
  4. Add the following line to application/config/autoload.php:
    $autoload['libraries'] = array('everyman');
    

This is actually a generic autoloader that will attempt to autoload any namespaced class under application/library where the class name matches the file path.

2011-07-23

Performance of Graph vs. Relational Databases

A few weeks ago, Emil Eifrem, CEO of Neo Technology gave a webinar introduction to graph databases. I watched it as a lead up to my own presentation on graph databases and Neo4j.

Right around the 54 minute mark, Emil talks about a very interesting experiment showing the performance difference between a relational database and a graph database for a certain type of problem called "arbitrary path query", specifically, given 1,000 users with an average of 50 "friend" relationships each, determine if one person is connected to another in 4 or fewer hops.

Against a popular open-source relational database, the query took around 2,000 ms. For a graph database, the same determination took 2 ms. So the graph database was 1,000 times faster for this particular use case.

Not satisfied with that, they then decided to run the same experiment with 1,000,000 users. The graph database took 2 ms. They stopped the relational database after several days of waiting for results.

I showed this clip at my presentation to a few people who stuck around afterwards, and we tried to figure out why the graph database had the same performance with 1,000 times the data, while the relational database became unusably slow. The answer has to do with the way in which each type of database searches information.

Relational databases search all of the data looking for anything that meets the search criteria. The larger the set of data, the longer it takes to find matches, because the database has to examine everything in the collection.

Here's an example: let's assume there is a table with "friend" relationships:
> SELECT * FROM friends;
+-------------+--------------+
| user_id     | friend_id    |
+-------------+--------------+
| 1           | 2            |
| 1           | 3            |
| 1           | 4            |
| 2           | 5            |
| 2           | 6            |
| 2           | 7            |
| 3           | 8            |
| 3           | 9            |
| 3           | 10           |
+-------------+--------------+
In order to see if user "9" is connected to user "2", the database has to find all the friends of user "9", and see if user "2" is in that list. If not, find all of their friends, and then see if user "2" is in that list. The database has to scan the entire table each time. This means that if you double the number of rows in the table, you've doubled the amount of data to search, and thus doubled the amount of time it takes to find what you are looking for. (Even with indexing, it still has to find the values in the index tree, which involves traversing that tree. The index tree grows larger with each new record, meaning the time it takes to traverse grows larger as well. And for each search, you always start at the root of the tree.)

Each new batch of friends to look at requires an entirely new scan of the table/index. So more records leads to more search time.

Conversely, a graph database looks only at records that are directly connected to other records. If it is given a limit on how many "hops" it is allowed to make, it can ignore everything more than that number of hops away.
Only the blue records are ever seen during the search. And since a graph traversal "remembers" where it is at any time, it never has to start from the beginning, only from its last known position.

You could add another ring of records around the outside, or add a thousand more rings of records outside, but the search would never need to look at them because they are more steps away than the limit. The only way to increase the number of records searched (and thereby decrease performance) is to add more records within the 2-step limit, or add more relationships between the existing records.

So the reason why having 1,000 vs. 1,000,000 records causes such a stark difference between a relational and a graph database is that relational database performance decreases in relation to the number of records in the table, while graph database performance decreases in relation to the number of connections between the records.

2011-07-19

TriPUG Meetup slides

Here are the slides for my presentation on graph databases at the TriPUG meetup tonight: http://www.slideshare.net/JoshAdell/graphing-databases

Overall, I think it was pretty well received. I didn't mean for it to last an hour and a half, but there it is. There were a lot of new folks there; I'm encouraged at the enthusiasm of the developer community in the Triangle region. Thanks for having me!

2011-07-16

agile Adoption on a Non-Technical Team

I'm pretty excited about a new development at my office. Based on my team's success in increasing our productivity, efficiency and transparency, another team has decided to adopt "agile-like" practices. The most interesting thing is that this team isn't a technical team at all.

It's probably fair to say that what they're adopting is less like agile and more like lean. The goal was to increase visibility to their management structure on what the team was doing each day, increase the efficiency of their work, minimize interruptions of team members and raise problems earlier. To accomplish this, they have implemented a stripped-down version of Scrum, with the following practices:
  • Planning:
  • break-down of larger, more ambiguous mid- to long-term goals into shorter, more easily tracked tasks.
  • Daily Standup:
  • team stand-up meeting every day in the morning, with each team member answering "What did I do yesterday?", "What am I going to do today?" and "Is anything preventing me from getting things done?"
  • Scrum Board:
  • a week-long calendar set up inside the team's area in a highly trafficked place so that anyone else (including team managers!) can see at a glance how the team is doing without having to ask individual team members.
  • Retrospective:
  • periodic gathering of the team to discuss the new processes, how the processes are helping/hindering and what to change to make the team more efficient.
The team's goals are very different from that of a software development team, so we encouraged them to not follow our process dogmatically, and instead, to come up with methods that work for their team's unique needs (the core of "a"gile vs. "Agile".) For instance, my team operates on a 2 week sprint cadence (a new sprint every two weeks.) Their team receives the team goals on a monthly basis. Rather than adopt 2-week or 4-week sprints, they have chosen to break up the monthly goals into smaller and more transparent 1-week goals. This provides them with earlier warning if things start to go sideways, and provides motivation to not let things stack up until the end of the month.

Also, their work does not fit easily into "delivering feature or product X by the end of the month." Rather, they are tasked with having certain regions of the country covered by certain baselines of recruitment effort (for partner and vendor recruiting.) So the "User Stories" on their backlog are phrased as "15% coverage of NJ/NY metro area" or "Decrease turnover 10% in southern VA." Their tasks become things like calling or emailing specific contacts, setting up conference calls and facilitating the transition of a contact into a partner.

The Scrum board also has little stickers (a frog for their direct manager and Venom for her manager) that they can attach to a specific task to signal that they need additional help with an item. I think this is a neat idea, because it means that their managers don't have to ask them many times a day if everything is going smoothly. A quick glance at the Scrum board gives them the same information.

A few sprints ago, my team decided to start sharing a few minutes about our agile practices at each Sprint Review. I'm really glad to see that not only were people not spacing out (which I had feared they might) but were actually absorbing the information and thinking about how it could apply to their own team. I'm very excited to see how this new process works for our recruitment team.

2011-07-06

Neo4jPHP beta released

I've been working on a PHP client for the Neo4j REST API for little while. I think it's ready for some real-life testing. The beta version is available here: https://github.com/jadell/Neo4jPHP/tarball/0.0.1-beta

Features:
  • Developed against the Neo4j 1.4 milestone releases
  • Simple, object-oriented API
  • Almost complete REST API coverage
  • Indexing of nodes and relationships, including exact match and query support
  • Cypher queries (thanks to Jacob Hansson)
  • Traversal support, including paged traversals
  • Lazy-loading of node and relationship data

Hopefully coming soon:
  • Client-side caching
  • Batch operations

There are some usage examples included.

It's a beta release, so please be gentle (on me, that is; be as rough as you want with the code.) If anyone finds any bugs or has feature requests, please use the GitHub issues page.


Update 2011-07-08:I would be remiss if I didn't mention the other PHP Neo4j REST client available at https://github.com/tchaffee/Neo4J-REST-PHP-API-client. It is built for the current stable release of Neo4j 1.3.

2011-06-27

Path finding with Neo4j

In my previous post I talked about graphing databases (Neo4j in particular) and how they can be applied to certain classes of problems where data may have multiple degrees of separation in their relationships.

The thing that makes graphing databases useful is the ability to find relationship paths from one node to another. There are many algorithms for finding paths efficiently, depending on the use case.

Consider a graph that represents a map of a town. Every node in the graph represents an intersection of two or more streets. Each direction of a street is represented as a different relationship: two-way streets will have two relationships between the same two nodes, one pointing from the first node to the second, and the other pointing from the second node to the first. One-way streets will have a single relationship. Each street relationship will also have a distance property. Here is what one such map might look like:
Every street is two-way, except for Market Alley and Park Drive. The numbers in parentheses represent the street distance in hundreds of yards (i. e. 3 = 300 yards.)

Here is the code to create this graph in the database, using the REST client I've been working on (connection initialization is not shown):
$intersections = array(
    "First & Main",
    "Second & Main",
    "Third & Main",
    "First & Oak",
    "Second & Oak",
    "Third & Market",
);

$streets = array(
    // start, end, [direction, distance, name]
    array(0, 1, array('direction'=>'east', 'distance'=>2, 'name'=>'Main St.')),
    array(0, 3, array('direction'=>'south', 'distance'=>2, 'name'=>'First Ave.')),
    array(0, 4, array('direction'=>'south', 'distance'=>3, 'name'=>'Park Dr.')),

    array(1, 0, array('direction'=>'west', 'distance'=>2, 'name'=>'Main St.')),
    array(1, 2, array('direction'=>'east', 'distance'=>2, 'name'=>'Main St.')),
    array(1, 4, array('direction'=>'south', 'distance'=>2, 'name'=>'Second Ave.')),

    array(2, 1, array('direction'=>'west', 'distance'=>2, 'name'=>'Main St.')),
    array(2, 5, array('direction'=>'south', 'distance'=>1, 'name'=>'Third Ave.')),

    array(3, 0, array('direction'=>'north', 'distance'=>2, 'name'=>'First Ave.')),
    array(3, 4, array('direction'=>'east', 'distance'=>2, 'name'=>'Oak St')),

    array(4, 3, array('direction'=>'west', 'distance'=>2, 'name'=>'Oak St.')),
    array(4, 1, array('direction'=>'north', 'distance'=>2, 'name'=>'Second Ave.')),

    array(5, 2, array('direction'=>'north', 'distance'=>1, 'name'=>'Third Ave.')),
    array(5, 4, array('direction'=>'west', 'distance'=>2, 'name'=>'Market Alley')),
);

$nodes = array();
$intersectionIndex = new Index($client, Index::TypeNode, 'intersections');
foreach ($intersections as $intersection) {
    $node = new Node($client);
    $node->setProperty('name', $intersection)->save();
    $intersectionIndex->add($node, 'name', $node->getProperty('name'));
    $nodes[] = $node;
}

foreach ($streets as $street) {
    $start = $nodes[$street[0]];
    $end = $nodes[$street[1]];
    $properties = $street[2];
    $start->relateTo($end, 'CONNECTS')
        ->setProperties($properties)
        ->save();
}
First, we set up a list of the intersections on the map, and also a list of the streets that connect those intersections. Each street will be a relationship that has data about the streets direction, distance and name.

Then, we create each intersection node. Each node is also added to an index. Indexes allow stored nodes to be found quickly. Nodes and relationships can be indexed, and indexing can occur on any node or relationship field. You can even index on fields that don't exist on the node or relationship. We'll use the index later to find the start and end points of our path by name.

Finally, we create a relationship for every street. Relationships can have arbitrary data attached to them, just like nodes.

Now we create way to find and display driving directions from any intersection to any other intersection:
$turns = array(
    'east' => array('north' => 'left','south' => 'right'),
    'west' => array('north' => 'right','south' => 'left'),
    'north' => array('east' => 'right','west' => 'left'),
    'south' => array('east' => 'left','west' => 'right'),
);

$fromNode = $intersectionIndex->findOne("Second & Oak");
$toNode = $intersectionIndex->findOne("Third & Main");

$paths = $fromNode->findPathsTo($toNode, 'CONNECTS', Relationship::DirectionOut)
    ->setMaxDepth(5)
    ->getPaths();

foreach ($paths as $i => $path) {
    $path->setContext(Path::ContextRelationship);
    $prevDirection = null;
    $totalDistance = 0;

    echo "Path " . ($i+1) .":\n";
    foreach ($path as $j => $rel) {
        $direction = $rel->getProperty('direction');
        $distance = $rel->getProperty('distance') * 100;
        $name = $rel->getProperty('name');

        if (!$prevDirection) {
            $action = 'Head';
        } else if ($prevDirection == $direction) {
            $action = 'Continue';
        } else {
            $turn = $turns[$prevDirection][$direction];
            $action = "Turn $turn, and continue";
        }
        $prevDirection = $direction;
        $step = $j+1;
        $totalDistance += $distance;

        echo "\t{$step}: {$action} {$direction} on {$name} for {$distance} yards.\n";
    }
    echo "\tTravel distance: {$totalDistance} yards\n\n";
}
Note that in the `findPathsTo` call, we specify that we want only outgoing relationships. This is how we prevent getting directions that may send us the wrong way down a one-way street. We also set a max depth, since the default search depth is 1. Also, note the `setContext` call which tells the path that we want the `foreach` to loop through the relationships, not the nodes. Running this produces the following directions:
Path 1:
    1: Head north on Second Ave. for 200 yards.
    2: Turn left, and continue west on Main St. for 200 yards.
    Travel distance: 400 yards
We only got back one path because, by default, the path finding algorithm finds the shortest path, meaning the path with the least number of relationships. If we switch the to and from intersections, we can get directions back to our starting point:

Path 1:
    1: Head west on Main St. for 200 yards.
    2: Turn left, and continue south on Second Ave. for 200 yards.
    Travel distance: 400 yards

Path 2:
    1: Head south on Third Ave. for 100 yards.
    2: Turn right, and continue west on Market Alley for 200 yards.
    Travel distance: 300 yards
We got back two paths this time, but one of them is longer than the other. How can this be if we are using the shortest path algorithm? The answer is that shortest path refers only to the number of relationships in the path. We didn't tell our path finder to pay attention to the cost of one path over another. Fortunately, we can do that quite easily by telling the `findPathsTo` call to use a property of each relationship when determining which path is the least expensive:
$paths = $fromNode->findPathsTo($toNode, 'CONNECTS', Relationship::DirectionOut)
    ->setAlgorithm(PathFinder::AlgoDijkstra)
    ->setCostProperty('distance')
    ->setMaxDepth(5)
    ->getPaths();
We're telling the path finder to use Dijkstra's algorithm, an algorithm to find the lowest cost path, which is different from the shortest path. Since our map uses distance, the cost of a path is the total of all the distance properties of the relationships in that path. Running the script now gives us only one path, the path with the least distance:
Path 1:
    1: Head south on Third Ave. for 100 yards.
    2: Turn right, and continue west on Market Alley for 200 yards.
    Travel distance: 300 yards
If there were another path with the same total distance, it would also be displayed. Path cost ignores the number of relationships in the path. If there were a path with 3 relationships totaling 300 yards, and another with 1 relationship totaling 300 yards, both of them would be displayed.

Neo4j has several path finding algorithms built in. Some of them are:
  • shortest path - find paths with the fewest relationships
  • dijkstra - find paths with the lowest cost
  • simple path - find paths with no repeated nodes
  • all - find all paths between two nodes
Neo4j also comes with an expressive traversal API that allows you to create your own path finding algorithms for domain specific needs.

2011-06-15

Neo4j for PHP

Update 2011-09-14: I've put a lot of effort into Neo4jPHP since this post was written. It's pretty full-featured and covers almost 100% of the Neo4j REST interface. Anyone interested in playing with Neo4j from PHP should definitely check it out. I would love some feedback!

Lately, I've been playing around with the graph database Neo4j and its application to certain classes of problems. Graph databases are meant to solve problems in domains where data relationships can be multiple levels deep. For example, in a relational database, it's very easy to answer the question "Give me a list of all actors who have been in a movie with Kevin Bacon":
> desc roles;
+-------------+--------------+------+-----+---------+----------------+
| Field       | Type         | Null | Key | Default | Extra          |
+-------------+--------------+------+-----+---------+----------------+
| id          | int(11)      | NO   | PRI | NULL    | auto_increment |
| actor_name  | varchar(100) | YES  |     | NULL    |                |
| role_name   | varchar(100) | YES  |     | NULL    |                |
| movie_title | varchar(100) | YES  |     | NULL    |                |
+-------------+--------------+------+-----+---------+----------------+

> SELECT actor_name, role_name FROM roles WHERE movie_title IN (SELECT DISTINCT movie_title FROM roles WHERE actor_name='Kevin Bacon')
Excuse my use of sub-queries (re-write it to a JOIN in your head if you wish.)

But suppose you want to get the names of all the actors who have been in a movie with someone who has been in a movie with Kevin Bacon. Suddenly, you have yet another JOIN against the same table. Now add a third degree: someone who has been in a movie with someone who has been in a movie with someone who has been in a movie with Kevin Bacon. As you continue to add degrees, the query becomes increasingly unwieldy, harder to maintain, and less performant.

This is precisely the type of problem graph databases are meant to solve: finding paths between pieces of data that may be one or more relationships removed from each other. They solve it very elegantly by modeling domain objects as graph nodes and edges (relationships in graph db parlance) and then traversing the graph using well-known and efficient algorithms.

The above example can be very easily modeled this way: every actor is a node, every movie is a node, and every role is a relationship going from the actor to the movie they were in:
Now it becomes very easy to find a path from a given actor to Kevin Bacon.

Neo4j is an open source graph database, with both community and enterprise licensing structures. It supports transactions and can handle billions of nodes and relationships in a single instance. It was originally built to be embedded in Java applications, and most of the documentation and examples are evidence of that. Unfortunately, there is no native PHP wrapper for talking to Neo4j.

Luckily, Neo4j also has a built-in REST server and PHP is very good at consuming REST services. There's already a good Neo4j REST PHP library out there, but I decided to write my own to get a better understanding of how the REST interface actually works. You can grab it here and all the code examples below are written using it. The concepts can easily be ported to any Neo4j REST client.

First, we need to initialize a connection to the database. Since this is a REST interface, there is no persistent connection, and in fact, no data communication happens until the first time we need to read or write data:
use Everyman\Neo4j\Client,
    Everyman\Neo4j\Transport,
    Everyman\Neo4j\Node,
    Everyman\Neo4j\Relationship;

$client = new Client(new Transport('localhost', 7474));
Now we need to create a node for each of our actors and movies. This is analogous to the INSERT statements used to enter data into a traditional RDBMS:
$keanu = new Node($client);
$keanu->setProperty('name', 'Keanu Reeves')->save();
$laurence = new Node($client);
$laurence->setProperty('name', 'Laurence Fishburne')->save();
$jennifer = new Node($client);
$jennifer->setProperty('name', 'Jennifer Connelly')->save();
$kevin = new Node($client);
$kevin->setProperty('name', 'Kevin Bacon')->save();

$matrix = new Node($client);
$matrix->setProperty('title', 'The Matrix')->save();
$higherLearning = new Node($client);
$higherLearning->setProperty('title', 'Higher Learning')->save();
$mysticRiver = new Node($client);
$mysticRiver->setProperty('title', 'Mystic River')->save();
Each node has `setProperty` and `getProperty` methods that allow storing arbitrary data on the node. No server communication happens until the `save()` call, which must be called for each node.

Linking an actor to a movie means setting up a relationship between them. In RDBMS terms, the relationship takes the place of a join table or a foreign key column. In the example, the relationship always starts with the actor pointing to the movie, and is tagged as an "acted in" type relationship:
$keanu->relateTo($matrix, 'IN')->save();
$laurence->relateTo($matrix, 'IN')->save();

$laurence->relateTo($higherLearning, 'IN')->save();
$jennifer->relateTo($higherLearning, 'IN')->save();

$laurence->relateTo($mysticRiver, 'IN')->save();
$kevin->relateTo($mysticRiver, 'IN')->save();
The `relateTo` call returns a Relationship object, which is like a node in that it can have arbitrary properties stored on it. Each relationship is also saved to the database.

The direction of the relationship is totally arbitrary; paths can be found regardless of which direction a relationship points. You can use whichever semantics make sense for your problem domain. In the example above, it makes sense that an actor is "in" a movie, but it could just as easily be phrased that a movie "has" an actor. The same two nodes can have multiple relationships to each other, with different directions, types and properties.

The relationships are all set up, and now we are ready to find links between any actor in our system and Kevin Bacon. Note that the maximum length of the path is going to be 12 (6 degrees of separation multiplied by 2 nodes for each degree; an actor node and a movie node.)
$path = $keanu->findPathsTo($kevin)
    ->setMaxDepth(12)
    ->getSinglePath();

foreach ($path as $i => $node) {
    if ($i % 2 == 0) {
        echo $node->getProperty('name');
        if ($i+1 != count($path)) {
            echo " was in\n";
        }
    } else {
        echo "\t" . $node->getProperty('title') . " with\n";
    }
}
A path is an ordered array of nodes. The nodes alternate between being actor and movie nodes.

You can also do the typical "find all the related movies" type queries:
echo $laurence->getProperty('name') . " was in:\n";
$relationships = $laurence->getRelationships('IN');
foreach ($relationships as $relationship) {
    $movie = $relationship->getEndNode();
    echo "\t" . $movie->getProperty('title') . "\n";
}
`getRelationships` returns an array of all relationships that a node has, optionally limiting to only relationships of a given type. It is also possible to specify only incoming or outgoing relationships. We've set up our data so that all 'IN' relationships are from an actor to a movie, so we know that the end node of any 'IN' relationship is a movie.

There is more available in the REST interface, including node and relationship indexing, querying, and traversal (which allows more complicated path finding behaviors.) Transaction/batch operation support over REST is marked "experimental" for now. I'm hoping to add wrappers for more of this functionality soon. I'll also be posting more on the different types of problems that can be solved very elegantly with graphing databases.

The next post dives into a bit more detail about path-finding and some of the different algorithms available.

2011-05-16

Logging User Sessions Across Requests

When tracking down a bug, few things aid the process more than good application logs. And this is especially true when the bug has already escaped to your production systems. In these cases, you may not be able to inject test data into your database, cowboy-code in some `var_dump`s or perform the same action over and over again. Logs can save a lot of time and effort when tracking down issues in a live application.

But logs also have a few shortcomings. In an active application with multiple users, all the log messages for every user are piled together in the same log file. If the application is multi-user (like most web applications), log messages from different requests can be interwoven. This makes it a nightmare to try and track the log messages for a single user request, let alone messages across several requests by the same user.

One way to handle this is to put a request-specific identifier in every log message. But I shouldn't have to remember to append or prepend the identifier to their log messages. I'd rather have it happen automatically, without me or my teammates having to think about it.

Here's a method we've been using to try and untangle the mess and retain the usefulness of our logs. The code uses Zend's logging component, but can easily be adapted to other log systems.

First, we start up a log and tell it how to format each log line.
$format = "%timestamp% [%logId%]: %message%" . PHP_EOL;
$formatter = new Zend_Log_Formatter_Simple($format);

$writer = new Zend_Log_Writer_Stream('/path/to/application.log');
$writer->setFormatter($formatter);

$log = new Zend_Log();
$log->addWriter($writer);
In the format string, names wrapped in "%" signs are log variables that will be filled in when a message is logged. "timestamp" and "message" are provided by Zend and will be filled with the appropriate values when the logging call is made. "logId" is a custom symbol, which means we need to tell the log object what value to give it.

In order to track all the log messages across a single user request, we need to pick a unique value for the log id:
// The last 8 characters of a uniqid are unique enough for our purposes
$logId = substr(uniqid(), -8);
$log->setEventItem('logId', $logId);
Now we wait for a few users to make requests at the same time:
// A web request
$log->info('The user made some request');
$log->info('Here is some output of that request: '.$output);

// A different, simultaneous request from another user
$log->info('The user made some request');
$log->info('Here is some output of that request: '.$output);

// Another request from the first user
$log->info('Another user request');
Here's the log output:
2011-05-16 19:04 [a73fe12a]: The user made some request
2011-05-16 19:04 [24f90ae1]: The user made some request
2011-05-16 19:04 [24f90ae1]: Here is some output of that request: Hello, World
2011-05-16 19:04 [a73fe12a]: Here is some output of that request: Hello, World
2011-05-16 19:04 [feb40a52]: Another user request
Now we can easily track a single request's messages in the log, even if multiple requests happen at the same time.

But we can do better. Often times, bugs are not the result of a single user request, but of a sequence of requests. If we only track logs on a per request basis, we haven't helped ourselves track request-spanning bugs. Luckily, we already have a value that uniquely identifies a user across multiple requests: their PHP session id. We can use it, in addition to a request identifier:
// The first 8 characters of the session id are unique enough
session_start();
$sessId = substr(session_id(), 0, 8);
$requestId = substr(uniqid(), -8);
$logId = $sessId.'-'.$requestId;
$log->setEventItem('logId', $logId);
And here's what our logs look like now:
2011-05-16 19:04 [b74be12f-a73fe12a]: The user made some request
2011-05-16 19:04 [075aef5e-24f90ae1]: The user made some request
2011-05-16 19:04 [075aef5e-24f90ae1]: Here is some output of that request: Hello, World
2011-05-16 19:04 [b74be12f-a73fe12a]: Here is some output of that request: Hello, World
2011-05-16 19:04 [b74be12f-feb40a52]: Another user request
And there it is. We can use the log id in useful ways, such as displaying it to the user with a "contact tech support" message. When a user calls or emails and says their "support id" is "b74be12f-a73fe12a" we not only have the log messages for the request that gave them the error, but all the messages for their entire session:
# All the request log messages
> grep 'b74be12f-a73fe12a' /path/to/application.log
# All the session log messages
> grep 'b74be12f-' /path/to/application.log

Here's all the code together:
$format = "%timestamp% [%logId%]: %message%" . PHP_EOL;
$formatter = new Zend_Log_Formatter_Simple($format);

$writer = new Zend_Log_Writer_Stream('/path/to/application.log');
$writer->setFormatter($formatter);

$log = new Zend_Log();
$log->addWriter($writer);

session_start();
$sessId = substr(session_id(), 0, 8);
$requestId = substr(uniqid(), -8);
$logId = $sessId.'-'.$requestId;
$log->setEventItem('logId', $logId);

$log->info('The user made some request');
$log->info('Here is some output of that request: '.$output);

2011-05-03

Interview with Bulat Shakirzyanov

Today on Cal Evans' "Voices of the ElePHPant" there was an interview with Bulat Shakirzyanov. His comments about mocking PHP's built-in functions reminded me of this PHP built-in mocking library I had made a little while ago, then forgotten about. Bulat's example talks about `mkdir`, which the built-in mock library would handle, but for filesystem specific functionality, I also highly recommend vfStream. It allows you to create a virtual filesystem, comeplete with directory hierarchy and files with contents, all from within a unit test, and all without needing an actual filesystem.

2011-04-22

Tropo-Turing Collision: Voice Chat Bots with Tropo

On the first evening of PHPComCon this year, Tropo sponsored a beer-and-pizza-fueled hackathon. Having never heard of Tropo before, I decided to check it out.

Tropo provides a platform for bridging voice, SMS, IM and Twitter data to create interactive applications in Javascript, PHP, Python and a couple other languages. An example would be the call-center navigation menus with which many people are familiar, or a voicemail system that sends a text message when a new message is received. But those are only the tip of the iceberg.

To explore a little deeper, I thought it would be neat to create a voice chat-bot. The idea would be that a caller could talk to an automated voice in a natural way, and the voice would respond in a relevant way and push the conversation along. Since I wasn't aiming for Turing test worthiness, a good starting point was ELIZA, one of the first automated chat-bots. I thought it was a pretty ambitious project, but it turns out that Tropo's system handles all of the functionality right out of the gate.

The docs do a great job of explaining setting up an account and creating an application, so I'm going to jump right into the code (in PHP).

I like to keep my code relatively clean and organized, with functions and classes in their own files. So the first thing I did was create a hosted file called "Eliza.php" with the following contents:
class Eliza
{
  public function respondsTo($statement="")
  {
     $responses = array(
       "0" => "one",
       "1" => "two",
       "2" => "three",
       "3" => "four",
       "4" => "five",
       "5" => "six",
       "6" => "seven",
       "7" => "eight",
       "8" => "nine",
       "9" => "zero",
     );

     $response = "Please pick a number from 0 to 9";
     if (isset($responses[$statement])) {
       $response = $responses[$statement];
     }

     return $response;
  }

  public function hears($prompt)
  {
    $result = ask($prompt, array(
      "choices" => "[1 DIGIT]"
    ));
    $statement = strtolower($result->value);
    return $statement;
  }
}
It's important to note that the opening <?php and closing ?> should be left out of this file, or the next bit will fail.

The main functionality of the application is in the `Eliza` class. Eliza will translate a user input string into a response, which it will then use to prompt the user. The `ask()` function in the `hears()` method is functionality provided by Tropo's system that takes care of the text-to-speech and speech-to-text aspect of prompting the caller, and then waits for the caller to respond. The `choices => [1 DIGIT]` option to `ask()` hints that we expect the user to respond to our prompt with a single 0-9 character.

Next, I created a file called "chatbot.php" with the following contents:
<?php
$url = "http://hosting.tropo.com/00001/www/Eliza.php";
$ElizaFile = file_get_contents($url);
eval($ElizaFile);

$eliza = new Eliza();
$statement = "";
do {
  $prompt = $eliza->respondsTo($statement);
  $statement = $eliza->hears($prompt);
  _log("They said ".$statement);
} while (true);
?>
This is the entry script for the application, which can be set on the "Application Settings" page. Unfortunately, it does not look like Tropo supports `require` and `include` in their system. Fortunately, what they do provide are URLs to download the contents of any hosted file via simple `file_get_contents`. So we "inlcude" our Eliza.php file by downloading its contents, then `eval`ing them into the running scripts scope. Note: Yes I could have just written the contents of Eliza.php into the chatbot.php file, but a) it was more fun to try and find a way around that limitation :-) and b) many developers separate their code this way to keep it clean, encapsulated and reusable and this demonstrates a way to accomplish that.

The code is fairly self-explanatory: "include" and instantiate Eliza, then enter a prompt-respond loop which will last until the caller hangs up. `_log()` outputs to Tropo's built in application debugger.

I have to say congratulations to Tropo for creating a platform that makes all this easy. I had this up and running (except the file include portion) in about 15 minutes. You can try this out for yourself by calling (919) 500-7747.

So now that prompt-response was working, I could get started on making a real Eliza chat-bot that would parse the caller's speech and provide an Eliza-like response:

Caller: I'm building a voice chat application.
Eliza: Tell me more about voice chat application.
Caller: You're it!
Eliza: How do you feel about that?
Caller: I think it's pretty neat.

There are probably a hundred Eliza implementations on the web, and at least a dozen are written in PHP. I grabbed the first one I found, shoved it into the `Eliza::respondsTo()` method, and called up my application.

And this is where I hit the iceberg. In order to accomplish what I wanted, I needed the `ask()` function to be able to accept and parse any spoken words into a string. This meant getting rid of the `choices => [1 DIGIT]` line. As soon as I called, Eliza started prompting me over and over for input, until Tropo's system killed the loop and hung up.

Luckily, a Tropo guy (who was great to talk to, but who's name I have unfortunately forgotten) informed me that `ask()` works by using the `choices` option as training for the speech-to-text parser. There is a way to do generalized speech-to-text but it is incredibly processor intensive, and would have to make use of an asynchronous call. Later that evening, @akalsey confirmed that the current state of the technology (everyone's, not just Tropo's) makes generalized real-time speech-to-text processing impossible. So the dream of speaking with Eliza instead of just IMing with her dies unfulfilled, or is at least put on the shelf until technology catches up with ideas.

I did learn some about Tropo's service, though, so I count the hackathon successful. Thanks again, Tropo, for the great new tool!

2011-04-14

Syntactic sugar for Vows.js

We end up writing quite a few full-stack integration tests. As the workflows for our projects get longer and more complex, the corresponding Zombie and Vows tests also get longer and more complex. Soon enough, a single test can be nested 15 or 20 levels deep, with a mix of anonymous functions and commonly-used macros.

Needless to say, these tests can become monsters to try and read through and understand:
var vows = require('vows');

vows.describe("Floozle Workflow").addBatch({
    "Go to login page" : {
        topic : function () {
            zombie.visit("http://widgetfactory.com", this.callback);
        },
        "then login" : {
            topic : function (browser) {
                browser
                  .fill("#username", "testuser")
                  .fill("#password", "testpass")
                  .pressButton("Login", this.callback);
            },
            "then navigate to Floozle listing" : {
                topic : function (browser) {
                    browser.click("Floozles", this.callback);
                },
                "should be on the Floozle listing" : function (browser) {
                    assert.include(browser.text("h3"), "Floozles");
                },
                "should have a 'Create Floozle' link" : function (browser) {
                    assert.include(browser.text("Create Floozle"));
                },
                "then click 'Create Floozle'" : {
                    topic : function (browser) {
                        browser.click("Create Floozle");
                    },
                    "then fill out Floozle form" : {
                        topic : function (browser) {
                            browser
                              .fill("#name", "Klaxometer")
                              .fill("#whizzbangs", "27")
                              .choose("#rate", "klaxes per cubic freep")
                              .pressButton("Save", this.callback);
                        },
                        // .... Continue on from there, with more steps
                    }
                }
            }
        }
    }
}).export(module);
So we created prenup, a syntactic sugar library for easily creating Vows tests. Prenup provides a fluent interface for generating easy to read test structures as well as the ability to reuse and chain testing contexts.

Here is the above test written with prenup:
var vows = require('vows'),
    prenup = require('prenup');

var floozleWorkflow = prenup.createContext(function () {
    zombie.visit("http://widgetfactory.com", this.callback);
})
.sub("then login", function (browser) {
    browser
      .fill("#username", "testuser")
      .fill("#password", "testpass")
      .pressButton("Login", this.callback);
})
.sub("then navigate to Floozle listing", function (browser) {
    browser.click("Floozles", this.callback);
})
.vow("should be on the Floozle listing", function (browser) {
    assert.include(browser.text("h3"), "Floozles");
})
.vow("should have a 'Create Floozle' link", function (browser) {
    assert.include(browser.text("Create Floozle"));
})
.sub("then click 'Create Floozle'", function (browser) {
    browser.click("Create Floozle");
})
.sub("then fill out Floozle form", function (browser) {
    browser
      .fill("#name", "Klaxometer")
      .fill("#whizzbangs", "27")
      .choose("#rate", "klaxes per cubic freep")
      .pressButton("Save", this.callback);
})
.root();

vows.describe("Floozle Workflow").addBatch({
    "Go to login page" : floozleWorkflow.seal()
}).export(module);
Every `sub()` call generates a new sub-context, and every `vows()` call attaches a vow assertion to the most recent context. `parent()` can be called to pop back up the chain and attach parallel contexts. If a context is saved, it can be reused in other chains. When `seal()` is called, it will generate the same structure as the original test.

More examples, including branching and reusing contexts, are given in the documentation. Prenup is available on NPM via `npm install prenup`

2011-04-07

Dynamic Assets: Part III - Entry Forms

Now that we have instantiated our dynamically defined objects, we can get to the meat of our project: displaying a form based on the dynamic object.

My team uses Zend to build our forms, but the technique can easily be adapted for other frameworks. Since Zend_Form affords adding fields one at a time, we can easily build a form by looping through an Asset's field list:
$asset = $assetFactory->build('some_type');

$form = new Zend_Form();
// ... Set up our form action, method, etc. ...

$typeField = new Zend_Form_Element_Hidden('__asset_type');
$typeField->setValue($asset->getType())
    ->setRequired(true);
$form->addElement($typeField);

foreach ($asset->listFields() as $fieldName => $field) {
    if (!$field->isSubAsset()) {
        $fieldElement = new AssetFieldFormElement($field, $asset->$fieldName);
        $form->addElement($fieldElement);
    }
}

// ... Add a submit button, other form housekeeping ...
There are two interesting bits to the above snippet. Firstly, we intentionally skip rendering sub-asset fields as part of the form. We found it was easier to make the user create a parent asset before allowing them to assign sub-assets to it. This simplifies the form generation, display, validation and data persistence.

The second interesting bit is the introduction of a new class: AssetFieldFormElement. We give this class the field object for which to build form element(s), and the value of the field.

This class is where the heavy-lifting is done. The basic idea is to read the field information, instantiate a proper Zend_Form_Element object (or set of objects) and set the properties of that element. The rest of the class is mainly methods to pass through calls to things like "addValidator()", "render()", etc.

Using this class, we can turn any given AssetField object into a wide variety of HTML form elements. The simplest is a basic text box:
$field = new AssetField();
$field->setName("first_name");
$element = new AssetFieldFormElement($field, "Zaphod");
Rendered HTML:
<label>First Name</label> <input name="first_name" type="text" value="Zaphod" />

 
If the field has a list of options, it renders as a drop-down. For things like the "date" type, we have a form element which will automatically attach a date-picker to the text box.

In a lot of cases, a single field will render as multiple form elements. For instance, sometimes a field's value can be an array of strings:
$field = new AssetField();
$field->setName("favorite_restaurants")
    ->setCollection(true);
$element = new AssetFieldFormElement($field, array("Milliways", "Stavro Mueller Beta"));
Rendered HTML:
<label>Favorite Restaurants</label>
<input name="favorite_restaurants[]" type="text" value="Milliways" />
<input name="favorite_restaurants[]" type="text" value="Stavro Mueller Beta" />
<input name="favorite_restaurants[]" type="text" value="" /><a href="#" class="adder" rel="favorite_restaurants">[+]</a>




[+]
A bit of Javascript on the "[+]" link allows the user to add more text boxes as needed (we can use the AssetField's "max" property to limit that if need be.)

Hang on, what if we we have a list of options for the user to choose from, but they can only choose each option once, and they need to be able to add their own if necessary? Not a problem:
$field = new AssetField();
$field->setName("drinks")
    ->setCollection(true)
    ->setUnique(true)
    ->setOptions(array("jynnan tonnyx", "jinond-o-nicks", "Ouisghian Zodah"))
    ->setOther("Something else?");
$element = new AssetFieldFormElement($field, array("jinond-o-nicks", "Pan Galactic Gargleblaster"));
Rendered HTML:
<label>Drinks</label>
<input name="drinks[]" type="checkbox" value="jynnan tonnyx" /> jynnan tonnyx
<input name="drinks[]" type="checkbox" value="jinond-o-nicks" checked="true" /> jinond-o-nicks
<input name="drinks[]" type="checkbox" value="Ouisghian Zodah" /> Ouisghian Zodah
<label>Something else?</label> <input name="drinks[]" type="text" value="Pan Galactic Gargleblaster" />


 jynnan tonnyx
 jinond-o-nicks
 Ouisghian Zodah
 
A wide variety of form effects can be achieved through combining different asset field flags.

The other function that AssetFieldFormElement performs is adding appropriate validators to each form element. This allows us to use `$form->isValid($params)` just as we do with our application's static forms.

This is the last of the posts on dynamic class definition. I hope it was helpful.