2012-05-14

Neo4jPHP Available as a Composer Package

Neo4jPHP is now available as a Composer package. To use it in your project, create a "composer.json" file in the root of your project, with the following contents:
{
  "require":{
    "everyman/neo4jphp": "dev-master"
  }
}
After installing Composer, run the following command in the root of the project:
> php composer.phar install
The Neo4jPHP library will be downloaded and installed in your project under "vendor/everyman/neo4jphp". An autoloader will be created at "vendor/autoload.php" which you should include in your project, then start using Neo4jPHP as normal:
require("vendor/autoload.php");

$client = new Everyman\Neo4j\Client();
$node = $client->makeNode();
To get the latest version of Neo4jPHP, run:
> php composer.phar update

2012-04-23

Runtime Expectations Javascript Throwdown

Last week, TriangleJS were the guests on the talk-show podcast Runtime Expectations. The topic was a Javascript framework "throwdown" which covered a wide range of different frameworks and toolkits. We even discussed if frameworks were necessary at all. Ben and Adrian are great guys and it was a lot of fun being on their show and hanging out for beers afterwards.

If you want to hear me make a total stuttering ass out of myself, I start rambling around 23:34 then fade out so more informed people can talk.

2012-04-02

Data Mapper Injection in PHP Objects

I was trying to figure out a simple way to have data from a database injected into a domain object in PHP. I don't want to enforce that the domain object implements setters and getters for every property, and the burden on the developer for implementing the data injection interface should be as low as possible.

My solution relies on PHP's ability to pass and assign values by reference. The data mapper is injected into domain object. The domain object tells the mapper to attach to the object's properties. The mapper can then read and write to those properties as if they were its own.
interface Mappable {
    public function injectMapper(Mapper $mapper);
}

class Mapper {
    protected $attached = array();

    public function attach(&$ref, $fieldName) {
        $this->attached[$fieldName] = &$ref;
    }

    public function fromDataStore($data) {
        foreach ($data as $fieldName => $value) {
            if (array_key_exists($fieldName, $this->attached)) {
                $this->attached[$fieldName] = $value;
            }
        }
    }

    public function fromDomainObject() {
        $data = array();
        foreach ($this->attached as $fieldName => $value) {
            $data[$fieldName] = $value;
        }
        return $data;
    }
}

class Product implements Mappable {
    protected $id;
    protected $name;

    public function injectMapper(Mapper $mapper) {
        $mapper->attach($this->id, 'id');
        $mapper->attach($this->name, 'name');
    }

    // ...domain/business logic methods go here...
}

// Create a product and manipulate it
$product = new Product();
$product->doFoo();
$product->linkToBar();

// Inject the data mapper
$mapper = new Mapper();
$product->injectMapper($mapper);
// This could be a call to save the array of data in a database
print_r($mapper->fromDomainObject());

// Read a product from the "database"
$product = new Product();
$mapper = new Mapper();
$product->injectMapper($mapper);
$mapper->fromDataStore( array("id"=>123, "name"=>"Widgets") );
So what's happening here? The fields of the Product domain object are passed into the Mapper's `attach` method by reference. This allows that method to modify the variable passed in without passing it back to the caller.
public function attach(&$ref, $fieldName) {
        $this->attached[$fieldName] = &$ref;
    }
If the value were only passed by reference, only the `attach` method would be able to modify it. As soon as the method returned, the reference would be lost. In order to continue having a reference to the Product's properties, the `attach` method holds the reference in an array using "assign by reference."

Note the ampersand after the equals sign. This tells PHP that we don't want to assign a copy of the value in $ref, we want the reference to $ref itself. This allows us to later modify the value of the property in Product by modifying the value of a reference to the same place in memory that that Product's properties are held.

So now we have the beginnings of a basic data mapper that remains loosely coupled to our domain objects, and without having to write an explicit mapping class for each domain object-to-database mapping.

2012-03-04

GetSet Methods vs. Public Properties

I was recently having a debate with a coworker over the utility of writing getter and setter methods for protected properties of classes. On the one hand, having getters and setters seems like additional boilerplate and programming overhead for very little gain. On the other hand, exposing the value properties of a class seems like bad encapsulation and will overall lead to code that is more difficult to maintain.

I come down firmly on the get/set method side of the fence. It takes very little extra code, defines the interface to the class explicitly and makes the class easier to extend and maintain in the long run. Let's try an example (in PHP, though the premise holds for any language with public/private properties.)
class PublicValues {
    public $foo;
    public $bar;
}

class GetSet {
    protected $foo;
    protected $bar;

    public function getFoo() {
        return $this->foo;
    }

    public function setFoo($value) {
        $this->foo = $value;
    }

    public function getBar() {
        return $this->bar;
    }

    public function setBar($value) {
        $this->bar = $value;
    }
}

$public = new PublicValues();
$public->foo = array('some'=>'value', 'data'=>'structure');
$public->bar = 123;

$getset = new GetSet();
$getset->setFoo(array('some'=>'value', 'data'=>'structure'));
$getset->setBar(123);

echo "Public is: ".$public->bar." and ".print_r($public->foo, true)."\n";
echo "GetSet is: ".$getset->getBar()." and ".print_r($public->getFoo(), true)."\n";
Obviously, class `PublicValues` is much shorter than class `GetSet`. Shorter is better, right? Less code to maintain, fewer places for bugs to creep in. And it didn't take me nearly as long to type (though, truthfully, `GetSet` didn't take that long to write, either, as it was mostly cut-and-paste.)

Now let's say that now we have decided to wrap the array data structure in a new `Widget` class and we need to ensure that `foo` can only be of type `Widget`. Modify the classes to make this so:
class PublicValues {
    // ... add a whole new method
    public function setFoo(Widget $value) {
        $this->foo = $value;
    }
}

class GetSet {
    // ... only change here is type-hinting
    public function setFoo(Widget $value) {
        $this->foo = $value;
    }
    // ...
}

// ...
$public->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...
$getset->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...
`PublicVars` had an entirely new method added to it. `GetSet` only had a small modification to an existing method.

Actually, this still isn't perfect; application code can still say `$public->foo = 'whateverIWant';` without going through the `PublicVars::setFoo` method. In fact, while writing this example, I forgot until my third revision to go back and change the sample code to use the set method. Let's make sure that doesn't happen in the future:
class PublicValues {
    // ... change foo so that client code can't set it to the wrong thing
    protected $foo;

    // ... still need to be able to read foo, so now we add a getter
    public function getFoo() {
        return $this->foo;
    }
    // ...
}

// GetSet remains the same ...

// ...
$public->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...
echo "Public is: ".$public->bar." and ".print_r($public->getFoo(), true)."\n";
Notice that `GetSet` didn't have to change at all in this case, and the only client code that had to change for it was searching for anywhere calling `GetSet::setFoo` and making sure it provided a `Widget`.

As for `PublicValues`, we'll have to go through all the code making sure that everywhere that was setting `foo` with assignment now calls the set method (with a proper `Widget`), and anywhere that was reading it directly calls the get method.

My future self is screaming at my current self. Now we have two ways of accessing the PublicValues class: one way through direct value reading and writing (bar), and the other through getset methods (foo). As a consumer of this class, I have no way of knowing which properties are openly accessible and which are guarded through getset methods, other than going back and looking through the code.

To solve this problem, I am going to use a method that I see in almost every PHP project of non-trivial size that I have ever worked with or seen the source code of: use magic __get and __set to determine if the property is directly accessible or not:
class PublicValues {
    // ...

    public __get($name) {
        $method = 'get'.$name;
        if (method_exists($this,$method)) {
            return $this->$method();
        } else {
            return $this->$name;
        }
    }

    public __set($name, $value) {
        $method = 'set'.$name;
        if (method_exists($this,$method)) {
            return $this->$method($value);
        } else {
            $this->$name = $value;
        }
    }
}
As before, `GetSet` doesn't change one bit. For `PublicValues`, the rest of the code can continue to use what looks like straight assignment, and if I want to know if the assignment is direct modification or a magic set, I only need to look in only two places to figure it out: look for a named getset method, or assume I am using the magic __get and __set. Perfect...

...until we have to add something to this class that absolutely must not be modified by client code.
class PublicValues {
    // ...
    protected $modificationToThisCrashesTheApp;
    // ...
}

class GetSet {
    // ...
    protected $modificationToThisCrashesTheApp;
    // ...
}
Now we have a big problem. Clients of `GetSet` are protected, because nothing in `GetSet` can be modified except through a class method. We don't provide a method for accessing the dangerous property, so clients can't hurt themselves.

But clients of `PublicValues` can modify any property they want. The magic __set method makes public any protected value that does not have an explicit set method. We could fix this by adding a set method that throws an exception. Or we could solve it by blacklisting that property in the magic __set. Or we could make sure everyone (including new hires six months from now) knows not to touch this property. Or we could solve it by...

Truthfully, it doesn't matter how you solve it. Encapsulation of `PublicValues` has been broken from the beginning. We haven't even tried adding validation to either class, or making the setting of one value affect another, or adding read- or write-only methods. And over the course of just a few changes, we've had to add get/set methods anyway. We've just done it later in the application lifecycle after client code is already using the class, the time when it is most costly to make those sorts of changes.

In a blog post, it's easy to see these changes coming in a few paragraphs and account for them mentally. But imagine this in a large system, one that is constantly having features added, bugs fixed and maintenance performed over the course of months or years. Every change in this example could have come months apart. I don't expect myself or my fellow programmers to keep track of the intricacies of a single class over that timeframe, especially if that class is modified rarely, is one class in hundreds and we've been working in thousands of lines of other code in the meantime.

So do yourself a favor and take the 20 seconds to wrap your properties in boilerplate getset methods from the start. Not because it's a best practice, but because future you will be grateful.

Here is the meme-pic that started the debate:



2012-02-23

Similarity-based Recommendation Engines

I am currently participating in the Neo4j-Heroku Challenge. My entry is a -- as yet, unfinished -- beer rating and recommendation service called FrostyMug. All the major functionality is complete, except for the actual recommendations, which I am currently working on. I wanted to share some of my thoughts and methods for building the recommendation engine.

My starting point for the recommendation engine was a blog post by Marko Rodriguez, A Graph-Based Movie Recommender Engine. Marko's approach is pretty straightforward and he does an excellent job of explaining the subtleties of collaborative and content-based filtering. It was also a great way to get acquainted with the Gremlin graph traversal API, which I have been unwisely avoiding for fear of it being too complicated.

Basic collaborative filtering works like this: if I rate a beer, say, Lonerider Sweet Josie Brown Ale as 9 out of 10, and you rate that same beer as 8 out of 10, then I should be able to get recommendations based on other beers that you (and others who have rated that beer similarly to me) have rated highly.

I think we can improve on this. You and I happen have rated one beer the same way. What happens if that is the only beer we have rated similarly? What if I have rated Fullsteam Carver Sweet Potato Ale at 8 out of 10, but you have rated it 2 out of 10? You have shown that you have poor taste, so why should I get recommendations based on your other ratings, simply because we happened to rate a single beer similarly?

With a small modification, we can overcome this problem. Instead of basing recommendations off of one similar rating, I can calculate how similarly you and I rated all the things we have rated, and only get recommendations from you if I have determined we are similar enough in our tastes. Call this "similarity-based collaborative filtering."

Finding similar users works like this:
  1. Find all users who have rated the same beers as I have.
  2. For each rated beer, calculate the differences between my rating and that user's rating.
  3. For each user, average the differences.
  4. Return all users where the average difference is less than or equal to some tolerance (2 in the below example.)

Let's see what this looks like in Gremlin, a domain specific graph traversal language in Groovy. Gremlin is built on the concept of "pipes" where a single chain of operations runs from beginning to end on each element put into the pipe. The elements in our case are the user nodes, beer nodes and rating edges that connect them in our graph:
// We're going to use this again and again, so make a single "step" for it,
// to which we can pass a "similarity threshold"
Gremlin.defineStep('similarUser', [Vertex,Pipe], {Integer threshold ->

    // Capture the incoming pipe
    target = _();

    // Set up a table to hold ratings for averages
    m=[:].withDefault{[0,0]};

    // Find all beers rated by the target, and store each rating
    target.outE("RATED").sideEffect{w=it.rating}

    // For each rated beer, find all the users who have also rated that beer
    // and who are not the target user, then go back to their ratings
    .inV.inE("RATED").outV.except(target).back(2)

    // At this point in the pipeline, we are iterating over
    // only the ratings of beers that both users have rated.
    // For each rating, calculate the difference between
    // the target user's rating and the compared user's rating
    .sideEffect{diff=Math.abs(it.rating-w)}

    // For each compared user, store the number of beers rated in common
    // and the total difference for all common ratings. 
    .outV.sideEffect{ me=m[it.id]; me[0]++; me[1]+=diff; }.iterate();

    // Find the average difference for each user and filter
    // out any where the average is greater than two.
    m.findAll{it.value[1]/it.value[0] <= threshold}.collect{g.v(it.key)}._()
});

// Return all the users who are similar to our target
// with an average similarity of 2 or less
g.v(123).similarUser(2)

The result of this script is a list of all user nodes that are considered "similar" to the target user. Now we can get more accurate recommendations, based on users whose tastes are similar to ours. Given users who are similar to me, recommend beers that they rated highly:
  1. Find all beers rated by similar users.
  2. For each beer, average the similar users' ratings.
  3. Sort the beers by highest to lowest average and return the top 25.
r=[:].withDefault{[0,0]};

// Find similar users 
g.v(123).similarUser(2)

// Find all outgoing beer ratings by similar users
// and the beer they are rating
.outE("RATED").sideEffect{rating=it.rating}.inV

// Store the ratings in a map for averaging
.sideEffect{me=r[it.id]; me[0]++; me[1]+=rating; }

// Transform the map into a new map of beer id : average
r.collectEntries{key, value -> [key , value[1]/value[0]]}

// sort by highest average first, then take the top 25
.sort{a,b -> b.value <=> a.value}[0..24]

// Transform the map back into a mapping of beer node and rating
.collect{key,value -> [g.v(key), value]}
Now I only get recommendations from users whose tastes are similar to mine.

I can also calculate an estimated rating for any beer that I have not rated. To see how that might be useful, consider Netflix. When you look at any movie on Netflix that you have not already rated, they show you their best guess for how you would like that movie. This can be helpful in determining which of a set of movies you want to watch, or in our case, which beer you might want to try given limited pocket money. Again, we can use our similar users to figure out the estimated rating of a given beer:
  1. Find all ratings of the target beer by similar users.
  2. Return the average of their ratings of that beer.
// Grab our target beer
beer=g.v(987)

// Find similar users 
g.v(123).similarUser(2)

// For each similar user, find all that have rated the target beer
.outE("RATED").inV.filter{it==beer}

// Go back to their ratings, and average them
.back(2).rating.mean()
So now we have a simple recommendation engine based on ratings by users with similar tastes. There are a few improvements that could still be made:
  • Don't consider a user "similar" unless they have rated at least 10 beers in common (or some other threshold). This prevents users who have only rated a few beers in common from being falsely labeled as similar.
  • Calculate the similar users ahead of time, and store them with a "SIMILAR" relationship to the target user. Periodically recalculate the similar users to keep the list fresh.
  • Use random sampling or size limiting on the number of users and ratings checked when determining similarity, trading accuracy for performance.
Many thanks to Marko for his help on the mailing list, and with giving me some pointers in writing this, and for being a pretty awesome guy in general.

Update: Luanne Misquitta posted about how to do similarity-based recommendations with Cypher. Thanks, Luanne!

2012-01-14

Command Invoker Pattern with the Open/Closed Principle

Over on DZone, Giorgio Sironi demonstrates the "Open/Closed Principle on real world code". The pattern demonstrated is similar to the Command Pattern, and the post does a good job of introducing a class that is open for extension but closed for modification. Giorgio mentions that there is still more work to be done. I thought I might take a stab at creating a flexible and extendable command invocation solution.

One of the issues that should be addressed is that the invoker (the Session class in the post), is only extensible by actually extending the class. If we want to add custom methods to it, we have two options: make our own sub-class, which should be avoided when favoring composition over inheritance; or modifying the base class's constructor, which violates the open/closed principle.

Another thing that could be improved is that the invoker is violating the Single-Responsibility principle (which violates the "S" in SOLID.) It needs to know not just how to execute commands, but also how to construct them. As the list of available commands grows, and the constructors for each command become more complex, the invoker needs to contain more logic to build each command type.

Let's overcome some of these issues, and also make the code even more extensible. I'll use a simplified command invoker to demonstrate.

To start, let's remove the complexity of requiring the invoker know how to build each command object. A simple way to do this is the move the factory methods into the individual command classes. We'll also define a simple interface to ensure that our commands are executable. Here are two simple commands. Note that the constructor is not part of the interface, which allows the command classes to have completely different constructor method signatures.
interface Command {
 public static function register();

 public function execute();
}

class HelloCommand implements Command {
 protected $name;

 public static function register() {
  return function ($name) {
   return new HelloCommand($name);
  };
 }

 public function __construct($name) {
  $this->name = $name;
 }

 public function execute() {
  echo "Hello, " . $this->name . "\n";
 }
}

class PwdCommand implements Command {

 public static function register() {
  return function () {
   return new PwdCommand();
  };
 }

 public function execute() {
  echo "You are here: " . getcwd() . "\n";
 }
}
Each command class has a static `register` method that returns an anonymous function which knows how to instantiate the class. We'll call this the "factory" function. In the example, the factory function signatures are the same as the constructors, but that is not necessary; as long as the factory function can construct a valid object, it can take whatever parameters it needs.

Our commands can construct themselves, so the invoker only has to know how to map its own method calls to the commands. Here is the simplified Invoker class:
class Invoker {
 protected $commandMap = array();

 public function __construct() {
  $this->register('hello', 'HelloCommand');
  $this->register('pwd', 'PwdCommand');
 }

 public function __call($command, $args) {
  if (!array_key_exists($command, $this->commandMap)) {
   throw new BadMethodCallException("$command is not a registered command");
  }
  $command = call_user_func_array($this->commandMap[$command], $args);
  $command->execute();
 }

 public function register($command, $commandClass) {
  if (!array_key_exists('Command', class_implements($commandClass))) {
   throw new LogicException("$commandClass does not implement the Command interface");
  }
  $this->commandMap[$command] = $commandClass::register();
 }
}
The important bit is the `register` method. It checks that the registered class is indeed a Command, and thus, that it will have both an `execute` and a `register` method. We then invoke the `register` method, and store the factory function for that command.

In the magic `__call` method, we have to use `call_user_func_array` since we don't know the real signature of the called factory function. If we wanted, we could have the `__call` method return the command object instead of executing it. That would turn the Invoker into an Abstract Factory for command objects.

The main point is that Invoker never cares exactly how commands are constructed. Invoker now only has one responsibility: invoking commands.

We execute commands by calling Invoker methods with parameters that match the factory function signature of the command being executed:
$invoker = new Invoker();
$invoker->hello('Josh');
$invoker->pwd();
By making the Invoker's `register` method public, we've also opened up the Invoker to allow a client to extend its functionality without having to modify the Invoker itself:
class CustomCommand implements Command {
 protected $foo;
 protected $bar;

 public static function register() {
  return function ($foo, $bar) {
   return new CustomCommand($foo, $bar);
  };
 }

 public function __construct($foo, $bar) {
  $this->foo = $foo;
  $this->bar = $bar;
 }

 public function execute() {
  echo "What does it all mean? " . ($this->foo + $this->bar) . "\n";
 }
}

$invoker->register('custom', 'CustomCommand');
$invoker->custom(40, 2);
We can even let clients overwrite existing commands by re-registering with a custom class:
$invoker->register('pwd', 'CustomCommand');
$invoker->pwd(2, 40);
If we don't want to allow this, we can build more checks into `register` that prevent overwriting built-in commands.

So now we have an extendable, flexible command invocation system that requires no modification to the actual command runner. This could also be used as the basis for a plug-in system.

Full sample code is available at http://gist.github.com/1610148





2012-01-03

Funkatron's MicroPHP Manifesto

Just saw this over on @funkatron's blog:

The MicroPHP Manifesto:

I am a PHP developer
  • I am not a Zend Framework or Symfony or CakePHP developer
  • I think PHP is complicated enough

I like building small things
  • I like building small things with simple purposes
  • I like to make things that solve problems
  • I like building small things that work together to solve larger problems

I want less code, not more
  • I want to write less code, not more
  • I want to manage less code, not more
  • I want to support less code, not more
  • I need to justify every piece of code I add to a project

I like simple, readable code
  • I want to write code that is easily understood
  • I want code that is easily verifiable

I agree that there are "and-the-kitchen sink" frameworks out there that do too much. I also agree that having every developer re-invent the wheel is not the ideal situation. The main problem I see with the macro-frameworks is that they tend to not allow me to strip out their functionality and replace it with another library when I need to. I much prefer smaller libraries and components that I can mix and match if necessary. Isn't that the whole point of code re-use in the first place?

So maybe this isn't so much about huge frameworks being bad as it is that they should be built off small, interchangeable coponents? Wasn't there a call for more inter-operability a little while back?

Funkatron has made it pretty clear that these are his personal preferences, but that hasn't stopped the haters.