2012-11-21

Migrating to Dependency Injection

Recently, I gave a lunch-and-learn to my team on the topic of Dependency Injection (DI). Instead of showing a bunch of slides explaining what DI is and what it's good for, I created a small project and demonstrated the process of migrating a codebase that does not use DI to one that does.

Each stage of the project is a different tag in the repository. The code can be found on Github: http://github.com/jadell/laldi. Checkout the code, and run composer install. To see the code at each step in the process, run the git checkout command in the header of each section.

git checkout 0-initial


The initial stage of the project is a listing of developers on the team, and with whom each has pair programmed. The project uses Silex, which is both a PHP micro-framework, and a simple dependency injection container. For the initial stage, only the framework is used (for HTTP request handling, routing and dispatching.) Each route handler is registered in index.php. A handler intantiates a controller object, then calls a method on the controller to handle the request. For this project, there is only one controller, the HelloController, but there could be many others.

When the HelloController is instantiated, it instantiates a UserRepository to handle access to User domain objects and a View to handle rendering the output. The UserRepository instantiates a Datasource to handle access to the raw data. The path to the data file is hardcoded in the Datasource. User objects instantiate a UserRepository of their own, so they can access the list of paired Users.


The code smell here is the chain of initializations that get triggered by constructing a new HelloController, which leads to several problems: it tightly couples every component to every other component; none of the components can be unit-tested in isolation from the others; we can't change the behavior of a class or its dependencies without modifying the class; and if the constructor signature of any component changes, we need to find each instantiation of that component and change it to pass in the new parameters (which might involve threading those parameters through multiple layers which don't require them.) There also may be bugs with multiple objects having their own copies of dependencies (for example, each User object should probably share a UserRepository, not have their own.)

git checkout 1-add-di


A simple way to solve these problems is to ensure that no class is responsible for creating its own dependencies. In order for a class to access its dependencies, we pass those dependencies into the class's constructor (another form of DI uses "setter" injection instead of constructor injection.)


The problem now is that all the dependencies must be wired together. Since no class instantiates its own dependencies, we must have a component that does this for us. This is the job of the Dependency Injection Container (DIC).

Silex is also used as a DIC. In bootstrap.php, all class instantiation is wrapped in a factory method that is registered with the DIC. When the DIC is asked for a registered object, if that object is not already instantiated, it will be created. Additionally, all its dependencies will be created and injected through its constructor.

For cases where a component may need to create multiple objects of the same type (like the UserRepository needs to create many User objects), factory closures are created. The factory makes sure that any User object is created with the correct dependencies, and that shared dependencies are not re-created each time. Since the closure is defined in the DIC, the UserRepository does not need access to any of the external dependencies that might be used by User objects, and the closure can be passed to the UserRepository.



git checkout 2-comments


Our project gets a new requirement to allow comments on users. Following the same pattern as with Users, we create a CommentRepository and a Comment domain object, and factory methods for creating and injecting each. HelloController needs access to the CommentRepository in order to save comments. Also, Users need access to the repository to retrieve all the Comments for that User. Since we are using DI, we change their constructors to receive a CommentRepository.


If we were not using a DIC, we would have to find every place that instantiated a HelloController or User object and change those calls to pass in the CommentRepository. Additionally, we would have to pass the CommentRepository through to any place that instantiated the object. Since we are using a DIC and factory methods, we can be assured that there is only one place that creates each of these object types, and therefore only one place needs to change (the factory methods in the DIC.)

We already get benefits from our refactoring to use a DIC!

git checkout 3-events


Our last requirements are to do some simple tracking of page views and notification of comments (perhaps with email, but for demonstration purposes, to a log file.) Each of these is a good use case for an event-driven system.

One of the challenges with event-driven development is making sure that the event listeners are instantiated before the emitters begin sending events. An option for getting around this is to instantiate all listeners as part of the application's bootstraping/initialization process. But this can become cumbersome and have performance implications.

Instead of always instantiating every listener, we can use our DIC and factory methods. Since event emitters are instantiated through a factory method, we use that same factory method to initialize listeners to any events that emitter may emit. By tying the creation of the emit to the creation of the listeners, we ensure that we only instantiate those listeners for which we have created an emitter.

For example, we know that the HelloController may emit page view events, so we make sure that whenever we instantiate a HelloController, we also create a page view listener and set it to listen for the event.



Potential Issues


Several good questions were asked during my presentation, and I attempted to answer them as best as I could.

Is debugging harder having the definitions of the objects away from their usage?
Since you are explicitly passing an object's dependencies, and you get all the dependencies from the same place, you can minimize external state that may affect an object's behavior. Also, by using factory methods, you can find exactly where an object was created. Without the DIC, an object or its dependencies could come from anywhere in your code. DI also makes unit- and regression-testing easier.

Do the bootstrap/DIC definitions become unwieldy when you have a lot of interconnected objects?
The DIC is kept simple by having each factory method responsible for creating only one type of object. If each method is small and isolated, the file may become very large, but the logic in the file will remain easy to read. You can even break the bootstrap of the DIC into multiple files, each file handling a different component or section of the project's dependencies. Circular dependencies may be a problem with a highly connected object graph, but that is a separate problem regardless of using a DIC or not.

Where is a good place to start with refactoring an existing large codebase to use DIC?
It's a multi-step process:
  1. Pick a class and create a factory method on a globally accessible DIC that instantiates that class.
  2. For every place in the class that uses the "new" operator, replace it with a call to a factory function/method on the DIC.
  3. Move calls to the factory methods into the class's constructor, and store the results as properties on the object.
  4. Change the constructor to not call the factory methods itself, but receive its dependencies as parameters.
  5. Update the class's factory method to call the other factory methods and pass in the dependencies to the constructor.
  6. Find any place in the code that instantiates the class and replace it with a call to the class's factory method.

Repeat this process for any usage of the "new" operator in the code. By the end, the only place in the code that uses the "new" operator or refers to the DIC should be the DIC itself.



2012-09-27

Interfaces and Traits: A Powerful Combo

If you're not using interfaces in PHP, you are missing out on a powerful object-oriented programming feature. An interface defines how to interact with a class. By defining an interface and then implementing it, you can guarantee a "contract" for consumers of a class. Interfaces can be used across unrelated classes. And they become even more useful when combined with the new traits feature in PHP 5.4.

Let's suppose we have a class called User. Users have an Address that our application mails packages to via a PackageShipper:

class Address {
 // ... setters and getters for address fields ...
}

class User {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other user logic ...
}

class PackageShipper {
 public function shipTo(User $user) {
  $address = $user->getAddress();

  // ... do shipping code using $address ...
 }
}

Our application is happily shipping packages, until one day a new requirement comes in that we need to be able to ship packages to Companies as well. We create a new Company class to handle this:

class Company {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other company logic ...
}

Now we have a problem. The PackageShipper class only knows how to handle Users. What we need is for the PackageShipper class to handle anything that has an Address.

We could make a base class that both Users and Companies inherit from and have the PackageShipper accept any class that descends from the base class. This is an unsatisfying solution. Semantically, Users and Companies are different entities, and there may not be enough common functionalty to move to a base class that both inherit. They may not have much in common besides the fact that they both have an address. Also, either class may already extend another class, and we cannot extend more than one class since PHP has a single-inheritance object model.

Instead, we can use an interface to define the common parts of Users and Companies that PackageShipper needs to deal with. Then, Users and Companies can implement that interface, and PackageShipper only has to deal with objects that implement the interface.

inteface Addressable {
 public function setAddress(Address $address);
 public function getAddress();
}

class User implements Addressable {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other user logic ...
}

class Company implements Addressable {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other company logic ...
}

class PackageShipper {
 public function shipTo(Addressable $entity) {
  $address = $entity->getAddress();

  // ... do shipping code using $address ...
 }
}

A class can implement many different interfaces, and classes with different bases can still implement the same interface.

But there is still a problem. Both Company and User implement the Addressable interface using the same code. It is not required to do this; as long as the constraints of the interface are met, the implementation does not have to be the same for every class that implements the interface. But in this case, they both implement the interface in the same way, and we have duplicated code. If a third Addressable class came along, it is possible that it would also implement the interface the same way, leading to even more duplication.

If you are using PHP 5.3 or less, there is little that can be done about this situation. But if you are using PHP 5.4, there is a new construct that is made to handle exactly these types of code duplication scenarios: traits.

A trait is similar to a class in that they both implement methods and have properties. The difference is that classes can be instantiated, but traits cannot. Instead, traits are added to class definitions, giving that class all the methods and properties that are defined in the trait.

A helpful way to think about traits is like a macro: using a trait in a class is the same as if the code in the trait were copied into the class.

Using traits, we can clean up the code duplication and still meet the contract defined by the Addressable interface:

trait AddressAccessor {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }
}

class User implements Addressable {
 use AddressAccessor;

 // ... other user logic ...
}

class Company implements Addressable {
 use AddressAccessor;

 // ... other company logic ...
}

Now, any class can immediately implement the Addressable interface by simply using the AddressAccessor trait. All the duplicated code has been moved to a single place.

The trait itself does not implement Addressable. This is because only classes can implement interfaces. Remember that traits are nothing more than code that gets copied into a class, they are not classes on their own.

Interfaces in PHP are a useful feature for enforcing code correctness. And when interfaces are combined with traits, they form a powerful tool which allows for faster development, decreased code duplication, greater readbility, and better maintainability.

Here is all the final code for the package shipping application:

class Address {
 // ... setters and getters for address fields ...
}

inteface Addressable {
 public function setAddress(Address $address);
 public function getAddress();
}

trait AddressAccessor {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }
}

class User implements Addressable {
 use AddressAccessor;

 // ... other user logic ...
}

class Company implements Addressable {
 use AddressAccessor;

 // ... other company logic ...
}

class PackageShipper {
 public function shipTo(Addressable $entity) {
  $address = $entity->getAddress();

  // ... do shipping code using $address ...
 }
}

2012-08-31

Plumbing PHP

When building a project with Neo4j or most other graph databases, it is impossible to avoid learning about Tinkerpop's excellent Gremlin graph processing language. The processing layer of Gremlin is built on top of Pipes, a dataflow programming library.

I was inspired by the syntax and ease-of-use of Gremlin to build a simple processing pipeline library in PHP. The result is Plumber, a library for easily building extensible deferred-processing pipelines.

Plumber is built on top of PHP's native Iterator library. The idea is simple: instantiate a new processing pipeline, attach processing pipes to it, then send an iterator through the pipeline and iterate over the results in a foreach. Each element that comes out the other end of the pipeline has been passed through an processed by each pipe. And because Plumber uses iterators, it natively supports lazy-loading and Just-In-Time evaluation.

A simple example would be reading a set of records from a database, formatting them in some manner, then echo'ing them out to the screen:
$users = // code to retrieve user records from a database as an array or Iterator...

$names = array();
foreach ($users as $user) {
    if (!$user['first_name'] || !$user['last_name']) {
        continue;
    }

    $name = $user['first_name'] . ' ' . $user['last_name'];
    $name = ucwords($name);
    $name = htmlentities($name);
    $names[] = $name;
}

// later on, display the names
foreach ($names as $name) {
    echo "$name<br>";
}
There are a few obvious downsides to doing things this way: the entire set of records is looped through more than once; all the records must be in memory at the same time (twice even, once for $users and once for $names); and the processing steps in the foreach are executed immediately on every record. These may not seem like a big deal if the record set is small and the processing steps are trivial, but they can become big problems if you are not careful.

Here is the same code using Plumber:
$users = // code to retrieve user records from a database as an array or Iterator...

$names = new Everyman\Plumber\Pipeline();
$names->filter(function ($user) {
        return $user['first_name'] && $user['last_name'];
    })
    ->transform(function ($user) {
        return $user['first_name'] . ' ' . $user['last_name'];
    })
    ->transform('ucwords')
    ->transform('htmlentities');

// later on, display the names
foreach ($names($users) as $name) {
    echo "$name<br>";
}
The list of $users is only looped through one time, and there is no need to keep a separate list of $names in sync with the $users list. Each $user is transformed into a $name on-demand, keeping resources free.

This can all be accomplished using Iterators, but there is quite a bit of boilerplate code involved. Plumber is meant to remove most of the boilerplate and let the developer concentrate on writing their business logic.

There is more to Plumber, including several built-in pipe types, and the ability to extend the library with your own custom pipes. It is also not necessary to use the fluent interface, if that is not your style. More usage information can be found in the README file in the Plumber github repo. Constructive feedback is always welcome!

2012-07-15

No Best Practices Yet

I'm getting a little tired of the constant onslaught of religious wars in software development these days. Every day, there's a new "PHP sucks vs. PHP is the best evar!", "Nodejs is crap vs Nodejs async FTW!", "OOP is a failure vs. FP doesn't get things done", "Agile vs. anything else" article. It's frustrating, because everyone is wrong.

Mankind has been building bridges for literally tens of thousands of years. From the first moment a primitive monkey-man threw a log across a stream and every other moneky-man tribe copied the idea, we've been building bridges. And guess what? As a species, we're really good at it.

And over the millennia, we've developed the discipline of engineering. At its core, engineering is nothing more than a set of best practices that say, "Hey, if you build your bridge this way or that way (depending on its purpose), it will most likely stay up for a really long time and continue to serve the function that you meant it to serve." If you ignore those best practices of engineering, your bridge may stay up, or it may not. And anyone who does build their bridges according to best practices will be able to look at your bridge and tell you why you have failed.

How does this relate to software? Mankind has been writing computer code for less than a century. Modern programming languages have been around for about 40 years. The current crop of "high level" languages (PHP, Ruby, Java, Python, Javascript, Clojure, etc.) have only been around for about 20 years. The processes by which software gets built have been constantly evolving; we can't even agree on a definition of an "agile methodology." There are new programming languages, new frameworks, new architectures, new hardware, new ideas constantly coming out of the best minds in our field.

So why does any single camp believe they know best when it comes to software "best practices?" Our discipline is so young that I don't believe any tool or method used by any software developer nowadays can be considered a "best practice."

I'm sure back in the prehistoric days, monkey-tribes thought that a log across the stream was a best practice. They couldn't envision how mathematics, physics and materials would change to allow them to build better, stronger bridges. Fast-forward to current day, and building bridges is a pretty rigorous discipline. Even breakthroughs in stronger, lighter materials don't change the underlying principles of bridge-building. Engineers might have personal preferences for the type of bridges they like to build, but they don't debate over the fundamental rules of how to build a given bridge type, or which bridge type might be superior for a given purpose (unless the purpose is so trivial that aesthetics or personal preference can be a factor beyond physical properties.)

And yet, every time a new language or framework pops up in software, entire populations of the development community are willing to abandon what came before in droves and proclaim the new way as self-evidently superior. Ignoring, of course, that that everyone said the same thing about whatever they were leaving behind the last time.

Every article I've read on either side of any of these debates can be boiled down to "my brain doesn't work that way, it works this way, therefore, this way is superior." I've yet to see a compelling argument given for any side of any software debate that doesn't have the subtext of simply being the author's personal preference. And that right there tells me that our profession is too young to have decided on what is or is not a best practice.

So yeah, let's keep throwing our logs across the stream and fighting over whether oak is better than pine, ignoring that as soon as we discover steel the debate will be pointless. At least be willing to admit that while the tools and methods we have today might be the best we can come up with, they're certainly not "best practices" yet.

2012-05-14

Neo4jPHP Available as a Composer Package

Neo4jPHP is now available as a Composer package. To use it in your project, create a "composer.json" file in the root of your project, with the following contents:
{
  "require":{
    "everyman/neo4jphp": "dev-master"
  }
}
After installing Composer, run the following command in the root of the project:
> php composer.phar install
The Neo4jPHP library will be downloaded and installed in your project under "vendor/everyman/neo4jphp". An autoloader will be created at "vendor/autoload.php" which you should include in your project, then start using Neo4jPHP as normal:
require("vendor/autoload.php");

$client = new Everyman\Neo4j\Client();
$node = $client->makeNode();
To get the latest version of Neo4jPHP, run:
> php composer.phar update

2012-04-23

Runtime Expectations Javascript Throwdown

Last week, TriangleJS were the guests on the talk-show podcast Runtime Expectations. The topic was a Javascript framework "throwdown" which covered a wide range of different frameworks and toolkits. We even discussed if frameworks were necessary at all. Ben and Adrian are great guys and it was a lot of fun being on their show and hanging out for beers afterwards.

If you want to hear me make a total stuttering ass out of myself, I start rambling around 23:34 then fade out so more informed people can talk.

2012-04-02

Data Mapper Injection in PHP Objects

I was trying to figure out a simple way to have data from a database injected into a domain object in PHP. I don't want to enforce that the domain object implements setters and getters for every property, and the burden on the developer for implementing the data injection interface should be as low as possible.

My solution relies on PHP's ability to pass and assign values by reference. The data mapper is injected into domain object. The domain object tells the mapper to attach to the object's properties. The mapper can then read and write to those properties as if they were its own.
interface Mappable {
    public function injectMapper(Mapper $mapper);
}

class Mapper {
    protected $attached = array();

    public function attach(&$ref, $fieldName) {
        $this->attached[$fieldName] = &$ref;
    }

    public function fromDataStore($data) {
        foreach ($data as $fieldName => $value) {
            if (array_key_exists($fieldName, $this->attached)) {
                $this->attached[$fieldName] = $value;
            }
        }
    }

    public function fromDomainObject() {
        $data = array();
        foreach ($this->attached as $fieldName => $value) {
            $data[$fieldName] = $value;
        }
        return $data;
    }
}

class Product implements Mappable {
    protected $id;
    protected $name;

    public function injectMapper(Mapper $mapper) {
        $mapper->attach($this->id, 'id');
        $mapper->attach($this->name, 'name');
    }

    // ...domain/business logic methods go here...
}

// Create a product and manipulate it
$product = new Product();
$product->doFoo();
$product->linkToBar();

// Inject the data mapper
$mapper = new Mapper();
$product->injectMapper($mapper);
// This could be a call to save the array of data in a database
print_r($mapper->fromDomainObject());

// Read a product from the "database"
$product = new Product();
$mapper = new Mapper();
$product->injectMapper($mapper);
$mapper->fromDataStore( array("id"=>123, "name"=>"Widgets") );
So what's happening here? The fields of the Product domain object are passed into the Mapper's `attach` method by reference. This allows that method to modify the variable passed in without passing it back to the caller.
public function attach(&$ref, $fieldName) {
        $this->attached[$fieldName] = &$ref;
    }
If the value were only passed by reference, only the `attach` method would be able to modify it. As soon as the method returned, the reference would be lost. In order to continue having a reference to the Product's properties, the `attach` method holds the reference in an array using "assign by reference."

Note the ampersand after the equals sign. This tells PHP that we don't want to assign a copy of the value in $ref, we want the reference to $ref itself. This allows us to later modify the value of the property in Product by modifying the value of a reference to the same place in memory that that Product's properties are held.

So now we have the beginnings of a basic data mapper that remains loosely coupled to our domain objects, and without having to write an explicit mapping class for each domain object-to-database mapping.

2012-03-04

GetSet Methods vs. Public Properties

I was recently having a debate with a coworker over the utility of writing getter and setter methods for protected properties of classes. On the one hand, having getters and setters seems like additional boilerplate and programming overhead for very little gain. On the other hand, exposing the value properties of a class seems like bad encapsulation and will overall lead to code that is more difficult to maintain.

I come down firmly on the get/set method side of the fence. It takes very little extra code, defines the interface to the class explicitly and makes the class easier to extend and maintain in the long run. Let's try an example (in PHP, though the premise holds for any language with public/private properties.)
class PublicValues {
    public $foo;
    public $bar;
}

class GetSet {
    protected $foo;
    protected $bar;

    public function getFoo() {
        return $this->foo;
    }

    public function setFoo($value) {
        $this->foo = $value;
    }

    public function getBar() {
        return $this->bar;
    }

    public function setBar($value) {
        $this->bar = $value;
    }
}

$public = new PublicValues();
$public->foo = array('some'=>'value', 'data'=>'structure');
$public->bar = 123;

$getset = new GetSet();
$getset->setFoo(array('some'=>'value', 'data'=>'structure'));
$getset->setBar(123);

echo "Public is: ".$public->bar." and ".print_r($public->foo, true)."\n";
echo "GetSet is: ".$getset->getBar()." and ".print_r($public->getFoo(), true)."\n";
Obviously, class `PublicValues` is much shorter than class `GetSet`. Shorter is better, right? Less code to maintain, fewer places for bugs to creep in. And it didn't take me nearly as long to type (though, truthfully, `GetSet` didn't take that long to write, either, as it was mostly cut-and-paste.)

Now let's say that now we have decided to wrap the array data structure in a new `Widget` class and we need to ensure that `foo` can only be of type `Widget`. Modify the classes to make this so:
class PublicValues {
    // ... add a whole new method
    public function setFoo(Widget $value) {
        $this->foo = $value;
    }
}

class GetSet {
    // ... only change here is type-hinting
    public function setFoo(Widget $value) {
        $this->foo = $value;
    }
    // ...
}

// ...
$public->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...
$getset->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...
`PublicVars` had an entirely new method added to it. `GetSet` only had a small modification to an existing method.

Actually, this still isn't perfect; application code can still say `$public->foo = 'whateverIWant';` without going through the `PublicVars::setFoo` method. In fact, while writing this example, I forgot until my third revision to go back and change the sample code to use the set method. Let's make sure that doesn't happen in the future:
class PublicValues {
    // ... change foo so that client code can't set it to the wrong thing
    protected $foo;

    // ... still need to be able to read foo, so now we add a getter
    public function getFoo() {
        return $this->foo;
    }
    // ...
}

// GetSet remains the same ...

// ...
$public->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...
echo "Public is: ".$public->bar." and ".print_r($public->getFoo(), true)."\n";
Notice that `GetSet` didn't have to change at all in this case, and the only client code that had to change for it was searching for anywhere calling `GetSet::setFoo` and making sure it provided a `Widget`.

As for `PublicValues`, we'll have to go through all the code making sure that everywhere that was setting `foo` with assignment now calls the set method (with a proper `Widget`), and anywhere that was reading it directly calls the get method.

My future self is screaming at my current self. Now we have two ways of accessing the PublicValues class: one way through direct value reading and writing (bar), and the other through getset methods (foo). As a consumer of this class, I have no way of knowing which properties are openly accessible and which are guarded through getset methods, other than going back and looking through the code.

To solve this problem, I am going to use a method that I see in almost every PHP project of non-trivial size that I have ever worked with or seen the source code of: use magic __get and __set to determine if the property is directly accessible or not:
class PublicValues {
    // ...

    public __get($name) {
        $method = 'get'.$name;
        if (method_exists($this,$method)) {
            return $this->$method();
        } else {
            return $this->$name;
        }
    }

    public __set($name, $value) {
        $method = 'set'.$name;
        if (method_exists($this,$method)) {
            return $this->$method($value);
        } else {
            $this->$name = $value;
        }
    }
}
As before, `GetSet` doesn't change one bit. For `PublicValues`, the rest of the code can continue to use what looks like straight assignment, and if I want to know if the assignment is direct modification or a magic set, I only need to look in only two places to figure it out: look for a named getset method, or assume I am using the magic __get and __set. Perfect...

...until we have to add something to this class that absolutely must not be modified by client code.
class PublicValues {
    // ...
    protected $modificationToThisCrashesTheApp;
    // ...
}

class GetSet {
    // ...
    protected $modificationToThisCrashesTheApp;
    // ...
}
Now we have a big problem. Clients of `GetSet` are protected, because nothing in `GetSet` can be modified except through a class method. We don't provide a method for accessing the dangerous property, so clients can't hurt themselves.

But clients of `PublicValues` can modify any property they want. The magic __set method makes public any protected value that does not have an explicit set method. We could fix this by adding a set method that throws an exception. Or we could solve it by blacklisting that property in the magic __set. Or we could make sure everyone (including new hires six months from now) knows not to touch this property. Or we could solve it by...

Truthfully, it doesn't matter how you solve it. Encapsulation of `PublicValues` has been broken from the beginning. We haven't even tried adding validation to either class, or making the setting of one value affect another, or adding read- or write-only methods. And over the course of just a few changes, we've had to add get/set methods anyway. We've just done it later in the application lifecycle after client code is already using the class, the time when it is most costly to make those sorts of changes.

In a blog post, it's easy to see these changes coming in a few paragraphs and account for them mentally. But imagine this in a large system, one that is constantly having features added, bugs fixed and maintenance performed over the course of months or years. Every change in this example could have come months apart. I don't expect myself or my fellow programmers to keep track of the intricacies of a single class over that timeframe, especially if that class is modified rarely, is one class in hundreds and we've been working in thousands of lines of other code in the meantime.

So do yourself a favor and take the 20 seconds to wrap your properties in boilerplate getset methods from the start. Not because it's a best practice, but because future you will be grateful.

Here is the meme-pic that started the debate:



2012-02-23

Similarity-based Recommendation Engines

I am currently participating in the Neo4j-Heroku Challenge. My entry is a -- as yet, unfinished -- beer rating and recommendation service called FrostyMug. All the major functionality is complete, except for the actual recommendations, which I am currently working on. I wanted to share some of my thoughts and methods for building the recommendation engine.

My starting point for the recommendation engine was a blog post by Marko Rodriguez, A Graph-Based Movie Recommender Engine. Marko's approach is pretty straightforward and he does an excellent job of explaining the subtleties of collaborative and content-based filtering. It was also a great way to get acquainted with the Gremlin graph traversal API, which I have been unwisely avoiding for fear of it being too complicated.

Basic collaborative filtering works like this: if I rate a beer, say, Lonerider Sweet Josie Brown Ale as 9 out of 10, and you rate that same beer as 8 out of 10, then I should be able to get recommendations based on other beers that you (and others who have rated that beer similarly to me) have rated highly.

I think we can improve on this. You and I happen have rated one beer the same way. What happens if that is the only beer we have rated similarly? What if I have rated Fullsteam Carver Sweet Potato Ale at 8 out of 10, but you have rated it 2 out of 10? You have shown that you have poor taste, so why should I get recommendations based on your other ratings, simply because we happened to rate a single beer similarly?

With a small modification, we can overcome this problem. Instead of basing recommendations off of one similar rating, I can calculate how similarly you and I rated all the things we have rated, and only get recommendations from you if I have determined we are similar enough in our tastes. Call this "similarity-based collaborative filtering."

Finding similar users works like this:
  1. Find all users who have rated the same beers as I have.
  2. For each rated beer, calculate the differences between my rating and that user's rating.
  3. For each user, average the differences.
  4. Return all users where the average difference is less than or equal to some tolerance (2 in the below example.)

Let's see what this looks like in Gremlin, a domain specific graph traversal language in Groovy. Gremlin is built on the concept of "pipes" where a single chain of operations runs from beginning to end on each element put into the pipe. The elements in our case are the user nodes, beer nodes and rating edges that connect them in our graph:
// We're going to use this again and again, so make a single "step" for it,
// to which we can pass a "similarity threshold"
Gremlin.defineStep('similarUser', [Vertex,Pipe], {Integer threshold ->

    // Capture the incoming pipe
    target = _();

    // Set up a table to hold ratings for averages
    m=[:].withDefault{[0,0]};

    // Find all beers rated by the target, and store each rating
    target.outE("RATED").sideEffect{w=it.rating}

    // For each rated beer, find all the users who have also rated that beer
    // and who are not the target user, then go back to their ratings
    .inV.inE("RATED").outV.except(target).back(2)

    // At this point in the pipeline, we are iterating over
    // only the ratings of beers that both users have rated.
    // For each rating, calculate the difference between
    // the target user's rating and the compared user's rating
    .sideEffect{diff=Math.abs(it.rating-w)}

    // For each compared user, store the number of beers rated in common
    // and the total difference for all common ratings. 
    .outV.sideEffect{ me=m[it.id]; me[0]++; me[1]+=diff; }.iterate();

    // Find the average difference for each user and filter
    // out any where the average is greater than two.
    m.findAll{it.value[1]/it.value[0] <= threshold}.collect{g.v(it.key)}._()
});

// Return all the users who are similar to our target
// with an average similarity of 2 or less
g.v(123).similarUser(2)

The result of this script is a list of all user nodes that are considered "similar" to the target user. Now we can get more accurate recommendations, based on users whose tastes are similar to ours. Given users who are similar to me, recommend beers that they rated highly:
  1. Find all beers rated by similar users.
  2. For each beer, average the similar users' ratings.
  3. Sort the beers by highest to lowest average and return the top 25.
r=[:].withDefault{[0,0]};

// Find similar users 
g.v(123).similarUser(2)

// Find all outgoing beer ratings by similar users
// and the beer they are rating
.outE("RATED").sideEffect{rating=it.rating}.inV

// Store the ratings in a map for averaging
.sideEffect{me=r[it.id]; me[0]++; me[1]+=rating; }

// Transform the map into a new map of beer id : average
r.collectEntries{key, value -> [key , value[1]/value[0]]}

// sort by highest average first, then take the top 25
.sort{a,b -> b.value <=> a.value}[0..24]

// Transform the map back into a mapping of beer node and rating
.collect{key,value -> [g.v(key), value]}
Now I only get recommendations from users whose tastes are similar to mine.

I can also calculate an estimated rating for any beer that I have not rated. To see how that might be useful, consider Netflix. When you look at any movie on Netflix that you have not already rated, they show you their best guess for how you would like that movie. This can be helpful in determining which of a set of movies you want to watch, or in our case, which beer you might want to try given limited pocket money. Again, we can use our similar users to figure out the estimated rating of a given beer:
  1. Find all ratings of the target beer by similar users.
  2. Return the average of their ratings of that beer.
// Grab our target beer
beer=g.v(987)

// Find similar users 
g.v(123).similarUser(2)

// For each similar user, find all that have rated the target beer
.outE("RATED").inV.filter{it==beer}

// Go back to their ratings, and average them
.back(2).rating.mean()
So now we have a simple recommendation engine based on ratings by users with similar tastes. There are a few improvements that could still be made:
  • Don't consider a user "similar" unless they have rated at least 10 beers in common (or some other threshold). This prevents users who have only rated a few beers in common from being falsely labeled as similar.
  • Calculate the similar users ahead of time, and store them with a "SIMILAR" relationship to the target user. Periodically recalculate the similar users to keep the list fresh.
  • Use random sampling or size limiting on the number of users and ratings checked when determining similarity, trading accuracy for performance.
Many thanks to Marko for his help on the mailing list, and with giving me some pointers in writing this, and for being a pretty awesome guy in general.

Update: Luanne Misquitta posted about how to do similarity-based recommendations with Cypher. Thanks, Luanne!

2012-01-14

Command Invoker Pattern with the Open/Closed Principle

Over on DZone, Giorgio Sironi demonstrates the "Open/Closed Principle on real world code". The pattern demonstrated is similar to the Command Pattern, and the post does a good job of introducing a class that is open for extension but closed for modification. Giorgio mentions that there is still more work to be done. I thought I might take a stab at creating a flexible and extendable command invocation solution.

One of the issues that should be addressed is that the invoker (the Session class in the post), is only extensible by actually extending the class. If we want to add custom methods to it, we have two options: make our own sub-class, which should be avoided when favoring composition over inheritance; or modifying the base class's constructor, which violates the open/closed principle.

Another thing that could be improved is that the invoker is violating the Single-Responsibility principle (which violates the "S" in SOLID.) It needs to know not just how to execute commands, but also how to construct them. As the list of available commands grows, and the constructors for each command become more complex, the invoker needs to contain more logic to build each command type.

Let's overcome some of these issues, and also make the code even more extensible. I'll use a simplified command invoker to demonstrate.

To start, let's remove the complexity of requiring the invoker know how to build each command object. A simple way to do this is the move the factory methods into the individual command classes. We'll also define a simple interface to ensure that our commands are executable. Here are two simple commands. Note that the constructor is not part of the interface, which allows the command classes to have completely different constructor method signatures.
interface Command {
 public static function register();

 public function execute();
}

class HelloCommand implements Command {
 protected $name;

 public static function register() {
  return function ($name) {
   return new HelloCommand($name);
  };
 }

 public function __construct($name) {
  $this->name = $name;
 }

 public function execute() {
  echo "Hello, " . $this->name . "\n";
 }
}

class PwdCommand implements Command {

 public static function register() {
  return function () {
   return new PwdCommand();
  };
 }

 public function execute() {
  echo "You are here: " . getcwd() . "\n";
 }
}
Each command class has a static `register` method that returns an anonymous function which knows how to instantiate the class. We'll call this the "factory" function. In the example, the factory function signatures are the same as the constructors, but that is not necessary; as long as the factory function can construct a valid object, it can take whatever parameters it needs.

Our commands can construct themselves, so the invoker only has to know how to map its own method calls to the commands. Here is the simplified Invoker class:
class Invoker {
 protected $commandMap = array();

 public function __construct() {
  $this->register('hello', 'HelloCommand');
  $this->register('pwd', 'PwdCommand');
 }

 public function __call($command, $args) {
  if (!array_key_exists($command, $this->commandMap)) {
   throw new BadMethodCallException("$command is not a registered command");
  }
  $command = call_user_func_array($this->commandMap[$command], $args);
  $command->execute();
 }

 public function register($command, $commandClass) {
  if (!array_key_exists('Command', class_implements($commandClass))) {
   throw new LogicException("$commandClass does not implement the Command interface");
  }
  $this->commandMap[$command] = $commandClass::register();
 }
}
The important bit is the `register` method. It checks that the registered class is indeed a Command, and thus, that it will have both an `execute` and a `register` method. We then invoke the `register` method, and store the factory function for that command.

In the magic `__call` method, we have to use `call_user_func_array` since we don't know the real signature of the called factory function. If we wanted, we could have the `__call` method return the command object instead of executing it. That would turn the Invoker into an Abstract Factory for command objects.

The main point is that Invoker never cares exactly how commands are constructed. Invoker now only has one responsibility: invoking commands.

We execute commands by calling Invoker methods with parameters that match the factory function signature of the command being executed:
$invoker = new Invoker();
$invoker->hello('Josh');
$invoker->pwd();
By making the Invoker's `register` method public, we've also opened up the Invoker to allow a client to extend its functionality without having to modify the Invoker itself:
class CustomCommand implements Command {
 protected $foo;
 protected $bar;

 public static function register() {
  return function ($foo, $bar) {
   return new CustomCommand($foo, $bar);
  };
 }

 public function __construct($foo, $bar) {
  $this->foo = $foo;
  $this->bar = $bar;
 }

 public function execute() {
  echo "What does it all mean? " . ($this->foo + $this->bar) . "\n";
 }
}

$invoker->register('custom', 'CustomCommand');
$invoker->custom(40, 2);
We can even let clients overwrite existing commands by re-registering with a custom class:
$invoker->register('pwd', 'CustomCommand');
$invoker->pwd(2, 40);
If we don't want to allow this, we can build more checks into `register` that prevent overwriting built-in commands.

So now we have an extendable, flexible command invocation system that requires no modification to the actual command runner. This could also be used as the basis for a plug-in system.

Full sample code is available at http://gist.github.com/1610148





2012-01-03

Funkatron's MicroPHP Manifesto

Just saw this over on @funkatron's blog:

The MicroPHP Manifesto:

I am a PHP developer
  • I am not a Zend Framework or Symfony or CakePHP developer
  • I think PHP is complicated enough

I like building small things
  • I like building small things with simple purposes
  • I like to make things that solve problems
  • I like building small things that work together to solve larger problems

I want less code, not more
  • I want to write less code, not more
  • I want to manage less code, not more
  • I want to support less code, not more
  • I need to justify every piece of code I add to a project

I like simple, readable code
  • I want to write code that is easily understood
  • I want code that is easily verifiable

I agree that there are "and-the-kitchen sink" frameworks out there that do too much. I also agree that having every developer re-invent the wheel is not the ideal situation. The main problem I see with the macro-frameworks is that they tend to not allow me to strip out their functionality and replace it with another library when I need to. I much prefer smaller libraries and components that I can mix and match if necessary. Isn't that the whole point of code re-use in the first place?

So maybe this isn't so much about huge frameworks being bad as it is that they should be built off small, interchangeable coponents? Wasn't there a call for more inter-operability a little while back?

Funkatron has made it pretty clear that these are his personal preferences, but that hasn't stopped the haters.