Everyman Software

Serializing Data Like a PHP Session

2013-05-01T10:17:00.001-04:00

PHP has several built-in ways of serializing/unserializing data. The most cross-platform is json_encode; pretty much every programming stack can JSON decode data that has been encoded by any other stack. There's also PHP's native serialize function, which is not as cross-platform, but has the added benefit of being able to store and restore PHP objects to their original class.

There's a third, and much lesser known PHP serialization format: the format that PHP uses to store session data. If you have ever popped open a PHP session file, or stored session data in a database, you may have noticed that this serialization looks very similar to the serialize function's output, but it is not the same.

Recently, I needed to serialize data so that it looked like PHP session data (don't ask why; I highly suggest not doing this if it can be avoided.) It turns out, PHP has a function that encodes data in this format: session_encode. Great! I'll just pass my array of data to it and...

Oh wait. session_encode doesn't accept any arguments. You can't pass data to it. It just takes whatever is in the $_SESSION superglobal and serializes it. There is no built-in function in all of PHP that will serialize arbitrary data for you the same way that it would be serialized into a session.

There are a few userland implementations of PHP's built-in session serialization, mainly built around string splitting and regexes. All of them handle scalar values; some handle single-level arrays. None of them handle nested arrays and objects, and some have trouble if your data contains certain characters that are used in the encoding.

So I came up with my own functions for reading/writing arbitrary session data without overwriting the existing session. (Edit: I later noticed that a few people suggest a similar method in the comments on the PHP manual pages for session_encode and session_decode.):

Using these functions requires there to be an active session (session_start must have already been called.) Edit: Thanks to Rasmus Schultz for also pointing out that session_encode might be disabled on some systems due to security concerns.

PHP already has a built-in way to serialize and unserialize session data. The problem is that it only serializes from and unserializes into the PHP $_SESSION global. We probably don't want to overwrite the current $_SESSION. We hold a copy of whatever data is already in $_SESSION, then use it to perform our data serialization, then restore it afterwards. And because we're using PHP's built-in session serialization, we get nested array and object serialization for free, and we didn't have to write our own parser.

Loggly and Puppet

2013-04-23T09:30:00.000-04:00

Update 2013-10-22: This post refers to Loggly generation 1, and may (most likely) not work with Loggly's new second generation product offering.

As a follow-up to my previous post on pulling data from Loggly using JQuery, this post will show how to use Puppet to automatically register and configure instances to send data to Loggly.

At ServiceTrade, we use Amazon Web Services for almost all of our infrastructure. All our production servers are EC2 instances. The configuration of all the instances is kept in Puppet manifests. Instances go down and come up all the time, and Puppet helps us make sure they are all configured exactly alike out of the box.

A server cannot send data to Loggly unless you have previously told Loggly to accept data from it. Unfortunately, with server instances being created and removed automatically, it would be impossible to keep up with hand-registering each instance with Loggly. Fortunately, we can use Loggly's API and some Puppet manifests to register our instances for us when they come up.

We use rsyslog on our instances for collecting system log data (syslog, kernel, mail logs, etc.). rsyslog can tail log files forward them to other log files, or even other servers via TCP or UDP. Loggly has great documentation on setting up rsyslog to forward log files to Loggly.

First, we need to have Puppet manage rsyslog. This ensures that rsyslog will be installed, and gives us control over rsyslog's master configuration file and a directory of instance specific configuration files. Below is the rsyslog module file. All files are relative to the Puppet root directory.

modules/rsyslog/manifests/init.pp

As it says, the main configuration file will be in modules/rsyslog/files/rsyslog.conf. The config file is the standard one installed by our package manager with a few minor alterations, seen here:

modules/rsyslog/files/rsyslog.conf

That last line is important, because all out Loggly specific configurations will go in /etc/rsyslog.d.

Now that rsyslog is set up, we need to tell each instance where to send its log files. Additionally, we need to register each instance with each log file it will be sending to Loggly. Each log file is sent to a different endpoint, which Loggly refers to as an input. Each input has an ID, and a specific port on the Loggly logging server that maps to that ID. We have already set up a specific Loggly user for API purposes, and we'll use that user to do the registration.

First, we set up a new module that will hold our Loggly API configuration.

modules/loggly/manifests/init.pp

The hash of inputs allows us to easily reference each input we've mapped in Loggly, without having to remember specific ID and port numbers elsewhere in our manifests.

We'll also set up a Puppet template for tailing out log files and forwarding them to Loggly:

modules/loggly/templates/rsyslog.loggly.conf.erb

The template will take the values defined in loggly::$inputs and create one log file per input, which will ultimately end up in /etc/rsyslog.d.

One last Loggly manifest file is needed. This one generates the config file for an input, then registers the server against the input using Loggly's API.

modules/loggly/manifests/device.pp

This manifest uses the $name passed to the definition to gather data from the $inputs hash to build a config file. Also, it execs a cUrl call to Loggly's API to register the device for the input. The response to this call is stored for two reasons: first, if anything goes wrong, we have a record of Loggly's response to our request; and second, if the response file already exists, Puppet will know it does not need to make another call to the API.

All that remains is to use our loggly::device definition in a node definition:

manifests/site.pp

Since our input IDs and ports are bound to specific input names in our $inputs hash, we only need to know the names of the inputs we want to configure this instance to send to, and loggly::device does the rest.

Hopefully, at some point we (or someone else) will get around to releasing a proper Puppet module for this. Until then, I hope this post helps you get set up with the Loggly centralized logging service.

Loggly from Javascript

2013-04-12T23:47:00.001-04:00

Update 2013-10-22: This post refers to Loggly generation 1, and may (most likely) not work with Loggly's new second generation product offering.

For my most recent Dev Days project, I implemented centralized logging for our application, ServiceTrade. I don't want to worry about running our own indexing server, or storing the logs long term, so I investigated several SaaS logging solutions and eventually settled on Loggly. I was impressed with the ease of setting up our account, defining our logging inputs and even integrating with our Puppet configuration management infrastructure. For long term storage, they push raw log files to an S3 bucket of your choosing. Their customer support seemed very eager to help with the one issue I had. All-in-all, I've been pleased with the product.

One thing about Loggly that could use a little work is saved searches. First off, when Loggly gives you a graph of events from a saved search (a very cool feature) the graph sometimes loses information when zooming in and clicking on a section to see specific logs. Visiting the page for a saved search on a specific set of inputs and clicking on the graph to pull up the log lines for that search with give log lines across all inputs, not just the ones the saved search is limited to.

~~Secondly, you are limited to only 5 saved searches at the moment. The saved search feature is in beta, so hopefully they will allow saving of more (ideally unlimited) searches in the future.~~ Apparently, you can have up to 2000 saved searches; the wording on the saved searches list page is out-of-date.

We are using their excellent API to pull down data and do our own visualizations of multiple saved searches. I'm using jQuery on a simple HTML page to query the API. There are a few caveats. The following information will hopefully prevent someone else from spending the half-hour I did trying to figure this out.

First of all, the API uses HTTP basic authentication. Jquery's get call does not handle HTTP authentication, so I had to use the more verbose ajax method.

Also, since the request is cross-domain, I had to use JSONP, which Loggly supports.

Finally, Loggly's API returns a Bad Request response if you send any parameters that it does not recognize. Unfortunately, unless you tell it otherwise, jQuery.ajax() will always send a timestamp query parameter to prevent response caching. In order to get everything to work, I had to tell the request to turn caching off and not send the parameter.

Here is what the final call looks like:

You might also want to check out my blog post soon on using Puppet to automatically register servers with Loggly.

Storytellers and Prognosticators: Lessons in Communication Style

2013-02-11T08:30:00.000-05:00

Every other Friday, my team holds a retrospective meeting. One issue that came up in our most recent retrospective was a particularly contentious user story estimation meeting that had occurred a few days prior. Voices were raised, people were interrupted mid-sentence, and sarcasm was liberally deployed. This was a specific isolated incident, and I am very proud of my team that we were able to quickly own, discuss and remedy the situation. We are a better team of communicators as a result.

After the retrospective, I thought about different communication styles, and I tried to come up with different personas for the different types of communicators on my team. So far, I've broken the communication styles down into these general personas:

Storytellers love to talk about the past. Their goal is to remind everyone that experience is the best teacher. They are the archive of a team's experiences, reminding everyone of past obstacles and past triumphs. By trying to couch everything in terms of previously encountered situations, storytellers sometimes have trouble recognizing changed circumstances. You can recognize a storyteller by phrases like "Do you remember when..." and "Last time this happened..."

Prognosticators are in some ways the opposite of storytellers. They love to talk about likely outcomes and visions of the future. Their ideas tend to be expressed as innovative approaches to problems and as warnings about potential pitfalls. Prognosticators can sometimes derail a conversation by making predictions based on erroneous assumptions. They can be recognized by phrases like "Something we need to watch out for..." and "It's possible that..."

Prognosticators and storytellers tend to feed off each other, with the latter talking about how a current situation is similar to the past, and the former talking about how the current situation is different. When they get on a roll together, it may be difficult for the other personas to break into the conversation. Both must be careful to not take the conversation off tangents and drive it away from productive outcomes.

Inquisitors communicate by asking questions (the Socratic method.) Their strength is getting others to think about what they are saying by asking for clarification, and by steering the tone and direction of the conversation to discover new possibilities. Inquisitors are excellent at keeping storytellers and prognosticators on track. It is important that inquisitors remember to contribute knowledge instead of always asking questions to which they already know the answers, otherwise they risk looking condescending. Inquisitors can be recognized by phrases like "What if..." and "What did you mean by..." and "Have you considered..."

Evaluators are good listeners. Unlike the other personas, which tend to drive conversations, evaluators do not speak often, and when they do, they tend to be soft-spoken. Evaluators are good at judging ideas objectively and combining multiple points of view into one cohesive vision. It is important that evaluators do not let themselves get talked over, and that they do not completely hold back from contributing; inquisitors can help draw an evaluator into the conversation. Evaluators can be recognized by phrases like "That's an interesting thought..." and "I was thinking about..."

Most people are more comfortable in one, or a combination of two, personas. I call this their base communicator. These are not discrete communication styles. Elements of any one may be combined with elements of another, and no person is purely one persona or another. From day to day, even within the course of a single conversation, people flow in and out of these different personas.

In conversation, especially in larger groups, it is important to recognize which persona is speaking. Remember that familiarity breeds complacency! As team cohesion grows, you will soon learn to recognize each team member's base communicator, and when that happens it is even more important to pay attention to what they are saying and how they are saying it. Just because you know someone's general communication style, you cannot assume that they will always communicate that way in every circumstance.

Migrating to Dependency Injection

2012-11-21T09:00:00.000-05:00

Recently, I gave a lunch-and-learn to my team on the topic of Dependency Injection (DI). Instead of showing a bunch of slides explaining what DI is and what it's good for, I created a small project and demonstrated the process of migrating a codebase that does not use DI to one that does.

Each stage of the project is a different tag in the repository. The code can be found on Github: http://github.com/jadell/laldi. Checkout the code, and run composer install. To see the code at each step in the process, run the git checkout command in the header of each section.

`git checkout 0-initial`

The initial stage of the project is a listing of developers on the team, and with whom each has pair programmed. The project uses Silex, which is both a PHP micro-framework, and a simple dependency injection container. For the initial stage, only the framework is used (for HTTP request handling, routing and dispatching.) Each route handler is registered in index.php. A handler intantiates a controller object, then calls a method on the controller to handle the request. For this project, there is only one controller, the HelloController, but there could be many others.

When the HelloController is instantiated, it instantiates a UserRepository to handle access to User domain objects and a View to handle rendering the output. The UserRepository instantiates a Datasource to handle access to the raw data. The path to the data file is hardcoded in the Datasource. User objects instantiate a UserRepository of their own, so they can access the list of paired Users.

The code smell here is the chain of initializations that get triggered by constructing a new HelloController, which leads to several problems: it tightly couples every component to every other component; none of the components can be unit-tested in isolation from the others; we can't change the behavior of a class or its dependencies without modifying the class; and if the constructor signature of any component changes, we need to find each instantiation of that component and change it to pass in the new parameters (which might involve threading those parameters through multiple layers which don't require them.) There also may be bugs with multiple objects having their own copies of dependencies (for example, each User object should probably share a UserRepository, not have their own.)

`git checkout 1-add-di`

A simple way to solve these problems is to ensure that no class is responsible for creating its own dependencies. In order for a class to access its dependencies, we pass those dependencies into the class's constructor (another form of DI uses "setter" injection instead of constructor injection.)

The problem now is that all the dependencies must be wired together. Since no class instantiates its own dependencies, we must have a component that does this for us. This is the job of the Dependency Injection Container (DIC).

Silex is also used as a DIC. In bootstrap.php, all class instantiation is wrapped in a factory method that is registered with the DIC. When the DIC is asked for a registered object, if that object is not already instantiated, it will be created. Additionally, all its dependencies will be created and injected through its constructor.

For cases where a component may need to create multiple objects of the same type (like the UserRepository needs to create many User objects), factory closures are created. The factory makes sure that any User object is created with the correct dependencies, and that shared dependencies are not re-created each time. Since the closure is defined in the DIC, the UserRepository does not need access to any of the external dependencies that might be used by User objects, and the closure can be passed to the UserRepository.

`git checkout 2-comments`

Our project gets a new requirement to allow comments on users. Following the same pattern as with Users, we create a CommentRepository and a Comment domain object, and factory methods for creating and injecting each. HelloController needs access to the CommentRepository in order to save comments. Also, Users need access to the repository to retrieve all the Comments for that User. Since we are using DI, we change their constructors to receive a CommentRepository.

If we were not using a DIC, we would have to find every place that instantiated a HelloController or User object and change those calls to pass in the CommentRepository. Additionally, we would have to pass the CommentRepository through to any place that instantiated the object. Since we are using a DIC and factory methods, we can be assured that there is only one place that creates each of these object types, and therefore only one place needs to change (the factory methods in the DIC.)

We already get benefits from our refactoring to use a DIC!

`git checkout 3-events`

Our last requirements are to do some simple tracking of page views and notification of comments (perhaps with email, but for demonstration purposes, to a log file.) Each of these is a good use case for an event-driven system.

One of the challenges with event-driven development is making sure that the event listeners are instantiated before the emitters begin sending events. An option for getting around this is to instantiate all listeners as part of the application's bootstraping/initialization process. But this can become cumbersome and have performance implications.

Instead of always instantiating every listener, we can use our DIC and factory methods. Since event emitters are instantiated through a factory method, we use that same factory method to initialize listeners to any events that emitter may emit. By tying the creation of the emit to the creation of the listeners, we ensure that we only instantiate those listeners for which we have created an emitter.

For example, we know that the HelloController may emit page view events, so we make sure that whenever we instantiate a HelloController, we also create a page view listener and set it to listen for the event.

Potential Issues

Several good questions were asked during my presentation, and I attempted to answer them as best as I could.

Is debugging harder having the definitions of the objects away from their usage?
Since you are explicitly passing an object's dependencies, and you get all the dependencies from the same place, you can minimize external state that may affect an object's behavior. Also, by using factory methods, you can find exactly where an object was created. Without the DIC, an object or its dependencies could come from anywhere in your code. DI also makes unit- and regression-testing easier.

Do the bootstrap/DIC definitions become unwieldy when you have a lot of interconnected objects?
The DIC is kept simple by having each factory method responsible for creating only one type of object. If each method is small and isolated, the file may become very large, but the logic in the file will remain easy to read. You can even break the bootstrap of the DIC into multiple files, each file handling a different component or section of the project's dependencies. Circular dependencies may be a problem with a highly connected object graph, but that is a separate problem regardless of using a DIC or not.

Where is a good place to start with refactoring an existing large codebase to use DIC?
It's a multi-step process:

Pick a class and create a factory method on a globally accessible DIC that instantiates that class.
For every place in the class that uses the "new" operator, replace it with a call to a factory function/method on the DIC.
Move calls to the factory methods into the class's constructor, and store the results as properties on the object.
Change the constructor to not call the factory methods itself, but receive its dependencies as parameters.
Update the class's factory method to call the other factory methods and pass in the dependencies to the constructor.
Find any place in the code that instantiates the class and replace it with a call to the class's factory method.

Repeat this process for any usage of the "new" operator in the code. By the end, the only place in the code that uses the "new" operator or refers to the DIC should be the DIC itself.

Interfaces and Traits: A Powerful Combo

2012-09-27T12:00:00.000-04:00

If you're not using interfaces in PHP, you are missing out on a powerful object-oriented programming feature. An interface defines how to interact with a class. By defining an interface and then implementing it, you can guarantee a "contract" for consumers of a class. Interfaces can be used across unrelated classes. And they become even more useful when combined with the new traits feature in PHP 5.4.

Let's suppose we have a class called User. Users have an Address that our application mails packages to via a PackageShipper:

class Address {
 // ... setters and getters for address fields ...
}

class User {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other user logic ...
}

class PackageShipper {
 public function shipTo(User $user) {
  $address = $user->getAddress();

  // ... do shipping code using $address ...
 }
}

Our application is happily shipping packages, until one day a new requirement comes in that we need to be able to ship packages to Companies as well. We create a new Company class to handle this:

class Company {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other company logic ...
}

Now we have a problem. The PackageShipper class only knows how to handle Users. What we need is for the PackageShipper class to handle anything that has an Address.

We could make a base class that both Users and Companies inherit from and have the PackageShipper accept any class that descends from the base class. This is an unsatisfying solution. Semantically, Users and Companies are different entities, and there may not be enough common functionalty to move to a base class that both inherit. They may not have much in common besides the fact that they both have an address. Also, either class may already extend another class, and we cannot extend more than one class since PHP has a single-inheritance object model.

Instead, we can use an interface to define the common parts of Users and Companies that PackageShipper needs to deal with. Then, Users and Companies can implement that interface, and PackageShipper only has to deal with objects that implement the interface.

inteface Addressable {
 public function setAddress(Address $address);
 public function getAddress();
}

class User implements Addressable {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other user logic ...
}

class Company implements Addressable {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other company logic ...
}

class PackageShipper {
 public function shipTo(Addressable $entity) {
  $address = $entity->getAddress();

  // ... do shipping code using $address ...
 }
}

A class can implement many different interfaces, and classes with different bases can still implement the same interface.

But there is still a problem. Both Company and User implement the Addressable interface using the same code. It is not required to do this; as long as the constraints of the interface are met, the implementation does not have to be the same for every class that implements the interface. But in this case, they both implement the interface in the same way, and we have duplicated code. If a third Addressable class came along, it is possible that it would also implement the interface the same way, leading to even more duplication.

If you are using PHP 5.3 or less, there is little that can be done about this situation. But if you are using PHP 5.4, there is a new construct that is made to handle exactly these types of code duplication scenarios: traits.

A trait is similar to a class in that they both implement methods and have properties. The difference is that classes can be instantiated, but traits cannot. Instead, traits are added to class definitions, giving that class all the methods and properties that are defined in the trait.

A helpful way to think about traits is like a macro: using a trait in a class is the same as if the code in the trait were copied into the class.

Using traits, we can clean up the code duplication and still meet the contract defined by the Addressable interface:

trait AddressAccessor {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }
}

class User implements Addressable {
 use AddressAccessor;

 // ... other user logic ...
}

class Company implements Addressable {
 use AddressAccessor;

 // ... other company logic ...
}

Now, any class can immediately implement the Addressable interface by simply using the AddressAccessor trait. All the duplicated code has been moved to a single place.

The trait itself does not implement Addressable. This is because only classes can implement interfaces. Remember that traits are nothing more than code that gets copied into a class, they are not classes on their own.

Interfaces in PHP are a useful feature for enforcing code correctness. And when interfaces are combined with traits, they form a powerful tool which allows for faster development, decreased code duplication, greater readbility, and better maintainability.

Here is all the final code for the package shipping application:

class Address {
 // ... setters and getters for address fields ...
}

inteface Addressable {
 public function setAddress(Address $address);
 public function getAddress();
}

trait AddressAccessor {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }
}

class User implements Addressable {
 use AddressAccessor;

 // ... other user logic ...
}

class Company implements Addressable {
 use AddressAccessor;

 // ... other company logic ...
}

class PackageShipper {
 public function shipTo(Addressable $entity) {
  $address = $entity->getAddress();

  // ... do shipping code using $address ...
 }
}

Plumbing PHP

2012-08-31T15:24:00.001-04:00

When building a project with Neo4j or most other graph databases, it is impossible to avoid learning about Tinkerpop's excellent Gremlin graph processing language. The processing layer of Gremlin is built on top of Pipes, a dataflow programming library.

I was inspired by the syntax and ease-of-use of Gremlin to build a simple processing pipeline library in PHP. The result is Plumber, a library for easily building extensible deferred-processing pipelines.

Plumber is built on top of PHP's native Iterator library. The idea is simple: instantiate a new processing pipeline, attach processing pipes to it, then send an iterator through the pipeline and iterate over the results in a foreach. Each element that comes out the other end of the pipeline has been passed through an processed by each pipe. And because Plumber uses iterators, it natively supports lazy-loading and Just-In-Time evaluation.

A simple example would be reading a set of records from a database, formatting them in some manner, then echo'ing them out to the screen:

$users = // code to retrieve user records from a database as an array or Iterator...

$names = array();
foreach ($users as $user) {
    if (!$user['first_name'] || !$user['last_name']) {
        continue;
    }

    $name = $user['first_name'] . ' ' . $user['last_name'];
    $name = ucwords($name);
    $name = htmlentities($name);
    $names[] = $name;
}

// later on, display the names
foreach ($names as $name) {
    echo "$name<br>";
}

There are a few obvious downsides to doing things this way: the entire set of records is looped through more than once; all the records must be in memory at the same time (twice even, once for $users and once for $names); and the processing steps in the foreach are executed immediately on every record. These may not seem like a big deal if the record set is small and the processing steps are trivial, but they can become big problems if you are not careful.

Here is the same code using Plumber:

$users = // code to retrieve user records from a database as an array or Iterator...

$names = new Everyman\Plumber\Pipeline();
$names->filter(function ($user) {
        return $user['first_name'] && $user['last_name'];
    })
    ->transform(function ($user) {
        return $user['first_name'] . ' ' . $user['last_name'];
    })
    ->transform('ucwords')
    ->transform('htmlentities');

// later on, display the names
foreach ($names($users) as $name) {
    echo "$name<br>";
}

The list of $users is only looped through one time, and there is no need to keep a separate list of $names in sync with the $users list. Each $user is transformed into a $name on-demand, keeping resources free.

This can all be accomplished using Iterators, but there is quite a bit of boilerplate code involved. Plumber is meant to remove most of the boilerplate and let the developer concentrate on writing their business logic.

There is more to Plumber, including several built-in pipe types, and the ability to extend the library with your own custom pipes. It is also not necessary to use the fluent interface, if that is not your style. More usage information can be found in the README file in the Plumber github repo. Constructive feedback is always welcome!

No Best Practices Yet

2012-07-15T16:39:00.000-04:00

I'm getting a little tired of the constant onslaught of religious wars in software development these days. Every day, there's a new "PHP sucks vs. PHP is the best evar!", "Nodejs is crap vs Nodejs async FTW!", "OOP is a failure vs. FP doesn't get things done", "Agile vs. anything else" article. It's frustrating, because everyone is wrong.

Mankind has been building bridges for literally tens of thousands of years. From the first moment a primitive monkey-man threw a log across a stream and every other moneky-man tribe copied the idea, we've been building bridges. And guess what? As a species, we're really good at it.

And over the millennia, we've developed the discipline of engineering. At its core, engineering is nothing more than a set of best practices that say, "Hey, if you build your bridge this way or that way (depending on its purpose), it will most likely stay up for a really long time and continue to serve the function that you meant it to serve." If you ignore those best practices of engineering, your bridge may stay up, or it may not. And anyone who does build their bridges according to best practices will be able to look at your bridge and tell you why you have failed.

How does this relate to software? Mankind has been writing computer code for less than a century. Modern programming languages have been around for about 40 years. The current crop of "high level" languages (PHP, Ruby, Java, Python, Javascript, Clojure, etc.) have only been around for about 20 years. The processes by which software gets built have been constantly evolving; we can't even agree on a definition of an "agile methodology." There are new programming languages, new frameworks, new architectures, new hardware, new ideas constantly coming out of the best minds in our field.

So why does any single camp believe they know best when it comes to software "best practices?" Our discipline is so young that I don't believe any tool or method used by any software developer nowadays can be considered a "best practice."

I'm sure back in the prehistoric days, monkey-tribes thought that a log across the stream was a best practice. They couldn't envision how mathematics, physics and materials would change to allow them to build better, stronger bridges. Fast-forward to current day, and building bridges is a pretty rigorous discipline. Even breakthroughs in stronger, lighter materials don't change the underlying principles of bridge-building. Engineers might have personal preferences for the type of bridges they like to build, but they don't debate over the fundamental rules of how to build a given bridge type, or which bridge type might be superior for a given purpose (unless the purpose is so trivial that aesthetics or personal preference can be a factor beyond physical properties.)

And yet, every time a new language or framework pops up in software, entire populations of the development community are willing to abandon what came before in droves and proclaim the new way as self-evidently superior. Ignoring, of course, that that everyone said the same thing about whatever they were leaving behind the last time.

Every article I've read on either side of any of these debates can be boiled down to "my brain doesn't work that way, it works this way, therefore, this way is superior." I've yet to see a compelling argument given for any side of any software debate that doesn't have the subtext of simply being the author's personal preference. And that right there tells me that our profession is too young to have decided on what is or is not a best practice.

So yeah, let's keep throwing our logs across the stream and fighting over whether oak is better than pine, ignoring that as soon as we discover steel the debate will be pointless. At least be willing to admit that while the tools and methods we have today might be the best we can come up with, they're certainly not "best practices" yet.

Neo4jPHP Available as a Composer Package

2012-05-14T00:29:00.002-04:00

Neo4jPHP is now available as a Composer package. To use it in your project, create a "composer.json" file in the root of your project, with the following contents:

{
  "require":{
    "everyman/neo4jphp": "dev-master"
  }
}

After installing Composer, run the following command in the root of the project:

> php composer.phar install

The Neo4jPHP library will be downloaded and installed in your project under "vendor/everyman/neo4jphp". An autoloader will be created at "vendor/autoload.php" which you should include in your project, then start using Neo4jPHP as normal:

require("vendor/autoload.php");

$client = new Everyman\Neo4j\Client();
$node = $client->makeNode();

To get the latest version of Neo4jPHP, run:

> php composer.phar update

Runtime Expectations Javascript Throwdown

2012-04-23T21:22:00.001-04:00

Last week, TriangleJS were the guests on the talk-show podcast Runtime Expectations. The topic was a Javascript framework "throwdown" which covered a wide range of different frameworks and toolkits. We even discussed if frameworks were necessary at all. Ben and Adrian are great guys and it was a lot of fun being on their show and hanging out for beers afterwards.

If you want to hear me make a total stuttering ass out of myself, I start rambling around 23:34 then fade out so more informed people can talk.

Data Mapper Injection in PHP Objects

2012-04-02T08:00:00.000-04:00

I was trying to figure out a simple way to have data from a database injected into a domain object in PHP. I don't want to enforce that the domain object implements setters and getters for every property, and the burden on the developer for implementing the data injection interface should be as low as possible.

My solution relies on PHP's ability to pass and assign values by reference. The data mapper is injected into domain object. The domain object tells the mapper to attach to the object's properties. The mapper can then read and write to those properties as if they were its own.

interface Mappable {
    public function injectMapper(Mapper $mapper);
}

class Mapper {
    protected $attached = array();

    public function attach(&$ref, $fieldName) {
        $this->attached[$fieldName] = &$ref;
    }

    public function fromDataStore($data) {
        foreach ($data as $fieldName => $value) {
            if (array_key_exists($fieldName, $this->attached)) {
                $this->attached[$fieldName] = $value;
            }
        }
    }

    public function fromDomainObject() {
        $data = array();
        foreach ($this->attached as $fieldName => $value) {
            $data[$fieldName] = $value;
        }
        return $data;
    }
}

class Product implements Mappable {
    protected $id;
    protected $name;

    public function injectMapper(Mapper $mapper) {
        $mapper->attach($this->id, 'id');
        $mapper->attach($this->name, 'name');
    }

    // ...domain/business logic methods go here...
}

// Create a product and manipulate it
$product = new Product();
$product->doFoo();
$product->linkToBar();

// Inject the data mapper
$mapper = new Mapper();
$product->injectMapper($mapper);
// This could be a call to save the array of data in a database
print_r($mapper->fromDomainObject());

// Read a product from the "database"
$product = new Product();
$mapper = new Mapper();
$product->injectMapper($mapper);
$mapper->fromDataStore( array("id"=>123, "name"=>"Widgets") );

So what's happening here? The fields of the Product domain object are passed into the Mapper's `attach` method by reference. This allows that method to modify the variable passed in without passing it back to the caller.

public function attach(&$ref, $fieldName) {
        $this->attached[$fieldName] = &$ref;
    }

If the value were only passed by reference, only the `attach` method would be able to modify it. As soon as the method returned, the reference would be lost. In order to continue having a reference to the Product's properties, the `attach` method holds the reference in an array using "assign by reference."

Note the ampersand after the equals sign. This tells PHP that we don't want to assign a copy of the value in $ref, we want the reference to $ref itself. This allows us to later modify the value of the property in Product by modifying the value of a reference to the same place in memory that that Product's properties are held.

So now we have the beginnings of a basic data mapper that remains loosely coupled to our domain objects, and without having to write an explicit mapping class for each domain object-to-database mapping.

GetSet Methods vs. Public Properties

2012-03-04T22:06:00.001-05:00

I was recently having a debate with a coworker over the utility of writing getter and setter methods for protected properties of classes. On the one hand, having getters and setters seems like additional boilerplate and programming overhead for very little gain. On the other hand, exposing the value properties of a class seems like bad encapsulation and will overall lead to code that is more difficult to maintain.

I come down firmly on the get/set method side of the fence. It takes very little extra code, defines the interface to the class explicitly and makes the class easier to extend and maintain in the long run. Let's try an example (in PHP, though the premise holds for any language with public/private properties.)

class PublicValues {
    public $foo;
    public $bar;
}

class GetSet {
    protected $foo;
    protected $bar;

    public function getFoo() {
        return $this->foo;
    }

    public function setFoo($value) {
        $this->foo = $value;
    }

    public function getBar() {
        return $this->bar;
    }

    public function setBar($value) {
        $this->bar = $value;
    }
}

$public = new PublicValues();
$public->foo = array('some'=>'value', 'data'=>'structure');
$public->bar = 123;

$getset = new GetSet();
$getset->setFoo(array('some'=>'value', 'data'=>'structure'));
$getset->setBar(123);

echo "Public is: ".$public->bar." and ".print_r($public->foo, true)."\n";
echo "GetSet is: ".$getset->getBar()." and ".print_r($public->getFoo(), true)."\n";

Obviously, class `PublicValues` is much shorter than class `GetSet`. Shorter is better, right? Less code to maintain, fewer places for bugs to creep in. And it didn't take me nearly as long to type (though, truthfully, `GetSet` didn't take that long to write, either, as it was mostly cut-and-paste.)

Now let's say that now we have decided to wrap the array data structure in a new `Widget` class and we need to ensure that `foo` can only be of type `Widget`. Modify the classes to make this so:

class PublicValues {
    // ... add a whole new method
    public function setFoo(Widget $value) {
        $this->foo = $value;
    }
}

class GetSet {
    // ... only change here is type-hinting
    public function setFoo(Widget $value) {
        $this->foo = $value;
    }
    // ...
}

// ...
$public->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...
$getset->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...

`PublicVars` had an entirely new method added to it. `GetSet` only had a small modification to an existing method.

Actually, this still isn't perfect; application code can still say `$public->foo = 'whateverIWant';` without going through the `PublicVars::setFoo` method. In fact, while writing this example, I forgot until my third revision to go back and change the sample code to use the set method. Let's make sure that doesn't happen in the future:

class PublicValues {
    // ... change foo so that client code can't set it to the wrong thing
    protected $foo;

    // ... still need to be able to read foo, so now we add a getter
    public function getFoo() {
        return $this->foo;
    }
    // ...
}

// GetSet remains the same ...

// ...
$public->setFoo(new Widget(array('some'=>'value', 'data'=>'structure')));
// ...
echo "Public is: ".$public->bar." and ".print_r($public->getFoo(), true)."\n";

Notice that `GetSet` didn't have to change at all in this case, and the only client code that had to change for it was searching for anywhere calling `GetSet::setFoo` and making sure it provided a `Widget`.

As for `PublicValues`, we'll have to go through all the code making sure that everywhere that was setting `foo` with assignment now calls the set method (with a proper `Widget`), and anywhere that was reading it directly calls the get method.

My future self is screaming at my current self. Now we have two ways of accessing the PublicValues class: one way through direct value reading and writing (bar), and the other through getset methods (foo). As a consumer of this class, I have no way of knowing which properties are openly accessible and which are guarded through getset methods, other than going back and looking through the code.

To solve this problem, I am going to use a method that I see in almost every PHP project of non-trivial size that I have ever worked with or seen the source code of: use magic __get and __set to determine if the property is directly accessible or not:

class PublicValues {
    // ...

    public __get($name) {
        $method = 'get'.$name;
        if (method_exists($this,$method)) {
            return $this->$method();
        } else {
            return $this->$name;
        }
    }

    public __set($name, $value) {
        $method = 'set'.$name;
        if (method_exists($this,$method)) {
            return $this->$method($value);
        } else {
            $this->$name = $value;
        }
    }
}

As before, `GetSet` doesn't change one bit. For `PublicValues`, the rest of the code can continue to use what looks like straight assignment, and if I want to know if the assignment is direct modification or a magic set, I only need to look in only two places to figure it out: look for a named getset method, or assume I am using the magic __get and __set. Perfect...

...until we have to add something to this class that absolutely must not be modified by client code.

class PublicValues {
    // ...
    protected $modificationToThisCrashesTheApp;
    // ...
}

class GetSet {
    // ...
    protected $modificationToThisCrashesTheApp;
    // ...
}

Now we have a big problem. Clients of `GetSet` are protected, because nothing in `GetSet` can be modified except through a class method. We don't provide a method for accessing the dangerous property, so clients can't hurt themselves.

But clients of `PublicValues` can modify any property they want. The magic __set method makes public any protected value that does not have an explicit set method. We could fix this by adding a set method that throws an exception. Or we could solve it by blacklisting that property in the magic __set. Or we could make sure everyone (including new hires six months from now) knows not to touch this property. Or we could solve it by...

Truthfully, it doesn't matter how you solve it. Encapsulation of `PublicValues` has been broken from the beginning. We haven't even tried adding validation to either class, or making the setting of one value affect another, or adding read- or write-only methods. And over the course of just a few changes, we've had to add get/set methods anyway. We've just done it later in the application lifecycle after client code is already using the class, the time when it is most costly to make those sorts of changes.

In a blog post, it's easy to see these changes coming in a few paragraphs and account for them mentally. But imagine this in a large system, one that is constantly having features added, bugs fixed and maintenance performed over the course of months or years. Every change in this example could have come months apart. I don't expect myself or my fellow programmers to keep track of the intricacies of a single class over that timeframe, especially if that class is modified rarely, is one class in hundreds and we've been working in thousands of lines of other code in the meantime.

So do yourself a favor and take the 20 seconds to wrap your properties in boilerplate getset methods from the start. Not because it's a best practice, but because future you will be grateful.

Here is the meme-pic that started the debate:

Similarity-based Recommendation Engines

2012-02-23T23:42:00.001-05:00

I am currently participating in the Neo4j-Heroku Challenge. My entry is a -- as yet, unfinished -- beer rating and recommendation service called FrostyMug. All the major functionality is complete, except for the actual recommendations, which I am currently working on. I wanted to share some of my thoughts and methods for building the recommendation engine.

My starting point for the recommendation engine was a blog post by Marko Rodriguez, A Graph-Based Movie Recommender Engine. Marko's approach is pretty straightforward and he does an excellent job of explaining the subtleties of collaborative and content-based filtering. It was also a great way to get acquainted with the Gremlin graph traversal API, which I have been unwisely avoiding for fear of it being too complicated.

Basic collaborative filtering works like this: if I rate a beer, say, Lonerider Sweet Josie Brown Ale as 9 out of 10, and you rate that same beer as 8 out of 10, then I should be able to get recommendations based on other beers that you (and others who have rated that beer similarly to me) have rated highly.

I think we can improve on this. You and I happen have rated one beer the same way. What happens if that is the only beer we have rated similarly? What if I have rated Fullsteam Carver Sweet Potato Ale at 8 out of 10, but you have rated it 2 out of 10? You have shown that you have poor taste, so why should I get recommendations based on your other ratings, simply because we happened to rate a single beer similarly?

With a small modification, we can overcome this problem. Instead of basing recommendations off of one similar rating, I can calculate how similarly you and I rated all the things we have rated, and only get recommendations from you if I have determined we are similar enough in our tastes. Call this "similarity-based collaborative filtering."

Finding similar users works like this:

Find all users who have rated the same beers as I have.
For each rated beer, calculate the differences between my rating and that user's rating.
For each user, average the differences.
Return all users where the average difference is less than or equal to some tolerance (2 in the below example.)

Let's see what this looks like in Gremlin, a domain specific graph traversal language in Groovy. Gremlin is built on the concept of "pipes" where a single chain of operations runs from beginning to end on each element put into the pipe. The elements in our case are the user nodes, beer nodes and rating edges that connect them in our graph:

// We're going to use this again and again, so make a single "step" for it,
// to which we can pass a "similarity threshold"
Gremlin.defineStep('similarUser', [Vertex,Pipe], {Integer threshold ->

    // Capture the incoming pipe
    target = _();

    // Set up a table to hold ratings for averages
    m=[:].withDefault{[0,0]};

    // Find all beers rated by the target, and store each rating
    target.outE("RATED").sideEffect{w=it.rating}

    // For each rated beer, find all the users who have also rated that beer
    // and who are not the target user, then go back to their ratings
    .inV.inE("RATED").outV.except(target).back(2)

    // At this point in the pipeline, we are iterating over
    // only the ratings of beers that both users have rated.
    // For each rating, calculate the difference between
    // the target user's rating and the compared user's rating
    .sideEffect{diff=Math.abs(it.rating-w)}

    // For each compared user, store the number of beers rated in common
    // and the total difference for all common ratings. 
    .outV.sideEffect{ me=m[it.id]; me[0]++; me[1]+=diff; }.iterate();

    // Find the average difference for each user and filter
    // out any where the average is greater than two.
    m.findAll{it.value[1]/it.value[0] <= threshold}.collect{g.v(it.key)}._()
});

// Return all the users who are similar to our target
// with an average similarity of 2 or less
g.v(123).similarUser(2)

The result of this script is a list of all user nodes that are considered "similar" to the target user. Now we can get more accurate recommendations, based on users whose tastes are similar to ours. Given users who are similar to me, recommend beers that they rated highly:

Find all beers rated by similar users.
For each beer, average the similar users' ratings.
Sort the beers by highest to lowest average and return the top 25.

r=[:].withDefault{[0,0]};

// Find similar users 
g.v(123).similarUser(2)

// Find all outgoing beer ratings by similar users
// and the beer they are rating
.outE("RATED").sideEffect{rating=it.rating}.inV

// Store the ratings in a map for averaging
.sideEffect{me=r[it.id]; me[0]++; me[1]+=rating; }

// Transform the map into a new map of beer id : average
r.collectEntries{key, value -> [key , value[1]/value[0]]}

// sort by highest average first, then take the top 25
.sort{a,b -> b.value <=> a.value}[0..24]

// Transform the map back into a mapping of beer node and rating
.collect{key,value -> [g.v(key), value]}

Now I only get recommendations from users whose tastes are similar to mine.

I can also calculate an estimated rating for any beer that I have not rated. To see how that might be useful, consider Netflix. When you look at any movie on Netflix that you have not already rated, they show you their best guess for how you would like that movie. This can be helpful in determining which of a set of movies you want to watch, or in our case, which beer you might want to try given limited pocket money. Again, we can use our similar users to figure out the estimated rating of a given beer:

Find all ratings of the target beer by similar users.
Return the average of their ratings of that beer.

// Grab our target beer
beer=g.v(987)

// Find similar users 
g.v(123).similarUser(2)

// For each similar user, find all that have rated the target beer
.outE("RATED").inV.filter{it==beer}

// Go back to their ratings, and average them
.back(2).rating.mean()

So now we have a simple recommendation engine based on ratings by users with similar tastes. There are a few improvements that could still be made:

Don't consider a user "similar" unless they have rated at least 10 beers in common (or some other threshold). This prevents users who have only rated a few beers in common from being falsely labeled as similar.
Calculate the similar users ahead of time, and store them with a "SIMILAR" relationship to the target user. Periodically recalculate the similar users to keep the list fresh.
Use random sampling or size limiting on the number of users and ratings checked when determining similarity, trading accuracy for performance.

Many thanks to Marko for his help on the mailing list, and with giving me some pointers in writing this, and for being a pretty awesome guy in general.

Update: Luanne Misquitta posted about how to do similarity-based recommendations with Cypher. Thanks, Luanne!

Command Invoker Pattern with the Open/Closed Principle

2012-01-14T00:21:00.004-05:00

Over on DZone, Giorgio Sironi demonstrates the "Open/Closed Principle on real world code". The pattern demonstrated is similar to the Command Pattern, and the post does a good job of introducing a class that is open for extension but closed for modification. Giorgio mentions that there is still more work to be done. I thought I might take a stab at creating a flexible and extendable command invocation solution.

One of the issues that should be addressed is that the invoker (the Session class in the post), is only extensible by actually extending the class. If we want to add custom methods to it, we have two options: make our own sub-class, which should be avoided when favoring composition over inheritance; or modifying the base class's constructor, which violates the open/closed principle.

Another thing that could be improved is that the invoker is violating the Single-Responsibility principle (which violates the "S" in SOLID.) It needs to know not just how to execute commands, but also how to construct them. As the list of available commands grows, and the constructors for each command become more complex, the invoker needs to contain more logic to build each command type.

Let's overcome some of these issues, and also make the code even more extensible. I'll use a simplified command invoker to demonstrate.

To start, let's remove the complexity of requiring the invoker know how to build each command object. A simple way to do this is the move the factory methods into the individual command classes. We'll also define a simple interface to ensure that our commands are executable. Here are two simple commands. Note that the constructor is not part of the interface, which allows the command classes to have completely different constructor method signatures.

interface Command {
 public static function register();

 public function execute();
}

class HelloCommand implements Command {
 protected $name;

 public static function register() {
  return function ($name) {
   return new HelloCommand($name);
  };
 }

 public function __construct($name) {
  $this->name = $name;
 }

 public function execute() {
  echo "Hello, " . $this->name . "\n";
 }
}

class PwdCommand implements Command {

 public static function register() {
  return function () {
   return new PwdCommand();
  };
 }

 public function execute() {
  echo "You are here: " . getcwd() . "\n";
 }
}

Each command class has a static `register` method that returns an anonymous function which knows how to instantiate the class. We'll call this the "factory" function. In the example, the factory function signatures are the same as the constructors, but that is not necessary; as long as the factory function can construct a valid object, it can take whatever parameters it needs.

Our commands can construct themselves, so the invoker only has to know how to map its own method calls to the commands. Here is the simplified Invoker class:

class Invoker {
 protected $commandMap = array();

 public function __construct() {
  $this->register('hello', 'HelloCommand');
  $this->register('pwd', 'PwdCommand');
 }

 public function __call($command, $args) {
  if (!array_key_exists($command, $this->commandMap)) {
   throw new BadMethodCallException("$command is not a registered command");
  }
  $command = call_user_func_array($this->commandMap[$command], $args);
  $command->execute();
 }

 public function register($command, $commandClass) {
  if (!array_key_exists('Command', class_implements($commandClass))) {
   throw new LogicException("$commandClass does not implement the Command interface");
  }
  $this->commandMap[$command] = $commandClass::register();
 }
}

The important bit is the `register` method. It checks that the registered class is indeed a Command, and thus, that it will have both an `execute` and a `register` method. We then invoke the `register` method, and store the factory function for that command.

In the magic `__call` method, we have to use `call_user_func_array` since we don't know the real signature of the called factory function. If we wanted, we could have the `__call` method return the command object instead of executing it. That would turn the Invoker into an Abstract Factory for command objects.

The main point is that Invoker never cares exactly how commands are constructed. Invoker now only has one responsibility: invoking commands.

We execute commands by calling Invoker methods with parameters that match the factory function signature of the command being executed:

$invoker = new Invoker();
$invoker->hello('Josh');
$invoker->pwd();

By making the Invoker's `register` method public, we've also opened up the Invoker to allow a client to extend its functionality without having to modify the Invoker itself:

class CustomCommand implements Command {
 protected $foo;
 protected $bar;

 public static function register() {
  return function ($foo, $bar) {
   return new CustomCommand($foo, $bar);
  };
 }

 public function __construct($foo, $bar) {
  $this->foo = $foo;
  $this->bar = $bar;
 }

 public function execute() {
  echo "What does it all mean? " . ($this->foo + $this->bar) . "\n";
 }
}

$invoker->register('custom', 'CustomCommand');
$invoker->custom(40, 2);

We can even let clients overwrite existing commands by re-registering with a custom class:

$invoker->register('pwd', 'CustomCommand');
$invoker->pwd(2, 40);

If we don't want to allow this, we can build more checks into `register` that prevent overwriting built-in commands.

So now we have an extendable, flexible command invocation system that requires no modification to the actual command runner. This could also be used as the basis for a plug-in system.

Full sample code is available at http://gist.github.com/1610148

Funkatron's MicroPHP Manifesto

2012-01-03T16:11:00.000-05:00

Just saw this over on @funkatron's blog:

The MicroPHP Manifesto:

I am a PHP developer

I am not a Zend Framework or Symfony or CakePHP developer
I think PHP is complicated enough

I like building small things

I like building small things with simple purposes
I like to make things that solve problems
I like building small things that work together to solve larger problems

I want less code, not more

I want to write less code, not more
I want to manage less code, not more
I want to support less code, not more
I need to justify every piece of code I add to a project

I like simple, readable code

I want to write code that is easily understood
I want code that is easily verifiable

I agree that there are "and-the-kitchen sink" frameworks out there that do too much. I also agree that having every developer re-invent the wheel is not the ideal situation. The main problem I see with the macro-frameworks is that they tend to not allow me to strip out their functionality and replace it with another library when I need to. I much prefer smaller libraries and components that I can mix and match if necessary. Isn't that the whole point of code re-use in the first place?

So maybe this isn't so much about huge frameworks being bad as it is that they should be built off small, interchangeable coponents? Wasn't there a call for more inter-operability a little while back?

Funkatron has made it pretty clear that these are his personal preferences, but that hasn't stopped the haters.

PHP Fog Quickstart

2011-12-25T14:58:00.000-05:00

For a while, I've been trying to find a good hosting service to use for testing out PHP apps and ideas I have. My criteria was to find a service that would be like Heroku for PHP apps: simple to get started, pre-configured with access to a database, scalable on demand, and no-hassle deployment (able to integrate with continuous deployment.)

Funnily enough, if you do a Google search for "heroku php" the first thing that comes up is PHP Fog. They provide hosting for up to 3 projects for free on shared hosting. They also offer MongoDB and application monitoring as add-ons.

Being it was Christmas and all, I decided to give myself a present and sign up. I was very surprised by how easy it was to get up and running. I managed to build a simple "echo" service in about 8 minutes, following roughly these steps:

After logging in, click "Launch New App". This brings you to a screen where you can choose from several pre-built applications (popular CMS or blogging platforms) or skeleton projects built on the most popular PHP frameworks.

For my purposes, I chose "Custom App". This will build an application containing exactly one "Hello world" PHP file. Enter a password for the MySQL database that will be set up, and a subdomain. The subdomain must be unique among all PHP Fog apps, and the setup won't continue until it is. After your app is set up, it can be given a custom domain name from $5 per month.

Click "Create App". This will bring you to your project's management page. From here, you can find everything you need to start developing your app. The git repository address for the app is displayed, as are the database credentials, helpfully wrapped in a mysql_connect function call.

Eventually, the "Status" light will turn green, indicating that your app is available. Clicking "View Live Site" brings you to your application, live to the world. Not much there. Time to change that.

Follow the instructions to add an SSH key to PHP Fog, which will give you access to write to your application's git repo. Then use the given git address to clone the repo to your development environment. Also, grab the Silex microframework to start building the app:

> git clone git@git01.phpfog.com:myapp.phpfogapp
> cd myapp.phpfogapp
> wget http://silex.sensiolabs.org/get/silex.phar

Edit index.php to provide a simple endpoint which will greet users:

<?php
require_once __DIR__.'/silex.phar';

$app = new Silex\Application();

$app->get('/', function() use($app) {
    return 'Checkout <a href="/hello/world">Hello...</a>';
});

$app->get('/hello/{name}', function($name) use($app) {
    return 'Hello '.$app->escape($name);
});

$app->run();

Also, create an .htaccess file that will redirect all traffic to index.php:

RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^ index.php [L]

Add the files to repository, commit them, then push the master branch to make the changes live:

> git add index.php .htaccess silex.phar
> git commit
> git push origin master

Reload your site to see the changes. Follow the "Hello" link, and play with the URL to see the results.

After the holidays, I hope to actually build something interesting on the platform.

Decode JSON in Bash with PHP

2011-12-17T18:56:00.000-05:00

I recently found myself needing to make cUrl calls from the command-line to an endpoint which returned JSON responses. Rather than parsing through the JSON as a string, or downloading some third party tool to format it for me, I created this handy Bash alias that decodes JSON from the command-line using PHP.

Put the following in your ~/.bashrc file:

alias json-decode="php -r 'print_r(json_decode(file_get_contents(\"php://stdin\")));'"

Usage from any process:

echo '{"foo":"bar","baz":"qux"}' | json-decode

Or from a cUrl call:

curl -s http://example.com/json/endpoint | json-decode

Codeworks '11 Raleigh Rundown

2011-12-08T17:59:00.001-05:00

I had a great time at CodeWorks 2011 in Raleigh this week, put on by the great folks at php|architect. Here is a rundown of the presentations, with some of my thoughts.

CI:IRL

Beth Tucker Long started out the day with a talk about continuous integration tools for automating away the pain of building and deploying PHP projects. One of her main themes was the importance of standardization in your code to avoid wasting mental effort on things like variable and class naming, which lines to put curly braces on, or how many spaces are in a single indent (editor: the answer is none; use a tab character.) The same can be said about manual testing and deployment processes.

The tools discussed included PHP_CodeSniffer, the venerable and nigh ubiquitous PHPUnit, and multiple options for automated builds and testing (CruiseControl and Jenkins/Hudson.) Beth also talked about Phing, which is one of my new favorite tools. She also covered tools for automated documentation and report generation. I would have gone with DocBlox instead of phpDocumentor, but that's just a personal preference. The most useful thing about most of these tools is that many of them are aware of each other, directly or through plugins, and can be chained together to build a flexible CI pipeline.

Most of the tools I had already heard of or am actively using. In addition to PHP_CodeSniffer for coding standards compliance, I would also throw in one or both of PHP Mess Detector and PHP Copy-Paste Detector. These three tools together give great insights into the current state of your code base and its future maintainability, which Beth mentioned when discussing "technical debt".

An important thing that Beth stressed at the end of her presentation was that not every project requires the use of every tool. It's important to take into account things like size/complexity of the project, timeline, budgets and team size. I think this was a valuable piece of advice that a lot of people forget. If you're anything like me, when you first learn about a new tool your response is to go out and find as many ways to use it as possible. I'm glad Beth ended with the reminder to keep in mind the constraints of your project when picking which tools to apply.

How Beer Made Me A Better Developer

Next up was Jason Austin, who is one of the coordinators of my local PHP meetup group. Jason's talk was a call for all developers to find their passion. His two passions are beer and PHP, so he decided to combine them into a side project, and Pint Labs was born. All developers should find a side project related to their passion. It is something he looks for as a hiring manager. They also prevent burnout in your day job.

Side projects give developers an opportunity to experiment without the consequences of failure that usually accompany employed or contract work. More than that, they allow for greater opportunities to bring new knowledge to a paid job. "The dumbest thing a manager can do is stop their developers from working on their own side projects," was one of my favorite lines from the presentation.

I can't agree more, and I'm lucky to have never worked in an environment where an employer has either forbidden me from having side projects, or claimed that anything I work on in my own time belongs to the company. I learned at lunch that many people aren't so fortunate. I would be very wary of any employment where that was the case.

As for side projects, even if you don't start your own, there are tons of open source projects out there who would love more hands on keyboards helping out. There is no excuse for not getting involved, even at a small degree.

What's New in PHP 5.4

Of all the talks, Cal Evans's about the new features coming in PHP 5.4 was probably the least applicable to my current work, but was one of my favorites. Cal is an engaging speaker, and there is some really cool stuff coming up in PHP.

The bits I was most interested in were improvements in Closures, Traits, and the built-in HTTP server. For those who don't know, in PHP 5.4, closures created in an object method will be able to reference $this from within the closure. That may not seem like a big deal, but if you're working with classes that build themselves or are factories for other objects, having the ability to reference the creating object is incredibly useful. The closure can access protected and private stuff from the instantiated class.

The caveat is that $this will always reference the object in which the closure was created. It would be great to be able to bind $this to whatever object was invoking the closure, like you can with Javascript's .call() and .apply() functions.

Lots has been written about Traits. Of all the features in PHP 5.4, this is the one I'm most looking forward to. They will solve a lot of problems that currently can only be solved by having deeply nested class hierarchies. Combining traits with the ability to bind closures to an object paves the way for having completely dynamically created classes, which would be useful in many situations. It also gives more evidence for my theory that every programming language evolves until it can replicate Javascript.

One other thing that Cal mentioned was the new JsonSerializable interface. This is a great addition for anyone who deals with sending JSON responses in PHP. I only have one problem with it: there is no complementary JsonUnserializable interface that would allow an object to populate its properties from a JSON string. I don't watch the PHP internals list, so I don't know why it was decided not to include such obvious functionality. Does anyone have any insights?

Refactoring and Other Small Animals

After lunch, we got right into the second half of the day with Marco Tabini and a presentation on refactoring. The presentation centered around a piece of "legacy" code. He wrote a test to assert the existing behavior, then proceeded to follow the refactoring steps of abstract, break (apart) and rename to show how the code could be made more maintainable and testable.

Marco's talk was to the point, with a great example, but it tilted more towards the theoretical side of why you should refactor. It also took a very purist look at what refactoring means (absolutely no changes to interface, even changes that would not affect calling code.) While I agree that this is the technical definition of refactoring, it's hard to start poking at a piece of code without wanting to improve its usage instead of just its implementation. I think there is a fuzzy line that developers cross often, at least mentally, between refactoring and improving the interface.

At the after-party, some of my coworkers had a longer discussion with Marco where he expounded on some of the ideas he presented and gave some more practical advice for our own legacy application. I wasn't there, but I do appreciate speakers and experts taking the time to apply their own knowledge and experience to their audience's issues. So, thank you Marco.

Jquery Mobile and PhoneGap

Terry Ryan was the representative from Adobe, one of the conference sponsors. His talk began with a quick demo of building an app with jQuery Mobile. I'll admit, I'm not a big fan of doing UI or front-end development, but even I could see how working with a library like that could make such development fun. It seems some of the kinks still need to be worked out, but it looks like a promising library and I hope to play with it some day soon.

Following that was an overview of PhoneGap, and more impressively, PhoneGap:Build, where Terry tweaked his application, ran it through the cloud building service, and had the new app deployed to his phone in just a few minutes, all live. Impressive stuff!

Then, Terry showed off some of the things Adobe has in the works with CSS shaders. The effects he demoed were neat to look at, but I don't feel like there was anything there that WebGL doesn't already cover.

For a finale, Terry live demoed Adobe's Proto application, which allows tablet users to create UI wireframes and mock-ups by drawing with their fingers. Imagine a virtual whiteboard that takes your scratchings and turns them into divs, tables, text, headers and image frames live while you draw. Now imagine sharing that view with remote users, also live. Very cool stuff.

REST Best Practices

Keith Casey's presentation was an informative look into the philosophy and reasoning behind REST architecture, and was another one of my favorites. He talked about the myth of "strictly RESTful" and how the vast majority of APIs out there don't even come close (hint #1: if you're passing an authentication token or session ID on every request, your API is not RESTful.) I also liked his use of "-ilities": discoverability, readability, flexibility, maintainability, etc.

He used Twilio as his example API, and showed how a good REST API should follow the 6 constraints of REST: client-server, stateless (where most authentication schemes fail at being REST), cacheable, layered, uniform interface, and code-on-demand. The last constraint is the hardest one for me to get my head around, so I appreciated his use of GMail as an example of providing not just data, but the code that manipulates that data. I think that JSONP might also fall into that category, but I didn't get a chance to ask.

Like with Marco's talk, I appreciated the theoretical nature of the topic, but I would have liked some more examples of real-world practicality trumping academic purity. It's always nice when a speaker acknowledges that sometimes you just need to get things done and points out which portions of a technology are more flexible than others and amenable to not following the letter of the law. Besides that, I thought it was a great look into why REST is the way it is.

After-Party

No development conference is complete without an after-party and no after-party is complete without free beer. I'd like to thank the folks at SugarCRM for sponsoring the after-party at Tir na nOg, one of my favorite pubs in downtown Raleigh.

Besides free booze, I love to talk with speakers and other conference attendees in a less formal setting. It's great to get to know other developers at the top of their craft, and the people in the PHP community are some of the friendliest and easiest going. It's one of the reasons I love being a PHP developer so much. Keith, his coworker at Twilio Devin Rader, a few other people and I had a great conversation that shifted topics basically at the speed of thought. John Mertic from SugarCRM was there and answered some of my questions, and Beth explained the php|architect submission process in a way that didn't sound scary at all.

To all the conference speakers, organizers and attendees, I want to say thank you for a fun and educational experience. Hopefully some of us will meet up again at phptek 2012!

Development Setup for Neo4j and PHP: Part 2

2011-11-05T14:28:00.002-04:00

Update 2014-02-15: Using Neo4jPHP from a downloaded PHAR file has been deprecated. The preferred and supported way to install the library is via Composer. The examples below have been updated to reflect this. It has also been updated for the 2.0 release of Neo4j.

This is Part 2 of a series on setting up a development environment for building projects using the graph database Neo4j and PHP. In Part 1 of this series, we set up unit test and development databases. In this part, we'll build a skeleton project that includes unit tests, and a minimalistic user interface.

All the files will live under a directory on our web server. In a real project, you'll probably want only the user interface files under the web server directory and your testing and library files somewhere more protected.

Also, I won't be using any specific PHP framework. The principles in the code below should apply equally to any framework you decide to use.

Create the Project and Testing Harness

Create a project in a directory on your web server. For this project, mine is "/var/www/neo4play" on my local host. We'll also need a Neo4j client library. I recommend Neo4jPHP (disclaimer: I'm the author) and all the code samples in Part 2 use it.

> cd /var/www/neo4play
> mkdir -p tests/unit
> mkdir lib
> echo '{"require":{"everyman/neo4jphp":"dev-master"}}' > composer.json
> composer install
> echo "<?php phpinfo(); ?>" > index.php

Test the setup by browsing to http://localhost/neo4play. You should see the output of `phpinfo`.

Now we'll create a bootstrap file that we can include to do project-wide and environment specific setup. Call this file "bootstrap.php" in the root project directory.

<?php
require_once(__DIR__.'/vendor/autoload.php');

error_reporting(-1);
ini_set('display_errors', 1);

if (!defined('APPLICATION_ENV')) {
    define('APPLICATION_ENV', 'development');
}

$host = 'localhost';
$port = (APPLICATION_ENV == 'development') ? 7474 : 7475;
$client = new Everyman\Neo4j\Client($host, $port);

The main point of this file at the moment is to differentiate between our development and testing environments, and set up our connection to the correct database. We do this by attaching the database client to the correct port based on an application constant.

We'll use the bootstrap file to setup a different, unit testing specific bootstrap file. Create the following file as "tests/bootstap-test.php":

<?php
define('APPLICATION_ENV', 'testing');
require_once(__DIR__.'/../bootstrap.php');

// Clean the database
$query = new Everyman\Neo4j\Cypher\Query($client, 'MATCH n-[r]-m DELETE n, r, m');
$query->getResultSet();

The purpose of this file to to tell our application bootstrap that we are in the "testing" environment. Then it cleans out the database so that our tests run from a known state.

Tell PHPUnit to use our test bootstrap with the following config file, called "tests/phpunit.xml":

<phpunit colors="true" bootstrap="./bootstrap-test.php">
    <testsuite name="Neo4j Play Test Results">
        <directory>./unit</directory>
    </testsuite>
</phpunit>

And because we're following TDD, we'll create our first test file, "tests/unit/ActorTest.php":

<?php
class ActorTest extends PHPUnit_Framework_TestCase
{
    public function testCreateActorAndRetrieveByName()
    {
        $actor = new Actor();
        $actor->name = 'Test Guy '.rand();
        Actor::save($actor);

        $actorId = $actor->id;
        self::assertNotNull($actorId);

        $retrievedActor = Actor::getActorByName($actor->name);
        self::assertInstanceOf('Actor', $retrievedActor);
        self::assertEquals($actor->id, $retrievedActor->id);
        self::assertEquals($actor->name, $retrievedActor->name);
    }

    public function testActorDoesNotExist()
    {
        $retrievedActor = Actor::getActorByName('Test Guy '.rand());
        self::assertNull($retrievedActor);
    }
}

So we know we want a domain object called "Actor" (apparently we're building some sort of movie application) and that Actors have names and ids. We also know we want to be able to look up an Actor by their name. If we can't find the Actor by name, we should get a `null` value back.

Run the tests:

> cd tests
> phpunit

Excellent, our tests failed! If you've been playing along, they probably failed because the "Actor" class isn't defined. Our next step is to start creating our domain objects.

Defining the Application Domain

So far, we only have one domain object, and a test that asserts its behavior. In order to make the test pass, we'll need to connect to the database, persist entities to it, and then query it for those entities.

For persisting our entities, we'll need a way to get the client connection in our "Actor" class and any other domain object classes we define. To do this, we'll create an application registry/dependency-injection container/pattern-buzzword-of-the-month class. Put the following in the file "lib/Neo4Play.php":

<?php
class Neo4Play
{
    protected static $client = null;

    public static function client()
    {
        return self::$client;
    }

    public static function setClient(Everyman\Neo4j\Client $client)
    {
        self::$client = $client;
    }
}

Now our domain objects will have access to the client connection through `Neo4Play::client()` when we persist them to the database. It's time to define our actor class, in the file "lib/Actor.php":

<?php
use Everyman\Neo4j\Node,
    Everyman\Neo4j\Index;

class Actor
{
    public $id = null;
    public $name = '';

    public static function save(Actor $actor)
    {
    }

    public static function getActorByName($name)
    {
    }
}

Requiring our classes and setting up the client connection is part of the bootstrapping process of the application, so we'll need to add some thing to "bootstrap.php":

<?php
require_once(__DIR__.'/vendor/autoload.php');
require_once(__DIR__.'/lib/Neo4Play.php');
require_once(__DIR__.'/lib/Actor.php');

// ** set up error reporting, environment and connection... **//

Neo4Play::setClient($client);

We have a stub class for the domain object. The tests will still fail when we run them again, but at least all the classes should be found correctly.

Let's start with finding an Actor by name. With our knowledge of graph databases, we know this will involve an index lookup, and that we will get a Node object in return. If the lookup returns no result, we'll get a `null`. If we do get a Node back, we'll want to hold on to it, for updating the Actor later.

Modify the Actor class with the following contents:

class Actor
{
    //

    protected $node = null;

    //

    public static function getActorByName($name)
    {
        $actorIndex = new Index(Neo4Play::client(), Index::TypeNode, 'actors');
        $node = $actorIndex->findOne('name', $name);
        if (!$node) {
            return null;
        }

        $actor = new Actor();
        $actor->id = $node->getId();
        $actor->name = $node->getProperty('name');
        $actor->node = $node;
        return $actor;
    }
}

The main thing we're trying to accomplish here is keeping our domain classes as Plain-Old PHP Objects, that don't require any special class inheritance or interface, and that hide the underlying persistence layer from the outside world.

The tests still fail. We'll finish up our Actor class by saving the Actor to the database.

class Actor
{
    //

    public static function save(Actor $actor)
    {
        if (!$actor->node) {
            $actor->node = new Node(Neo4Play::client());
        }

        $actor->node->setProperty('name', $actor->name);
        $actor->node->save();
        $actor->id = $actor->node->getId();

        $actorIndex = new Index(Neo4Play::client(), Index::TypeNode, 'actors');
        $actorIndex->add($actor->node, 'name', $actor->name);
    }

    //
}

Run the tests again. If you see all green, then everything is working properly. To double check, browse to the testing instance webadmin panel http://localhost:7475/webadmin/#. You should see 2 nodes and 1 property (Why 2 nodes? Because there is a node 0 -- the reference node -- that is not deleted when the database is cleaned out.)

Build Something Useful

It's time to start tacking on some user functionality to our application. Thanks to our work on the unit tests, we can create actors in the database and find them again via an exact name match. Let's expose that functionality.

Change the contents of "index.php" to the following:

<?php
require_once('bootstrap.php');

if (!empty($_POST['actorName'])) {
    $actor = new Actor();
    $actor->name = $_POST['actorName'];
    Actor::save($actor);
} else if (!empty($_GET['actorName'])) {
    $actor = Actor::getActorByName($_GET['actorName']);
}

?>
<form action="" method="POST">
Add Actor Name: <input type="text" name="actorName" />
<input type="submit" value="Add" />
</form>

<form action="" method="GET">
Find Actor Name: <input type="text" name="actorName" />
<input type="submit" value="Search" />
</form>

<?php if (!empty($actor)) : ?>
    Name: <?php echo $actor->name; ?><br />
    Id: <?php echo $actor->id; ?><br />
<?php elseif (!empty($_GET['actorName'])) : ?>
    No actor found by the name of "<?php echo $_GET['actorName']; ?>"<br />
<?php endif; ?>

Browse to your index file. Mine is at http://localhost/neo4play/index.php.

You should see the page you just created. Enter a name in the "Add Actor Name" box and click the "Add" button. If everything went according to plan, you should see the actor name and the id assigned to the actor by the database.

Try finding that actor using the search box. Note the actor's id.

Browse to http://localhost:7474/webadmin/# and click the "Data browser" tab. Enter the actor id in the text box at the top. The node you created when you added the actor should show up.

The interesting thing is that our actual application doesn't know anything about how the Actors are stored. Nothing in "index.php" references graphs or nodes or indexes. This means that, in theory, we could swap out the persistence layer for a SQL databases later, or MongoDB, or anything else, and nothing in our application would have to change. If we started with a SQL database, we could easily transition to a graph database.

Explore the Wonderful World of Graphs

Your development environment is now set up, and your application is bootstrapped. There's a lot more to add to this application, including creating movies, and linking actors and movies together. Maybe you'll want to add a social aspect, with movie recommendations. Graph databases are powerful tools that enable such functionality to be added easily.

Go ahead and explore the rest of the Neo4jPHP library (wiki and API). Also, be sure to checkout the Neo4j documentation, especially the sections about the REST API, Cypher and Gremlin (two powerful graph querying and processing languages.)

All the code for this sample application is available as a gist: http://gist.github.com/1341833.

Happy graphing!

Development Setup for Neo4j and PHP: Part 1

2011-11-05T02:47:00.000-04:00

Update 2014-02-15: Using Neo4jPHP from a downloaded PHAR file has been deprecated. The preferred and supported way to install the library is via Composer. The examples below have been updated to reflect this. It has also been updated for the 2.0 release of Neo4j.

I would really love to see more of my fellow PHP developers talking about, playing with, and building awesome applications on top of graph databases. They really are a powerful storage solution that fits well into a wide variety of domains.

In this two part series, I'll detail how to set up a development environment for building a project with a graph database (specifically Neo4j). Part 1 will show how to set up the development and unit testing databases. In Part 2, we'll create a basic application that talks to the database, including unit tests.

All the steps below were performed on Ubuntu 10.10 Maverick, but should be easy to translate to any other OS.

Grab the Components

There are a few libraries and software tools needed for the project. First off, we'll need to install our programming environment, specifically PHP, PHPUnit and Java. How to do this is dependent on your operating system. My setup is PHP 5.3, PHPUnit 3.6 and Sun Java 6.

Next, we need to get Neo4j. Download the latest tarball from http://neo4j.org/download. I usually go with the latest milestone release (2.0 at this time.)

> cd ~/Downloads
> wget http://dist.neo4j.org/neo4j-community-2.0.1-unix.tar.gz

Set Up the Development Instance

It is possible to put test and development data in the same database instance, but I prefer a two database solution, even when using a SQL database. I do this because 1) I don't have to worry about accidentally blowing away any development data I care about, and 2) I don't have to worry about queries and traversals in my unit tests accidentally pulling in development data, and vice versa.

Unlike SQL or many NOSQL database servers, Neo4j cannot have more than one database in a single instance. The only way to get two databases is to run two separate instances on two separate ports or hosts. The instructions below describe how to set up our development and unit testing databases on the same host but listening on different ports.

First, create a point to hold both neo4j instances, and unpack the development instance:

> mkdir -p ~/neo4j
> cd ~/neo4j
> tar -xvzf ~/Downloads/neo4j-community-2.0.1-unix.tar.gz
> mv neo4j-community-2.0.1 dev

Next, configure the server by editing the file "~/neo4j/dev/conf/neo4j-server.properties". Make sure that the server is on port 7474

org.neo4j.server.webserver.port=7474

If you will be accessing the database instance from a host other than localhost, uncomment the following line:

org.neo4j.server.webserver.address=0.0.0.0

Save the config and exit. Now it's time to start the instance.

> ~/neo4j/dev/bin/neo4j start

You should be able to browse to the Neo4j Browser panel: http://localhost:7474

Set Up the Testing Instance

Unpack the test instance:

> cd ~/neo4j
> tar -xvzf ~/Downloads/neo4j-community-2.0.1-unix.tar.gz
> mv neo4j-community-2.0.1 test

Next, configure the server by editing the file "~/neo4j/test/conf/neo4j-server.properties". Make sure that the server is on port 7475 (note that this is a different port from the development instance!)

org.neo4j.server.webserver.port=7475

If you will be accessing the database instance from a host other than localhost, uncomment the following line:

org.neo4j.server.webserver.address=0.0.0.0

We need one more change to differentiate the testing from the development instance. Edit the file "~/neo4j/test/conf/neo4j-wrapper.properties" and change the instance name line:

wrapper.name=neo4j-test

Save and exit, then start the instance.

> ~/neo4j/test/bin/neo4j start

You should be able to browse to the Neo4j Browser panel: http://localhost:7475

Congratulations! Everything is set up and ready for us to build our application. In Part 2 of this series, I'll talk about setting up unit tests and building a basic application using a graph database as a backend.

A Humbling Reference

2011-10-04T23:10:00.003-04:00

Sometimes I have an experience that reminds me of how near-sighted I can get when I'm head-down in a problem.

What do you think the output of the following code will be (no cheating by running it first!):

<?php
$a = array(array("result" => "a"));
for ($i=0; $i < 2; $i++) {
    foreach ($a as $j => $v) {
        $v["result"] .= "a";
        echo $v["result"] ."\n";
    }
}
foreach ($a as $k => $v) echo $v["result"] ."\n";

If the goal is to add the character "a" twice onto each result element of each sub-array, at first glance this seems like a perfectly reasonable way to go about it.

But hold up! Each time through, we seem to have only added one "a" onto the result. And in the second foreach loop, it seems like we didn't modify the value at all! What gives?

Well, it might be obvious to most of you, but when it was costing me an hour and a half of my work day it was buried in a nest of process forking and stream handling, part of which looked something like this:

while ($currentExecution) {
    foreach ($currentExecution as $i => $handle) {
        $handle['output'] .= stream_get_contents($handle['outstream']);
        $status = proc_get_status($handle['process']);
        if (!$status['running']) {
            fclose($handle['outstream']);
            proc_terminate($handle['process']);
            echo $handle['output']."\n";
            unset($currentExecution[$i]);
        }
    }
}

Same concept as before: I have an array of process handles (opened with proc_open) and I'm looping through, gathering the output of each process as it is generated, until the process is finished running, at which point I close the process and display the output. The symptom was that, sometimes, I would only see the end of the output and not the beginning.

The times when this happened were the times where a process handle went through the outer while loop more than once, the same as in the contrived example above. The reason for this is because when you are working with a value inside a foreach loop, you are actually working with a copy of that value (technically a copy-on-write). So modifications made to the value do not persist through multiple passes through the while loop.

As it turns out, the solution here is quite simple. Actually, there are several solutions. Here's the easiest one, applied to the original example:

<?php
$a = array(array("result" => "a"));
for ($i=0; $i < 2; $i++) {
    foreach ($a as $j => &$v) {
        $v["result"] .= "a";
        echo $v["result"] ."\n";
    }
}
foreach ($a as $k => $v) echo $v["result"] ."\n";

The difference is subtle: it's the "&" prepended to the $v in the foreach loop. This tells the loop to use a reference to the value instead of a copy. Here's another possible solution:

<?php
$a = array(array("result" => "a"));
for ($i=0; $i < 2; $i++) {
    foreach ($a as $j => $v) {
        $a[$j]["result"] .= "a";
        echo $a[$j]["result"] ."\n";
    }
}
foreach ($a as $k => $v) echo $v["result"] ."\n";

In this case, we're not even using the variable $v. We're using the array index to refer to the value inside of the array directly. Because we're modifying the actual array value and not the copy, the changes persist through the loop.

The final solution (and the one I ended up going with) is to make the value an object instead of an array:

<?php
$obj->result = "a";
$a = array($obj);
for ($i=0; $i < 2; $i++) {
        foreach ($a as $j => $v) {
                $v->result .= "a";
                echo $v->result ."\n";
        }
}
foreach ($a as $k => $v) echo $v->result ."\n";

Objects in PHP are always passed around by reference, including in foreach loops. So modifying the object's property inside the loop modifies the actual object, not a copy.

The most annoying thing about this whole experience? I had solved this problem for someone else a few weeks ago, telling them what the issue was just from hearing them describe the situation and without even seeing their code run. So the deeper lesson here is this: when you've been staring at code for so long that it's making you cross-eyed and frustrated, walk away. And most importantly, get someone else who hasn't been working on it to look it over.

...something something forest...something something trees...

Sample code available at http://gist.github.com/1263522

Phar Flung Phing

2011-09-23T20:33:00.001-04:00

One of the cooler features of PHP 5.3 is the ability to package up a set of PHP class files and scripts into a single archive, known as a PHAR ("PHp ARchive"). PHAR files are pretty much equivalent to Java's JAR files: they allow you to distribute an entire library or application as a single file. I decided to see how easy it would be to wrap up Neo4jPHP in a PHAR for distribution.

Surprisingly, there isn't much information out there on creating PHAR files outside of the PHP manual and a few older posts. They deal mainly with creating a PHAR with a hand-written script specific to the project being packaged, using PHP's PHAR classes and methods.

Since I also started playing with Phing recently, I decided to see if I could incorporate packaging a project as a PHAR into my build system. It turns out, it's pretty easy, given that Phing has a built-in PharPackage task.

All steps below were performed on Ubuntu 10.10 Maverick

Step 1 - Get Phing

The Phing docs are pretty explicit and easy to follow. If you have PEAR installed, simply do:

> sudo pear channel-discover pear.phing.info
> sudo pear install phing/phing

If all goes well, you should the location of the phing command line utility when you run

> which phing

Step 2 - Allow PHP to create PHARs

This is one that most of the posts I found on creating PHARs fail to mention. In the default installation of PHP, PHARs are read-only (this prevents a nasty security hole, whereby any PHP script on your machine has the ability to modify other PHARs.) In order to prevent this, the php.ini setting "phar.readonly" must be turned off. I did this by creating a new file in my PHP conf.d directory:

> sudo sh -c "echo 'phar.readonly=0' > /etc/php5/conf.d/phar.ini"

Step 3 - Create a stub

In a PHAR, the stub file is run when the PHAR is `include`d or `require`d in a script (or when the PHAR is invoked from the command line.) The purpose of the stub is to do any setup necessary for the library contained in the PHAR. Since I'm packaging a library, the stub I created doesn't do anything more than set up autoloading of my library's classes.

Save the following file as "stub.php" in the root of your project:

<?php
Phar::mapPhar('myproject.phar');
spl_autoload_register(function ($className) {
	$libPath = 'phar://myproject.phar/lib/';
	$classFile = str_replace('\\',DIRECTORY_SEPARATOR,$className).'.php';
	$classPath = $libPath.$classFile;
	if (file_exists($classPath)) {
		require($classPath);
	}
});
__HALT_COMPILER();

The "__HALT_COMPILER();" line must be the last thing in the file. The call to `Phar::mapPhar()` registers the PHAR with PHP and allows any of the files inside to be found via the "phar://" stream wrapper. Any files or directories contained in the PHAR can be referenced as "phar://myproject.phar/path/inside/phar/MyClass.php". Since all my project files are namespaced, autoloading is a matter of translating the class name into a path inside the PHAR.

Step 4 - Create a build file

Phing's documentation is pretty extensive, so I won't go into to much detail about what each piece of the build file means. The important part is the PharPackage task in the "package" target.

Save the following file as "build.xml" in the root of your project:

<?xml version="1.0" encoding="UTF-8"?>
<project name="MyProject" default="package">

  <!-- Target: build -->
  <target name="build">
    <delete dir="./build" />
    <mkdir dir="./build" />
    <copy todir="./build">
      <fileset dir=".">
        <include name="lib/" />
      </fileset>
    </copy>
  </target>

  <!-- Target: package -->
  <target name="package" depends="build">
    <delete file="./myproject.phar" />
    <pharpackage
      destfile="./myproject.phar"
      basedir="./build"
      compression="gzip"
      stub="./stub.php"
      signature="sha1">
      <fileset dir="./build">
        <include name="**/**" />
      </fileset>
      <metadata>
        <element name="version" value="1.2.3" />
        <element name="authors">
          <element name="Joe Developer">
            <element name="email" value="joe.developer@example.com" />
          </element>
        </element>
      </metadata>
    </pharpackage>
  </target>
</project>

The important part is the "package" target, the default target of this build file. It depends on the "build" target, whose only purpose is to copy out all the files that will be packaged in the PHAR into a separate "build" directory. This strips out any testing, example or other non-library files. Any previously created package is deleted.

The main work of packaging is done via the "pharpackage" task. In this case, I'm specifying that the output package should be called "myproject.phar" and be put in the same directory as the build file, the PHAR should be compressed with "gzip", the stub to load when requiring the file is at "./stub.php" and the PHAR should be signed with a SHA1 hash.

Metadata appears to be optional, but Phing throws up a nasty looking warning if it's not there (it tries to convert an empty value to an array.) Nested metadata is turned into a PHP array and serialized.

Step 5 - Phling your PHAR

Run the following command in the directory with your build file:

> phing
# or, if the "package" task is not the default
> phing package

If all went well, there should be a file called "myproject.phar" in the root of your project. You can test it out by running this test script:

<?php
require("myproject.phar");

$thing = new Class\In\My\Project();

If PHP doesn't complain about a missing class, the PHAR set up correctly and autoloading is working.

The Neo4jPHP PHAR is available at http://github.com/downloads/jadell/Neo4jPHP/neo4jphp.phar.

Getting Neo4jPHP working with CodeIgniter 2

2011-08-25T10:26:00.003-04:00

This post is based heavily off of Integrating Doctrine 2 with CodeIgniter 2. I haven't tested it myself, so please use it as a starting point and let me know if I should update anything.

Here are the steps to get Neo4jPHP set up as a CodeIgniter 2 library:

Copy the 'Everyman' folder to application/libraries
Create a file called Everyman.php in application/libraries

Copy the following code into Everyman.php:

<?php
class Everyman
{
    public function __construct()
    {
        spl_autoload_register(array($this,'autoload'));
    }

    public function autoload($sClass)
    {
        $sLibPath = __DIR__.DIRECTORY_SEPARATOR;
        $sClassFile = str_replace('\\',DIRECTORY_SEPARATOR,$sClass).'.php';
        $sClassPath = $sLibPath.$sClassFile;
        if (file_exists($sClassPath)) {
            require($sClassPath);
        }
    }
}

Add the following line to application/config/autoload.php:
```
$autoload['libraries'] = array('everyman');
```

This is actually a generic autoloader that will attempt to autoload any namespaced class under application/library where the class name matches the file path.

Performance of Graph vs. Relational Databases

2011-07-23T16:06:00.001-04:00

A few weeks ago, Emil Eifrem, CEO of Neo Technology gave a webinar introduction to graph databases. I watched it as a lead up to my own presentation on graph databases and Neo4j.

Right around the 54 minute mark, Emil talks about a very interesting experiment showing the performance difference between a relational database and a graph database for a certain type of problem called "arbitrary path query", specifically, given 1,000 users with an average of 50 "friend" relationships each, determine if one person is connected to another in 4 or fewer hops.

Against a popular open-source relational database, the query took around 2,000 ms. For a graph database, the same determination took 2 ms. So the graph database was 1,000 times faster for this particular use case.

Not satisfied with that, they then decided to run the same experiment with 1,000,000 users. The graph database took 2 ms. They stopped the relational database after several days of waiting for results.

I showed this clip at my presentation to a few people who stuck around afterwards, and we tried to figure out why the graph database had the same performance with 1,000 times the data, while the relational database became unusably slow. The answer has to do with the way in which each type of database searches information.

Relational databases search all of the data looking for anything that meets the search criteria. The larger the set of data, the longer it takes to find matches, because the database has to examine everything in the collection.

Here's an example: let's assume there is a table with "friend" relationships:

> SELECT * FROM friends;
+-------------+--------------+
| user_id     | friend_id    |
+-------------+--------------+
| 1           | 2            |
| 1           | 3            |
| 1           | 4            |
| 2           | 5            |
| 2           | 6            |
| 2           | 7            |
| 3           | 8            |
| 3           | 9            |
| 3           | 10           |
+-------------+--------------+

In order to see if user "9" is connected to user "2", the database has to find all the friends of user "9", and see if user "2" is in that list. If not, find all of their friends, and then see if user "2" is in that list. The database has to scan the entire table each time. This means that if you double the number of rows in the table, you've doubled the amount of data to search, and thus doubled the amount of time it takes to find what you are looking for. (Even with indexing, it still has to find the values in the index tree, which involves traversing that tree. The index tree grows larger with each new record, meaning the time it takes to traverse grows larger as well. And for each search, you always start at the root of the tree.)

Each new batch of friends to look at requires an entirely new scan of the table/index. So more records leads to more search time.

Conversely, a graph database looks only at records that are directly connected to other records. If it is given a limit on how many "hops" it is allowed to make, it can ignore everything more than that number of hops away.

Only the blue records are ever seen during the search. And since a graph traversal "remembers" where it is at any time, it never has to start from the beginning, only from its last known position.

You could add another ring of records around the outside, or add a thousand more rings of records outside, but the search would never need to look at them because they are more steps away than the limit. The only way to increase the number of records searched (and thereby decrease performance) is to add more records within the 2-step limit, or add more relationships between the existing records.

So the reason why having 1,000 vs. 1,000,000 records causes such a stark difference between a relational and a graph database is that relational database performance decreases in relation to the number of records in the table, while graph database performance decreases in relation to the number of connections between the records.

TriPUG Meetup slides

2011-07-19T22:58:00.001-04:00

Here are the slides for my presentation on graph databases at the TriPUG meetup tonight: http://www.slideshare.net/JoshAdell/graphing-databases

Overall, I think it was pretty well received. I didn't mean for it to last an hour and a half, but there it is. There were a lot of new folks there; I'm encouraged at the enthusiasm of the developer community in the Triangle region. Thanks for having me!