Everyman Software

2013-05-01

Serializing Data Like a PHP Session

PHP has several built-in ways of serializing/unserializing data. The most cross-platform is json_encode; pretty much every programming stack can JSON decode data that has been encoded by any other stack. There's also PHP's native serialize function, which is not as cross-platform, but has the added benefit of being able to store and restore PHP objects to their original class.

There's a third, and much lesser known PHP serialization format: the format that PHP uses to store session data. If you have ever popped open a PHP session file, or stored session data in a database, you may have noticed that this serialization looks very similar to the serialize function's output, but it is not the same.

Recently, I needed to serialize data so that it looked like PHP session data (don't ask why; I highly suggest not doing this if it can be avoided.) It turns out, PHP has a function that encodes data in this format: session_encode. Great! I'll just pass my array of data to it and...

Oh wait. session_encode doesn't accept any arguments. You can't pass data to it. It just takes whatever is in the $_SESSION superglobal and serializes it. There is no built-in function in all of PHP that will serialize arbitrary data for you the same way that it would be serialized into a session.

There are a few userland implementations of PHP's built-in session serialization, mainly built around string splitting and regexes. All of them handle scalar values; some handle single-level arrays. None of them handle nested arrays and objects, and some have trouble if your data contains certain characters that are used in the encoding.

So I came up with my own functions for reading/writing arbitrary session data without overwriting the existing session. (Edit: I later noticed that a few people suggest a similar method in the comments on the PHP manual pages for session_encode and session_decode.):

Using these functions requires there to be an active session (session_start must have already been called.) Edit: Thanks to Rasmus Schultz for also pointing out that session_encode might be disabled on some systems due to security concerns.

PHP already has a built-in way to serialize and unserialize session data. The problem is that it only serializes from and unserializes into the PHP $_SESSION global. We probably don't want to overwrite the current $_SESSION. We hold a copy of whatever data is already in $_SESSION, then use it to perform our data serialization, then restore it afterwards. And because we're using PHP's built-in session serialization, we get nested array and object serialization for free, and we didn't have to write our own parser.

2013-04-23

Loggly and Puppet

Update 2013-10-22: This post refers to Loggly generation 1, and may (most likely) not work with Loggly's new second generation product offering.

As a follow-up to my previous post on pulling data from Loggly using JQuery, this post will show how to use Puppet to automatically register and configure instances to send data to Loggly.

At ServiceTrade, we use Amazon Web Services for almost all of our infrastructure. All our production servers are EC2 instances. The configuration of all the instances is kept in Puppet manifests. Instances go down and come up all the time, and Puppet helps us make sure they are all configured exactly alike out of the box.

A server cannot send data to Loggly unless you have previously told Loggly to accept data from it. Unfortunately, with server instances being created and removed automatically, it would be impossible to keep up with hand-registering each instance with Loggly. Fortunately, we can use Loggly's API and some Puppet manifests to register our instances for us when they come up.

We use rsyslog on our instances for collecting system log data (syslog, kernel, mail logs, etc.). rsyslog can tail log files forward them to other log files, or even other servers via TCP or UDP. Loggly has great documentation on setting up rsyslog to forward log files to Loggly.

First, we need to have Puppet manage rsyslog. This ensures that rsyslog will be installed, and gives us control over rsyslog's master configuration file and a directory of instance specific configuration files. Below is the rsyslog module file. All files are relative to the Puppet root directory.

modules/rsyslog/manifests/init.pp

As it says, the main configuration file will be in modules/rsyslog/files/rsyslog.conf. The config file is the standard one installed by our package manager with a few minor alterations, seen here:

modules/rsyslog/files/rsyslog.conf

That last line is important, because all out Loggly specific configurations will go in /etc/rsyslog.d.

Now that rsyslog is set up, we need to tell each instance where to send its log files. Additionally, we need to register each instance with each log file it will be sending to Loggly. Each log file is sent to a different endpoint, which Loggly refers to as an input. Each input has an ID, and a specific port on the Loggly logging server that maps to that ID. We have already set up a specific Loggly user for API purposes, and we'll use that user to do the registration.

First, we set up a new module that will hold our Loggly API configuration.

modules/loggly/manifests/init.pp

The hash of inputs allows us to easily reference each input we've mapped in Loggly, without having to remember specific ID and port numbers elsewhere in our manifests.

We'll also set up a Puppet template for tailing out log files and forwarding them to Loggly:

modules/loggly/templates/rsyslog.loggly.conf.erb

The template will take the values defined in loggly::$inputs and create one log file per input, which will ultimately end up in /etc/rsyslog.d.

One last Loggly manifest file is needed. This one generates the config file for an input, then registers the server against the input using Loggly's API.

modules/loggly/manifests/device.pp

This manifest uses the $name passed to the definition to gather data from the $inputs hash to build a config file. Also, it execs a cUrl call to Loggly's API to register the device for the input. The response to this call is stored for two reasons: first, if anything goes wrong, we have a record of Loggly's response to our request; and second, if the response file already exists, Puppet will know it does not need to make another call to the API.

All that remains is to use our loggly::device definition in a node definition:

manifests/site.pp

Since our input IDs and ports are bound to specific input names in our $inputs hash, we only need to know the names of the inputs we want to configure this instance to send to, and loggly::device does the rest.

Hopefully, at some point we (or someone else) will get around to releasing a proper Puppet module for this. Until then, I hope this post helps you get set up with the Loggly centralized logging service.

2013-04-12

Loggly from Javascript

Update 2013-10-22: This post refers to Loggly generation 1, and may (most likely) not work with Loggly's new second generation product offering.

For my most recent Dev Days project, I implemented centralized logging for our application, ServiceTrade. I don't want to worry about running our own indexing server, or storing the logs long term, so I investigated several SaaS logging solutions and eventually settled on Loggly. I was impressed with the ease of setting up our account, defining our logging inputs and even integrating with our Puppet configuration management infrastructure. For long term storage, they push raw log files to an S3 bucket of your choosing. Their customer support seemed very eager to help with the one issue I had. All-in-all, I've been pleased with the product.

One thing about Loggly that could use a little work is saved searches. First off, when Loggly gives you a graph of events from a saved search (a very cool feature) the graph sometimes loses information when zooming in and clicking on a section to see specific logs. Visiting the page for a saved search on a specific set of inputs and clicking on the graph to pull up the log lines for that search with give log lines across all inputs, not just the ones the saved search is limited to.

~~Secondly, you are limited to only 5 saved searches at the moment. The saved search feature is in beta, so hopefully they will allow saving of more (ideally unlimited) searches in the future.~~ Apparently, you can have up to 2000 saved searches; the wording on the saved searches list page is out-of-date.

We are using their excellent API to pull down data and do our own visualizations of multiple saved searches. I'm using jQuery on a simple HTML page to query the API. There are a few caveats. The following information will hopefully prevent someone else from spending the half-hour I did trying to figure this out.

First of all, the API uses HTTP basic authentication. Jquery's get call does not handle HTTP authentication, so I had to use the more verbose ajax method.

Also, since the request is cross-domain, I had to use JSONP, which Loggly supports.

Finally, Loggly's API returns a Bad Request response if you send any parameters that it does not recognize. Unfortunately, unless you tell it otherwise, jQuery.ajax() will always send a timestamp query parameter to prevent response caching. In order to get everything to work, I had to tell the request to turn caching off and not send the parameter.

Here is what the final call looks like:

You might also want to check out my blog post soon on using Puppet to automatically register servers with Loggly.

2013-02-11

Storytellers and Prognosticators: Lessons in Communication Style

Every other Friday, my team holds a retrospective meeting. One issue that came up in our most recent retrospective was a particularly contentious user story estimation meeting that had occurred a few days prior. Voices were raised, people were interrupted mid-sentence, and sarcasm was liberally deployed. This was a specific isolated incident, and I am very proud of my team that we were able to quickly own, discuss and remedy the situation. We are a better team of communicators as a result.

After the retrospective, I thought about different communication styles, and I tried to come up with different personas for the different types of communicators on my team. So far, I've broken the communication styles down into these general personas:

Storytellers love to talk about the past. Their goal is to remind everyone that experience is the best teacher. They are the archive of a team's experiences, reminding everyone of past obstacles and past triumphs. By trying to couch everything in terms of previously encountered situations, storytellers sometimes have trouble recognizing changed circumstances. You can recognize a storyteller by phrases like "Do you remember when..." and "Last time this happened..."

Prognosticators are in some ways the opposite of storytellers. They love to talk about likely outcomes and visions of the future. Their ideas tend to be expressed as innovative approaches to problems and as warnings about potential pitfalls. Prognosticators can sometimes derail a conversation by making predictions based on erroneous assumptions. They can be recognized by phrases like "Something we need to watch out for..." and "It's possible that..."

Prognosticators and storytellers tend to feed off each other, with the latter talking about how a current situation is similar to the past, and the former talking about how the current situation is different. When they get on a roll together, it may be difficult for the other personas to break into the conversation. Both must be careful to not take the conversation off tangents and drive it away from productive outcomes.

Inquisitors communicate by asking questions (the Socratic method.) Their strength is getting others to think about what they are saying by asking for clarification, and by steering the tone and direction of the conversation to discover new possibilities. Inquisitors are excellent at keeping storytellers and prognosticators on track. It is important that inquisitors remember to contribute knowledge instead of always asking questions to which they already know the answers, otherwise they risk looking condescending. Inquisitors can be recognized by phrases like "What if..." and "What did you mean by..." and "Have you considered..."

Evaluators are good listeners. Unlike the other personas, which tend to drive conversations, evaluators do not speak often, and when they do, they tend to be soft-spoken. Evaluators are good at judging ideas objectively and combining multiple points of view into one cohesive vision. It is important that evaluators do not let themselves get talked over, and that they do not completely hold back from contributing; inquisitors can help draw an evaluator into the conversation. Evaluators can be recognized by phrases like "That's an interesting thought..." and "I was thinking about..."

Most people are more comfortable in one, or a combination of two, personas. I call this their base communicator. These are not discrete communication styles. Elements of any one may be combined with elements of another, and no person is purely one persona or another. From day to day, even within the course of a single conversation, people flow in and out of these different personas.

In conversation, especially in larger groups, it is important to recognize which persona is speaking. Remember that familiarity breeds complacency! As team cohesion grows, you will soon learn to recognize each team member's base communicator, and when that happens it is even more important to pay attention to what they are saying and how they are saying it. Just because you know someone's general communication style, you cannot assume that they will always communicate that way in every circumstance.

2012-11-21

Migrating to Dependency Injection

Recently, I gave a lunch-and-learn to my team on the topic of Dependency Injection (DI). Instead of showing a bunch of slides explaining what DI is and what it's good for, I created a small project and demonstrated the process of migrating a codebase that does not use DI to one that does.

Each stage of the project is a different tag in the repository. The code can be found on Github: http://github.com/jadell/laldi. Checkout the code, and run composer install. To see the code at each step in the process, run the git checkout command in the header of each section.

`git checkout 0-initial`

The initial stage of the project is a listing of developers on the team, and with whom each has pair programmed. The project uses Silex, which is both a PHP micro-framework, and a simple dependency injection container. For the initial stage, only the framework is used (for HTTP request handling, routing and dispatching.) Each route handler is registered in index.php. A handler intantiates a controller object, then calls a method on the controller to handle the request. For this project, there is only one controller, the HelloController, but there could be many others.

When the HelloController is instantiated, it instantiates a UserRepository to handle access to User domain objects and a View to handle rendering the output. The UserRepository instantiates a Datasource to handle access to the raw data. The path to the data file is hardcoded in the Datasource. User objects instantiate a UserRepository of their own, so they can access the list of paired Users.

The code smell here is the chain of initializations that get triggered by constructing a new HelloController, which leads to several problems: it tightly couples every component to every other component; none of the components can be unit-tested in isolation from the others; we can't change the behavior of a class or its dependencies without modifying the class; and if the constructor signature of any component changes, we need to find each instantiation of that component and change it to pass in the new parameters (which might involve threading those parameters through multiple layers which don't require them.) There also may be bugs with multiple objects having their own copies of dependencies (for example, each User object should probably share a UserRepository, not have their own.)

`git checkout 1-add-di`

A simple way to solve these problems is to ensure that no class is responsible for creating its own dependencies. In order for a class to access its dependencies, we pass those dependencies into the class's constructor (another form of DI uses "setter" injection instead of constructor injection.)

The problem now is that all the dependencies must be wired together. Since no class instantiates its own dependencies, we must have a component that does this for us. This is the job of the Dependency Injection Container (DIC).

Silex is also used as a DIC. In bootstrap.php, all class instantiation is wrapped in a factory method that is registered with the DIC. When the DIC is asked for a registered object, if that object is not already instantiated, it will be created. Additionally, all its dependencies will be created and injected through its constructor.

For cases where a component may need to create multiple objects of the same type (like the UserRepository needs to create many User objects), factory closures are created. The factory makes sure that any User object is created with the correct dependencies, and that shared dependencies are not re-created each time. Since the closure is defined in the DIC, the UserRepository does not need access to any of the external dependencies that might be used by User objects, and the closure can be passed to the UserRepository.

`git checkout 2-comments`

Our project gets a new requirement to allow comments on users. Following the same pattern as with Users, we create a CommentRepository and a Comment domain object, and factory methods for creating and injecting each. HelloController needs access to the CommentRepository in order to save comments. Also, Users need access to the repository to retrieve all the Comments for that User. Since we are using DI, we change their constructors to receive a CommentRepository.

If we were not using a DIC, we would have to find every place that instantiated a HelloController or User object and change those calls to pass in the CommentRepository. Additionally, we would have to pass the CommentRepository through to any place that instantiated the object. Since we are using a DIC and factory methods, we can be assured that there is only one place that creates each of these object types, and therefore only one place needs to change (the factory methods in the DIC.)

We already get benefits from our refactoring to use a DIC!

`git checkout 3-events`

Our last requirements are to do some simple tracking of page views and notification of comments (perhaps with email, but for demonstration purposes, to a log file.) Each of these is a good use case for an event-driven system.

One of the challenges with event-driven development is making sure that the event listeners are instantiated before the emitters begin sending events. An option for getting around this is to instantiate all listeners as part of the application's bootstraping/initialization process. But this can become cumbersome and have performance implications.

Instead of always instantiating every listener, we can use our DIC and factory methods. Since event emitters are instantiated through a factory method, we use that same factory method to initialize listeners to any events that emitter may emit. By tying the creation of the emit to the creation of the listeners, we ensure that we only instantiate those listeners for which we have created an emitter.

For example, we know that the HelloController may emit page view events, so we make sure that whenever we instantiate a HelloController, we also create a page view listener and set it to listen for the event.

Potential Issues

Several good questions were asked during my presentation, and I attempted to answer them as best as I could.

Is debugging harder having the definitions of the objects away from their usage?
Since you are explicitly passing an object's dependencies, and you get all the dependencies from the same place, you can minimize external state that may affect an object's behavior. Also, by using factory methods, you can find exactly where an object was created. Without the DIC, an object or its dependencies could come from anywhere in your code. DI also makes unit- and regression-testing easier.

Do the bootstrap/DIC definitions become unwieldy when you have a lot of interconnected objects?
The DIC is kept simple by having each factory method responsible for creating only one type of object. If each method is small and isolated, the file may become very large, but the logic in the file will remain easy to read. You can even break the bootstrap of the DIC into multiple files, each file handling a different component or section of the project's dependencies. Circular dependencies may be a problem with a highly connected object graph, but that is a separate problem regardless of using a DIC or not.

Where is a good place to start with refactoring an existing large codebase to use DIC?
It's a multi-step process:

Pick a class and create a factory method on a globally accessible DIC that instantiates that class.
For every place in the class that uses the "new" operator, replace it with a call to a factory function/method on the DIC.
Move calls to the factory methods into the class's constructor, and store the results as properties on the object.
Change the constructor to not call the factory methods itself, but receive its dependencies as parameters.
Update the class's factory method to call the other factory methods and pass in the dependencies to the constructor.
Find any place in the code that instantiates the class and replace it with a call to the class's factory method.

Repeat this process for any usage of the "new" operator in the code. By the end, the only place in the code that uses the "new" operator or refers to the DIC should be the DIC itself.

2012-09-27

Interfaces and Traits: A Powerful Combo

If you're not using interfaces in PHP, you are missing out on a powerful object-oriented programming feature. An interface defines how to interact with a class. By defining an interface and then implementing it, you can guarantee a "contract" for consumers of a class. Interfaces can be used across unrelated classes. And they become even more useful when combined with the new traits feature in PHP 5.4.

Let's suppose we have a class called User. Users have an Address that our application mails packages to via a PackageShipper:

class Address {
 // ... setters and getters for address fields ...
}

class User {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other user logic ...
}

class PackageShipper {
 public function shipTo(User $user) {
  $address = $user->getAddress();

  // ... do shipping code using $address ...
 }
}

Our application is happily shipping packages, until one day a new requirement comes in that we need to be able to ship packages to Companies as well. We create a new Company class to handle this:

class Company {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other company logic ...
}

Now we have a problem. The PackageShipper class only knows how to handle Users. What we need is for the PackageShipper class to handle anything that has an Address.

We could make a base class that both Users and Companies inherit from and have the PackageShipper accept any class that descends from the base class. This is an unsatisfying solution. Semantically, Users and Companies are different entities, and there may not be enough common functionalty to move to a base class that both inherit. They may not have much in common besides the fact that they both have an address. Also, either class may already extend another class, and we cannot extend more than one class since PHP has a single-inheritance object model.

Instead, we can use an interface to define the common parts of Users and Companies that PackageShipper needs to deal with. Then, Users and Companies can implement that interface, and PackageShipper only has to deal with objects that implement the interface.

inteface Addressable {
 public function setAddress(Address $address);
 public function getAddress();
}

class User implements Addressable {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other user logic ...
}

class Company implements Addressable {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }

 // ... other company logic ...
}

class PackageShipper {
 public function shipTo(Addressable $entity) {
  $address = $entity->getAddress();

  // ... do shipping code using $address ...
 }
}

A class can implement many different interfaces, and classes with different bases can still implement the same interface.

But there is still a problem. Both Company and User implement the Addressable interface using the same code. It is not required to do this; as long as the constraints of the interface are met, the implementation does not have to be the same for every class that implements the interface. But in this case, they both implement the interface in the same way, and we have duplicated code. If a third Addressable class came along, it is possible that it would also implement the interface the same way, leading to even more duplication.

If you are using PHP 5.3 or less, there is little that can be done about this situation. But if you are using PHP 5.4, there is a new construct that is made to handle exactly these types of code duplication scenarios: traits.

A trait is similar to a class in that they both implement methods and have properties. The difference is that classes can be instantiated, but traits cannot. Instead, traits are added to class definitions, giving that class all the methods and properties that are defined in the trait.

A helpful way to think about traits is like a macro: using a trait in a class is the same as if the code in the trait were copied into the class.

Using traits, we can clean up the code duplication and still meet the contract defined by the Addressable interface:

trait AddressAccessor {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }
}

class User implements Addressable {
 use AddressAccessor;

 // ... other user logic ...
}

class Company implements Addressable {
 use AddressAccessor;

 // ... other company logic ...
}

Now, any class can immediately implement the Addressable interface by simply using the AddressAccessor trait. All the duplicated code has been moved to a single place.

The trait itself does not implement Addressable. This is because only classes can implement interfaces. Remember that traits are nothing more than code that gets copied into a class, they are not classes on their own.

Interfaces in PHP are a useful feature for enforcing code correctness. And when interfaces are combined with traits, they form a powerful tool which allows for faster development, decreased code duplication, greater readbility, and better maintainability.

Here is all the final code for the package shipping application:

class Address {
 // ... setters and getters for address fields ...
}

inteface Addressable {
 public function setAddress(Address $address);
 public function getAddress();
}

trait AddressAccessor {
 protected $address;

 public function setAddress(Address $address) {
  $this->address = $address;
 }

 public function getAddress() {
  return $this->address;
 }
}

class User implements Addressable {
 use AddressAccessor;

 // ... other user logic ...
}

class Company implements Addressable {
 use AddressAccessor;

 // ... other company logic ...
}

class PackageShipper {
 public function shipTo(Addressable $entity) {
  $address = $entity->getAddress();

  // ... do shipping code using $address ...
 }
}

2012-08-31

Plumbing PHP

When building a project with Neo4j or most other graph databases, it is impossible to avoid learning about Tinkerpop's excellent Gremlin graph processing language. The processing layer of Gremlin is built on top of Pipes, a dataflow programming library.

I was inspired by the syntax and ease-of-use of Gremlin to build a simple processing pipeline library in PHP. The result is Plumber, a library for easily building extensible deferred-processing pipelines.

Plumber is built on top of PHP's native Iterator library. The idea is simple: instantiate a new processing pipeline, attach processing pipes to it, then send an iterator through the pipeline and iterate over the results in a foreach. Each element that comes out the other end of the pipeline has been passed through an processed by each pipe. And because Plumber uses iterators, it natively supports lazy-loading and Just-In-Time evaluation.

A simple example would be reading a set of records from a database, formatting them in some manner, then echo'ing them out to the screen:

$users = // code to retrieve user records from a database as an array or Iterator...

$names = array();
foreach ($users as $user) {
    if (!$user['first_name'] || !$user['last_name']) {
        continue;
    }

    $name = $user['first_name'] . ' ' . $user['last_name'];
    $name = ucwords($name);
    $name = htmlentities($name);
    $names[] = $name;
}

// later on, display the names
foreach ($names as $name) {
    echo "$name<br>";
}

There are a few obvious downsides to doing things this way: the entire set of records is looped through more than once; all the records must be in memory at the same time (twice even, once for $users and once for $names); and the processing steps in the foreach are executed immediately on every record. These may not seem like a big deal if the record set is small and the processing steps are trivial, but they can become big problems if you are not careful.

Here is the same code using Plumber:

$users = // code to retrieve user records from a database as an array or Iterator...

$names = new Everyman\Plumber\Pipeline();
$names->filter(function ($user) {
        return $user['first_name'] && $user['last_name'];
    })
    ->transform(function ($user) {
        return $user['first_name'] . ' ' . $user['last_name'];
    })
    ->transform('ucwords')
    ->transform('htmlentities');

// later on, display the names
foreach ($names($users) as $name) {
    echo "$name<br>";
}

The list of $users is only looped through one time, and there is no need to keep a separate list of $names in sync with the $users list. Each $user is transformed into a $name on-demand, keeping resources free.

This can all be accomplished using Iterators, but there is quite a bit of boilerplate code involved. Plumber is meant to remove most of the boilerplate and let the developer concentrate on writing their business logic.

There is more to Plumber, including several built-in pipe types, and the ability to extend the library with your own custom pipes. It is also not necessary to use the fluent interface, if that is not your style. More usage information can be found in the README file in the Plumber github repo. Constructive feedback is always welcome!