2010-07-11

Parallel Testing in the Cloud with Feature Branches

At iContact, we run unit and functional tests in a cloud of testing machines (we call it a "pool" but it's really a cloud.) Each test suite is wrapped in a job and farmed out to one of several Gearman job queues. Worker processes grab a job from the queue, reserve an isolated local database, run the job and report back the results. Suites are run in parallel.

Multiple developers use the cloud at the same time and their tests remain isolated from each other. As we add developers (and tests), the system scales by setting up a new test server, putting a new set of databases on it, and running more worker processes on it to talk to those databases (from now on, I'll refer to a combination of process and database as a "worker".)

Each worker's schema is an identical copy of the schema we write our code against. When we write code that requires schema alterations, we have to make those changes to every worker in the cloud, because we don't know which worker will pick up the tests of the new schema. Since our development process involves running all test suites before checking-in to source control, this means the workers must have the updated schema before the code to use the schema is available to other developers. The result: broken tests for every developer until the code is checked-in.

To compound this problem, we have adopted a new source control repository branching strategy. All new features are developed in separate branches then merged into a central trunk before deployment. Each branch gets its own development environment, complete with web servers and databases that are only accessible by that branch. Tests for every branch, however, still run in the same test cloud on the same workers. So even though code changes are isolated from one another, the schema change problem remains.

I've been thinking about how we might solve, or at least alleviate, this problem. Here are some thoughts:

Separate Clouds

Each branch environment gets its own cloud of test workers. The developers have free access to alter the schema they are developing against, and they are responsible for making the same alterations to their set of workers. The alterations do not need to be pushed to the other test clouds until the branch is merged back into trunk.

Pros:
  • Solves the current problem: tests of new schema do not step on tests that do not expect that schema.
  • Encourages experimenting with schema without bothering the team responsible for maintaining the main test cloud.

Cons:
  • Complicates the automated branch environment setup process.
  • Burdens the developers with deploying every schema change to their branch and all workers.
  • De-commoditizes the workers; we want to allocate workers to one branch or another as needed. In this setup, we would have to know ahead of time that one branch will need more workers than another.

Dynamic Schema

When a new worker starts running, it reads a schema and uses it to build its own database. This happens once when the worker is created. This could be modified, such that a worker always refreshes its schema before each test run. The test job would indicate where its expected schema lives, and the worker would use that schema.

Pros:
  • Maintains test workers as a commodity.
  • Simplifies "experimentation"; changes are made in the canonical location/database, and the worker automatically picks them up.
  • Easily implemented; needs the fewest process and infrastructure changes from the current system.

Cons:
  • Complicates worker process, which is already pretty complicated.
  • Increases test run time; each suite (600+) rebuilds the schema every time when the suite is first run.
  • Disguises test failures; schema thrash means the schema may have already been altered by another branch's tests by the time a failure is reported.

Worker Dispatcher

A combination of the two above ideas. Test runs are are submitted to a central dispatcher which counts the number of test suites being run, reserves an appropriate number of free workers from the cloud, gives them their schema (once, at reservation time), then creates the test jobs in a queue only listened to by those workers. Workers run until there are no more jobs in the queue, then return to the cloud.

Pros:
  • All the pros listed for "Separate Clouds" and "Dynamic Schema" except ease of implementation.

Cons:
  • Most complex solution
  • Adds a new component to maintain, and a new central point of failure
  • Doesn't solve the thrash problem

I think that the "Worker Dispatcher" idea is my favorite. I'm not too concerned with the schema thrash problem. Every developer has their own non-cloud copy of the database against which individual failing tests can be run for investigative purposes. A developer shouldn't be examining the worker database's contents directly anyway, since by the time a failure is reported, that database has probably already been used by another test suite.

I'm not a fan of the complexity of the solution, though.

No comments:

Post a Comment