2011-07-23

Performance of Graph vs. Relational Databases

A few weeks ago, Emil Eifrem, CEO of Neo Technology gave a webinar introduction to graph databases. I watched it as a lead up to my own presentation on graph databases and Neo4j.

Right around the 54 minute mark, Emil talks about a very interesting experiment showing the performance difference between a relational database and a graph database for a certain type of problem called "arbitrary path query", specifically, given 1,000 users with an average of 50 "friend" relationships each, determine if one person is connected to another in 4 or fewer hops.

Against a popular open-source relational database, the query took around 2,000 ms. For a graph database, the same determination took 2 ms. So the graph database was 1,000 times faster for this particular use case.

Not satisfied with that, they then decided to run the same experiment with 1,000,000 users. The graph database took 2 ms. They stopped the relational database after several days of waiting for results.

I showed this clip at my presentation to a few people who stuck around afterwards, and we tried to figure out why the graph database had the same performance with 1,000 times the data, while the relational database became unusably slow. The answer has to do with the way in which each type of database searches information.

Relational databases search all of the data looking for anything that meets the search criteria. The larger the set of data, the longer it takes to find matches, because the database has to examine everything in the collection.

Here's an example: let's assume there is a table with "friend" relationships:
> SELECT * FROM friends;
+-------------+--------------+
| user_id     | friend_id    |
+-------------+--------------+
| 1           | 2            |
| 1           | 3            |
| 1           | 4            |
| 2           | 5            |
| 2           | 6            |
| 2           | 7            |
| 3           | 8            |
| 3           | 9            |
| 3           | 10           |
+-------------+--------------+
In order to see if user "9" is connected to user "2", the database has to find all the friends of user "9", and see if user "2" is in that list. If not, find all of their friends, and then see if user "2" is in that list. The database has to scan the entire table each time. This means that if you double the number of rows in the table, you've doubled the amount of data to search, and thus doubled the amount of time it takes to find what you are looking for. (Even with indexing, it still has to find the values in the index tree, which involves traversing that tree. The index tree grows larger with each new record, meaning the time it takes to traverse grows larger as well. And for each search, you always start at the root of the tree.)

Each new batch of friends to look at requires an entirely new scan of the table/index. So more records leads to more search time.

Conversely, a graph database looks only at records that are directly connected to other records. If it is given a limit on how many "hops" it is allowed to make, it can ignore everything more than that number of hops away.
Only the blue records are ever seen during the search. And since a graph traversal "remembers" where it is at any time, it never has to start from the beginning, only from its last known position.

You could add another ring of records around the outside, or add a thousand more rings of records outside, but the search would never need to look at them because they are more steps away than the limit. The only way to increase the number of records searched (and thereby decrease performance) is to add more records within the 2-step limit, or add more relationships between the existing records.

So the reason why having 1,000 vs. 1,000,000 records causes such a stark difference between a relational and a graph database is that relational database performance decreases in relation to the number of records in the table, while graph database performance decreases in relation to the number of connections between the records.

2011-07-19

TriPUG Meetup slides

Here are the slides for my presentation on graph databases at the TriPUG meetup tonight: http://www.slideshare.net/JoshAdell/graphing-databases

Overall, I think it was pretty well received. I didn't mean for it to last an hour and a half, but there it is. There were a lot of new folks there; I'm encouraged at the enthusiasm of the developer community in the Triangle region. Thanks for having me!

2011-07-16

agile Adoption on a Non-Technical Team

I'm pretty excited about a new development at my office. Based on my team's success in increasing our productivity, efficiency and transparency, another team has decided to adopt "agile-like" practices. The most interesting thing is that this team isn't a technical team at all.

It's probably fair to say that what they're adopting is less like agile and more like lean. The goal was to increase visibility to their management structure on what the team was doing each day, increase the efficiency of their work, minimize interruptions of team members and raise problems earlier. To accomplish this, they have implemented a stripped-down version of Scrum, with the following practices:
  • Planning:
  • break-down of larger, more ambiguous mid- to long-term goals into shorter, more easily tracked tasks.
  • Daily Standup:
  • team stand-up meeting every day in the morning, with each team member answering "What did I do yesterday?", "What am I going to do today?" and "Is anything preventing me from getting things done?"
  • Scrum Board:
  • a week-long calendar set up inside the team's area in a highly trafficked place so that anyone else (including team managers!) can see at a glance how the team is doing without having to ask individual team members.
  • Retrospective:
  • periodic gathering of the team to discuss the new processes, how the processes are helping/hindering and what to change to make the team more efficient.
The team's goals are very different from that of a software development team, so we encouraged them to not follow our process dogmatically, and instead, to come up with methods that work for their team's unique needs (the core of "a"gile vs. "Agile".) For instance, my team operates on a 2 week sprint cadence (a new sprint every two weeks.) Their team receives the team goals on a monthly basis. Rather than adopt 2-week or 4-week sprints, they have chosen to break up the monthly goals into smaller and more transparent 1-week goals. This provides them with earlier warning if things start to go sideways, and provides motivation to not let things stack up until the end of the month.

Also, their work does not fit easily into "delivering feature or product X by the end of the month." Rather, they are tasked with having certain regions of the country covered by certain baselines of recruitment effort (for partner and vendor recruiting.) So the "User Stories" on their backlog are phrased as "15% coverage of NJ/NY metro area" or "Decrease turnover 10% in southern VA." Their tasks become things like calling or emailing specific contacts, setting up conference calls and facilitating the transition of a contact into a partner.

The Scrum board also has little stickers (a frog for their direct manager and Venom for her manager) that they can attach to a specific task to signal that they need additional help with an item. I think this is a neat idea, because it means that their managers don't have to ask them many times a day if everything is going smoothly. A quick glance at the Scrum board gives them the same information.

A few sprints ago, my team decided to start sharing a few minutes about our agile practices at each Sprint Review. I'm really glad to see that not only were people not spacing out (which I had feared they might) but were actually absorbing the information and thinking about how it could apply to their own team. I'm very excited to see how this new process works for our recruitment team.

2011-07-06

Neo4jPHP beta released

I've been working on a PHP client for the Neo4j REST API for little while. I think it's ready for some real-life testing. The beta version is available here: https://github.com/jadell/Neo4jPHP/tarball/0.0.1-beta

Features:
  • Developed against the Neo4j 1.4 milestone releases
  • Simple, object-oriented API
  • Almost complete REST API coverage
  • Indexing of nodes and relationships, including exact match and query support
  • Cypher queries (thanks to Jacob Hansson)
  • Traversal support, including paged traversals
  • Lazy-loading of node and relationship data

Hopefully coming soon:
  • Client-side caching
  • Batch operations

There are some usage examples included.

It's a beta release, so please be gentle (on me, that is; be as rough as you want with the code.) If anyone finds any bugs or has feature requests, please use the GitHub issues page.


Update 2011-07-08:I would be remiss if I didn't mention the other PHP Neo4j REST client available at https://github.com/tchaffee/Neo4J-REST-PHP-API-client. It is built for the current stable release of Neo4j 1.3.