The Rebuilding and Scaling of YellowPages.com
John Straw, chief software architect at YellowPages.com, gave an excellent talk at RailsConf about YellowPages’ conversion to Rails. We’ve pointed to YellowPages in the past as being one of the highest-traffic Rails sites, proving that Rails can scale.
John’s talk covered the scaling issues, but the talk was just as much about the process of successfully doing a big rewrite of a critical application at a large company. (YellowPages.com is part of AT&T.)
You can download John’s presentation from the RailsConf site.
With 2.5M searches per day and more than 170 million page views per month, YellowPages.com is now the biggest site AT&T runs, generating more traffic than att.com.
Why the rewrite?
The previous version was written in Java in 2004 and 2005 by consultants, and had fundamental design problems. This version had 125K lines of code and no automated tests, making new features hard to implement. The Rails version ended up with fewer than 20K lines of code, including tests.
The Java site also had lots of usability issues and SEO problems. As an example of the difficulty of adding features, John noted that it took three months to add a rating feature.
The requirements for the new version included:
- Absolute control of URLs
- No sessions
- Be agile
- Develop easy-to-leverage core business services
The new product was built by a cross-functional team of 20 people, including project managers, user experience experts, advertising people, search experts, and content and community managers. There were never more than 5 developers. The entire team sat together for the duration of the project.
The architecture they designed has three tiers: a web tier that delivers the front-end web experience; a services tier that responds to requests from the web tier; and a search cluster that performs the actual searches.
The team built an initial Rails prototype in early 2007 that looked like the existing site, just to get some experience with Rails. (Only one of the team members had any prior Rails experience.) They also built prototype search code in Python. They then started a new Rails prototype with designs from user experience team. They also built a Django prototype to explore that option, and evaluated EJB3/JBoss as a service tier platform.
The team rejected Java for the front end in part because of its inadequate URL control. In terms of Rails vs. Django, John called the decision a near toss-up. Platform maturity was an important attribute that led them to choose Rails. They also felt that it had better automated testing integration and a clearer path to moving parts of it to C if necessary for performance. Ultimately, the development team simply felt more comfortable with it.
John had initially expected they would use Java for the service tier, but after an evaluation of EJB3 the team felt there were no real advantages over Ruby or Python for their application. All the reasons for choosing Rails for the web tier applied equally to service tier, and having a single software environment has obvious advantages.
Keeping the project moving
The success of this rewrite had as much to do with adeptly managing the process as with technology. One key to keeping it moving was giving one person decision-making authority. Another important factor was freezing development on the existing site, so they wouldn’t have a moving target. An ad-hoc rule they came up with was that if you can’t figure out quickly how you’d like to change something about the existing site, then don’t change it.
Another important process was early and frequent communication with the sales team. A previous site redesign had nearly failed because of lack of support from sales. The project lead, who came from the user experience team, met with 20 to 40 salespeople a week to review the proposed changes. This gave them the buy-in they needed for the site to be successful from a business perspective.
The site was made available to friends and family as a closed beta on 4/26/07, an open beta on 5/17/07, and then went live at the end of June, less than six months after beginning work on their first Rails prototype. This was the first time AT&T had released a beta of any kind.
A few things were farmed out to other teams, including HTML/CSS coding, rewrite rules for legacy URL translation, and performance evaluation for the production deployment configuration.
The web tier, service tier, and search cluster are each fronted by an F5 load balancer. The web tier initially used Apache with 16 Mongrels on each server, while the service tier uses Apache with 30 Mongrels and memcached on each server.
The communication between tiers uses stateless HTTP transactions. The service tier provides a set of RESTful services that return JSON to the web tier. The search cluster uses the FAST ESP search engine.
The entire system consists of 25 machines, each with two dual-core, 64-bit processors. Two of the 25 are used for the database. This compares to 21 servers for the previous Java-based solution. There are two data centers with identical setups; each is designed to handle the entire load, so they provide geographically distributed redundancy.
They run Solaris on the database machines and CentOS 5 on all the others. John called Solaris “a mistake they wouldn’t repeat,” not because of any particular problems with it but because of a lack of system administrators in their organization who had experience with it.
They use Oracle for the database engine, which had issues with the large numbers of connections created by the large number of Mongrels. As a result, memory usage on the database servers was high, and these machines were upgraded to 12G of RAM.
The performance goals, which were achieved, were:
- Sub-second home page load time
- 4-second average search time
- Never dies
They wrote Mongrel handlers to pick requests off from the web tier and send them to the service tier without involving Rails. They also wrote a C library for parsing search cluster responses and turning them into hashes.
They used the asset_packager plug-in (now made obsolete by Rails 2) and moved image serving to an Akamai edge cache.
They found that Apache was slow serving the 42-byte single-pixel GIFs that they use as analytics tags, which drove a shift to Nginx for the web tier.
After this tuning, performance was better than the Java site. In terms of availability, all site availability issues in first six months were due to database problems.
All things considered, it was a very successful rewrite. While the problems with the prior Java site cannot all be blamed on the technology, it is clear that Rails gave the team significant advantages with no loss of performance.