In today’s regulatory environment, there are two constants about compliance: requirements are changing and the volume of data to manage is immense. For Craigslist, the popular classifieds and job posting community that serves 570 cities in 50 countries, this means having to archive years of accumulated data where the structure of that data has changed numerous times. With 1.5 million new classified ads posted every day, Craigslist must archive billions of records in many different formats, and must be able to query and report on these archives at runtime.
Historically, Craigslist stored the information in a MySQL cluster. But the lack of flexibility and management costs became barriers for continued use of MySQL. A simple schema change on their vast archive took months to complete, preventing them from pushing new features. Craigslist realized that a NoSQL database would provide them the flexibility they needed, and they ultimately settled on MongoDB for its scalability and flexible schema. In 2011, Craigslist migrated over two billion documents to MongoDB from their archive of classified ads.
For much of the history of Craigslist, MySQL was the only option for data storage, including the archive. The original Craigslist archive application took the existing live database data and copied it to the archive system. But using a relational database system limited flexibility -- if the live database schema changed, those changes needed to be propagated to the archive system.
This led to a hodgepodge schema and lengthy delays in making changes. For example, each ALTER TABLE statement took months to complete on the MySQL archive. When making changes to billions of rows in their MySQL cluster, Craigslist could not move data to the archive. Archive-ready data would pile up in the production database. During these periods, performance on the live database deteriorated.
In addition to a more flexible schema, Craigslist wanted to reduce operational complexity. Craigslist needed a solution that would allow them to add new machines without downtime and route around dead machines without clients failing. To overcome these challenges the team began looking for an alternative solution.
After evaluating several NoSQL options, Craigslist settled upon MongoDB for its flexible document-based storage and built-in scalability. Each post and its metadata can be stored as a single document. As the schema changes on the live database, MongoDB can accommodate these changes without costly schema migrations.
Scalability and Availability
MongoDB’s support for auto-sharding and high availability eased several operational pain points for Craigslist. MongoDB enabled Craiglist to scale horizontally across commodity hardware without having to write and maintain complex, custom sharding code. Using auto-sharding, Craigslist’s initial MongoDB deployment was designed to hold over 5 billion documents and 10TB of data. MongoDB’s support for automated failover of nodes via replica sets was another big win, providing high availability within the Craigslist cluster. In their previous architecture, failover was a manual processes requiring significant of effort from the Craigslist operations team.
Ease of Use
Because MongoDB concepts and features are similar, in many respects, to relational databases, Craigslist's developers were able to hit the ground running to develop their archiving solution. For lead developer Jeremy Zawodny, the author of High Performance MySQL, the transition was easy. “It’s friendly,” Zawodny explains. “By friendly, I mean that coming from a relational background, specifically a MySQL background, a lot of the concepts carry over.... It makes it very easy to get started.”
Proven, Supported Technology
Compared to the other NoSQL options, MongoDB is broadly-used technology with many major deployments. At the time of the evaluation, MongoDB was one of the few NoSQL options with a well-supported Perl interface. During development, Craigslist ran into a bug in the Perl driver, which was fixed by 10gen within hours. This gave the team added confidence in the database and the team behind it.
Industry: Consumer Web
Location: San Francisco, CA
- Ability to archive years of accumulated, ever-changing data without costly schema migrations
- Horizontally scalable solution
- Easy to use, especially for a team with relational database experience
- Well-supported language drivers