How to Survive a 1000% Traffic Increase (and Live to Blog About It)

Here at Demandbase, we’re justifiably proud of our software and systems. Hundreds of large companies depend on us for B2B data, delivered in real time, to drive the analytics and personalization that in turn drives their business. The process goes a little something like this:

    1. Visitor loads a page on a Demandbase customer’s site
    2. Page sends the IP address of the visitor to Demandbase via REST API call
    3. Demandbase returns 40 different attributes of the visitor’s company in 2-400ms
    4. Repeat. A lot.

In the first few months of 2014, we averaged about 20,000 of these transactions per minute, distributed across our east and west coast availability zones, but as our business continued to grow, our customers were sending us more and more data. By July 1, our transaction volume had increased significantly to almost 60,000 per minute. Since we can easily scale horizontally by adding more servers, life was good. Until we woke up one Monday morning to find 200,000 requests per minute flooding in. (That’s an overnight tripling of traffic, if you’re counting.) After verifying that the traffic was in fact valid and we weren’t under DoS attack, we relaxed a bit. The front end scaled very nicely and API response times remained rock solid but, as in life, stress tends to reveal hidden weaknesses.

In this case, we were fortunate enough to discover that some weaknesses were around reporting of real time data. See, we log every API transaction in a high throughput, asynchronous way. Front end API servers send transactions to a series of brokers that write to Amazon S3, then are picked up by another process that writes them to HDFS. Buffering can occur at several points in this chain, so we are OK when traffic spikes. But we also have lower-throughput services running on infrastructure that is more appropriate to their lower traffic volumes. Unfortunately, our real-time visit ticker was one of those services. It couldn’t scale to meet our new traffic spikes, so the team had to work quickly to divert ticker traffic to a more robust system.

We’re happy to report that everything is humming along nicely now and we’re nearly done with a new, scalable ticker architecture. But what did we learn? It turns out the lessons here aren’t exactly about architecture. We can handle massive traffic spikes – cool, but we knew that already. The lesson for us is around when to scale a MVP. See, we implemented the ticker on low-throughput infrastructure because it was a simple and straightforward way to get the functionality in front of customers. We got early feedback, worked it into the UX and congratulated ourselves on a job well done. But with Sales and CSM signing up customers left and right, turns out we needed to scale a lot sooner than we thought.

So Product folks, heed my warning – and follow these steps:

    1. Get your experiment out and iterate quickly
    2. Keep an eye on leading indicators of success such as customer engagements and sales forecasts
    3. Work with Engineering to get a realistic estimate of how long it will take to scale that experiment
    4. Armed with those critical data points, decide when (or if) to sink your company’s precious resources into scaling.

In conclusion, premature scaling is the root of all evil (h/t Donald Knuth), but overdue scaling can be darn uncomfortable when your success is raining down on you at 3,000 transactions per second.