Reducing MongoDB traffic by 78% with Redis

TL;DR: 31 lines of Rack middleware leverage Redis for highly-performant and flexible response caching.

As Crashlytics has scaled, we’ve always been on the lookout for ways to drastically reduce the load on our systems. We recently brought production Redis servers online for some basic analytics tracking and we’ve been extremely pleased with their performance and stability. This weekend, it was time to give them something a bit more load-intensive to chew on.

The vast majority – roughly 90% – of inbound traffic to our servers is destined for the same place. Our client-side SDK, embedded in apps on hundreds of millions of devices worldwide, periodically loads configuration settings that power many of our advanced features. These settings vary by app and app version, but are otherwise identical across devices – a prime candidate for caching.

There are countless built-in and third-party techniques for Rails caching, but we sought something simple that could leverage the infrastructure we already had. Wouldn’t it be great if we could specify a cache duration in any Rails action and it would “just work”?

1 cache_response_for 10.minutes

Rack Middleware to the Rescue

One of the most powerful features of Rack-based Rails is middleware – functionality you can inject into the request processing logic to adjust how it is handled. This will let us check Redis for a cached response or fall-through to the standard Rails action.

 1 class RackRedisCache
 2   def initialize(rails)
 3     @rails = rails
 4   end
 5
 6   def call(env)
 7     cache_key = "rack::redis-cache::#{env['ORIGINAL_FULLPATH']}"
 8
 9     data = REDIS.hgetall(cache_key)
10     if data['status'] && data['body']
11       Rails.logger.info "Completed #{data['status'].to_i} from Redis cache"
12       [data['status'].to_i, JSON.parse(data['headers']), [data['body']]]
13     else
14       @rails.call(env).tap do |response|
15         response_status, response_headers, response_body = *response
16         response_cache_duration = response_headers.delete('Rack-Cache-Response-For').to_i
17
18         if response_cache_duration > 0
19           REDIS.hmset(cache_key,
20             'status', response_status,
21             'headers', response_headers.to_json,
22             'body', response_body.body
23           )
24
25           REDIS.expire(cache_key, response_cache_duration)
26           Rails.logger.info "Cached response to Redis for #{response_cache_duration} seconds."
27         end
28       end
29     end
30   end
31 end

A response in Rails consists of 3 components – the HTTP status, HTTP headers, and of course, the response body. For clarity, we store these under separate keys within a Hash in Redis, JSON-encoding the headers to convert them into a string.

If the cache key is not present, the middleware falls-through to calling the action, and then checking an internal header value to determine whether the action desires its response be cached. The final critical line leverages Redis’ key expiration functionality to ensure the cache is only valid for a given amount of time. It couldn’t get much simpler.

Implementing our DSL

To tie it all together, the ApplicationController needs a simple implementation of cache_response_for that sets the header appropriately:

1 def cache_response_for(duration)
2   headers['Rack-Cache-Response-For'] = duration
3 end

Boom. It was really that easy.

Impact?

This implementation took us only about an hour to develop and deploy, and the effects were immediate. Only 4% of these requests now fall-through to Rails, CPU usage on our API servers has plummeted, and total queries to our MongoDB cluster are down 78%. An hour well-spent. Our Redis cluster also doesn’t sweat its increased responsibility: its CPU usage is up just marginally!

Join Us!

Interested in working on these and other high-scale challenges?  We’re hiring!  Give us a shout at jobs@crashlytics.com. You can stay up to date with all our progress on Twitter, and Facebook.

 


  • http://twitter.com/justizin Justizin

    Far better would be to influence legit cache-control headers with your DSL, and put Varnish up front – avoid using Rack / Ruby slots at all! :)

    • Jonathan Matthews

      Seconded. Why write code that’ll only get tested in your infra, instead of taking advantage of the huge body of battle-tested work that is varnish (or nginx; or squid(yuk!) or etc, etc, etc)? cf. http://en.wikipedia.org/wiki/Not_invented_here ;-)

      • http://twitter.com/cheapRoc cheapRoc

        ^– this… it should be the EXACT same testing as far as I’m concerned.

    • http://twitter.com/jeffseibert Jeff Seibert

      Fair point, but we’d need new software in production (Varnish), and far more testing. Using our approach, we’ve proven it out in just an hour of work, and paved the way for a more infrastructure-level solution down the road. Consider this our caching MVP :)

      • http://twitter.com/plukevdh Lukas

        You’re going to need new software in production in the future, without question. Every new feature you write is new software in production. This post translates in my mind to “We didn’t want to take the time to do the right thing, so we did the fast thing instead” which is the ethos of almost all startups and why so many in recent history have had public and embarrassing (and often costly) downtimes.

        What happens when your infrastructure starts piling more and more into Redis, more and more actions start caching, and you start to hit memory limits on your servers? Any idea what Redis does in the case that it runs out of memory? Did you test your own solution for hardware edge cases? I can tell you right now, Redis gets ugly very fast when it runs out of memory. You’ve moved software solutions to rely on hardware and once Redis stops fitting your scale, you will have a hardware AND a software problem.

        • Zen

          You are far reaching at best. Clearly he made a huge improvement on what he had before, with not a lot of work. How about he just monitors his hardware usage and tackles the next bottleneck when it is close to becoming a real issue?

          • http://twitter.com/plukevdh Lukas

            Because then you’re either going to have to spend time monitoring, your machines more closely or spend time building tools to monitor. And then eventually you STILL have to fix the actual problem. So you’ve only moved the problem to a later time when most likely it will be a much harder to solve problem and you will have much less time to do things right. In the end, you’ve not actually saved yourself any time and in fact added more potential for your product to fail. Doing something “new” may look cool, but 24/48/72 hours of downtime because “caching” broke and suddenly Mongo can’t handle the request load that used to be cached is not cool. People have tried these kinds of things before and the reason so many reliable services use things like Varnish over middleware caching is because they’ve broken in the past and people decided to implement sane solutions to common problems rather than reinventing the wheel each time. Know your industry. If you do not learn from it’s history, you will repeat the mistakes of the past.

          • http://twitter.com/plukevdh Lukas

            From this discussion I can only assume that you don’t know how easy it is to setup varnish with ZERO code changes. I’m shocked that you would take the hour or two to write and test something like this when with the same time, you could setup something just as simple. The argument “we didn’t do X because X requires more software on production” is crap because your changes are changing software on the server. That still takes a deploy. And if your infrastructure is so fragile that adding code to an application vs setting up an application around your application makes that much of a difference to uptime or stability, your platform is not fault tolerant enough.

        • http://twitter.com/jeffseibert Jeff Seibert

          Lukas, thanks for the thoughts – we’re quite familiar with Redis and well-aware of its behavior when it runs out of memory.

          I’d encourage you to check out my presentation at RedisConf a few weeks ago for an overview of our primary Redis utilization (http://vimeo.com/52564618).

          We’re currently running a pre-sharded 4-node cluster on HighMEM 2XL EC2 instances, so we’re in good shape as far as memory is concerned. Utilization is hovering in the single-digit percent range.

          • http://twitter.com/plukevdh Lukas

            That link 404s. Otherwise I will look.

          • http://twitter.com/jeffseibert Jeff Seibert

            Doh, Vimeo moved it. Here: http://vimeo.com/52564618

  • Chris Eigner

    Can you talk about why you chose Redis over memcache? I go back and forth a lot between these two when choosing a caching strategy.

    • http://twitter.com/jeffseibert Jeff Seibert

      Great question – they are both fantastic tools. It really came down to the fact that we already had Redis running in production, so little need to add memcached to the mix. You could do the same thing with memcached.

      • Chris Eigner

        Gotchya. Completely understandable and often the reason I choose redis as well.

      • http://twitter.com/blindman2k blindman2k

        1/ Memcache drops old/unused cache items when memory runs low, Redis just fills up.
        2/ When your Memcache dies you lose caching, when your Redis dies you lose everything.
        3/ Memcache clusters slightly better than Redis.
        4/ Sometimes practicalities dictate the choice but you really should review the choice soon.

  • http://twitter.com/tommychheng tommy chheng

    Could you also use rack-cache and the standard http cache headers? rack-cache lets you use configurable stores including memcache or redis.

  • Amit Kumar

    Why would you Redis over Varnish which is primarily an HTTP cache? Its like choosing to use a fork to cut butter.

    • wedtm

      Why get a knife dirty when you can do a “good enough” job with the fork, saving yourself the knife entirely.

  • http://twitter.com/cheapRoc cheapRoc

    This is definitely inventive and I like the use of Rack… I still feel that traditional HTTP caching is much more battle tested as numerous commenters have stated already.

    • http://twitter.com/jeffseibert Jeff Seibert

      This is a great point that I should have mentioned in the original post. There are a number of obscure HTTP caching bugs within the network layers of iOS 3, 4, and 6 that we need to avoid. 5 seems to have been a good release! By performing all caching server-side, we can predict our client-side SDK behavior vastly more accurately.

  • Jason Nochlin

    Awesome, implementing features as Rack-middleware is a great and very underutilized design tool!

    Curious what you are experiences have been so far with Redis cluster? Have you run into any problems?

    • http://twitter.com/jeffseibert Jeff Seibert

      Hey Jason! Not using Redis cluster, per se – we’re running a pre-sharded master-slave setup, so it’s the best approximation there is at the moment. Runs flawlessly – only material risk is hardware failure.

      • Jason Nochlin

        Ok, got it. Looks like “Redis Cluster” is still in development, will be interesting to see what it turns into