A few months ago, I migrated an enterprise website from AWS to Heroku, to reduce the maintenance burden of AWS EC2 instances and make the website more reliable. Here's how that went and what I learned from it.

Planning for failure

Before migrating the servers, I made sure there was an easy way to switch back to the old infrastructure, just in case something went wrong. As we'll see, this would turn out to be useful.

The domain of the website pointed to an AWS ELB via a CNAME record. On Heroku, CNAMEs are also used.

So, to switch back from the new, potentially buggy, infrastructure to the old one, all I had to do was keep the old servers up and make update the CNAME record.

Having a low "Time To Live" (TTL) for the CNAME record was important to enable quick reverts.

ttl

First try and failure to handle the load

Before moving to Heroku, I actually tried using another PaaS. But this attempt failed miserably. I had not properly load tested and new infrastructure didn't work properly for close to 4 hours.

At that point, it was decided to point the CNAME back at AWS until a better solution was found.

Learning from this first experience, I now knew I'd need the ability to:

  • customize the Nginx configuration at the location level
  • have the possibility to tweak php.ini
  • be able to easily scale the number of servers up and down

Heroku seemed like a perfect fit because it supports all of this (and a bit more).

But before moving production traffic to Heroku, I wanted to make sure that, this time around, the new infrastructure was going to work under production conditions.

Load testing

Figuring out what to load test

The website I was migrating had different kinds of work loads: asynchronous tasks run by workers, static content via Wordpress, APIs which themselves call out to multiple APIs, etc.

I wasn't sure how to go about testing all these things. My good friend Kim was of great help. He gave me a whole bunch of tips about log analysis.

And indeed, his advice was on point: AWS provides you with lots of useful graphs that can help understand the number of concurrent requests you must be able to handle.

ec2-monitoring

Looking the number of concurrent database connections and transactions also helps more accurately define what tests are required.

Deeper analysis should be done if the website is particularly complex. For inspiration, I recommend the article about how Github tested a change the way merges were processed.

How to load test

After I had a somewhat good idea of what to load test, I thought I'd run tests with way more traffic than what would happen in production for the different kinds of workloads that the website runs.

If the servers held up, that would be a good indication that the servers would also remain available under production traffic.

The "Apache HTTP benchmarking tool" was recommended to me for URL based load testing. It allowed me to run a specific number of requests to a specific URL with a set concurrency:

ab -n 1000 -c 200 https://mywebsite.herokudns.com/some-url-to-load-test

Which gave me this kind of report:

ab-results

The load tests yielded good results, so I went ahead and moved to production with the new Heroku based setup.

That wasn't the end of it. It was important to make sure error rates were low after the infrastructure change.

Keeping an eye on logs

Once the servers were live with Heroku, keeping an eye on logs was key to ironing out the last few bugs that hadn't appeared during testing.

Heroku has a very helpful "Metrics" view, which displays the number of 4xx and 5xx requests. That's a good first indicator when it comes to the number of failed requests.

logs

Looking at Google Analytics performance metrics and Page Speed Insights also helped debunk some bugs.

Some of the unexpected slow downs that were discovered were due to:

  • Heroku not enabling GZIP by default, which our old configuration did,
  • Having fixed memory limits on Heroku, which reveals high memory usage for some routes, leading to "Out of memory" errors,
  • Moving from one server to a multi-server setup, which broke file based caching.

The important lesson for me here, was that it took much longer than expected to iron out all of these bugs.

But, a few weeks in, bugs are getting more infrequent and the website has become much more reliable. Where it before had regular 5 minute outages, it has not had a single one since the migration.