A few months ago, I migrated an enterprise website from AWS to Heroku, to reduce the maintenance burden of AWS EC2 instances and make the website more reliable. Here’s how that went and what I learned from it.
Planning for failure
Before migrating the servers, I made sure there was an easy way to switch back to the old infrastructure, just in case something went wrong. As we’ll see, this would turn out to be useful.
So, to switch back from the new, potentially buggy, infrastructure to the old one, all I had to do was keep the old servers up and make update the CNAME record.
Having a low “Time To Live” (TTL) for the CNAME record was important to enable quick reverts.
First try and failure to handle the load
Before moving to Heroku, I actually tried using another PaaS we’ll call Cocorico. It seemed very simple to use. No configuration whatsoever. I was very enthusiastic about giving this platform a shot.
In my initial tests, the website I was working on worked nicely. No lag. No crashes. No timeouts. I was off to a great start.
But then my enthusiasm caught up to me. I flipped the DNS switch over to the new infrastructure. Almost instantly, everything started crashing. We had hundreds of timeouts. The website was completely unusable.
Stress started growing. Requests started coming in to ask what was going on. At this point, I thought I could call Cocorico’s support team for help. But they were unable to assist me.
So I looked for possible bottlenecks. That too, was difficult, because Cocorico provided almost no monitoring tools. That’s when I realised how important monitoring tools are. Not to mention how much more I should have load tested Cocorico’s infrastructure before putting it on the front-line.
I had to come to terms with the fact that moving to Cocorico had been a mistake on my part. So I switched back to the old AWS infrastructure. Instantly, the website started delivering quick responses again.
Learning from this first experience, I now knew I’d need the ability to:
- customize the Nginx & PHP configuration
- easily scale the number of servers up and down
- monitor incoming traffic in case issues arose
Heroku seemed like a perfect fit because it supports all of this (and a bit more). But before blindly trusting Heroku’s feature-set like I had done with Cocorico, I wanted to make sure that, this time around, the new infrastructure was going to work properly under production conditions.
Figuring out what to load test
The website I was migrating had different kinds of work loads: asynchronous tasks run by workers, static content via Wordpress, APIs which themselves call out to multiple APIs, etc.
All of these things had to at least be as fast and stable as they were with AWS. But I didn’t really know how to find out what the current load was on the AWS servers.
My good friend Kim was of great help. He gave me a whole bunch of tips about log analysis.
He walked me through AWS monitoring dashboards amongst others. They provide lots of useful graphs that can help understand the number of concurrent requests you must be able to handle.
This is just one example of the data available for an EC2 instance.
Kim also recommended I take a close look at the number of concurrent database connections and transactions, which would help more accurately define what tests were required.
For inspiration, I highly recommend the article about how Github tested a change the way merges were processed. Not everyone needs to go as far as Github does in this case, but if Github is as stable as it is, it is probably because of all the work that goes into testing changes.
How to load test
After I had a somewhat good idea of what to load test, I ran tests with way more traffic than what would happen in production for the different kinds of workloads that the website runs.
If the servers held up, that would be a good indication that they would also remain available under production traffic.
My colleague Romain pointed me to a list of load testing tools, one of which “Apache HTTP benchmarking tool”. What it does, is that it throws a specific number of requests at a specific URL with a predefined concurrency threshold.
Here is what it looks like for 1000 requests with a concurrency of 200:
ab -n 1000 -c 200 https://mywebsite.herokudns.com/some-url-to-load-test
ab output this kind of report:
The load tests of all workloads yielded good results. Under high load scenario, increasing the number of servers was painless and without interruption of live traffic. Everything looked good.
So I went ahead and flipped the DNS switch again. We were now in production with the new Heroku based setup. Now, it was time for ironing out edge cases through monitoring.
Edge case detection through monitoring
Heroku has a very helpful “Metrics” view, which displays the number of 4xx and 5xx requests. That’s a good first indicator when it comes to the number of failed requests.
The automated alerts that Heroku sends out can be a bit out of place, but they are generally helpful and help avoid downtime, because at the slightest rise in failed requests, a notification is sent out, which can be used to take action (increase the number of servers, look for ways to optimize CPU/memory use, etc).
Looking at Google Analytics performance metrics and Page Speed Insights also helped debunk some bugs.
Some of the unexpected slow downs that were discovered were due to:
- Heroku having hard memory limits per request, which broke requests that used more than 32 megabytes of memory,
- Moving from one server to a multi-server setup, which broke file based caching.
The important lesson for me here, was that it took much longer than expected to iron out all of these bugs.
A few weeks in, bugs are getting more infrequent though. The website has become much more reliable than it was on AWS. And giving differentiated access to team members was made easier because of Herku’s easy to use ACL model.