After choosing the tech stack for DM, the next step was to figure out hosting. A standard bit of wisdom for new web startups is to use Heroku instead of building and maintaining your own infrastructure. The argument is that, despite the expensive cost relative to other hosting solutions, Heroku will still be substantially cheaper than the time you (or someone you hire) spend managing infrastructure.
Choosing Heroku is a polarizing decision. There are claims that Heroku is too expensive, too buggy, or no longer innovative. Instead, our choice was about needing the flexibility to take our product in any direction we needed, and for that reason, we chose AWS. (Google Cloud Platform was a close second choice.)
Despite choosing AWS, I still wanted the high development velocity you get when using Heroku. Pushing to the
main branch on GitHub should run tests and deploy the site to a staging environment. Then, if everything looks good, the build is manually promoted to production. Importantly, all of this should happen within minutes. I was able to recreate this process using AWS Fargate and GitHub Actions and, so far, it’s been great.
The main goals of this architecture were to keep everything simple and minimize moving parts as much as possible. Maybe this is laughable since even the simplest of AWS infrastructure is a web of tiny configurations, but I think we managed to do it.
We’re using GitHub Actions to build a Docker image for every push to our project’s
main branch. If the build succeeds, we push it to an Elastic Container Registry (ECR), update the current task definition on Elastic Compute Service (ECS) with that image and tell the
staging service to update itself. ECS then does a rolling deployment where old instances are kept around until the new instances are healthy. (Staging has only one instance, but production has two or more.)
Once the staging deployment finishes, I do some manual testing on the staging site, and if I’m satisfied, I run a small script to update the production task definition with the same image and update the
prod cluster. Once the production deployment finishes, the changes are live. The entire process takes less than 13 minutes with most of the time spent on GitHub building the docker image.
We have a single Postgres database on RDS. When the application container starts, any new migrations get applied using Knex.js. The ECS cluster runs behind an ELB Application load balancer, which is behind a CloudFront proxy that caches all content-addressed static assets generated by Next.js.
Deploy progress gets monitored by getting ECS status notifications in Slack using a small Lambda function with a little SNS and CloudWatch plumbing. We use another small Lambda function to post CloudWatch alerts to Slack, such as when CloudFront serves a 500 error. Since all logs go to CloudWatch, reading logs is easy with the
aws logs tail command. And if you need a shell in a production container to debug something, you can use ECS Exec.
There are a few dozen ways to host your application on AWS, but the reason we chose ECS Fargate is that it’s the easiest way on AWS to run a web application container without worrying about the guts of EC2, and it gives us the option of auto-scaling later. Here’s a list of steps I used to create our environment — though I’m sure someone will happily point out alternatives:
- I created a new Task Definition in ECS called
staging, chose Fargate, and kept the rest of the default options. I chose an appropriate memory and vCPU configuration that seemed suitable for our application.
- I added a container to the task definition and I gave it a name,
app, which is probably too generic and will bite us later. (Sorry, future me.)
- I added the correct TCP port number to forward from the container and all required environment variables. For sensitive env vars such as API keys and the Postgres URL, I used the
ValueFromoption and an ARN for each value from a
SecureStringentry in the Parameter Store of the AWS Systems Manager.
- I used a tip to speed up deployments substantially by setting the container start timeout to 600s and stop timeout to 2s.
- I finished creating the task definition and repeated the process for production.
- Two clusters were created,
prod, on Elastic Container Service (ECS).
- Inside each cluster, I created a new service called
appusing the appropriate task definition for each environment. (A minimum of 2 tasks for production seemed like a good idea in case one explodes.)
DNS, Load Balancing & Caching
Each Fargate cluster needs a load balancer, so I created an Application Load Balancer for each environment and assigned them the load balancer security group (described later). Each load balancer forwards HTTP and HTTPS to its appropriate cluster, and HTTPS is assigned a certificate from AWS Certificate Manager. The SSL certificate isn’t necessary since CloudFront handles SSL termination but has remained there because I initially set everything up without CloudFront. Also, I wanted to keep the possibility of turning CloudFront off in case there were problems with caching.
CloudFront is set up as a reverse proxy for staging and production. Each CloudFront distribution has a single origin that is the load balancer for that environment, and each one has two behaviors enabled: a
Managed-CachingOptimized policy for the path
/_next/static/* so that all content-addressed static assets get cached, and a
Managed-AllViewer policy as the default so that all other requests go right to the load balancer.
DNS for dm.app is handled through Route 53, which is easy. The domain
dm.app is an alias to the production CloudFront distribution.
I needed to create at least two security groups because I was surprised to learn that our Fargate task instances were initially publicly accessible. Unfortunately, I can’t remember when exactly I created the security groups in the setup process, but I’m mentioning them here for completeness. There are a few:
- A group for the EC2 Application Load Balancers, which allowed inbound traffic on TCP ports 80 and 443 for HTTP and HTTPS, respectively.
- A group for the container, which allowed inbound traffic on our application’s TCP port (3000) from the load balancer security group.
- A group for a single, hardened EC2
t2.microSSH-only bastion server so that we can proxy through it to access the production and staging databases using local tools like Postico.
- A group for our RDS Postgres database that limits inbound access to our EC2 containers and the bastion server.
The CI/CD Pipeline
A build-and-deploy process using Docker and GitHub Actions was easy to set up, but it took us a lot of tweaking to find the correct cache settings. You can view the two GitHub Actions files we’ve created here. One builds and deploys the application when the
main branch is updated, and the other only builds the application when there’s a pull request made against the
main branch. This allows us to get that lovely green checkmark on pull requests indicating that the build and tests were successful. We had a few snags, such as adding
libc6-compat to our Alpine Node.js image and giving Node.js more memory during Next.js builds with
NODE_OPTIONS="--max-old-space-size=8192". We still have to update our
docker/build-push-action cache settings from time to time when the build starts failing.
Logging and Alerts
As mentioned before, all production logs are easily accessible using the CloudWatch interface or using the
aws command line tool, such as
aws logs tail /ecs/prod/app --follow --format short --since 1d.
For client-side errors we’re using Sentry, but I couldn’t get Sentry working on the backend after multiple attempts. So instead, I set up CloudWatch subscription filters on the two log groups that send any log line matching
failed to a small Lambda function that posts them in a dedicated alerting channel on Slack. This has worked well for catching errors and notifying us.
Was this more work than using Heroku? Absolutely. And it will probably be a while before we hit the return on investment on the time cost of building out such a solution. However, I’m confident that this solution gives us the flexibility we need no matter where the product takes us.
I still have a TODO list of tasks, such as creating separate VPCs for each environment, enabling auto-scaling, and spreading the service across multiple availability zones. If you have thoughts, I’d love to hear them!