Mistakes I've Made in AWS


September 10th, 2021

Mistakes I've Made in AWS

I've been using AWS "professionally" since about 2015. In that time, I've made lots of mistakes.

Other than occasionally deleting production data, the mistakes all arose from ignorance - there's so much to know about AWS that it's easy to miss something important.

Here's a collection of the most commonly missed things when using AWS with Laravel Forge!

Being Clueless about CPU Credits

The first mistake many of us hit is not knowing about the CPU credit system on the T2/T3/T4 instance classes.

Most of us have probably used the T2 or T3 instance classes. They're cheaper!

The reason why they're cheaper is because they work on a CPU credit system.

Knowing how this works is very important, especially when you're running a database on the same server as your application.

The T3 server class are a newer generation than the T2. You should prefer the T3 class as they are more performant and generally cheaper.

Each server size in the T2/T3/T4g classes have a CPU threshold. If your CPU usage goes above that threshold, you start using CPU credits. If the credits reach zero, the server is capped at the threshold. If CPU usage is under the threshold, CPU credits are gained (up to a certain number).

For example, the t3.small server size gains 24 credits per hour for a max of 576 credits. Howver, if you go above 20% CPU usage, you start using CPU credits. If your credits go down to zero, the server is capped at 20% CPU.

T3 and T4g instances come with a feature called "Unlimited Mode". This is enabled by default. If you have no CPU credits remaining, your CPU is allowed to go above the threshold at additional cost.

Luckily this cost is generally low, so you may not even notice the increase on your bill. However it's still best to not leave your server running at zero CPU credits.

You can monitor your CPU credits and burst usage in the CloudWatch metrics for your instances, or within the EC2 control panel under the "Monitoring" tab for any given server instance.

Not Using Cheaper Servers

Have you ever wondered what the T3a instances are? The "a" is for AMD. While T3 instances use Intel CPUs, T3a instances use AMD CPU's. Technically they are a smidgen slower than Intel for certain workloads. However, for web applications (PHP, MySQL, etc), this usually doesn't make a difference.

Use the cheaper instance type and save ~10% on costs!

There's also the newest T4g instance type. These use ARM processors, and are the cheapest. The CPUs are fast, but ARM CPU's are cheaper in general - and so AWS passes that savings off to you. These work great on Forge and I highly recommend them.

Ignoring IOPS

The next, more pernicious mistake made is not checking IOPS usage. Hitting disk volume limits is not at all obvious when it happens, but can lead to performance issues.

Similar to CPU bursting, hitting disk limits is easy when running your database on the same server as your application.

Your EC2 servers have Volumes attached to them (at least one, maybe more). These disk volumes are probably either gp2 or gp3 types.

Despite the newer gp3 volumes being (generally) better, they are not yet the default volume type.

Both volume types have a maximum number of IOPS and Throughput.

  1. IOPS are IO Operations Per Second
  2. Throughput is data measured in Megabytes per Second (MB/s)

You can read the following brief overview of how they work below, but there's more details on how IOPS work here at CloudCasts.

GP2 Volumes

GP2 volumes gives you 3 IOPS per GB of storage. There's a minimum of 100 IOPS - you get additional IOPS when you provision over 33.333 GB.

GP2 throughput caps out at 250 MB/s, but the calcuation for what your disk gets is complex.

GP2 IOPS work on a burst system similar to T3 instances. You can burst up to 3000 IOPS until you run out of credits. Once you run out of credits, you are capped at your max IOPS as determined by the size of the volume.

Once you get 1000 GB of storage, you reach 3000 IOPS and there's no more bursting available. The max IOPS you can get is 16,000 at a pricey ~5334 GB of storage.

You can check your CloudWatch metrics for each volume to see IOPS usage and Queue Depth (higher is bad, Queue Depth should be really low, below 1 or 2).

GP3 Volumes

GP3 volumes have a set number of IOPS and throughput. They start at 3000 IOPS and 125 MB/s. This is generally better than gp2 instances, especially for smaller disk sizes (which most of us likely use).

There's no credit system for IOPS - you can use up to the amount provisioned for the volume. You can provision more IOPS and throughput separately at needed!

GP3 volumes should generally be your default volume type. However, you should be aware that RDS databases only support gp2 volumes currently.

Except for Aurora databases, which have essentially unlimited IOPS due to how the database is architected.

Metrics to Watch

To know if you're using your disk too much, you can watch the following metrics on any given volume. These are found within CloudWatch Metrics or within the "Monitoring" tab when clicking on a Volume in the EC2 web console.

  1. Burst Balance - if this reaches zero, your volumes will be slow, as you can no longer burst up to 3000 IOPS. This only applies tp gp2 volumes under 1000 GB in size.
  2. Queue Depth - This is the count of pending IO operations. If you have pending operations, it's a bad sign. This number should generally be very low, below 1 or 2 (although it depends on your usage).

Filed in:

Chris Fidao

Teaching coding and servers at CloudCasts and Servers for Hackers. Co-founder of Chipper CI.