AWS S3 503 Slow Down Exceptions

Sun, Mar 26, 2023

Read in 4 minutes

Overview

Recently, we have been experiencing S3 503 slow down exceptions in our Lambda logs. We read and write S3 objects from Lambda, and it’s during these GET/PUT requests that we encounter this error. In this blog post, we will delve into the reasons behind this S3 503 exception and explore ways to resolve the issue.

Why we got 503 Exception ?

A 503 is an HTTP status code returned when the requested service is currently unavailable.

one common cause is when the client application has rapid increase in request rate to the API / or service.

In our case, the reason we received a 503 error is because we sent more than 20,000 requests for a specific prefix to S3 within a second. S3 attempted to handle this request rate by further partitioning, but it takes 30 to 60 minutes for each partition to complete. During this time, S3 will send a ‘503 slowdown’ exception.

What is S3 Prefix ?

The simple term , s3 prefix is the path of the S3 object.For example ,

TestBucket/Photos/photo1.jpg “TestBucket /photos/“ are prefix . “Photos/photo1.jpg is a object key.

What is partition And how does s3 automatically scale request rate ?

A partition is an AWS S3 mechanism to distribute performance across a common set of object key names. In the absence of any shared prefixes, all objects share a single partition based on the bucket name. Performance scaling is adaptive. As traffic rates increase or decrease, S3 will automatically adjust as needed. S3 monitors request rates on objects sharing common prefixes and scales data partitioning to accommodate higher traffic.S3 needs some time to make sure that the increases are not the result of a peak spike before automatically splitting the performance within a prefix. This typically takes 30 to 60 minutes of monitoring the request traffic.

First the prefix for the below example is bucket name which is TestAWSbucket. If there is an increase in the request rate load on the bucket “TestAWSbucket” for over 30 to 60 minutes at a rate higher than the per-partition maximums, S3 automatically creates another partition, which for example purposes we will assume is “TestAWSbucket/LogFiles/". Note that during this time, customers may receive 503 errors indicating that S3 is in the process of provisioning more resources for them.

BucketName/Prefix:

TestAWSbucket/LogFiles/

TestAWSbucket/LogErrors/

ExampleAWSbucket/…

Initial partition - TestAWSbucket

Next partition - TestAWSbucket/Log and so on.

If there is an increase in the request rate load for over 30 to 60 minutes on the prefix “Log” at a rate higher than the per-partition maximums, s3 automatically creates another partition for “Log”. Once this is done, two partitions exist, each of which receives per-partition requests rates, one partition for “Log* and one partition for all other objects, including all objects at the root level of the bucket

How our S3 prefix partitioned ?

TestBucket / AA_CODE_0001 TestBucket / AA_CODE_0002 TestBucket / AA_CODE_0003

Here AA_CODE is common for all our objects .

So , our first partition - TestBucket

When s3 get increased request rate , second partition - AA_CODE . As all our objects starts with AA_CODE we have tons of objects in that object and s3 requests from our lambda increase to 20k for that prefix .so we got slowdown exception for our read and writes into S3.

How Read/write (get/put) s3 requests are related with S3 prefix partition ?

As per AWS documentation https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html - application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned prefix. 

What solution implemented to solve 503 Error?

We reversed the name of our object from AA_CODE_0001 to 100_EDOC_AA , So the prefix can be unique for each object and the s3 request wont increase to large number for single prefix.

What happened while deploying with modified Prefix ?

First half an hour , the S3 throws “slow down” exception because , s3 was understanding the new prefix and started making partition . So got throttled . After that the S3 throttling Error reduced.

Changing s3 prefix is like changing the schema . So ,Existing buckets (bucket already filled with objects ) needs to be extra care while changing the prefix.