Introduction

S3 multipart uploads fail in production for several reasons: network interruptions during part upload, part checksum mismatches, timeout on large parts, or S3 throttling when uploading many parts concurrently. Unlike single PUT uploads, multipart uploads require managing upload state across multiple HTTP requests. If any part fails, the entire upload must be either retried from that part or aborted and restarted. Without proper retry configuration, boto3 gives up after the default retry limit, leaving incomplete multipart uploads that consume S3 storage and incur costs indefinitely.

Symptoms

bash
botocore.exceptions.ClientError: An error occurred (RequestTimeout) when calling the UploadPart operation
  (reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period.

Or:

bash
botocore.exceptions.ClientError: An error occurred (SlowDown) when calling the UploadPart operation
  (reached max retries: 4): Please reduce your request rate.

Orphaned multipart uploads detected by lifecycle policy:

bash
$ aws s3api list-multipart-uploads --bucket my-bucket
{
    "Uploads": [
        {
            "Key": "data/export-2024-03-15.csv",
            "UploadId": "abc123",
            "Initiated": "2024-03-15T10:00:00.000Z",
            "StorageClass": "STANDARD"
        }
    ]
}

Common Causes

  • Default retry configuration too conservative: botocore default retries do not cover all S3 error codes
  • Part size too large: 100MB+ parts take too long to upload and hit the socket timeout
  • Too many concurrent part uploads: Exceeding S3 per-prefix rate limits triggers SlowDown errors
  • Incomplete upload not aborted: Failed uploads are never cleaned up, accumulating storage costs
  • Network instability on EC2: Instance network throughput fluctuation causes part upload timeouts
  • Missing idempotency: Retry uploads the same file creating duplicate keys

Step-by-Step Fix

Step 1: Use TransferConfig with proper retry and part settings

```python import boto3 from boto3.s3.transfer import TransferConfig from botocore.config import Config

# Configure botocore retries with custom policy config = Config( retries={ "max_attempts": 10, "mode": "adaptive", # Standard or adaptive } )

s3_client = boto3.client("s3", config=config)

# Configure multipart transfer transfer_config = TransferConfig( multipart_threshold=50 * 1024 * 1024, # 50MB - use multipart above this multipart_chunksize=25 * 1024 * 1024, # 25MB per part (smaller = more resumable) max_concurrency=10, # Concurrent part uploads use_threads=True, ) ```

The adaptive retry mode adds client-side throttling in addition to exponential backoff, which is essential for S3 SlowDown responses.

Step 2: Implement resumable upload with abort cleanup

```python import logging

logger = logging.getLogger(__name__)

def upload_file_with_cleanup(bucket, key, filepath, transfer_config): """Upload file and abort multipart upload on failure.""" try: s3_client.upload_file( Filename=filepath, Bucket=bucket, Key=key, Config=transfer_config, ExtraArgs={"ServerSideEncryption": "aws:kms"}, ) logger.info("Successfully uploaded %s to s3://%s/%s", filepath, bucket, key) except Exception as exc: logger.error("Upload failed for %s: %s", filepath, exc) abort_incomplete_uploads(bucket, key, max_age_hours=1) raise

def abort_incomplete_uploads(bucket, key, max_age_hours=1): """Abort multipart uploads older than max_age_hours for this key.""" from datetime import datetime, timedelta, timezone

cutoff = datetime.now(timezone.utc) - timedelta(hours=max_age_hours)

response = s3_client.list_multipart_uploads(Bucket=bucket, Prefix=key) for upload in response.get("Uploads", []): if upload["Initiated"] < cutoff: s3_client.abort_multipart_upload( Bucket=bucket, Key=upload["Key"], UploadId=upload["UploadId"], ) logger.info( "Aborted stale multipart upload: %s (id: %s)", upload["Key"], upload["UploadId"], ) ```

Step 3: Configure S3 lifecycle rule to auto-abort incomplete uploads

Set up a bucket lifecycle rule to automatically abort multipart uploads older than a threshold:

bash
aws s3api put-bucket-lifecycle-configuration \
    --bucket my-bucket \
    --lifecycle-configuration '{
        "Rules": [
            {
                "ID": "AbortIncompleteMultipartUploads",
                "Filter": {},
                "Status": "Enabled",
                "AbortIncompleteMultipartUpload": {
                    "DaysAfterInitiation": 1
                }
            }
        ]
    }'

This ensures that even if your application fails to clean up, S3 automatically aborts uploads after 1 day.

Prevention

  • Use multipart_chunksize=25MB for files up to 5GB; increase to 100MB for very large files
  • Set max_concurrency based on available network bandwidth (each concurrent part uses bandwidth)
  • Enable S3 Transfer Acceleration for cross-region uploads
  • Add CloudWatch alarms on 4xx and 5xx error rates for the S3 bucket
  • Run a daily cron job to list and abort any multipart uploads older than 24 hours
  • Use boto3.set_stream_logger("botocore", logging.DEBUG) to debug retry behavior in staging