Introduction
S3 multipart uploads fail in production for several reasons: network interruptions during part upload, part checksum mismatches, timeout on large parts, or S3 throttling when uploading many parts concurrently. Unlike single PUT uploads, multipart uploads require managing upload state across multiple HTTP requests. If any part fails, the entire upload must be either retried from that part or aborted and restarted. Without proper retry configuration, boto3 gives up after the default retry limit, leaving incomplete multipart uploads that consume S3 storage and incur costs indefinitely.
Symptoms
botocore.exceptions.ClientError: An error occurred (RequestTimeout) when calling the UploadPart operation
(reached max retries: 4): Your socket connection to the server was not read from or written to within the timeout period.Or:
botocore.exceptions.ClientError: An error occurred (SlowDown) when calling the UploadPart operation
(reached max retries: 4): Please reduce your request rate.Orphaned multipart uploads detected by lifecycle policy:
$ aws s3api list-multipart-uploads --bucket my-bucket
{
"Uploads": [
{
"Key": "data/export-2024-03-15.csv",
"UploadId": "abc123",
"Initiated": "2024-03-15T10:00:00.000Z",
"StorageClass": "STANDARD"
}
]
}Common Causes
- Default retry configuration too conservative: botocore default retries do not cover all S3 error codes
- Part size too large: 100MB+ parts take too long to upload and hit the socket timeout
- Too many concurrent part uploads: Exceeding S3 per-prefix rate limits triggers SlowDown errors
- Incomplete upload not aborted: Failed uploads are never cleaned up, accumulating storage costs
- Network instability on EC2: Instance network throughput fluctuation causes part upload timeouts
- Missing idempotency: Retry uploads the same file creating duplicate keys
Step-by-Step Fix
Step 1: Use TransferConfig with proper retry and part settings
```python import boto3 from boto3.s3.transfer import TransferConfig from botocore.config import Config
# Configure botocore retries with custom policy config = Config( retries={ "max_attempts": 10, "mode": "adaptive", # Standard or adaptive } )
s3_client = boto3.client("s3", config=config)
# Configure multipart transfer transfer_config = TransferConfig( multipart_threshold=50 * 1024 * 1024, # 50MB - use multipart above this multipart_chunksize=25 * 1024 * 1024, # 25MB per part (smaller = more resumable) max_concurrency=10, # Concurrent part uploads use_threads=True, ) ```
The adaptive retry mode adds client-side throttling in addition to exponential backoff, which is essential for S3 SlowDown responses.
Step 2: Implement resumable upload with abort cleanup
```python import logging
logger = logging.getLogger(__name__)
def upload_file_with_cleanup(bucket, key, filepath, transfer_config): """Upload file and abort multipart upload on failure.""" try: s3_client.upload_file( Filename=filepath, Bucket=bucket, Key=key, Config=transfer_config, ExtraArgs={"ServerSideEncryption": "aws:kms"}, ) logger.info("Successfully uploaded %s to s3://%s/%s", filepath, bucket, key) except Exception as exc: logger.error("Upload failed for %s: %s", filepath, exc) abort_incomplete_uploads(bucket, key, max_age_hours=1) raise
def abort_incomplete_uploads(bucket, key, max_age_hours=1): """Abort multipart uploads older than max_age_hours for this key.""" from datetime import datetime, timedelta, timezone
cutoff = datetime.now(timezone.utc) - timedelta(hours=max_age_hours)
response = s3_client.list_multipart_uploads(Bucket=bucket, Prefix=key) for upload in response.get("Uploads", []): if upload["Initiated"] < cutoff: s3_client.abort_multipart_upload( Bucket=bucket, Key=upload["Key"], UploadId=upload["UploadId"], ) logger.info( "Aborted stale multipart upload: %s (id: %s)", upload["Key"], upload["UploadId"], ) ```
Step 3: Configure S3 lifecycle rule to auto-abort incomplete uploads
Set up a bucket lifecycle rule to automatically abort multipart uploads older than a threshold:
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration '{
"Rules": [
{
"ID": "AbortIncompleteMultipartUploads",
"Filter": {},
"Status": "Enabled",
"AbortIncompleteMultipartUpload": {
"DaysAfterInitiation": 1
}
}
]
}'This ensures that even if your application fails to clean up, S3 automatically aborts uploads after 1 day.
Prevention
- Use
multipart_chunksize=25MBfor files up to 5GB; increase to 100MB for very large files - Set
max_concurrencybased on available network bandwidth (each concurrent part uses bandwidth) - Enable S3 Transfer Acceleration for cross-region uploads
- Add CloudWatch alarms on
4xxand5xxerror rates for the S3 bucket - Run a daily cron job to list and abort any multipart uploads older than 24 hours
- Use
boto3.set_stream_logger("botocore", logging.DEBUG)to debug retry behavior in staging