Introduction
Python's pickle module is commonly used for caching computed results, session data, and ML model artifacts. When a pickle file becomes corrupted -- due to interrupted writes, disk errors, or version incompatibility -- pickle.load() raises EOFError, pickle.UnpicklingError, or AttributeError.
Unlike JSON, pickle has no built-in validation, so corrupted files cause immediate crashes with cryptic error messages.
Symptoms
- pickle.load raises "EOFError: Ran out of input" on empty or truncated file
- pickle.load raises "pickle.UnpicklingError: invalid load key"
- ModuleNotFoundError when loading a pickle saved with a different module structure
Common Causes
- Process crashed or was killed while writing the pickle file
- File was partially read while another process was writing it (no file locking)
- Pickle was created with a different class definition or module path than current code
Step-by-Step Fix
- 1.Wrap unpickling in try/except with fallback: Gracefully handle corrupted cache files.
- 2.```python
- 3.import pickle
- 4.import logging
logger = logging.getLogger(__name__)
def safe_load_pickle(filepath): try: with open(filepath, 'rb') as f: data = pickle.load(f) return data except EOFError: logger.warning(f"Empty pickle file: {filepath}") return None except (pickle.UnpicklingError, AttributeError) as e: logger.warning(f"Corrupted pickle file {filepath}: {e}") return None except FileNotFoundError: return None
cache = safe_load_pickle('cache.pkl') if cache is None: cache = compute_fresh_data() ```
- 1.Use atomic writes to prevent partial pickle files: Write to temp file then rename atomically.
- 2.```python
- 3.import pickle
- 4.import os
- 5.import tempfile
def safe_save_pickle(data, filepath): """Write pickle atomically to prevent corruption.""" dir_name = os.path.dirname(filepath) or '.' fd, tmp_path = tempfile.mkstemp(dir=dir_name, suffix='.pkl.tmp') try: with os.fdopen(fd, 'wb') as f: pickle.dump(data, f) os.replace(tmp_path, filepath) # Atomic on POSIX except: os.unlink(tmp_path) # Clean up on failure raise ```
- 1.Add cache version checking: Include a version marker to detect incompatible pickles.
- 2.```python
- 3.import pickle
CACHE_VERSION = 3
def save_cache(data, filepath): payload = {'version': CACHE_VERSION, 'data': data} with open(filepath, 'wb') as f: pickle.dump(payload, f)
def load_cache(filepath): try: with open(filepath, 'rb') as f: payload = pickle.load(f) if payload.get('version') != CACHE_VERSION: return None # Version mismatch, recompute return payload['data'] except (EOFError, pickle.UnpicklingError): return None ```
- 1.Switch to safer serialization for simple data: Use JSON or msgpack for non-object data.
- 2.```python
- 3.import json
# Instead of pickle for simple data structures: # BAD: pickle.dump(data, f) # Can execute arbitrary code
# GOOD: JSON is safe, human-readable, cross-language with open('cache.json', 'w') as f: json.dump(data, f)
with open('cache.json', 'r') as f: data = json.load(f) ```
Prevention
- Always use atomic writes (temp file + rename) for pickle files
- Include version markers in cached data to detect schema changes
- Avoid pickle for untrusted data -- it can execute arbitrary code during unpickling
- Prefer JSON or msgpack for simple data structures that don't need object serialization