Fix pandas MemoryError Large CSV - Processing Guide

Introduction

The MemoryError when loading a large CSV file with pd.read_csv() occurs because pandas loads the entire file into RAM as a DataFrame, and the memory footprint of a DataFrame can be 5-10x larger than the raw CSV file size. A 2GB CSV file can easily require 16-20GB of RAM when loaded as a DataFrame. This error commonly hits data engineering pipelines, ETL processes, and analytics scripts that worked fine on small samples but fail in production with full-size datasets.

Symptoms

bash

MemoryError: Unable to allocate 14.2 GiB for an array with shape (50000000, 38) and data type object

Or:

bash

pandas.errors.ParserError: Error tokenizing data. C error: out of memory

Or the process is killed by the OS:

bash

$ dmesg | tail -5
[98765.432101] Out of memory: Killed process 12345 (python) total-vm:18432100kB, anon-rss:16789012kB
Killed

Common Causes

Default dtype inference uses object type: String columns default to object dtype which has high memory overhead
Loading entire file at once: pd.read_csv() has no streaming mode by default
Integer columns with NaN become float64: Pandas uses float64 for integers with missing values, doubling memory
Date parsing creates full datetime objects: Parsing dates creates 64-bit objects instead of keeping strings
Insufficient system RAM: The machine simply does not have enough memory for the dataset
Multiple copies during processing: Operations like df.copy(), merge(), or groupby() create temporary copies

Step-by-Step Fix

Step 1: Optimize dtypes before loading

Inspect memory usage and specify dtypes explicitly:

```python import pandas as pd

# First, load a sample to understand the data sample = pd.read_csv("data/large_file.csv", nrows=1000) print(sample.dtypes) print(sample.memory_usage(deep=True))

# Then specify optimal dtypes for the full load dtypes = { "user_id": "int32", # int64 -> int32 saves 50% "status": "category", # Low-cardinality string -> category "country": "category", "amount": "float32", # float64 -> float32 saves 50% "quantity": "Int16", # Nullable int16 for columns with NaN "is_active": "boolean", # Nullable boolean }

df = pd.read_csv( "data/large_file.csv", dtype=dtypes, parse_dates=["created_at"], usecols=["user_id", "status", "amount", "created_at"], # Only needed columns )

print(df.memory_usage(deep=True).sum() / 1024**3, "GB") ```

A real-world example: a 3.2GB CSV with 50 million rows and 38 columns went from 18GB RAM to 2.4GB after dtype optimization.

Step 2: Use chunked processing

When the full dataset still does not fit in memory:

```python chunk_size = 500_000 # rows per chunk results = []

for chunk in pd.read_csv( "data/large_file.csv", dtype=dtypes, chunksize=chunk_size, ): # Process each chunk independently filtered = chunk[chunk["amount"] > 100] grouped = filtered.groupby("status")["amount"].sum() results.append(grouped)

# Combine results from all chunks final_result = pd.concat(results).groupby(level=0).sum() ```

Step 3: Use Dask for out-of-core processing

When chunked processing becomes too complex, use Dask:

```python import dask.dataframe as dd

# Dask reads the CSV lazily - no data loaded yet ddf = dd.read_csv( "data/large_file.csv", dtype=dtypes, parse_dates=["created_at"], )

# Operations are lazy - builds a task graph result = ( ddf[ddf["amount"] > 100] .groupby("status")["amount"] .sum() .compute() # This is when data is actually loaded ) ```

Dask automatically partitions the data and processes chunks in parallel, using only as much memory as needed per partition.

Prevention

Always check df.memory_usage(deep=True) after loading to understand actual memory consumption
Use pd.read_csv() with usecols to load only needed columns
Store processed data in Parquet format which is 5-10x smaller than CSV and preserves dtypes
Use pyarrow engine: pd.read_csv("file.csv", engine="pyarrow") for better memory efficiency
Monitor RSS memory with psutil.Process().memory_info().rss in long-running data pipelines
Consider Polars as a drop-in replacement: pl.scan_csv("file.csv") uses lazy evaluation and is significantly more memory-efficient than pandas

Fix pandas DataFrame MemoryError When Loading Large CSV

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Step 1: Optimize dtypes before loading

Step 2: Use chunked processing

Step 3: Use Dask for out-of-core processing

Prevention

Share this guide

More Python Troubleshooting Guides

Python Unit Test Error

Python Argparse Error

Python Logging Configuration Error

Python URLLIB Error

Python Requests Timeout Error

Python FastAPI Validation Error