Introduction

The MemoryError when loading a large CSV file with pd.read_csv() occurs because pandas loads the entire file into RAM as a DataFrame, and the memory footprint of a DataFrame can be 5-10x larger than the raw CSV file size. A 2GB CSV file can easily require 16-20GB of RAM when loaded as a DataFrame. This error commonly hits data engineering pipelines, ETL processes, and analytics scripts that worked fine on small samples but fail in production with full-size datasets.

Symptoms

bash
MemoryError: Unable to allocate 14.2 GiB for an array with shape (50000000, 38) and data type object

Or:

bash
pandas.errors.ParserError: Error tokenizing data. C error: out of memory

Or the process is killed by the OS:

bash
$ dmesg | tail -5
[98765.432101] Out of memory: Killed process 12345 (python) total-vm:18432100kB, anon-rss:16789012kB
Killed

Common Causes

  • Default dtype inference uses object type: String columns default to object dtype which has high memory overhead
  • Loading entire file at once: pd.read_csv() has no streaming mode by default
  • Integer columns with NaN become float64: Pandas uses float64 for integers with missing values, doubling memory
  • Date parsing creates full datetime objects: Parsing dates creates 64-bit objects instead of keeping strings
  • Insufficient system RAM: The machine simply does not have enough memory for the dataset
  • Multiple copies during processing: Operations like df.copy(), merge(), or groupby() create temporary copies

Step-by-Step Fix

Step 1: Optimize dtypes before loading

Inspect memory usage and specify dtypes explicitly:

```python import pandas as pd

# First, load a sample to understand the data sample = pd.read_csv("data/large_file.csv", nrows=1000) print(sample.dtypes) print(sample.memory_usage(deep=True))

# Then specify optimal dtypes for the full load dtypes = { "user_id": "int32", # int64 -> int32 saves 50% "status": "category", # Low-cardinality string -> category "country": "category", "amount": "float32", # float64 -> float32 saves 50% "quantity": "Int16", # Nullable int16 for columns with NaN "is_active": "boolean", # Nullable boolean }

df = pd.read_csv( "data/large_file.csv", dtype=dtypes, parse_dates=["created_at"], usecols=["user_id", "status", "amount", "created_at"], # Only needed columns )

print(df.memory_usage(deep=True).sum() / 1024**3, "GB") ```

A real-world example: a 3.2GB CSV with 50 million rows and 38 columns went from 18GB RAM to 2.4GB after dtype optimization.

Step 2: Use chunked processing

When the full dataset still does not fit in memory:

```python chunk_size = 500_000 # rows per chunk results = []

for chunk in pd.read_csv( "data/large_file.csv", dtype=dtypes, chunksize=chunk_size, ): # Process each chunk independently filtered = chunk[chunk["amount"] > 100] grouped = filtered.groupby("status")["amount"].sum() results.append(grouped)

# Combine results from all chunks final_result = pd.concat(results).groupby(level=0).sum() ```

Step 3: Use Dask for out-of-core processing

When chunked processing becomes too complex, use Dask:

```python import dask.dataframe as dd

# Dask reads the CSV lazily - no data loaded yet ddf = dd.read_csv( "data/large_file.csv", dtype=dtypes, parse_dates=["created_at"], )

# Operations are lazy - builds a task graph result = ( ddf[ddf["amount"] > 100] .groupby("status")["amount"] .sum() .compute() # This is when data is actually loaded ) ```

Dask automatically partitions the data and processes chunks in parallel, using only as much memory as needed per partition.

Prevention

  • Always check df.memory_usage(deep=True) after loading to understand actual memory consumption
  • Use pd.read_csv() with usecols to load only needed columns
  • Store processed data in Parquet format which is 5-10x smaller than CSV and preserves dtypes
  • Use pyarrow engine: pd.read_csv("file.csv", engine="pyarrow") for better memory efficiency
  • Monitor RSS memory with psutil.Process().memory_info().rss in long-running data pipelines
  • Consider Polars as a drop-in replacement: pl.scan_csv("file.csv") uses lazy evaluation and is significantly more memory-efficient than pandas