Introduction
The MemoryError when loading a large CSV file with pd.read_csv() occurs because pandas loads the entire file into RAM as a DataFrame, and the memory footprint of a DataFrame can be 5-10x larger than the raw CSV file size. A 2GB CSV file can easily require 16-20GB of RAM when loaded as a DataFrame. This error commonly hits data engineering pipelines, ETL processes, and analytics scripts that worked fine on small samples but fail in production with full-size datasets.
Symptoms
MemoryError: Unable to allocate 14.2 GiB for an array with shape (50000000, 38) and data type objectOr:
pandas.errors.ParserError: Error tokenizing data. C error: out of memoryOr the process is killed by the OS:
$ dmesg | tail -5
[98765.432101] Out of memory: Killed process 12345 (python) total-vm:18432100kB, anon-rss:16789012kB
KilledCommon Causes
- Default dtype inference uses object type: String columns default to
objectdtype which has high memory overhead - Loading entire file at once:
pd.read_csv()has no streaming mode by default - Integer columns with NaN become float64: Pandas uses float64 for integers with missing values, doubling memory
- Date parsing creates full datetime objects: Parsing dates creates 64-bit objects instead of keeping strings
- Insufficient system RAM: The machine simply does not have enough memory for the dataset
- Multiple copies during processing: Operations like
df.copy(),merge(), orgroupby()create temporary copies
Step-by-Step Fix
Step 1: Optimize dtypes before loading
Inspect memory usage and specify dtypes explicitly:
```python import pandas as pd
# First, load a sample to understand the data sample = pd.read_csv("data/large_file.csv", nrows=1000) print(sample.dtypes) print(sample.memory_usage(deep=True))
# Then specify optimal dtypes for the full load dtypes = { "user_id": "int32", # int64 -> int32 saves 50% "status": "category", # Low-cardinality string -> category "country": "category", "amount": "float32", # float64 -> float32 saves 50% "quantity": "Int16", # Nullable int16 for columns with NaN "is_active": "boolean", # Nullable boolean }
df = pd.read_csv( "data/large_file.csv", dtype=dtypes, parse_dates=["created_at"], usecols=["user_id", "status", "amount", "created_at"], # Only needed columns )
print(df.memory_usage(deep=True).sum() / 1024**3, "GB") ```
A real-world example: a 3.2GB CSV with 50 million rows and 38 columns went from 18GB RAM to 2.4GB after dtype optimization.
Step 2: Use chunked processing
When the full dataset still does not fit in memory:
```python chunk_size = 500_000 # rows per chunk results = []
for chunk in pd.read_csv( "data/large_file.csv", dtype=dtypes, chunksize=chunk_size, ): # Process each chunk independently filtered = chunk[chunk["amount"] > 100] grouped = filtered.groupby("status")["amount"].sum() results.append(grouped)
# Combine results from all chunks final_result = pd.concat(results).groupby(level=0).sum() ```
Step 3: Use Dask for out-of-core processing
When chunked processing becomes too complex, use Dask:
```python import dask.dataframe as dd
# Dask reads the CSV lazily - no data loaded yet ddf = dd.read_csv( "data/large_file.csv", dtype=dtypes, parse_dates=["created_at"], )
# Operations are lazy - builds a task graph result = ( ddf[ddf["amount"] > 100] .groupby("status")["amount"] .sum() .compute() # This is when data is actually loaded ) ```
Dask automatically partitions the data and processes chunks in parallel, using only as much memory as needed per partition.
Prevention
- Always check
df.memory_usage(deep=True)after loading to understand actual memory consumption - Use
pd.read_csv()withusecolsto load only needed columns - Store processed data in Parquet format which is 5-10x smaller than CSV and preserves dtypes
- Use
pyarrowengine:pd.read_csv("file.csv", engine="pyarrow")for better memory efficiency - Monitor RSS memory with
psutil.Process().memory_info().rssin long-running data pipelines - Consider Polars as a drop-in replacement:
pl.scan_csv("file.csv")uses lazy evaluation and is significantly more memory-efficient than pandas