Fix Python MemoryError Loading Large Dataset NumPy - Out of Memory

Introduction

NumPy loads entire arrays into memory by default. When a CSV or binary file contains more data than available RAM, Python raises MemoryError and the process crashes. This is common when working with datasets in the gigabyte range on machines with limited memory.

A 10GB CSV file with float64 values can easily require 20-40GB of RAM after parsing, due to temporary objects during the load process.

Symptoms

numpy.load or numpy.genfromtxt raises "MemoryError: Unable to allocate X GiB"
Python process is killed by OS OOM killer (Linux) or runs out of page file (Windows)
pandas.read_csv consumes all available RAM and fails mid-load

Common Causes

Dataset size exceeds available physical RAM plus swap/page file
NumPy creates temporary copies during dtype conversion
Loading with float64 (8 bytes) when float32 (4 bytes) would suffice

Step-by-Step Fix

1.Use memory-mapped files with numpy.memmap: Access disk-backed arrays without loading everything into RAM.
2.```python
3.import numpy as np

# Create a memory-mapped array (data stays on disk) mmap = np.memmap('large_data.dat', dtype='float32', mode='r', shape=(1000000, 500))

# Access slices without loading full array chunk = mmap[0:10000, :] print(chunk.shape) # (10000, 500)

# For writing: write_mmap = np.memmap('output.dat', dtype='float32', mode='w+', shape=(1000000, 500)) write_mmap[0:1000] = np.random.rand(1000, 500).astype('float32') write_mmap.flush() ```

1.Process data in chunks with generators: Load and process one chunk at a time.
2.```python
3.import numpy as np
4.import pandas as pd

def process_in_chunks(filepath, chunksize=100000): for chunk in pd.read_csv(filepath, chunksize=chunksize): # Process each chunk independently result = chunk.select_dtypes(include='number').mean() yield result

results = list(process_in_chunks('large_dataset.csv')) final_result = pd.DataFrame(results).mean() ```

1.Downcast dtypes to reduce memory footprint: Use smaller numeric types where precision allows.
2.```python
3.import numpy as np

# Before: float64 uses 8 bytes per value data = np.array([1.0, 2.0, 3.0], dtype=np.float64) # 24 bytes

# After: float32 uses 4 bytes per value data = np.array([1.0, 2.0, 3.0], dtype=np.float32) # 12 bytes

# For integers, use smallest sufficient type: data = np.array([1, 2, 3], dtype=np.int16) # 6 bytes vs 24 bytes with int64 ```

1.Use Dask for out-of-core computation: Dask provides NumPy-like API with chunked computation.
2.```python
3.import dask.array as da

# Works on arrays larger than RAM array = da.from_array(np.memmap('large_data.dat', dtype='float32', mode='r', shape=(1000000, 500)), chunks=(10000, 500))

mean = array.mean(axis=0).compute() # Computes in chunks ```

Prevention

Always check dataset size against available RAM before loading
Use memory profiling (memory_profiler package) to track allocation patterns
Prefer chunked processing or memory-mapped files for datasets over 1GB
Use appropriate dtypes: float32 instead of float64, int32 instead of int64

Fix Python MemoryError When Loading Large Dataset Into NumPy

Introduction

Symptoms

Common Causes

Step-by-Step Fix

Prevention

Share this guide

More Python Troubleshooting Guides

Python Unit Test Error

Python Argparse Error

Python Logging Configuration Error

Python URLLIB Error

Python Requests Timeout Error

Python FastAPI Validation Error