Introduction
NumPy loads entire arrays into memory by default. When a CSV or binary file contains more data than available RAM, Python raises MemoryError and the process crashes. This is common when working with datasets in the gigabyte range on machines with limited memory.
A 10GB CSV file with float64 values can easily require 20-40GB of RAM after parsing, due to temporary objects during the load process.
Symptoms
- numpy.load or numpy.genfromtxt raises "MemoryError: Unable to allocate X GiB"
- Python process is killed by OS OOM killer (Linux) or runs out of page file (Windows)
- pandas.read_csv consumes all available RAM and fails mid-load
Common Causes
- Dataset size exceeds available physical RAM plus swap/page file
- NumPy creates temporary copies during dtype conversion
- Loading with float64 (8 bytes) when float32 (4 bytes) would suffice
Step-by-Step Fix
- 1.Use memory-mapped files with numpy.memmap: Access disk-backed arrays without loading everything into RAM.
- 2.```python
- 3.import numpy as np
# Create a memory-mapped array (data stays on disk) mmap = np.memmap('large_data.dat', dtype='float32', mode='r', shape=(1000000, 500))
# Access slices without loading full array chunk = mmap[0:10000, :] print(chunk.shape) # (10000, 500)
# For writing: write_mmap = np.memmap('output.dat', dtype='float32', mode='w+', shape=(1000000, 500)) write_mmap[0:1000] = np.random.rand(1000, 500).astype('float32') write_mmap.flush() ```
- 1.Process data in chunks with generators: Load and process one chunk at a time.
- 2.```python
- 3.import numpy as np
- 4.import pandas as pd
def process_in_chunks(filepath, chunksize=100000): for chunk in pd.read_csv(filepath, chunksize=chunksize): # Process each chunk independently result = chunk.select_dtypes(include='number').mean() yield result
results = list(process_in_chunks('large_dataset.csv')) final_result = pd.DataFrame(results).mean() ```
- 1.Downcast dtypes to reduce memory footprint: Use smaller numeric types where precision allows.
- 2.```python
- 3.import numpy as np
# Before: float64 uses 8 bytes per value data = np.array([1.0, 2.0, 3.0], dtype=np.float64) # 24 bytes
# After: float32 uses 4 bytes per value data = np.array([1.0, 2.0, 3.0], dtype=np.float32) # 12 bytes
# For integers, use smallest sufficient type: data = np.array([1, 2, 3], dtype=np.int16) # 6 bytes vs 24 bytes with int64 ```
- 1.Use Dask for out-of-core computation: Dask provides NumPy-like API with chunked computation.
- 2.```python
- 3.import dask.array as da
# Works on arrays larger than RAM array = da.from_array(np.memmap('large_data.dat', dtype='float32', mode='r', shape=(1000000, 500)), chunks=(10000, 500))
mean = array.mean(axis=0).compute() # Computes in chunks ```
Prevention
- Always check dataset size against available RAM before loading
- Use memory profiling (memory_profiler package) to track allocation patterns
- Prefer chunked processing or memory-mapped files for datasets over 1GB
- Use appropriate dtypes: float32 instead of float64, int32 instead of int64