Introduction
scikit-learn models saved with pickle or joblib are tightly coupled to the specific version of scikit-learn (and often numpy) used to train them. When a model trained with scikit-learn 1.2 is loaded in an environment running scikit-learn 1.4, the deserialization can fail with attribute errors, wrong predictions, or silent corruption. This is a common production issue when model training happens in a data science environment with newer packages, while the production inference service runs older, stable versions. The problem is compounded because pickle errors are sometimes cryptic and do not clearly indicate a version mismatch.
Symptoms
AttributeError: Can't get attribute '_criterion_gini' on <module 'sklearn.tree._criterion' from '/usr/local/lib/python3.11/site-packages/sklearn/tree/_criterion.py'>Or:
ValueError: buffer size mismatchOr silent corruption -- the model loads but produces wrong predictions:
model = joblib.load("model.pkl")
prediction = model.predict(X_test)
# Predictions are wrong but no error raised - most dangerous scenarioOr during loading:
ModuleNotFoundError: No module named 'sklearn.ensemble._forest'Common Causes
- Version mismatch between training and inference: Model trained with sklearn 1.3, loaded with sklearn 1.1
- NumPy version incompatibility: sklearn models embed numpy arrays that may not deserialize correctly across numpy major versions
- Model trained with different sklearn sub-features: Using optional dependencies like scipy or threadpoolctl that differ between environments
- Pickle protocol version: Models pickled with protocol 5 (Python 3.8+) cannot be loaded on Python 3.6
- Silent API changes: sklearn internal class structures change between versions, breaking pickle deserialization
- Model file corruption: Incomplete upload or storage corruption of the pickle file
Step-by-Step Fix
Step 1: Pin identical versions in training and inference
The safest approach:
# pyproject.toml for BOTH training and inference environments
[project]
dependencies = [
"scikit-learn==1.3.2",
"numpy==1.24.3",
"scipy==1.11.2",
"joblib==1.3.2",
]Verify at runtime:
```python import sklearn import joblib
EXPECTED_SKLEARN_VERSION = "1.3.2"
def load_model(path: str): if sklearn.__version__ != EXPECTED_SKLEARN_VERSION: raise RuntimeError( f"Model requires scikit-learn {EXPECTED_SKLEARN_VERSION}, " f"but found {sklearn.__version__}. " f"Please update your environment." ) return joblib.load(path) ```
Step 2: Use ONNX for cross-version model serialization
Export the model to ONNX format which is version-agnostic:
```python # In training environment from sklearn.ensemble import RandomForestClassifier import skl2onnx from skl2onnx.common.data_types import FloatTensorType
model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)
# Export to ONNX initial_type = [("float_input", FloatTensorType([None, X_train.shape[1]]))] onnx_model = skl2onnx.convert_sklearn(model, initial_types=initial_type)
with open("model.onnx", "wb") as f: f.write(onnx_model.SerializeToString()) ```
In inference environment (any sklearn version):
```python import onnxruntime as ort
session = ort.InferenceSession("model.onnx") input_name = session.get_inputs()[0].name predictions = session.run(None, {input_name: X_test.astype("float32")})[0] ```
Step 3: Use sklearn-porter for cross-language deployment
```python from sklearn_porter import Porter
model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train)
# Export to Java, C, JavaScript, etc. porter = Porter(model, language="java") output = porter.export(embedded=True)
with open("RandomForestClassifier.java", "w") as f: f.write(output) ```
The exported code has zero dependencies and runs in any JVM without sklearn.
Prevention
- Always record the exact package versions used for training in a
requirements.txtor MLflow run metadata - Add version verification at model load time in the inference service
- Use ONNX format for models that need to run across different sklearn versions
- Implement model validation tests that compare predictions between old and new sklearn versions before upgrading
- Store model artifacts alongside their environment specification (Docker image hash, requirements.txt)
- Use MLflow or similar model registry that tracks environment dependencies per model version