sklearn dumping model using joblib, dumps multiple files. Which one is the correct model?

Question

I did a sample program to train a SVM using sklearn. Here is the code

from sklearn import svm
from sklearn import datasets
from sklearn.externals import joblib

clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)

print(clf.predict(X))
joblib.dump(clf, 'clf.pkl')

When I dump the model file I get this amount of files. :

['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf.pkl_06.npy', 'clf.pkl_07.npy', 'clf.pkl_08.npy', 'clf.pkl_09.npy', 'clf.pkl_10.npy', 'clf.pkl_11.npy']

I am confused if I did something wrong. Or is this normal? What is *.npy files. And why there are 11?

Presumably those are `numpy` arrays for your data, `joblib` when loading back the `.pkl` will look for those `np` arrays and load back the model data — EdChum, Nov 03 '15 at 10:58
I just realized that if I use joblib.dump(clf, 'clf.pkl', compress=9) I get only 1 clf.pkl file. So I assume as you stated those are numpy arrays. During loading I have load all manually or its automatically loaded? — kcc__, Nov 03 '15 at 11:00
Yup that is true. I dont load *.npy format just .pkl only. Do you know if I use argument compress, does it affect the array for very large dataset? — kcc__, Nov 03 '15 at 11:02
Basically it affects the pickled data size at the expense of reading/writing so it depends on what your priorities are — EdChum, Nov 03 '15 at 11:05

Ibraim Ganiev · Answer 1 · 2017-03-04T19:47:04.363

To save everything into 1 file you should set compression to True or any number (1 for example).

But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And don't want to save memmap np arrays) - i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle.

import numpy as np
from scikit-learn.externals import joblib

vector = np.arange(0, 10**7)

%timeit joblib.dump(vector, 'vector.pkl')
# 1 loops, best of 3: 818 ms per loop
# file size ~ 80 MB
%timeit vector_load = joblib.load('vector.pkl')
# 10 loops, best of 3: 47.6 ms per loop

# Compressed
%timeit joblib.dump(vector, 'vector.pkl', compress=1)
# 1 loops, best of 3: 1.58 s per loop
# file size ~ 15.1 MB
%timeit vector_load = joblib.load('vector.pkl')
# 1 loops, best of 3: 442 ms per loop

# Pickle
%%timeit
with open('vector.pkl', 'wb') as f:
    pickle.dump(vector, f)
# 1 loops, best of 3: 927 ms per loop
%%timeit                                    
with open('vector.pkl', 'rb') as f:
    vector_load = pickle.load(f)
# 10 loops, best of 3: 94.1 ms per loop

Yes, this works. I set compress=1 and it saved to one file. – Shane Halloran Sep 19 '18 at 22:26 — Shane Halloran, Sep 19 '18 at 22:26

sklearn dumping model using joblib, dumps multiple files. Which one is the correct model?

1 Answers1