18

I did a sample program to train a SVM using sklearn. Here is the code

from sklearn import svm
from sklearn import datasets
from sklearn.externals import joblib

clf = svm.SVC()
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf.fit(X, y)

print(clf.predict(X))
joblib.dump(clf, 'clf.pkl') 

When I dump the model file I get this amount of files. :

['clf.pkl', 'clf.pkl_01.npy', 'clf.pkl_02.npy', 'clf.pkl_03.npy', 'clf.pkl_04.npy', 'clf.pkl_05.npy', 'clf.pkl_06.npy', 'clf.pkl_07.npy', 'clf.pkl_08.npy', 'clf.pkl_09.npy', 'clf.pkl_10.npy', 'clf.pkl_11.npy']

I am confused if I did something wrong. Or is this normal? What is *.npy files. And why there are 11?

kcc__
  • 1,638
  • 4
  • 30
  • 59
  • 4
    Presumably those are `numpy` arrays for your data, `joblib` when loading back the `.pkl` will look for those `np` arrays and load back the model data – EdChum Nov 03 '15 at 10:58
  • 2
    I just realized that if I use joblib.dump(clf, 'clf.pkl', compress=9) I get only 1 clf.pkl file. So I assume as you stated those are numpy arrays. During loading I have load all manually or its automatically loaded? – kcc__ Nov 03 '15 at 11:00
  • I expect them to be automatically loaded, just try it – EdChum Nov 03 '15 at 11:00
  • Yup that is true. I dont load *.npy format just .pkl only. Do you know if I use argument compress, does it affect the array for very large dataset? – kcc__ Nov 03 '15 at 11:02
  • 3
    Basically it affects the pickled data size at the expense of reading/writing so it depends on what your priorities are – EdChum Nov 03 '15 at 11:05

1 Answers1

26

To save everything into 1 file you should set compression to True or any number (1 for example).

But you should know that separated representation of np arrays is necessary for main features of joblib dump/load, joblib can load and save objects with np arrays faster than Pickle due to this separated representation, and in contrast to Pickle joblib can correctly save and load objects with memmap numpy arrays. If you want to have one file serialization of whole object (And don't want to save memmap np arrays) - i think that it would be better to use Pickle, AFAIK in this case joblib dump/load functionality will work at same speed as Pickle.

import numpy as np
from scikit-learn.externals import joblib

vector = np.arange(0, 10**7)

%timeit joblib.dump(vector, 'vector.pkl')
# 1 loops, best of 3: 818 ms per loop
# file size ~ 80 MB
%timeit vector_load = joblib.load('vector.pkl')
# 10 loops, best of 3: 47.6 ms per loop

# Compressed
%timeit joblib.dump(vector, 'vector.pkl', compress=1)
# 1 loops, best of 3: 1.58 s per loop
# file size ~ 15.1 MB
%timeit vector_load = joblib.load('vector.pkl')
# 1 loops, best of 3: 442 ms per loop

# Pickle
%%timeit
with open('vector.pkl', 'wb') as f:
    pickle.dump(vector, f)
# 1 loops, best of 3: 927 ms per loop
%%timeit                                    
with open('vector.pkl', 'rb') as f:
    vector_load = pickle.load(f)
# 10 loops, best of 3: 94.1 ms per loop
Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52