How to correctly load images from local directory with sklearn.datasets.load_files?

Question

I have a bunch of spectrograms with labels that I want to use different machine learning methods for classification on. So I've wanted to use "sklearn.datasets.load_files()" to try out my dataset with some decision trees. But unfortunately I'm at a loss on how to achieve that.

I have a directory set up with the different classes of png files in different subfolders. So loading the images with "sklearn.datasets.load_files('path')" worked. But

tree=DecisionTreeRegressor(max_depth=3)

tree.fit(training.data,training.target)

gave me a value error "ValueError: could not convert string to float". Thus, I did some more reading in the documentation of load_files, and saw the possability to add the "encoding" parameter. After opening one of the files with notepad++, I saw that apparently the file uses 'ANSI' as encoding. Hence, I've tried "sklearn.datasets.load_files('path', encoding = 'ANSI')". I still get a "ValueError: could not convert string to float: " message though.

There is a very similar question already on this forum (see here), but the github link results in a 404 error and I'm at a loss on what to do with the remainders of the answer.

I'm under the impression that the load_files with correct encoding parameters should work, since the documentation linked aboved mentions images. Maybe I am wrong though, or there is some in between step that I'm missing. Thank you for reading!

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Jan 13 '23 at 12:05

score 0 · Answer 1 · edited Jan 29 '23 at 04:22

So, I've been able to fix it. Albeit not with the load_files function.

training_path = Path('Data/Training') #or whatever your path is 
training = dict()
training['data'] = [] 
training['label'] = []
# read all images in PATH, resize and write to DESTINATION_PATH
for subdir in os.listdir(training_path):
    current_path = os.path.join(training_path, subdir)
    for file in os.listdir(current_path):
        if file[-3:] in {'jpg', 'png'}:
            im = skimage.io.imread(os.path.join(current_path, file))
            training['label'].append(subdir)
            training['data'].append(im)

which I got from here. Then I've used

onehot_encoder = OneHotEncoder(sparse=False)

X_train = np.array(training['data'])
Y_train = np.array(training['label'])
a, b, c, d = np.shape(X_train)
X_train = X_train.reshape(a, b*c*d)
int_encoded = LabelEncoder().fit_transform(Y_train)
int_encoded = int_encoded.reshape(len(int_encoded), 1)
Y_train_ohe = onehot_encoder.fit_transform(int_encoded)

which I puzzled together after reading this.

How to correctly load images from local directory with sklearn.datasets.load_files?

1 Answers1