2

I try to train a Yolo Net with my custom Dataset. I have some Images (*.jpg) and the labels/annotations in the yolo format as a txt-file.

Now I want to split the data in a train and validation set. As a result I want a train and a validation folder each with their own images and annotations.

I tried something like this:

from sklearn.model_selection import train_test_split
import glob


# Get all paths to your images files and text files
PATH = '../TrainingsData/'
img_paths = glob.glob(PATH+'*.jpg')
txt_paths = glob.glob(PATH+'*.txt')
    
X_train, X_test, y_train, y_test = train_test_split(img_paths, txt_paths, test_size=0.3, random_state=42)

After saving the set to a new folder, the images and annotations got mixed up. So for example in the train folder, some images had no annotation (they were in the validation folder) and there were some annotaions but the image was missing.

Can you help me to split my dataset?

Basti
  • 45
  • 2
  • 8

5 Answers5

3

Ok !!, You can do this

Split images function

def split_img_label(data_train,data_test,folder_train,folder_test):
    
    os.mkdir(folder_train)
    os.mkdir(folder_test)
    
    
    train_ind=list(data_train.index)
    test_ind=list(data_test.index)
    
    
    # Train folder
    for i in tqdm(range(len(train_ind))):
        
        os.system('cp '+data_train[train_ind[i]]+' ./'+ folder_train + '/'  +data_train[train_ind[i]].split('/')[2])
        os.system('cp '+data_train[train_ind[i]].split('.jpg')[0]+'.txt'+'  ./'+ folder_train + '/'  +data_train[train_ind[i]].split('/')[2].split('.jpg')[0]+'.txt')
    
    # Test folder
    for j in tqdm(range(len(test_ind))):
        
        os.system('cp '+data_test[test_ind[j]]+' ./'+ folder_test + '/'  +data_test[test_ind[j]].split('/')[2])
        os.system('cp '+data_test[test_ind[j]].split('.jpg')[0]+'.txt'+'  ./'+ folder_test + '/'  +data_test[test_ind[j]].split('/')[2].split('.jpg')[0]+'.txt')

CODE


import pandas as pd 
import os 

PATH = './TrainingsData/'
list_img=[img for img in os.listdir(PATH) if img.endswith('.jpg')==True]
list_txt=[img for img in os.listdir(PATH) if img.endswith('.txt')==True]

path_img=[]

for i in range (len(list_img)):
    path_img.append(PATH+list_img[i])
    
df=pd.DataFrame(path_img)

# split 
data_train, data_test, labels_train, labels_test = train_test_split(df[0], df.index, test_size=0.20, random_state=42)

# Function split 
split_img_label(data_train,data_test,folder_train_name,folder_test_name)

OUTPUT

len(list_img)
583

100%|████████████████████████████████████████████████████████████████████████████████| 466/466 [00:26<00:00, 17.42it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 117/117 [00:07<00:00, 16.61it/s]

Finally, you will have 2 folders (folder_train_name,folder_test_name) with the same images and labels .

Datexland
  • 119
  • 4
  • Thank you very much :) This is what I was looking for. – Basti Mar 17 '21 at 14:28
  • 1
    @Alexbonella, I am running this code but my output text and train folders only have one image and label each. – Rashik Mar 27 '21 at 02:26
  • @Alexbonella, if I have 40 folders and I need to split all of images and thier lables. Could you help me – N.white Apr 17 '23 at 12:04
  • @Datexland when calling split_img_label NameError appered said that name 'folder_train_name' is not defined. Please tell me How to solve this issue? – N.white May 03 '23 at 05:50
  • @Datexland how to write 'cp' command into windows operating system as to copy files ... I try copy but it didn't work .. appreciate your help – N.white May 14 '23 at 11:15
2
import glob
import random
import os
filelist  = glob.glob('train/*.txt')
test = random.sample(filelist, int(len(filelist)*0.15))
output_path = 'test/'
if not os.path.exists(output_path):
    os.makedirs(output_path)

for file in test:
    txtpath = file
    impath = file[:-4] + '.jpg'
    out_text = os.path.join(output_path, os.path.basename(txtpath))
    out_image = os.path.join(output_path, os.path.basename(impath))
    print(txtpath,impath,out_text,out_image)
    os.system('powershell mv ' + txtpath + ' ' + out_text)
    os.system('powershell mv ' + impath + ' ' + out_image)

Set the train and test folder paths. Set the percentage of images to be sent to test.

Replace the powershell mv command with just mv if not using Windows.

1

@N.white You can use the same code as above, but you just have to add and change the following lines


def split_img_label_2(data_train,data_test,folder_train,folder_test):
    
    #os.mkdir(folder_train)
    #os.mkdir(folder_test)
    
    
    train_ind=list(data_train.index)
    test_ind=list(data_test.index)
    
    
    # Train folder
    for i in tqdm(range(len(train_ind))):
        
        os.system('cp '+data_train[train_ind[i]]+' ./'+ folder_train + '/'  +data_train[train_ind[i]].split('/')[2])
        os.system('cp '+data_train[train_ind[i]].split('.jpg')[0]+'.txt'+'  ./'+ folder_train + '/'  +data_train[train_ind[i]].split('/')[2].split('.jpg')[0]+'.txt')
    
    # Test folder
    for j in tqdm(range(len(test_ind))):
        
        os.system('cp '+data_test[test_ind[j]]+' ./'+ folder_test + '/'  +data_test[test_ind[j]].split('/')[2])
        os.system('cp '+data_test[test_ind[j]].split('.jpg')[0]+'.txt'+'  ./'+ folder_test + '/'  +data_test[test_ind[j]].split('/')[2].split('.jpg')[0]+'.txt')


os.mkdir(folder_train)
os.mkdir(folder_test)
list_folder = [folder1,folder2,.........folder40]

for folder_name in list_folder :

    PATH = 'folder_name' # pass the right path

    list_img=[img for img in os.listdir(PATH) if img.endswith('.jpg')==True]
    list_txt=[img for img in os.listdir(PATH) if img.endswith('.txt')==True]

    path_img=[]

    for i in range (len(list_img)):
       path_img.append(PATH+list_img[i])
    
    df=pd.DataFrame(path_img)

    # split 
    data_train, data_test, labels_train, labels_test = train_test_split(df[0], 
    df.index, test_size=0.20, random_state=42)

    # Function split 
  split_img_label_2(data_train,data_test,folder_train_name,folder_test_name)

NOTE: Keeping in mind that folder_train_name & folder_test_name would be the same for all process in order to get and unique final folder with all images.

Datexland
  • 119
  • 4
0

If you want to split your images and labels in order to train your custom model, I recommend the following steps :

  1. Create an obj folder with images and labels.
  2. Create and run the generate_train.py Script
#generate_train.py
import os

image_files = []
os.chdir(os.path.join("data", "obj"))
for filename in os.listdir(os.getcwd()):
    if filename.endswith(".jpg"):
        image_files.append("data/obj/" + filename)
os.chdir("..")
with open("train.txt", "w") as outfile:
    for image in image_files:
        outfile.write(image)
        outfile.write("\n")
    outfile.close()
os.chdir("..")
  1. Finally when you have the train.txt file, you can run the code below :
df=pd.read_csv('PATH/data/train.txt',header=None)


# sklearn split 80 train, 20 test

data_train, data_test, labels_train, labels_test = train_test_split(df[0], df.index, test_size=0.20, random_state=42)

# train.txt contain the PATH of images and label to train 
data_train=data_train.reset_index()
data_train=data_train.drop(columns='index')
with open("train.txt", "w") as outfile:
    for ruta in data_train[0]:
        outfile.write(ruta)
        outfile.write("\n")
    outfile.close()

# test.txt contain the PATH of images and label to test 
data_test=data_test.reset_index()
data_test=data_test.drop(columns='index')
with open("test.txt", "w") as outfile:
    for ruta in data_test[0]:
        outfile.write(ruta)
        outfile.write("\n")
    outfile.close()

Now, you are ready to train your model

YOLO

!./darknet detector train data/obj.data cfg/yolov4-FENO.cfg yolov4.conv.137 -dont_show -map

TINY

!./darknet detector train data/obj.data cfg/yolov4_tiny.cfg yolov4-tiny.conv.29 -dont_show -map

Datexland
  • 119
  • 4
  • Thank you very much for the answer, but it is not exactly what my problem is. I want to generate folders which contain the images and the labels (.txt file). So as a result I have one folder with some images and their label for training and another folder with different images and their label for validation. – Basti Mar 15 '21 at 07:56
0

Splitting image datasets is tricky because as you discovered, if you don't do it right you will end up with the annotations and images in separate folders.

I had this same issue when working with Yolo annotations and ended up created a Python package called PyLabel to do it as a school project.

I have a sample notebook to demonstrate how to split a data set into 2 or 3 groups here https://github.com/pylabel-project/samples/blob/main/dataset_splitting.ipynb.

Using PyLabel the code would be something like this:

dataset = importer.ImportYoloV5(path_to_annotations)
dataset.splitter.StratifiedGroupShuffleSplit(train_pct=.6, val_pct=.2, test_pct=.2, batch_size=1)
dataset.analyze.ShowClassSplits()

ShowClassSplits will provide the following output so you can inspect if the splits are balanced

PyLabel has 2 methods for splitting data:

  1. GroupShuffleSplit, which uses the GroupShuffleSplit command from sklearn.
  2. StratifiedGroupShuffleSplit, which attempts to balance the distribution of classes evenly across the split groups.

Hope this helps anyone reading this. If you run into any issues feel free to contact me and I can help.

alexheat
  • 479
  • 5
  • 9