I have two datasets of images: subjects 1-200
and each having c
(e.g. c=8
) images per subject. Now I want to divide this two datasets into training and testing sets for my algorithm. I typically want to do it for this following cases:
CASES REQUIRED
- CASE 1 Randomly select
k1
images(k1<c)
of each subject for training andk2
images (k2<c
andk2+k1<=c
) of each subject for testing. So training set =k1*200
and testing set =k2*200
. Rememberk1+k2<=c
The subjects are completely overlapping in both the training and testing sets.
Please Note Since we are using the same subjects in both the training and test set, k1
and k2
must not overlap i.e., suppose k1=3
and k2=3
Then pick any 3
for training and any other 3
from the rest per subject for testing. Thus the constraint k1+k2<=c
is necessary.
- CASE 2 Consider Training set consist of
t
subjects randomly choosen and Testing set consist of rest200-t
subjects. The subjects in both the training and testing set are completely non-overlapping. Randomly selectk1
images(k1<c)
of eacht1
subject for training andk2
images of each200-t
subjects for testing. So training set =k1*t
and testing set =k2*(200-t)
. Rememberk1+k2
may not be equal toc
. Evenk1=k2
(maybe possible)
Please Note Since we are using different subjects in both the training and test set, k1
and k2
may overlap and the constraint k1+k2<=c
is not neccessary.
- CASE 3 Consider the Training and Testing set consist of images from all the subjects i.e., subjects are completely overlapping in both sets. Randomly select suppose
m
(eg.m=470
) no. of images from the database for the training set such that at leasti
(e.g.i=2
) no. of images per subject are present (i<c
). Then training set =m
images. Testing set will consist of200*c-m
images.
I want to code this in MATLAB. Any help will be greatly appreciated. Thanks in advance.
EDIT I have tried to implement it in MATLAB. I am giving the code here:
%% Read the data
%% My data reads as follows:
Name Size Bytes Class Attributes
a_data 99x1 12672 cell
a_labels 1x99 792 double
c 1x1 8 double
card_a 11x2 176 double
unq_a_lab 1x11 88 double
% where a_data is my total dataset.
% Assume that it contains total 99 images.
% a_labels is the labels associated with the images.
% c is the minimum number of subjects present in a class
% c is calculated as min (card(subj1),card(subj2),.....)
% card_a is the cardinality of each class present in the database
% card_a = [1,2,3,4,......;10,9,11,9,.....] i.e. card of subj 1 = 10
% card of subj 2 = 9 ,...etc
% unq_a_labels : Number of unique subjects present in the database.
% Assume it to be 11 (as given).
CASE 1
%% CASE 1 COMPLETELY OVERLAPPING DATASET EQUAL SIZED PARTITIONS
% Split the dataset into randomly training and testing subsets
% trainset - each subject k1 images
% testset - eact subject k2 images
% bear in mind constraint : k1+k2<=c
% Total training set = k1*no. of subjects
% Total testing set = k2*no. of subjects
% Both training and testing sets (subjects) are completely overlapping
%split 1
k1 = 3;
%split 2
k2 = 3;
Train_data_a = cell(length(unq_a_lab)*k1,1);
Test_data_a = cell(length(unq_a_lab)*k2,1);
tr_a_labels = zeros(1,length(unq_a_lab)*k1);
tst_a_labels = zeros(1,length(unq_a_lab)*k2);
t1=0; t2=0;
for i=1:length(unq_a_lab)
id = randperm(c);
% split it into 1:k1 and k1+1:k2 points
for j=1:k1
Train_data_a{t1+j} = a_data{c*(i-1)+id(j)};
tr_a_labels(1,t1+j) = a_labels(c*(i-1)+id(j));
end
for j=1:k2
Test_data_a{t2+j} = a_data{c*(i-1)+id(j+k1)};
tst_a_labels(1,t2+j) = a_labels(c*(i-1)+id(j+k1));
end
t1 = t1+k1; t2 = t2+k2;
end
CASE 2 (a)
%% CASE 2 COMPLETELY NON-OVERLAPPING DATASETS EQUAL SIZED PARTITIONS
% Split the dataset into randomly training and testing subsets
% trainset - each subject k1 images
% testset - eact subject k2 images
% Total training set = k1* cardinality of Train Set
% Total testing set = k2* cardinality of Test Set
% cardinality of Train Set + cardinality of Test Set = Total cardinality of
% the database
% Both training and testing sets (subjects) are non-overlapping
% p1 = number of subjects in training set
% p2 = number of subjects in testing set
%split 1
k1 = 3;
%split 2
k2 = 3;
% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
p1 = round((size_p-1)*rand);
p2 = size_p-p1;
Train_data_a = cell(p1*k1,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,p1*k1);
tst_a_labels = zeros(1,p2*k2);
t1=0; t2=0;
for i=1:length(unq_a_lab)
id = randperm(c);
% split it into 1:k1 and 1:k2 points
if i<=p1
for j=1:k1
Train_data_a{t1+j} = a_data{c*(i-1)+id(j)};
tr_a_labels(1,t1+j) = a_labels(c*(i-1)+id(j));
end
t1 = t1+k1;
end
if i>p1
for j=1:k2
Test_data_a{t2+j} = a_data{c*(i-1)+id(j)};
tst_a_labels(1,t2+j) = a_labels(c*(i-1)+id(j));
end
t2 = t2+k2;
end
end
CASE 2 (b)
Randomization done such that p1
subjects are chosen out of total subjects and rest forms the p2
subjects.
%split 1
k1 = 3;
%split 2
k2 = 3;
% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
p1 = round((size_p-1)*rand);
p2 = size_p-p1;
Train_data_a = cell(p1*k1,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,p1*k1);
tst_a_labels = zeros(1,p2*k2);
x = randperm(length(unq_a_lab));
t1=0; t2=0;
for i=1:length(unq_a_lab)
id = randperm(c);
% split it into 1:k1 and 1:k2 points
if i<=p1
for j=1:k1
Train_data_a{t1+j} = a_data{c*(x(i)-1)+id(j)};
tr_a_labels(1,t1+j) = a_labels(c*(x(i)-1)+id(j));
end
t1 = t1+k1;
end
if i>p1
for j=1:k2
Test_data_a{t2+j} = a_data{c*(x(i)-1)+id(j)};
tst_a_labels(1,t2+j) = a_labels(c*(x(i)-1)+id(j));
end
t2 = t2+k2;
end
end
CASE 3
%% CASE 3 COMPLETELY NON OVERLAPPING DATASETS UNEQUAL SIZED PARTITIONS
%% Split the dataset into randomly training and testing subsets
% trainset - Total m images and each subject atleast having i=floor(m/p1) images
% testset - eact subject k2 images
% Total training set = m images
% Total testing set = k2*p2 images
% cardinality of Train Set + cardinality of Test Set = Total cardinality of
% the database
% Both training and testing sets (subjects) are non-overlapping
% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
% p1 = round((size_p-1)*rand);
p1 = 6;
p2 = size_p-p1;
%split 1
m = 29;
min_reqd = floor(m/p1);
%split 2
k2 = 3;
Train_data_a = cell(m,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,m);
dummy_labels = tr_a_labels;
tst_a_labels = zeros(1,p2*k2);
x = randperm(length(unq_a_lab));
% filling up the first min_reqd for each class
t1=1;
for j=1:p1
idx = randperm(c);
idx = idx(1:min_reqd);
for k=1:min_reqd
dummy_labels(t1) = c*(x(j)-1)+idx(k);
t1 = t1+1;
end
end
% form the numberset
num_pack = zeros(1,c*p1);
t2=1;
for j=1:p1
for k=1:c
num_pack(1,t2) = c*(x(j)-1)+k;
t2 = t2+1;
end
end
% getting the indices that have not been already selected previously
% using the set difference operation
% setdiff(A,B) is the values of A that are not in B
new_a_labels = setdiff(num_pack,dummy_labels);
idx = randperm(length(new_a_labels));
% randomly selecting the left amount of values from the set difference
% subset
idx = new_a_labels(idx(1:m-(min_reqd*p1)));
% inserting the values into the matrix
dummy_labels(t1:t1+length(idx)-1) = idx;
% sorting the matrix
[val,idx] = sort(dummy_labels);
% rearranging the matrix
dummy_labels = dummy_labels(idx);
% using the indices of the dummy variables to get the training set and
% their corresponding labels
for i=1:m
Train_data_a{i} = a_data{dummy_labels(i)};
tr_a_labels(1,i) = a_labels(dummy_labels(i));
end
% getting the testing set as previously done in case 2
t2=0;
for i=1:length(unq_a_lab)
% Random selection of k2 points for the testing set
id = randperm(c);
if i>p1
for j=1:k2
Test_data_a{t2+j} = a_data{c*(x(i)-1)+id(j)};
tst_a_labels(1,t2+j) = a_labels(c*(x(i)-1)+id(j));
end
t2 = t2+k2;
end
end*
NOTE
I believe my CASE 1 and 2 are correct. If wrong please point me out. I need help for CASE 3. Done Case 3 but not at all sure about it.