Divide dataset into training and testing dataset

Question

I have two datasets of images: subjects 1-200 and each having c (e.g. c=8) images per subject. Now I want to divide this two datasets into training and testing sets for my algorithm. I typically want to do it for this following cases:

CASES REQUIRED

CASE 1 Randomly select k1 images (k1<c) of each subject for training and k2 images (k2<c and k2+k1<=c) of each subject for testing. So training set = k1*200 and testing set = k2*200. Remember k1+k2<=c The subjects are completely overlapping in both the training and testing sets.

Please Note Since we are using the same subjects in both the training and test set, k1 and k2 must not overlap i.e., suppose k1=3 and k2=3 Then pick any 3 for training and any other 3 from the rest per subject for testing. Thus the constraint k1+k2<=c is necessary.

CASE 2 Consider Training set consist of t subjects randomly choosen and Testing set consist of rest 200-t subjects. The subjects in both the training and testing set are completely non-overlapping. Randomly select k1 images (k1<c) of each t1 subject for training and k2 images of each 200-t subjects for testing. So training set = k1*t and testing set = k2*(200-t). Remember k1+k2 may not be equal to c. Even k1=k2 (maybe possible)

Please Note Since we are using different subjects in both the training and test set, k1 and k2 may overlap and the constraint k1+k2<=c is not neccessary.

CASE 3 Consider the Training and Testing set consist of images from all the subjects i.e., subjects are completely overlapping in both sets. Randomly select suppose m (eg. m=470) no. of images from the database for the training set such that at least i (e.g. i=2) no. of images per subject are present (i<c). Then training set = m images. Testing set will consist of 200*c-m images.

I want to code this in MATLAB. Any help will be greatly appreciated. Thanks in advance.

EDIT I have tried to implement it in MATLAB. I am giving the code here:

%% Read the data
%% My data reads as follows:
Name            Size            Bytes  Class     Attributes

a_data         99x1             12672  cell                
a_labels        1x99              792  double              
c               1x1                 8  double              
card_a         11x2               176  double              
unq_a_lab       1x11               88  double             

% where a_data is my total dataset. 
% Assume that it contains total 99 images. 
% a_labels is the labels associated with the images. 
% c is the minimum number of subjects present in a class 
% c is calculated as min (card(subj1),card(subj2),.....)
% card_a is the cardinality of each class present in the database
% card_a = [1,2,3,4,......;10,9,11,9,.....] i.e. card of subj 1 = 10
% card of subj 2 = 9 ,...etc
% unq_a_labels : Number of unique subjects present in the database. 
% Assume it to be 11 (as given).

CASE 1

%% CASE 1 COMPLETELY OVERLAPPING DATASET EQUAL SIZED PARTITIONS
% Split the dataset into randomly training and testing subsets 
% trainset - each subject k1 images
% testset - eact subject k2 images
% bear in mind constraint : k1+k2<=c
% Total training set = k1*no. of subjects
% Total testing set = k2*no. of subjects
% Both training and testing sets (subjects) are completely overlapping

%split 1 
k1 = 3;
%split 2
k2 = 3;

Train_data_a = cell(length(unq_a_lab)*k1,1);
Test_data_a = cell(length(unq_a_lab)*k2,1);
tr_a_labels = zeros(1,length(unq_a_lab)*k1);
tst_a_labels = zeros(1,length(unq_a_lab)*k2);

t1=0; t2=0;
for i=1:length(unq_a_lab)
    id = randperm(c);
    % split it into 1:k1 and k1+1:k2 points
    for j=1:k1
        Train_data_a{t1+j} = a_data{c*(i-1)+id(j)};
        tr_a_labels(1,t1+j) = a_labels(c*(i-1)+id(j));
    end
    for j=1:k2
        Test_data_a{t2+j} = a_data{c*(i-1)+id(j+k1)};
        tst_a_labels(1,t2+j) = a_labels(c*(i-1)+id(j+k1));        
    end
    t1 = t1+k1; t2 = t2+k2;
end

CASE 2 (a)

%% CASE 2 COMPLETELY NON-OVERLAPPING DATASETS EQUAL SIZED PARTITIONS
% Split the dataset into randomly training and testing subsets 
% trainset - each subject k1 images
% testset - eact subject k2 images
% Total training set = k1* cardinality of Train Set
% Total testing set = k2* cardinality of Test Set
% cardinality of Train Set + cardinality of Test Set = Total cardinality of
% the database
% Both training and testing sets (subjects) are non-overlapping
% p1 = number of subjects in training set
% p2 = number of subjects in testing set

%split 1 
k1 = 3;
%split 2
k2 = 3;
% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
p1 = round((size_p-1)*rand);
p2 = size_p-p1;

Train_data_a = cell(p1*k1,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,p1*k1);
tst_a_labels = zeros(1,p2*k2);
t1=0; t2=0;
for i=1:length(unq_a_lab)
    id = randperm(c);
    % split it into 1:k1 and 1:k2 points
    if i<=p1
        for j=1:k1
            Train_data_a{t1+j} = a_data{c*(i-1)+id(j)};
            tr_a_labels(1,t1+j) = a_labels(c*(i-1)+id(j));            
        end
        t1 = t1+k1;
    end
    
    if i>p1
        for j=1:k2
            Test_data_a{t2+j} = a_data{c*(i-1)+id(j)};
            tst_a_labels(1,t2+j) = a_labels(c*(i-1)+id(j));                    
        end
        t2 = t2+k2;
    end
end

CASE 2 (b)

Randomization done such that p1 subjects are chosen out of total subjects and rest forms the p2 subjects.

%split 1
k1 = 3;
%split 2
k2 = 3;
% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
p1 = round((size_p-1)*rand);
p2 = size_p-p1;

Train_data_a = cell(p1*k1,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,p1*k1);
tst_a_labels = zeros(1,p2*k2);
x = randperm(length(unq_a_lab));
t1=0; t2=0;
for i=1:length(unq_a_lab)
    id = randperm(c);
    % split it into 1:k1 and 1:k2 points
    if i<=p1
        for j=1:k1
            Train_data_a{t1+j} = a_data{c*(x(i)-1)+id(j)};
            tr_a_labels(1,t1+j) = a_labels(c*(x(i)-1)+id(j));
        end
        t1 = t1+k1;
    end    
    if i>p1
        for j=1:k2
            Test_data_a{t2+j} = a_data{c*(x(i)-1)+id(j)};
            tst_a_labels(1,t2+j) = a_labels(c*(x(i)-1)+id(j));
        end
        t2 = t2+k2;
    end
end

CASE 3

%% CASE 3 COMPLETELY NON OVERLAPPING DATASETS UNEQUAL SIZED PARTITIONS
%% Split the dataset into randomly training and testing subsets
% trainset - Total m images and each subject atleast having i=floor(m/p1) images
% testset - eact subject k2 images
% Total training set = m images
% Total testing set = k2*p2 images
% cardinality of Train Set + cardinality of Test Set = Total cardinality of
% the database
% Both training and testing sets (subjects) are non-overlapping

% size of the partitions
% p1 = number of classes in the training sets
% p2 = number of classes in the testing sets
size_p = length(unq_a_lab);
% p1 = round((size_p-1)*rand);
p1 = 6;
p2 = size_p-p1;

%split 1
m = 29;
min_reqd = floor(m/p1);
%split 2
k2 = 3;

Train_data_a = cell(m,1);
Test_data_a = cell(p2*k2,1);
tr_a_labels = zeros(1,m);
dummy_labels = tr_a_labels;
tst_a_labels = zeros(1,p2*k2);
x = randperm(length(unq_a_lab));
% filling up the first min_reqd for each class
t1=1;
for j=1:p1
    idx = randperm(c);
    idx = idx(1:min_reqd);
    for k=1:min_reqd
        dummy_labels(t1) = c*(x(j)-1)+idx(k);
        t1 = t1+1;
    end
end
% form the numberset
num_pack = zeros(1,c*p1);
t2=1;
for j=1:p1
    for k=1:c
        num_pack(1,t2) = c*(x(j)-1)+k;
        t2 = t2+1;
    end
end
% getting the indices that have not been already selected previously
% using the set difference operation
% setdiff(A,B) is the values of A that are not in B
new_a_labels = setdiff(num_pack,dummy_labels);
idx = randperm(length(new_a_labels));
% randomly selecting the left amount of values from the set difference
% subset
idx = new_a_labels(idx(1:m-(min_reqd*p1)));
% inserting the values into the matrix
dummy_labels(t1:t1+length(idx)-1) = idx;
% sorting the matrix
[val,idx] = sort(dummy_labels);
% rearranging the matrix
dummy_labels = dummy_labels(idx);

% using the indices of the dummy variables to get the training set and 
% their corresponding labels
for i=1:m
    Train_data_a{i} = a_data{dummy_labels(i)};
    tr_a_labels(1,i) = a_labels(dummy_labels(i));
end

% getting the testing set as previously done in case 2
t2=0;
for i=1:length(unq_a_lab)
    % Random selection of k2 points for the testing set
    id = randperm(c);
    if i>p1
        for j=1:k2
            Test_data_a{t2+j} = a_data{c*(x(i)-1)+id(j)};
            tst_a_labels(1,t2+j) = a_labels(c*(x(i)-1)+id(j));
        end
        t2 = t2+k2;
    end
end*

NOTE

I believe my CASE 1 and 2 are correct. If wrong please point me out. I need help for CASE 3. Done Case 3 but not at all sure about it.

You seem to be missing an important piece of information. **How** is your data represented? Also, how are you ground truth labels represented? Are they cell arrays? 2D or 3D matrices? We can't suggest something until we know how your data is structured. Also saying *"I want this code in MATLAB"* suggests that you want us to write this code for you and you haven't shown any effort. I think this is an interesting problem, but other people make not be willing to invest any effort in solving your problem. — rayryeng, Mar 16 '15 at 15:00
@rayryeng I dont understand your question regarding ground truth. Please clarify. I greatly expanded the question and the cases required. I have posted a code snippet for case 1. Is the question understandable now ? Please point me out in case some other changes are required. — roni, Mar 17 '15 at 10:55
@rayryeng I added for Case 2. Could please tell me whether my approach is correct. I cant seem to quite figure out case 3. — roni, Mar 17 '15 at 11:21