1

I'll try to write my problem in a list to be more understandable:

  • I have a matlab table T of size 1000x30.
  • All the data in the last column called 'Class' in the table has certain values of integers ranging from 1 to 20.
  • So some rows will have the value 1 which means these rows are of "Class1" and some will have the value 2 and some will have the value 20 and so on.
  • The number of rows having a certain class are not equal to the number of rows having another class, so may be there are 100 rows have class 1 but 10 rows have class 2 and 500 have class 3 and so on.

This is what I want to do:

  • I want to get the number of rows with the class that have the smallest number of rows assigned to it. So let's say Class 10 has the least rows assigned to it with count == 3 while the rest of classes have more than 3 rows assigned to them.
  • I will then have a new column called YesNo where it will have only the values 0 or 1.
  • Then all rows of the class with the least count (e.g Class 10 in this example) will have the value 1.
  • For the rest of rows with all other classes, I want to randomly select from every other class a similar number of rows as the class with lowest number (in this example it will be 3).
  • Then for these randomly selected rows of each other class the value in the new column YesNo will be 1 while for the rest of the not chosen rows will be 0.
  • So in this example, this will ends up with a new column with 1000 values, where 3*20 of them will have 1's (3->number of rows assigned to class with lowest count, and 20->is number of classes) and 0 for the rest.

I wonder how this can be done in MATLAB R2015b? I know that I can create a new column in the table using T.YesNo = newArr; where newArr is a 1000x1 double having 0 and 1 values.

As a small example, if T is 10x3 and has only 3 classes (1,2,3), below is how T looks:

ID  Name    Class   
0   'a'     3
1   'b'     2
2   'a'     2
3   'b'     2
4   'a'     3
5   'a'     1
6   'a'     1
7   'b'     2
8   'b'     1
9   'a'     2

So as shown above, Class3 is the one with the lowest count where only 2 rows. So I want to randomly select two rows of each Class1 and Class2 and then set the values of the new column of these randomly selected rows to 1 while the rest will be 0 as shown below:

ID  Name    Class   YesNo
0   'a'     3       1
1   'b'     2       0
2   'a'     2       1
3   'b'     2       0
4   'a'     3       1
5   'a'     1       0
6   'a'     1       1
7   'b'     2       0
8   'b'     1       1
9   'a'     2       1
Dev-iL
  • 23,742
  • 7
  • 57
  • 99
Tak
  • 3,536
  • 11
  • 51
  • 93
  • Which step(s) are you having a difficulty with? – Dev-iL Mar 22 '17 at 06:40
  • @Dev-iL sorry, I've updated my question to make it more clearer. The first part I explain the structure of my table, the second part is what I want to do. – Tak Mar 22 '17 at 06:41
  • Any chance you could show an example with a small version of `T` and the expected output...? Oh, and please also mention the version of MATLAB you're using. – Dev-iL Mar 22 '17 at 06:43
  • Yeah, I saw. You already got my +1 ;) – Dev-iL Mar 22 '17 at 07:00
  • @Dev-iL thank you. I hope someone will be able to help :) – Tak Mar 22 '17 at 07:00

1 Answers1

1

See code below. It should be self-explanatory. If something is unclear - please ask.

function q42944288
%% Definitions
MAX_CLASS = 20;
%% Setup
tmp = struct;
tmp.Data = rand(1000,1);
tmp.Class = uint8(randi(MAX_CLASS,1000,1)); % uint8 for efficiency
T = table(tmp.Data,tmp.Class,'VariableNames',{'Data','Class'});
%% Solution:
% Step 1:
[count,minVal] = min(histcounts(T.Class,'BinMethod','integers'));
% Steps 2+3:
T.YesNo = T.Class == minVal;
% Steps 4+5+6:
whichClass = bsxfun(@eq,T.Class,1:MAX_CLASS); % >=R2007a syntax
% whichClass = T.Class == 1:MAX_CLASS; % This is a logical array, >=R2016b syntax.
for indC = setdiff(1:MAX_CLASS,minVal) 
  inds = find(whichClass(:,indC));
  T.YesNo(inds(randperm(numel(inds),count))) = true; 
end
%% Test:
fprintf(1,'\nThe total number of classes is %d', numel(unique(T.Class)));
fprintf(1,'\nThe minimal count is %d',count);
fprintf(1,'\nThe total number of 1''s in T.YesNo is %d', sum(T.YesNo));
Dev-iL
  • 23,742
  • 7
  • 57
  • 99
  • What if I wanted the second or third lowest class to be the one choosing the minimum number of rows selected? Where if for example I choose the third min lowest class this means that all rows of the first and second lowest classes will have 1s and so on. – Tak Mar 23 '17 at 13:48
  • @Tak I'm not sure exactly what you're trying to do, but instead of applying `min()` on the output(s) of `histcounts`, you should `sort` and then take whichever value you need. Please don't change the question after it was answered. If you have another problem, please post a new question and provide a link here for reference. – Dev-iL Mar 23 '17 at 13:56