0

I have two distinct double variables with one column and 30000 rows each. For instance:

A=[53
76
41
74
34
237
43…]

B= [1985
1985
1985
…
1986
1986
…
2013…]

If I do:

size(unique(A),1)
ans =261
size(unique(B),1)
ans = 27

But when I do:

D1=dummyvar(A)

I get a double matrix with 355 columns and 30000 rows of 1s and 0s, meaning that Matlab has identified 355 different dummies instead of 261.

and,

D2=dummyvar(B)

I get as well, a double with 2012 columns, what is also incorrect.

Matlab is identifying more dummies in my categorical columns as expected, so I must be doing something wrong, but I don’t know what because previously this formula worked for me. Can someone help me please? Thank you.

user3557054
  • 219
  • 2
  • 11

2 Answers2

2

The number of columns of dummyvar(A) is max(A). This example should clarify:

>> A = [1;2;2;5]
A =
     1
     2
     2
     5
>> unique(A)
ans =
     1
     2
     5
>> dummyvar(A)
ans =
     1     0     0     0     0
     0     1     0     0     0
     0     1     0     0     0
     0     0     0     0     1

If you want to avoid those all-zero columns, use third output of unique to "remove the gaps" in A, and then apply dummyvar:

>> A = [1;2;2;5]
A =
     1
     2
     2
     5
>> [~, ~, uA] = unique(A)
uA =
     1
     2
     2
     3
>> dummyvar(uA)
ans =
     1     0     0
     0     1     0
     0     1     0
     0     0     1
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
  • I understand your answer and maybe I have formulated my question in the wrong way. But my pont is, if I have a variable that (let's see the example of B) has 1 column and 30000 rows with values from 1983 to 2012, why would I get 2012 dummies instead of 30 dummies, which is the number of different values we observe in B? – user3557054 Aug 21 '14 at 15:08
  • 1
    @user3557054 You don't understand the answer, or else you wouldn't still be confused. The `dummyvar` function assumes that you have variables labeled from `1` to `max(B)` - but that there aren't any observations of the variables `1` to `1982` (the first 1982 columns of the result will all be set to zero). – Chris Taylor Aug 21 '14 at 15:16
  • 1
    @user3557054 See edited answer. Maybe second part is what you want – Luis Mendo Aug 21 '14 at 15:22
  • @ChrisTaylor I understand the answer, but it does not solve my problem. What is the point of having so manny dummies if I could only have 30. I have to delete them manually? Like this I will get an error in my regression because I have too many dummies. So I just thought there would be a way for Matlab to recognize 'useless' dummies so to say. – user3557054 Aug 21 '14 at 15:23
  • @user3557054 That's what my updated answer does. Have you checked it out? – Luis Mendo Aug 21 '14 at 15:27
  • @LuisMendo yes yes it solves exactly my problem, I just tried it. Thank you. – user3557054 Aug 21 '14 at 15:33
2

Maybe this function will be useful

function [result, columnNames] = dummyvarSmart(x)
    [columnNames, ~, indices] = unique(x);
    result = dummyvar(indices);
    columnNames = transpose(columnNames);
end

You can use it like this

>> B = sort(1983 + randi(30, 1000, 1));
>> min(B)
ans =
        1984
>> max(B)
ans =
        2013
>> [result, names] = dummyvarSmart(B);
>> size(result)
ans =
        1000        30
>> names(1:5)
ans =
        1984        1985        1986        1987        1988
Chris Taylor
  • 46,912
  • 15
  • 110
  • 154