Not sure why dcast() this data set results in dropping variables

Question

I have a data frame that looks like:

   id fromuserid touserid from_country to_country length
1   1   54525953 47195889           US         US      2
2   2   54525953 54361607           US         US      1
3   3   54525953 53571081           US         US      2
4   4   41943048 55379244           US         US      1
5   5   47185938 53140304           US         PR      1
6   6   47185938 54121387           US         US      1
7   7   54525974 50928645           GB         GB      1
8   8   54525974 53495302           GB         GB      1
9   9   51380247 45214216           SG         SG      2
10 10   51380247 43972484           SG         US      2

Each row describes a number of messages (length) sent from one user to another user.

What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.

There are almost 200 countries. I use the function dcast as follows:

countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)

This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.

At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th cell represents the messages sent from country i to country j. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.

So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?

Are the variables "from_country" and "to_country" factor or character variables? Do they both have the same levels? — A5C1D2H2I1M1N2O1R2T1, Mar 13 '13 at 17:43
Good question. There are apparently more levels (199) in the "to_country" column than the "from_country" column (189), meaning that a few countries receive messages but have not sent them. If I set the levels for both columns to be the same, then this should work? — Evan Zamir, Mar 13 '13 at 17:51
I was posting an answer as you were commenting. Yes, I believe that it should work if both those columns share the same factor levels. — A5C1D2H2I1M1N2O1R2T1, Mar 13 '13 at 17:53

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2013-03-13T18:24:47.583

3

Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:

chats$from_country <- factor(chats$from_country, 
                             levels = unique(c(chats$from_country, 
                                               chats$to_country)))
chats$to_country <- factor(chats$to_country, 
                           levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
#   from_country US GB SG PR
# 1           US  5  0  0  1
# 2           GB  0  2  0  0
# 3           SG  1  0  1  0
# 4           PR  0  0  0  0

If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:

chats$from_country <- factor(chats$from_country, 
                             levels = unique(c(levels(chats$from_country), 
                                               levels(chats$to_country)))

Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country) will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>.

edited Mar 13 '13 at 18:24

answered Mar 13 '13 at 17:52

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

Was just going to post a solution but it's essentially the same. Nice job. – Dason Mar 13 '13 at 17:53
It's very strange. When I run those first two commands, I just get a bunch of values in those two columns. Any idea why? It's pretty clear you got it to work with my data, so why can't I? – Evan Zamir Mar 13 '13 at 18:09
@EvanZamir, try changing the first set of levels using `unique(c(levels(chats$from_country), levels(chats$to_country)))` since it appears your columns are already factors (hence the `NA` values). – A5C1D2H2I1M1N2O1R2T1 Mar 13 '13 at 18:16
I did that, and I now definitely have the same levels for both columns, but when I do dcast, I get the error: "Error in split_indices(.group, .n) : n smaller than largest index". Any idea? – Evan Zamir Mar 13 '13 at 18:19
@EvanZamir, I've never encountered that error in my experience, but I've seen it mentioned somewhere before (perhaps even SO). Let me look for a couple of minutes. – A5C1D2H2I1M1N2O1R2T1 Mar 13 '13 at 18:25
Even after doing all your suggestions, it still doesn't work. It makes sense, though, so I'm going to check the check button. But there must be something wrong with my data set. – Evan Zamir Mar 13 '13 at 19:38
@EvanZamir, did you ever figure out this `split_indices` error? I found the question I had remembered seeing [here](http://stackoverflow.com/questions/14548570/plyr-split-indices-function-crashes-for-long-vectors), but unfortunately I don't see it definitively answered. – A5C1D2H2I1M1N2O1R2T1 Mar 24 '13 at 18:09

Not sure why dcast() this data set results in dropping variables

1 Answers1

Linked

Related