Generate matrix in RevoScaleR from a data frame

Question

I have a data frame as below:

group   sex    age
    A     M     15
    A     F     17
    A     M     12
    A     F      2
    A     F      6
    A     M      3
    A     M     10
    A     M     18
    B     F     16
    B     M      6
    B     M     18
    B     M     15
    B     F      8
    B     F     17
    B     M     18
    B     M     16
    B     F     13
    B     F      5
    B     F     13
    B     F      4
    B     M     15
    B     M      8
    B     M     18
    C     F      7
    C     M     12
    C     M      3
    C     F      1
    C     F      9
    C     F      2

expected result for this data frame. A B C A 0 4 3

B 4 0 0

C 3 0 0

I would like to generate a matrix showing the similarity among "group" in input data, based on the "age". For example, if group A and group B have 2 similar ages, then the common element A and B will be 2.

You do not have to put as answer the output but rather in your question :) — Colonel Beauvel, Feb 03 '15 at 09:03
@JIT What is the expected result based on the new dataset. Please do update that in your post :-) — akrun, Feb 03 '15 at 09:49
What is the link between both data.frame? You are asking for something, then add a totally different thing. — , Feb 03 '15 at 10:06

Colonel Beauvel · Answer 1 · 2015-02-03T09:02:56.087

1

One solution with outer:

library(magrittr)

func = Vectorize(function(u,v)
{
    if(all(u==v)) return(0)
    intersect(subset(df, group==u)$age, subset(df, group==v)$age) %>% unique %>% length 
})

x = df$group %>% unique
m = outer(x, x, func)
row.names(m) = colnames(m) = x

#>m
#  A B C
#A 0 4 3
#B 4 0 0
#C 3 0 0

edited Feb 03 '15 at 09:02

answered Feb 03 '15 at 08:46

Colonel Beauvel

30,423
11
47
87

thank Colonel Beauvel, bus this is'nt my expected result. you can see my expected, result at below answer. – JIT Feb 03 '15 at 09:03
@JIT, please add the expected result to your original post, not as an answer. – Feb 03 '15 at 09:07
ok, @Pascal, i will practice hardly. because this is my first answer, so i untill have some misstake. – JIT Feb 03 '15 at 09:29

akrun · Answer 2 · 2015-02-04T03:48:05.843

1

We could merge the dataset ("df") to itself by "age" on a subset of dataset ("df[-2]", ie. the second column is removed), remove the rows that are the same for "group.x" and "group.y", and reshape the unique dataset ("df1") from "long" to "wide" using acast.

 df1 <- subset(merge(df[-2], df[-2], by.x='age',
                          by.y='age'), group.x!=group.y)

 library(reshape2)
 acast(unique(df1), group.x~group.y, value.var='age')
 #   A B C
 #A 0 4 3
 #B 4 0 0
 #C 3 0 0

Or use xtabs from base R

 xtabs(~group.x+group.y, unique(df1))
 #     group.y
 #group.x A B C
 #      A 0 4 3
 #      B 4 0 0
 #      C 3 0 0

Update

Regarding the new dataset/expected result, it is not clear which column should be included in the relationship with "re". Here, I used "pro_id" to get the expected result.

tbl <- crossprod(table(df[c(3,1)]))
 diag(tbl) <- 0
 tbl
 #     re
 #re    144 205 209 222 235 250
 # 144   0   1   2   0   0   0
 # 205   1   0   1   0   0   0
 # 209   2   1   0   0   0   0
 # 222   0   0   0   0   0   1
 # 235   0   0   0   0   0   0
 # 250   0   0   0   1   0   0

edited Feb 04 '15 at 03:48

answered Feb 03 '15 at 09:07

akrun

874,273
37
540
662

can you tell me what the df[-2] and by.x, by.y mean. thank you very much. – JIT Feb 03 '15 at 09:16
@JIT `df[-2]` I am subsetting the `df` as the 2nd column is not needed further. Regarding `by.x/by.y`, these are arguments in the `merge` where we specify the common columns between the two identical datasets. Otherwise, all the columns will be used. – akrun Feb 03 '15 at 09:20
do it function effect if the type of column is character ? – JIT Feb 03 '15 at 09:24
@JIT Based on the `str(df)`, group column is `character` and age is numeric/integer. Sorry, I didn't get your question. – akrun Feb 03 '15 at 09:26
it is the way to do in based r. how to do if the data is a file with 400000 row and more. i apply with RevoScaleR and use rxReadNext( ) to read each data chunk in data source. in each data chunk i apply your solution, but it very slow, take about 30 min... :((. thank you. @akrun – JIT Feb 04 '15 at 06:15
@JIT Please check if you convert the dataset to data.table, and use `dcast.data.table` makes it faster or not. ie. `DT <- setDT(df[c(3,1)]);m1 <- as.matrix( dcast.data.table(DT, pro_id~re, value.var='re', length)[,-1, with=FALSE]);crossprod(m1)` – akrun Feb 04 '15 at 06:25
it does not faster. my idea is apply your solution in revolution r. if i load all data set to memory, it use a lost of memory. so i use rxReadNext() to read chunk by chunk of data source, and process all each chunk. but it still slow based on your solution. thank you! @akrun – JIT Feb 04 '15 at 06:35
@JIT Sorry, I don't use revolutionr. – akrun Feb 04 '15 at 08:44

Generate matrix in RevoScaleR from a data frame

2 Answers2

Update