Is there a R package to assist in large data processing?

Question

I am processing a large dataset (after being cleaned). The data set is then processed to create an adjacency matrix, which is passed a logicEval to id obs that contain the uniqueID. 5

When running the code snippet to create adjacency matrix, the process takes a huge amount of time to process (and sometimes, it just freezes).

Obviously, this is because the function is checking each of the unique elements (n=10901) and marking TRUE/FALSE if it appears in the observation. An example (greatly reduced):

  |Obs_1 |Obs_2 |Obs_3 |Obs_4 |Obs_5 | logEval|
  |:-----|:-----|:-----|:-----|:-----|-------:|
  |TRUE  |FALSE |FALSE |FALSE |FALSE |       1|
  |FALSE |TRUE  |FALSE |FALSE |FALSE |       1|
  |FALSE |FALSE |TRUE  |FALSE |FALSE |       1|
  |FALSE |FALSE |FALSE |TRUE  |FALSE |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |TRUE  |FALSE |FALSE |       1|
  |TRUE  |FALSE |FALSE |FALSE |FALSE |       1|
  |FALSE |FALSE |FALSE |FALSE |TRUE  |       1|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|
  |FALSE |FALSE |FALSE |FALSE |FALSE |       0|

In actuality, the Obs=43 and there are >10 0000 comparisons.

Problem: R crashes. Is there a better way to run this without having it crash due to size?

Code snippet:

  r
df1<-data.table(col1=sample(500000:500900,700,replace = T),
                col2=sample(500000:500900,700,replace = T),
                col3=sample(500000:500900,700,replace = T),
                col4=sample(500000:500900,700,replace = T),
                col5 = sample(500000:500900,700,replace = T),
                col6 = sample(500000:500900,700,replace = T),
                col7 = sample(500000:500900,700,replace = T),
                col8 = sample(500000:500900,700,replace = T),
                col9 = sample(500000:500900,700,replace = T),
                col10 = sample(500000:500900,700,replace = T),
                col11 = sample(500000:500900,700,replace = T),
                col12 = sample(500000:500900,700,replace = T),
                col13 = sample(500000:500900,700,replace = T),
                col14 = sample(500000:500900,700,replace = T),
                col15 = sample(500000:500900,700,replace = T),
                col16 = sample(500000:500900,700,replace = T),
                col17 = sample(500000:500900,700,replace = T),
                col18 = sample(500000:500900,700,replace = T),
                col19 = sample(500000:500900,700,replace = T),
                col20 = sample(500000:500900,700,replace = T),
                col21 = sample(500000:500900,700,replace = T),
                col22 = sample(500000:500900,700,replace = T),
                col23 = sample(500000:500900,700,replace = T),
                col24 = sample(500000:500900,700,replace = T),
                col25 = sample(500000:500900,700,replace = T),
                col26 = sample(500000:500900,700,replace = T),
                col27 = sample(500000:500900,700,replace = T),
                col28 = sample(500000:500900,700,replace = T),
                col29 = sample(500000:500900,700,replace = T),
                col30 = sample(500000:500900,700,replace = T),
                col31 = sample(500000:500900,700,replace = T),
                col32 = sample(500000:500900,700,replace = T),
                col33 = sample(500000:500900,700,replace = T),
                col34 = sample(500000:500900,700,replace = T),
                col35 = sample(500000:500900,700,replace = T),
                col36 = sample(500000:500900,700,replace = T),
                col37 = sample(500000:500900,700,replace = T),
                col38 = sample(500000:500900,700,replace = T),
                col39 = sample(500000:500900,700,replace = T),
                col40 = sample(500000:500900,700,replace = T),
                col41 = sample(500000:500900,700,replace = T),
                col42 = sample(500000:500900,700,replace = T),
                col43 = sample(500000:500900,700,replace = T))


#find all ids via table
uniqueIDs<-as.character(unique(unlist(df1)))

df1<-data.table(df1)

#creating adjacency matrix
mat <- sapply(uniqueIDs, function(s) apply(dt1, 1, function(x) s %in% x)) 

#clean-up 
colnames(mat) <- uniqueIDs

rownames(mat) <- paste0("row", seq(nrow(dt1)))

mat<-data.table(mat)

mat<-data.table(t(mat))

#apply logical evaluation to count number of TRUE
mat$logEval<-rowSums(mat==TRUE)

Want to make a small update to ensure I am making my overall goal clear:

-dataset has x (43) obs and each obs has y (200) nbrids.

the goal of running the above code is to create an adjacency matrix to id the nbrids (y) that appear per column. [For example, from the unique nbrids, does y(1) appear in x(i);does y(2)...does y(900)].
i am not concerned with x, per se. the end goal is:

From the unique ids throughout the matrix, what uniqueids appear together & how often [this is why I create the logic test to count .n(i)==TRUE]…for those >2, i can filter as it is likely that such rows share nbrids.

Sample end matrix;

  r

    From        To                Weight
    50012       50056             5
    50012       50032             3
    …
    50063      50090              9

Man thats a mouthful _

I see data.table littered through your code but you are not actually using the package where it counts. If you want to profit from its efficiency I suggest you study the package vignettes and fix that. — Roland, Jun 28 '19 at 13:02
The issue isn’t lack of packages, it’s using them correctly. In addition to what Roland said you can also do the same efficiently in base R by harnessing vectorisation. The nested `*apply` in your code is a warning sign that this will be inefficient. since it’s a quadratic operation (probably cubic, given the `%in%` operator being used). What’s more, `apply` operates *on a matrix*, so every one of your `apply` calls will copy the whole data.table each time, to convert it to a matrix first. — Konrad Rudolph, Jun 28 '19 at 13:24
@KonradRudolph wow I didn’t realize the table was being copied each time (no wonder it is slow; thats 10k copied) — OctoCatKnows, Jun 28 '19 at 13:39
@Roland youre right. Looking at CRAN i see that i can optimize the initial read by utilizing fread vice read.csv (in the beginning of all this, I import ~700 xlsx files) — OctoCatKnows, Jun 28 '19 at 13:49

Konrad Rudolph · Answer 1 · 2019-06-28T13:45:58.503

2

If I understand your requirement correctly then the following should work:

df1 = …
tdf1 = as.data.frame(t(df1))
unique_ids = as.character(unique(unlist(df1)))
# mat = sapply(tdf1, `%in%`, x = unique_ids)
mat = vapply(tdf1, `%in%`, logical(length(unique_ids)), x = unique_ids)
rownames(mat) = unique_ids
colnames(mat) = paste0('row', seq_len(ncol(mat))) # ??? Really?!
log_eval = rowSums(mat)

Note in particular how mat in my code doesn’t need to be transposed because it’s already in the correct orientation. The commented-out sapply line is equivalent to the vapply line but the latter is more explicit and performs stricter type checking, and is thus less error-prone if the data changes unexpectedly. vapply is probably also more efficient but with your example data the difference isn’t noticeable.

Incidentally, to generate a random df1 you can shorten your 43-line code to

df1 = as.data.frame(replicate(43, sample(500000 : 500900, 700, replace = TRUE)))

edited Jun 28 '19 at 13:45

answered Jun 28 '19 at 13:34

Konrad Rudolph

530,221
131
937
1,214

Thank you! Btw what does your comment mean (‘really’?) – OctoCatKnows Jun 28 '19 at 13:51
@BuffsGrad16 You are assigning *column names* but call them “row1”, “row2”, …. Your original code makes this less obvious because there you’re assigning row names — but then you’re immediately transposing the table! – Konrad Rudolph Jun 28 '19 at 13:52
That is because of the transpose. From the output I was getting, to observe T/F for each unique number for each observer, it allows me to know that [this obs(i of 43) has nbr1 (j of uniqueids) but not nbr2] – OctoCatKnows Jun 28 '19 at 13:55
Hi @Konrad. I had a chance to run this. A little confused on the output. It appears that rownames are uniqueids- what information are the rows providing? The length of each row is 700 (same as # of observations) but the obersvations are unique. What then does, for example mat[1,1] tell us? (I set seed to 1234) – OctoCatKnows Jun 28 '19 at 17:37

Cole · Accepted Answer · 2019-06-29T19:35:10.350

2nd Edit:

These options seem to get to your expected output in your edit. Both options rely on self-joins to look at which combos are there. The first option uses lapply() to do the self-join one column at a time while the latter melt()s and then self-joins the entire dataset. For smaller datasets, lapply() is slower but when trying 7,000 rows, it still came through whereas the melt and self-join created too large of a data frame.

One additional note, this dataset doesn't really have many unique values. If I knew it was sparse, I'd likely add a line looking to filter out values which were not duplicated in the entire dataset.

library(data.table)

# generate data -----------------------------------------------------------
set.seed(1234)
dt1<- data.table(replicate(43, sample(500000:500900,700, replace = TRUE)))

rbindlist(
  lapply(dt1
       , function(x) {
         nbrid_dt = data.table(nbrid = unique(x))

         nbrid_dt[nbrid_dt
                  , on = .(nbrid < nbrid)
                  , j = .(From = x.nbrid, To = i.nbrid)
                  , nomatch = 0L
                  , allow.cartesian = T]
       }
       )
  )[, .N, keyby = .(From, To)]

          From     To  N
     1: 500000 500001 11
     2: 500000 500002 11
     3: 500000 500003  7
     4: 500000 500004  9
     5: 500000 500005 13
    ---                 
405446: 500897 500899 12
405447: 500897 500900 10
405448: 500898 500899 13
405449: 500898 500900 12
405450: 500899 500900  9

#all at once

molten_dt <- unique(melt(dt1))
setkey(molten_dt, variable)

molten_dt[molten_dt
          , on = .(value < value
                   ,variable = variable
                    )
          , .(From = x.value, To = i.value)
          , allow.cartesian = TRUE
          , nomatch = 0L
          ][!is.na(From), .N, keyby = .(From, To)]

Original: I'm not fully following, but if you are mainly after the amount of counts in your 43 columns, it may be beneficial to gather / melt the data.

molten_dt <- melt(dt1)

molten_dt[, N := length(unique(variable)), by = value]

      variable  value  N
   1:       V1 500102  9
   2:       V1 500560  8
   3:       V1 500548  9
   4:       V1 500561 12
   5:       V1 500775  9
  ---                                    
8596:      V43 500096  7
8597:      V43 500320  6
8598:      V43 500205 14
8599:      V43 500711  7
8600:      V43 500413 11

#or you can aggregate instead of mutate-in-place

molten_dt[, .(N = length(unique(variable))), by = value]

      value  N
  1: 500102  9
  2: 500560  8
  3: 500548  9
  4: 500561 12
  5: 500775  9
 ---          
897: 500753  4
898: 500759  4
899: 500816  6
900: 500772  4
901: 500446  2

Also, my answer doesn't 100% agree with @Konrad. When there are duplicated values, there seems to be one additional count for @Konrad's solution.

Data:

set.seed(1234)
dt1<- as.data.table(replicate(43, sample(500000 : 500900, 200, replace = TRUE)))
#h/t for @Konrad for the quick way to make 43 columns

1st Edit: If you are only interested in the count of each value, you can do the following:

mat_data <- matrix(replicate(43, sample(500000 : 500900, 700, replace = TRUE)), ncol = 43)

table(unlist(apply(mat_data, 2, unique)))

It's the fastest approach but the problem is that you loose information about which column supplied the information.

Unit: milliseconds
           expr     min      lq     mean   median       uq      max neval
 melt_and_count 53.3914 53.8926 57.38576 55.95545 58.55605  79.2055    20
  table_version 11.0566 11.1814 12.24900 11.56760 12.82110  16.4351    20
 vapply_version 63.1623 64.8274 69.86041 67.84505 71.40635 108.2279    20

Thanks for the alternative, however, the goal is to see connections between nbrids. For example, if 50012 AND 500711 occurs in V1;V5;V6 that would give a count (weight) of [3]…similarly, if 500102 AND 500561 occurs in V1;V6;V10;V22 that would give the connection a weight of 4. The idea is to use graph theory and create nodes/edges from the connections (e.g. From 50012 | To 500711 | weight 3) — OctoCatKnows, Jun 29 '19 at 12:59
See edit at the top of my post. Also, your edit with example. Can you adjust it to ```set.seed(1234)```? My results are different. — Cole, Jun 29 '19 at 20:24
Cole this is great! And exactly what I have been looking for. This actually gets to end goal faster than making the adjacency matrix and grouping logic counts. I went further and built it in to igraph. Thanks! — OctoCatKnows, Jun 30 '19 at 15:08
I tried to utilize this with another data set, however, i am getting an error in setkey (molten_dt,variable) some columns are not in data.table:variable. The str() almost identical, with the difference being the values are chr vice int. — OctoCatKnows, Jul 01 '19 at 13:10
I'm not sure my solution would work on ```chr``` values. Both options rely on non-equi joins in which comparisons are made. However, for your direct error, you can skip that ```setkey(...)``` statement. That's just for performance. — Cole, Jul 03 '19 at 02:44
I ended converting to integer and it worked just dandy. Results (res) = 250 000 000 [2.6 GB]. Suffice to say...some filtering occurred (based on frequency) — OctoCatKnows, Jul 03 '19 at 16:16

Is there a R package to assist in large data processing?

2 Answers2