I am processing a large dataset (after being cleaned). The data set is then processed to create an adjacency matrix, which is passed a logicEval to id obs that contain the uniqueID. 5
When running the code snippet to create adjacency matrix, the process takes a huge amount of time to process (and sometimes, it just freezes).
Obviously, this is because the function is checking each of the unique elements (n=10901) and marking TRUE/FALSE if it appears in the observation. An example (greatly reduced):
|Obs_1 |Obs_2 |Obs_3 |Obs_4 |Obs_5 | logEval|
|:-----|:-----|:-----|:-----|:-----|-------:|
|TRUE |FALSE |FALSE |FALSE |FALSE | 1|
|FALSE |TRUE |FALSE |FALSE |FALSE | 1|
|FALSE |FALSE |TRUE |FALSE |FALSE | 1|
|FALSE |FALSE |FALSE |TRUE |FALSE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |TRUE |FALSE |FALSE | 1|
|TRUE |FALSE |FALSE |FALSE |FALSE | 1|
|FALSE |FALSE |FALSE |FALSE |TRUE | 1|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
|FALSE |FALSE |FALSE |FALSE |FALSE | 0|
In actuality, the Obs=43 and there are >10 0000 comparisons.
Problem: R crashes. Is there a better way to run this without having it crash due to size?
Code snippet:
r
df1<-data.table(col1=sample(500000:500900,700,replace = T),
col2=sample(500000:500900,700,replace = T),
col3=sample(500000:500900,700,replace = T),
col4=sample(500000:500900,700,replace = T),
col5 = sample(500000:500900,700,replace = T),
col6 = sample(500000:500900,700,replace = T),
col7 = sample(500000:500900,700,replace = T),
col8 = sample(500000:500900,700,replace = T),
col9 = sample(500000:500900,700,replace = T),
col10 = sample(500000:500900,700,replace = T),
col11 = sample(500000:500900,700,replace = T),
col12 = sample(500000:500900,700,replace = T),
col13 = sample(500000:500900,700,replace = T),
col14 = sample(500000:500900,700,replace = T),
col15 = sample(500000:500900,700,replace = T),
col16 = sample(500000:500900,700,replace = T),
col17 = sample(500000:500900,700,replace = T),
col18 = sample(500000:500900,700,replace = T),
col19 = sample(500000:500900,700,replace = T),
col20 = sample(500000:500900,700,replace = T),
col21 = sample(500000:500900,700,replace = T),
col22 = sample(500000:500900,700,replace = T),
col23 = sample(500000:500900,700,replace = T),
col24 = sample(500000:500900,700,replace = T),
col25 = sample(500000:500900,700,replace = T),
col26 = sample(500000:500900,700,replace = T),
col27 = sample(500000:500900,700,replace = T),
col28 = sample(500000:500900,700,replace = T),
col29 = sample(500000:500900,700,replace = T),
col30 = sample(500000:500900,700,replace = T),
col31 = sample(500000:500900,700,replace = T),
col32 = sample(500000:500900,700,replace = T),
col33 = sample(500000:500900,700,replace = T),
col34 = sample(500000:500900,700,replace = T),
col35 = sample(500000:500900,700,replace = T),
col36 = sample(500000:500900,700,replace = T),
col37 = sample(500000:500900,700,replace = T),
col38 = sample(500000:500900,700,replace = T),
col39 = sample(500000:500900,700,replace = T),
col40 = sample(500000:500900,700,replace = T),
col41 = sample(500000:500900,700,replace = T),
col42 = sample(500000:500900,700,replace = T),
col43 = sample(500000:500900,700,replace = T))
#find all ids via table
uniqueIDs<-as.character(unique(unlist(df1)))
df1<-data.table(df1)
#creating adjacency matrix
mat <- sapply(uniqueIDs, function(s) apply(dt1, 1, function(x) s %in% x))
#clean-up
colnames(mat) <- uniqueIDs
rownames(mat) <- paste0("row", seq(nrow(dt1)))
mat<-data.table(mat)
mat<-data.table(t(mat))
#apply logical evaluation to count number of TRUE
mat$logEval<-rowSums(mat==TRUE)
Want to make a small update to ensure I am making my overall goal clear:
-dataset has x (43) obs and each obs has y (200) nbrids.
the goal of running the above code is to create an adjacency matrix to id the nbrids (y) that appear per column. [For example, from the unique nbrids, does y(1) appear in x(i);does y(2)...does y(900)].
i am not concerned with x, per se. the end goal is:
From the unique ids throughout the matrix, what uniqueids appear together & how often [this is why I create the logic test to count .n(i)==TRUE]…for those >2, i can filter as it is likely that such rows share nbrids.
Sample end matrix;
r
From To Weight
50012 50056 5
50012 50032 3
…
50063 50090 9
Man thats a mouthful _