0

Maybe someone can explain, for better understanding, why my R code elapsed time is not linear :)

url <- c(NA)
id <- c(NA)
time <- c(NA)
j <- 1
l <- 1
id_p <- ""
for(i in 1:nrow(cookies_history)){
  if(i%%50000==0){
    print(i)
    print(Sys.time())
  }
  url_p <- substring(as.character(cookies_history$V3[i]), first=8) 
  url_p <- substring(url_p,first=1, last= regexpr("/",url_p)[1]-1)
  if(cookies_history$V1[i]!=id_p){
    id_p <- cookies_history$V1[i]
    id[j] <- cookies_history$V1[i]
    url[j] <- url_p
    time[j] <- cookies_history$V2[i]
    j <- j+1
    url_p2 <- url_p
  }else{
    if(url_p!=url_p2){
      id[j] <- cookies_history$V1[i]
      url[j] <- url_p
      time[j] <- cookies_history$V2[i]
      j <- j+1
      url_p2 <- url_p
    }
  }    
} 

Here is cookies data where V1 is user id,V2- datetime, and V3- full url. Here the function print results:

50000
[1] "2016-01-19 19:42:28 EET"
[1] 100000
[1] "2016-01-19 19:42:58 EET"
[1] 150000
[1] "2016-01-19 19:43:31 EET"
[1] 200000
[1] "2016-01-19 19:44:23 EET"
[1] 250000
[1] "2016-01-19 19:45:20 EET"
[1] 300000
[1] "2016-01-19 19:46:24 EET"
[1] 350000
[1] "2016-01-19 19:47:37 EET"
[1] 400000
[1] "2016-01-19 19:48:53 EET"
[1] 450000
[1] "2016-01-19 19:51:00 EET"
[1] 500000
[1] "2016-01-19 19:53:22 EET"
[1] 550000
[1] "2016-01-19 19:56:18 EET"
[1] 600000
[1] "2016-01-19 19:58:50 EET"
[1] 650000
[1] "2016-01-19 20:02:04 EET"
[1] 700000
[1] "2016-01-19 20:05:14 EET"
[1] 750000
[1] "2016-01-19 20:09:17 EET"
[1] 800000
[1] "2016-01-19 20:13:14 EET"
[1] 850000
[1] "2016-01-19 20:17:18 EET"
[1] 900000
[1] "2016-01-19 20:21:59 EET"
[1] 950000
[1] "2016-01-19 20:26:33 EET"
[1] 1000000
[1] "2016-01-19 20:31:52 EET"
[1] 1050000
[1] "2016-01-19 20:36:50 EET"
[1] 1100000
[1] "2016-01-19 20:42:21 EET"
[1] 1150000
[1] "2016-01-19 20:47:33 EET"
[1] 1200000
[1] "2016-01-19 20:53:21 EET"
[1] 1250000
[1] "2016-01-19 20:59:49 EET"
[1] 1300000
[1] "2016-01-19 21:07:10 EET"
[1] 1350000
[1] "2016-01-19 21:16:30 EET"
[1] 1400000
[1] "2016-01-19 21:25:56 EET"
[1] 1450000
[1] "2016-01-19 21:34:50 EET"
[1] 1500000
[1] "2016-01-19 21:46:01 EET"
[1] 1550000

And i guess that kind of extraction is not worth do with R? (Because for my research i want extarct 10 Gb of csv data files) Sample :

structure(list(V1 = c(-2138197066L, -2138197066L, -2138197066L, 
-2138197066L, -2138197066L, -2138197066L, -2138197066L, -2138197066L, 
-2138197066L, -2138197066L), V2 = structure(c(8L, 9L, 10L, 7L, 
3L, 12L, 1L, 13L, 14L, 2L), .Label = c("2013-07-03 18:48:57", 
"2013-07-03 18:50:30", "2013-07-08 00:02:23", "2013-07-08 00:04:37", 
"2013-07-08 00:04:39", "2013-07-08 00:06:33", "2013-07-08 00:13:28", 
"2013-07-15 15:06:33", "2013-07-15 15:08:18", "2013-07-15 15:08:21", 
"2013-07-16 10:31:20", "2013-07-21 13:02:50", "2013-07-22 08:37:54", 
"2013-07-22 08:39:02", "2013-07-22 23:34:27", "2013-07-23 00:17:36", 
"2013-07-23 00:17:37", "2013-07-23 09:45:59", "2013-07-23 10:59:28"
), class = "factor"), V3 = structure(c(2L, 5L, 5L, 7L, 2L, 3L, 
2L, 8L, 11L, 4L), .Label = c("http://aka-cdn-ns.adtech.de/apps/415/Ad9253791St3Sz16Sq104537573V0Id1/iframe.html?adclick=http://adserver.adtech.de/adlink%7C323%7C4233738%7C0%7C16%7CAdId=9253791;BnId=1;itime=234786428;key=key1+key2+key3+key4;nodecode=yes;link=&adclickesc=http%3A//adserver.adtech.de/adlink%7C323%7C4233738%7C0%7C16%7CAdId%3D9253791%3BB", 
"http://ekstrabladet.dk/", "http://ekstrabladet.dk/biler/bil_anmeldelser/article2045233.ece", 
"http://ekstrabladet.dk/flash/dkkendte/article2030174.ece", "http://ekstrabladet.dk/flash/udlandkendte/article2038591.ece", 
"http://ekstrabladet.dk/flash/udlandkendte/article2047659.ece", 
"http://ekstrabladet.dk/musik/koncert_anmeldelser/article2034295.ece", 
"http://ekstrabladet.dk/nyheder/samfund/article2046966.ece", 
"http://ekstrabladet.dk/nyheder/samfund/article2048290.ece", 
"http://ekstrabladet.dk/sport/anden_sport/motorsport/formel_et/article2035091.ece", 
"http://ekstrabladet.dk/vrangen/article2046810.ece", "http://newz.dk/"
), class = "factor")), .Names = c("V1", "V2", "V3"), row.names = c(NA, 
10L), class = "data.frame")
G dar
  • 5
  • 3
  • 1
    Could you please 1) Provide a small data sample (we don't need all of `cookies_history` but it would be useful to get the top 20 rows or so), and 2) explain what the code here is doing. – josliber Jan 19 '16 at 20:48
  • 2
    By the way, the likely culprit here are the calls to `id[j] <- ...` and similarly for `url[j]` and `time[j]`. You are growing enormous vectors one element at a time, which is painfully inefficient in R. To learn more, please see the second circle of [The R Inferno](http://www.burns-stat.com/pages/Tutor/R_inferno.pdf). A quick adjustment to see if this is the issue would be replacing `url <- c(NA)` at the top with `url <- rep(NA, nrow(cookies_history))` and similarly for `id` and `time` and see if this speeds up the computation. – josliber Jan 19 '16 at 20:50
  • I just cut url (work with strings) and just wanna skip id with same url (neighbors, for primarary analysis ) . Thanks josiber fir tip i try it. – G dar Jan 19 '16 at 21:14

1 Answers1

0

Without profiling it's hard to say, but as the earlier comment by josilber points out, R struggles with on-the-fly growing of large data.

Re-run, trying the following alternative for the first three lines. Pre-allocate blanks of the right class:

url <- rep(as.character(NA),nrow(cookies_history))
id <- rep(as.integer(NA),nrow(cookies_history))
time <- rep(as.character(NA),nrow(cookies_history))

(I'm assuming url, id, and time variable classes; if not, substitute the correct as.xxxx() function).

There's no need to pre-allocate each e.g. url[i] with dummy data, because under the hood, each url[i] is actually a pointer to a character vector; changing the contents of an particular url[i] element does not modify the overall structure of the url vector, so the processing overhead is no different to over-writing a single variable. (see http://adv-r.had.co.nz/C-interface.html#c-data-structures for particulars).

Re-sizing (by adding an element to) the url (and id, and time) vectors involves regularly re-building the entire vector.

tl;dr On-the-fly building or re-sizing of vectors is expensive.

Community
  • 1
  • 1
Jason
  • 2,507
  • 20
  • 25
  • Also, since R does everything in RAM, it might pay dividends to first pre-process the 10 GB of data into a database, and work on portions or samples. – Jason Jan 19 '16 at 23:04