0

I have a large data frame that is taking to long to compute a for loop, I've tried removing all computations to time the for loop but I still have an inefficient code. I'm new to R but I think there should be a better way of coding my for loop. If you could provide some guidance it would be appreciated.

My dataFrame has 2,772,807 obs of 6 variables. enter image description here

Simplified code (Still takes long):

    library("tictoc")
    tic()

    dataFlights <- read_delim("U.S._DOT_O&D_Monthly_Traffic_Report.tsv",
                              "\t", escape_double = FALSE, trim_ws = TRUE)

    dataFlights["Connections"] = ""

    pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)

    for (row in 1:nrow(dataFlights)) {
      dataFlights[row,7] <- 1
      setTxtProgressBar(pb, row)
    }
    close(pb)
    toc()

Original Code:


    #Reads DOT public flight information for 2017 & 2018, 
    #and computes the number of connections 
    #per route (Cp#1 or Cp#2) into a new column. Possible results 0,1, or 2 connections. 

    library("tictoc")
    tic()

    dataFlights <- read_delim("U.S._DOT_O&D_Monthly_Traffic_Report.tsv",
                              "\t", escape_double = FALSE, trim_ws = TRUE)

    dataFlights["Connections"] = ""  

    pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)

    for (row in 1:nrow(dataFlights)) {
      if(is.na(dataFlights[row,2]) & is.na(dataFlights[row,3])){
        dataFlights[row,7] <- 0
      } else if (is.na(dataFlights[row,2]) | is.na(dataFlights[row,3])) {
        dataFlights[row,7] <- 1
      } else {
        dataFlights[row,7] <- 2
      }
      setTxtProgressBar(pb, row)
    }
    close(pb)
    toc()

Raul Gonzalez
  • 875
  • 1
  • 10
  • 23
  • 2
    The progress bar eats a lot of performance. Anyway, you can do this in milliseconds if you just don't use a `for` loop. Make yourself familiar with vectorization and then use, e.g., `ifelse`. – Roland Apr 05 '19 at 05:15
  • 1
    Looks like you can just replace the loop in your original code with `rowSums(!is.na(dataFlights[,2:3]))` – Ritchie Sacramento Apr 05 '19 at 05:34
  • And use `data.table` package for extra speed `setDT(dataFlights); dataFlights[, c("Connections") := rowSums(!is.na(dataFlights[,2:3])]` – Tung Apr 05 '19 at 05:41
  • https://stackoverflow.com/questions/52513666/counting-the-nas-in-a-part-of-a-row-in-data-table – Tung Apr 05 '19 at 05:42

2 Answers2

4

As indicated in the comments, this can be done effortlessly with ifelse

# data
set.seed(123)
n <- 1e+6
dataFlights <- data.frame(x1 = runif(n), 
                          x2 = sample(c(runif(n/2), rep(NA, n/2)), n),
                          x3 = sample(c(runif(n/2), rep(NA, n/2)), n),
                          stringsAsFactors = FALSE
                          )

# conditions
na_2 <- is.na(.subset2(dataFlights, 2)) 
na_3 <- is.na(.subset2(dataFlights, 3))
na_sum <- na_2 + na_3

# ifelse
dataFlights$x4 <- ifelse(na_sum == 2, 0, ifelse(na_sum == 1, 1, 2))
head(dataFlights)
#          x1        x2        x3 x4
# 1 0.2875775        NA        NA  0
# 2 0.7883051 0.4415287        NA  1
# 3 0.4089769        NA 0.3130298  1
# 4 0.8830174 0.3077688        NA  1
# 5 0.9404673        NA        NA  0
# 6 0.0455565 0.5718788        NA  1

where for simplicity I set column 4 as opposed to column 7.

niko
  • 5,253
  • 1
  • 12
  • 32
3

Few suggestions:

dataFlights["Connections"] = ""

In this piece, if you use NA instead of "", it will keep the data size smaller. For comparison, I created a 3,000,000 x 3 matrix to see size. With only one column different, the one with "" had size 268Mb but the one with NA was only about 60Mb. Smaller the size, faster it will be to index.

pb <- txtProgressBar(min = 0, max = nrow(dataFlights), style = 3)

for (row in 1:nrow(dataFlights)) {
  dataFlights[row,7] <- 1
  setTxtProgressBar(pb, row)
}

In each iteration, you are assigning 1 to a matrix / data.frame cell. This assignment is a computationally expensive step. For your example, this can be completely vectorized. Here are few ways to get 7th column to replace your for loop

rowSums

col7.rowSums = rowSums(!is.na(dataFlights[, 2:3]))

sapply

col7.sapply = sapply(1:nrow(dataFlights), function(x) sum(!is.na(dataFlights[x, 2:3])))

apply

col7.apply = apply(!is.na(dataFlights[, 2:3]), 1, sum)

Microbenchmark

Unit: microseconds
     expr      min         lq        mean    median        uq        max neval
 for.loop 52604.86 56768.5590 58810.55595 58137.651 60064.056  81958.717   100
  rowSums    35.87    49.2225    61.23889    53.845    72.010    139.409   100
   sapply 49756.32 53131.1065 55778.95541 54414.455 56154.496 102558.473   100
    apply   997.21  1060.5380  1225.48577  1135.066  1254.936   3864.779   100
cropgen
  • 1,920
  • 15
  • 24