14

I am currently working with a series of large datasets and I'm trying to improve how I write scripts in R. I tend to mostly make use of for loops which I know can be cumbersome and slow, espeically with very large datasets.

I have heard a lot of people recommending the apply() family to avoid complex for loops, but I am struggling to get my head around using them to apply multiple functions in one go.

Here is some simple example data:

A <- data.frame('Area' = c(4, 6, 5),
                'flow' = c(1, 1, 1))
B <- data.frame('Area' = c(6, 8, 4),
                'flow' = c(1, 2, 1))
files <- list(A, B)
frames <- list('A', 'B')

What I want to do is sort the data by the 'flow' variable, then add columns for the portion of total 'flow' and 'area' each data point represents, before finally adding a further two columns of the cumulative percentage of each variable.

Currently I use this for loop:

sort_files <- list()
n <- 1
for(i in files){
  name <- frames[n]
  nom <- paste(name,'_sorted', sep = '')
  data <- i[order(-i$flow),]
  area <- sum(i$Area)
  total <- sum(i$flow)
  data$area_portion <- (data$Area/area)*100
  data$flow_portion <- (data$flow/total)*100
  data$cum_area <- cumsum(data$area_portion)
  data$cum_flow <- cumsum(data$flow_portion)
  assign(nom, data)
  df <- get(paste(name,'_sorted', sep = ''))
  sort_files[[nom]] <- df
  n <- n + 1
}

Which works, but seems overly complex and ugly, and I'm sure it will run far slower than better scripts.

How can I simplify and neaten up the above code?

This is the expected output:

sort_files

$`A_sorted`
  Area flow area_portion flow_portion  cum_area  cum_flow
1    4    1     26.66667     33.33333  26.66667  33.33333
2    6    1     40.00000     33.33333  66.66667  66.66667
3    5    1     33.33333     33.33333 100.00000 100.00000

$B_sorted
  Area flow area_portion flow_portion  cum_area cum_flow
2    8    2     44.44444           50  44.44444       50
1    6    1     33.33333           25  77.77778       75
3    4    1     22.22222           25 100.00000      100
double-beep
  • 5,031
  • 17
  • 33
  • 41
tom91
  • 685
  • 7
  • 24
  • 1
    `av_portion` is also missing, although I understand it's the mean o `Area`. `files` is also a R function. – patL Jan 31 '19 at 09:03
  • 2
    @tom91: can you add the expected output too? – Tung Jan 31 '19 at 09:09
  • @markus and patL Sorry! I just realised I copied over the script with the actual variable names and not the test one. I have updated it now. – tom91 Jan 31 '19 at 09:12
  • @Tung Expected output has been added to the bottom – tom91 Jan 31 '19 at 09:14

2 Answers2

20

Using lapply to loop over files and dplyr mutate to add new columns

library(dplyr)

setNames(lapply(files, function(x) 
          x %>%
            arrange(desc(flow)) %>%
            mutate(area_portion = Area/sum(Area)*100, 
                   flow_portion = flow/sum(flow) * 100, 
                   cum_area = cumsum(area_portion),
                   cum_flow = cumsum(flow_portion))
),paste0(frames, "_sorted"))


#$A_sorted
#  Area flow area_portion flow_portion  cum_area  cum_flow
#1    4    1     26.66667     33.33333  26.66667  33.33333
#2    6    1     40.00000     33.33333  66.66667  66.66667
#3    5    1     33.33333     33.33333 100.00000 100.00000

#$B_sorted
#  Area flow area_portion flow_portion  cum_area cum_flow
#1    8    2     44.44444           50  44.44444       50
#2    6    1     33.33333           25  77.77778       75
#3    4    1     22.22222           25 100.00000      100

Or completely going tidyverse way we can change lapply with map and setNames with set_names

library(tidyverse)

map(set_names(files, str_c(frames, "_sorted")), 
  . %>% arrange(desc(flow)) %>%
  mutate(area_portion = Area/sum(Area)*100, 
         flow_portion = flow/sum(flow) * 100, 
         cum_area = cumsum(area_portion),
         cum_flow = cumsum(flow_portion)))

Updated the tidyverse approach following some pointers from @Moody_Mudskipper.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • This is excellent, and exactly the kind of thing I was after. Out of interest what are the benefits of going the tidyverse route? – tom91 Jan 31 '19 at 09:34
  • 1
    @tom91 In this case not much benefit I would say. But some people find `tidyverse` more readable and easy to understand. – Ronak Shah Jan 31 '19 at 09:37
  • 2
    some very minor points, forgive me for scratching that itch: (1) if you really want to go full tidyverse you can use `str_c` (it's almost the same but has a few differences : https://stackoverflow.com/questions/53118271/difference-between-paste-str-c-str-join-stri-join-stri-c-stri-pa ). (2) you don't need to unlist `frames`. (3) To make avoid these embedded parentheses over several lines you could put the `set_names` after a pipe in the end OR (and this is what I'd do), rename files instead so you get the naming done ASAP. (4) `function(x) x %>%` can be replaced by a functional chain `. %>%`. – moodymudskipper Jan 31 '19 at 12:34
  • 1
    you would end up with something starting with `map(set_names(files, str_c(frames, "_sorted")), . %>% arrange(...` – moodymudskipper Jan 31 '19 at 12:35
  • 2
    @Moody_Mudskipper cool..Thanks. Updated the answer. Hope I did cover all the points you mentioned and in the right way :) – Ronak Shah Jan 31 '19 at 13:04
8

You could also define a function first ..

f <- function(data) {

  # sort data by flow
  data <- data[order(data['flow'], decreasing = TRUE), ]

  # apply your functions
  data["area_portion"] <- data['Area'] / sum(data['Area']) * 100
  data["flow_portion"] <- data['flow'] / sum(data['flow']) * 100
  data["cum_area"] <- cumsum(data['area_portion'])
  data["cum_flow"] <- cumsum(data['flow_portion'])
  data
  }

.. and use lapply to, ahhm, apply f to your list

out <- lapply(files, f)
out
#[[1]]
#  Area flow area_portion flow_portion  cum_area  cum_flow
#1    4    1     26.66667     33.33333  26.66667  33.33333
#2    6    1     40.00000     33.33333  66.66667  66.66667
#3    5    1     33.33333     33.33333 100.00000 100.00000

#[[2]]
#  Area flow area_portion flow_portion  cum_area cum_flow
#2    8    2     44.44444           50  44.44444       50
#1    6    1     33.33333           25  77.77778       75
#3    4    1     22.22222           25 100.00000      100

If you want to change the names of out you can use setNames

out <- setNames(lapply(files, f), paste0(c("A", "B"), "_sorted"))
# or
# out <- setNames(lapply(files, f), paste0(unlist(frames), "_sorted"))
markus
  • 25,843
  • 5
  • 39
  • 58
  • 2
    Creat a function, of course! I should of thought of that, far simpler than a complex for loop! Thanks! – tom91 Jan 31 '19 at 09:36