I am currently working with a series of large datasets and I'm trying to improve how I write scripts in R. I tend to mostly make use of for loops which I know can be cumbersome and slow, espeically with very large datasets.
I have heard a lot of people recommending the apply() family to avoid complex for loops, but I am struggling to get my head around using them to apply multiple functions in one go.
Here is some simple example data:
A <- data.frame('Area' = c(4, 6, 5),
'flow' = c(1, 1, 1))
B <- data.frame('Area' = c(6, 8, 4),
'flow' = c(1, 2, 1))
files <- list(A, B)
frames <- list('A', 'B')
What I want to do is sort the data by the 'flow' variable, then add columns for the portion of total 'flow' and 'area' each data point represents, before finally adding a further two columns of the cumulative percentage of each variable.
Currently I use this for loop:
sort_files <- list()
n <- 1
for(i in files){
name <- frames[n]
nom <- paste(name,'_sorted', sep = '')
data <- i[order(-i$flow),]
area <- sum(i$Area)
total <- sum(i$flow)
data$area_portion <- (data$Area/area)*100
data$flow_portion <- (data$flow/total)*100
data$cum_area <- cumsum(data$area_portion)
data$cum_flow <- cumsum(data$flow_portion)
assign(nom, data)
df <- get(paste(name,'_sorted', sep = ''))
sort_files[[nom]] <- df
n <- n + 1
}
Which works, but seems overly complex and ugly, and I'm sure it will run far slower than better scripts.
How can I simplify and neaten up the above code?
This is the expected output:
sort_files
$`A_sorted`
Area flow area_portion flow_portion cum_area cum_flow
1 4 1 26.66667 33.33333 26.66667 33.33333
2 6 1 40.00000 33.33333 66.66667 66.66667
3 5 1 33.33333 33.33333 100.00000 100.00000
$B_sorted
Area flow area_portion flow_portion cum_area cum_flow
2 8 2 44.44444 50 44.44444 50
1 6 1 33.33333 25 77.77778 75
3 4 1 22.22222 25 100.00000 100