1

I have a dataframe that has multiple columns named as "avg_metric", "wkday_avg_metric", "event_avg_metric" and "monthly_avg_metric", in which "metric" consists of multiple metrics with these calculations (orders, revenue, etc). I have to check for multiple columns if their rows have NAs and replace them with a row from another column. For that, I created a function that does the same verification for the column "metric" I specify. The thing is that I'm getting the same value for the entire new column that I'm creating, which should not be the case.

I added below an example_fixed on what should be the outcome.

Is there an easier way of doing that? Or am I lacking some logic in the function?

Tks.

Edit: I got the errors on my function, but I'm sure there's a better solution to mine. I tried your solutions, but couldn't apply them for my dataframe. I updated the reprex so you can help me better.

library(tidyverse)

(example <- tibble(country = c("A", "B", "C", "D"),
                   brand = c("A", "A", "B", "B"),
                   event = c(1:4),
                   month = c(1:4),
                   weekday = c(1:4),
                   avg_visits = c(5028, NA, NA, NA),
                   avg_revenue = c(12345, NA, NA, NA), 
                   wkday_avg_visits = c(1234, 4355, NA, NA),
                   wkday_avg_revenue = c(12345, 54321, NA, NA),
                   event_avg_visits = c(51271, 59212, 98773, NA),
                   event_avg_revenue = c(98764, 56435, 35634, NA),
                   monthly_avg_visits = c(5028, 5263, 6950, 8902),
                   monthly_avg_revenue = c(63457, 34536, 34574, 23426))) %>% 
  print(width = Inf)
#> # A tibble: 4 x 13
#>   country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#>   <chr>   <chr> <int> <int>   <int>      <dbl>       <dbl>            <dbl>
#> 1 A       A         1     1       1       5028       12345             1234
#> 2 B       A         2     2       2         NA          NA             4355
#> 3 C       B         3     3       3         NA          NA               NA
#> 4 D       B         4     4       4         NA          NA               NA
#>   wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#>               <dbl>            <dbl>             <dbl>              <dbl>
#> 1             12345            51271             98764               5028
#> 2             54321            59212             56435               5263
#> 3                NA            98773             35634               6950
#> 4                NA               NA                NA               8902
#>   monthly_avg_revenue
#>                 <dbl>
#> 1               63457
#> 2               34536
#> 3               34574
#> 4               23426

subs_metric <- function(data, metric) {
  
  avg <- paste0("avg_", metric)
  wkday_avg <- paste0("wkday_avg_", metric)
  event_avg <- paste0("event_avg_", metric)
  monthly_avg <- paste0("monthly_avg_", metric)
  
  for (i in nrow(data)) {
    
    value <- if (is.na(data[[avg]][i]) & is.na(data[[wkday_avg]][i]) & is.na(data[[event_avg]][i])) {
      data[[monthly_avg]][i]
    } else if (is.na(data[[avg]][i]) & is.na(data[[wkday_avg]][i])) {
      data[[event_avg]][i]
    } else if (is.na(data[[avg]][i])) {
      data[[wkday_avg]][i]
    } else {
      data[[avg]][i]
    }
    
    return(value) 
  }
}
  

example %>% 
  mutate(avg_visits_new = subs_metric(., "visits"),
         avg_revenue_new = subs_metric(., "revenue")) %>% 
  print(width = Inf)
#> # A tibble: 4 x 15
#>   country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#>   <chr>   <chr> <int> <int>   <int>      <dbl>       <dbl>            <dbl>
#> 1 A       A         1     1       1       5028       12345             1234
#> 2 B       A         2     2       2         NA          NA             4355
#> 3 C       B         3     3       3         NA          NA               NA
#> 4 D       B         4     4       4         NA          NA               NA
#>   wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#>               <dbl>            <dbl>             <dbl>              <dbl>
#> 1             12345            51271             98764               5028
#> 2             54321            59212             56435               5263
#> 3                NA            98773             35634               6950
#> 4                NA               NA                NA               8902
#>   monthly_avg_revenue avg_visits_new avg_revenue_new
#>                 <dbl>          <dbl>           <dbl>
#> 1               63457           8902           23426
#> 2               34536           8902           23426
#> 3               34574           8902           23426
#> 4               23426           8902           23426

(example_fixed <- tibble(country = c("A", "B", "C", "D"),
                         brand = c("A", "A", "B", "B"),
                         event = c(1:4),
                         month = c(1:4),
                         weekday = c(1:4),
                         avg_visits = c(5028, NA, NA, NA),
                         avg_revenue = c(12345, NA, NA, NA), 
                         wkday_avg_visits = c(1234, 4355, NA, NA),
                         wkday_avg_revenue = c(12345, 54321, NA, NA),
                         event_avg_visits = c(51271, 59212, 98773, NA),
                         event_avg_revenue = c(98764, 56435, 35634, NA),
                         monthly_avg_visits = c(5028, 5263, 6950, 8902),
                         monthly_avg_revenue = c(63457, 34536, 34574, 23426),
                         avg_visits_new = c(5028, 4355, 98773, 8902),
                         avg_revenue_new = c(12345, 54321, 35634, 23426))) %>% 
  print(width = Inf)
#> # A tibble: 4 x 15
#>   country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#>   <chr>   <chr> <int> <int>   <int>      <dbl>       <dbl>            <dbl>
#> 1 A       A         1     1       1       5028       12345             1234
#> 2 B       A         2     2       2         NA          NA             4355
#> 3 C       B         3     3       3         NA          NA               NA
#> 4 D       B         4     4       4         NA          NA               NA
#>   wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#>               <dbl>            <dbl>             <dbl>              <dbl>
#> 1             12345            51271             98764               5028
#> 2             54321            59212             56435               5263
#> 3                NA            98773             35634               6950
#> 4                NA               NA                NA               8902
#>   monthly_avg_revenue avg_visits_new avg_revenue_new
#>                 <dbl>          <dbl>           <dbl>
#> 1               63457           5028           12345
#> 2               34536           4355           54321
#> 3               34574          98773           35634
#> 4               23426           8902           23426

Created on 2020-07-07 by the reprex package (v0.3.0)

2 Answers2

2

We could use the following

example$avg_visits_new <- apply(example,1,function(x) x[!is.na(x)][1])


# A tibble: 4 x 5
  avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
       <dbl>            <dbl>            <dbl>              <dbl>          <dbl>
1       5028             1234            51271               5028           5028
2         NA             4355            59212               5263           4355
3         NA               NA            98773               6950          98773
4         NA               NA               NA               8902           8902

This just goes row-by-row and uses the first non-NA value it finds


Edit: here is a loop that will add recycle the above code on all the metrics.

metric <- unique(sub(".*_(.*)","\\1",colnames(example)[-(1:5)]))

for(i in metric){
    example <- cbind(example, print(apply(example[,grepl(i,colnames(example))],1,function(x) x[!is.na(x)][1])))
}

colnames(example)[(ncol(example)-length(metric)+1):ncol(example)] <- paste0("avg_",metric,"_new")



> example

  country brand event month weekday avg_visits avg_revenue wkday_avg_visits wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits monthly_avg_revenue avg_visits_new avg_revenue_new
1       A     A     1     1       1       5028       12345             1234             12345            51271             98764               5028               63457           5028           12345
2       B     A     2     2       2         NA          NA             4355             54321            59212             56435               5263               34536           4355           54321
3       C     B     3     3       3         NA          NA               NA                NA            98773             35634               6950               34574          98773           35634
4       D     B     4     4       4         NA          NA               NA                NA               NA                NA               8902               23426           8902           23426
Daniel O
  • 4,258
  • 6
  • 20
  • Nice solution for the example given in the question Daniel, though I think the OP was looking for a function that could be applied according to the `metric` suffix, as there are many such columns in the actual data frame – Allan Cameron Jul 07 '20 at 17:00
  • @AllanCameron, you may be right. If OP see's this they should update the example to be representative of the actual dateset. – Daniel O Jul 07 '20 at 17:23
  • That's a nice solution Daniel, tks! Though I have many metrics (9 different metrics to be more precise) and they are kinda messy in the dataframe (as in the example I updated) – Fernando Finotto Jul 09 '20 at 22:25
  • I've updated my answer to include a solution. Still in Base-R – Daniel O Jul 10 '20 at 11:23
0

There are better ways of doing this, for example you can replace the whole function with:

subs_metric <- function(data, metric)
{
  data.table::fcoalesce(data[grep(metric, names(data)), ])
}

Which gives the correct result:

example %>% 
  mutate(avg_visits_new = subs_metric(., "visits"))
#> # A tibble: 4 x 5
#>   avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
#>        <dbl>            <dbl>            <dbl>              <dbl>          <dbl>
#> 1       5028             1234            51271               5028           5028
#> 2         NA             4355            59212               5263           4355
#> 3         NA               NA            98773               6950          98773
#> 4         NA               NA               NA               8902           8902

However, I'm sure you would like to know where the flaws in your code were that stopped the loop working as expected.

Firstly, your loop starts with for (i in nrow(data)). Since there are 4 rows in your data frame, this means for (i in 4). That means the loop only runs once with i set to 4. I think you meant for (i in 1:nrow(data))

Secondly, you are returning value inside the loop. That means that any time the loop runs, it will only run once and the function will return value. I think this was just a misplaced curly bracket.

Thirdly, you are overwriting value in each iteration of the loop, where you want value to be the vector that will form your new column, so you need to declare value in advance and write to value[i] for each iteration of the loop.

Incorporating these changes, we have:

subs_metric <- function(data, metric) {
  
  avg         <- paste0("avg_", metric)
  wkday_avg   <- paste0("wkday_avg_", metric)
  event_avg   <- paste0("event_avg_", metric)
  monthly_avg <- paste0("monthly_avg_", metric)
  value       <- numeric(nrow(data))
  
  for (i in 1:nrow(data)) {
    
      value[i] <- if (is.na(data[[avg]][i]) & 
                      is.na(data[[wkday_avg]][i]) & 
                      is.na(data[[event_avg]][i])) {
        data[[monthly_avg]][i]
      } else if (is.na(data[[avg]][i]) & 
                 is.na(data[[wkday_avg]][i])) {
        data[[event_avg]][i]
      } else if (is.na(data[[avg]][i])) {
        data[[wkday_avg]][i]
      } else {
        data[[avg]][i]
      }
  }
  return(value) 
}

Which now gives the correct result:

example %>% 
  mutate(avg_visits_new = subs_metric(., "visits"))
#> # A tibble: 4 x 5
#>   avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
#>        <dbl>            <dbl>            <dbl>              <dbl>          <dbl>
#> 1       5028             1234            51271               5028           5028
#> 2         NA             4355            59212               5263           4355
#> 3         NA               NA            98773               6950          98773
#> 4         NA               NA               NA               8902           8902

However, I'd probably stick to one of the other solutions offered, since they are considerably shorter and more efficient than a row-wise loop.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • tks for adjusting my function. I updated the example in case it helps to understand my case. I also tried your approach, but it keeps giving me this error: Error: Problem with mutate() input avg_visits_new. x Input avg_visits_new can't be recycled to size 6048. i Input avg_visits_new is test_fun(., "visits"). i Input avg_visits_new must be size 6048 or 1, not 4 – Fernando Finotto Jul 09 '20 at 17:03