0

I have a dataset containing the number of a products sold per day. However, the data is recorded in a strange way. So if the number of items sold passes 100, the recorder is reset and the number of items sold starts from zero. I have provided an example of how the recorder works.

df <- data.frame(date = c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-05"),
items_sold = c(10,40,90, 10, 20), items_sold_real = c(10,40,90, 110, 130), items_sold_differene_day = c(NA, 30, 50, 20, 20)
)

So, I am trying to make a new variable that looks like variable items_sold_real. I was wondering if you have any idea what would be a good solution in here.

Thanks, S

Abdessabour Mtk
  • 3,895
  • 2
  • 14
  • 21
H Hosseini
  • 43
  • 6
  • I noticed that my previous code didn't give the expected output that you tagged in your question, that's why I edited the answer – Abdessabour Mtk Nov 04 '20 at 00:04
  • 1
    are you sure the last value for items_sold is 20 note 30? Because if the counter starts over at 100 than the invrementation from 10 to 20 should not result in 110 to 130 and diference of 20 (becuase 10 was the counter value the day before). – DPH Nov 04 '20 at 00:13
  • @DPH I actually changed the whole logic of my code because of that, I don't think he's mistaken. it's kinda strange tho – Abdessabour Mtk Nov 04 '20 at 00:18
  • 1
    @AbdessabourMtk The last value of 20 seems logically incorrect, if it is a restarting and accumulating counter running over various days. Once it passed over 100 it starts at 0. so the difference between 20 and 10 can not be 20 – DPH Nov 04 '20 at 00:22
  • @DPH yeah I agree – Abdessabour Mtk Nov 04 '20 at 00:23
  • @DPH and @@AbdessabourMtk first of all thank you both for your times. I believe it was a mistake I made and you are both right. – H Hosseini Nov 04 '20 at 00:29

2 Answers2

2

Logic :

  • we detect that items sold passed a have reset if the number of items sold in day i is inferior to i-1. (mind you that this can be correct only if the number of sales per day can reach at maximum 99 units).
  • if we still hasn't passed a 100 then we just put the items sold.
  • if we passed a hundred we got to add the cumulative sum of the items sold with 100 multiplied by the number of times we passed 100.

Code :

  • items_sold < lag(items_sold, default=F) logical as to if a value is inferior to the previous value.
  • cumsum to get the cumulative sum of this logical vector (TRUE=1,FALSE=0).
  • we group by this new variable (id) to ensure that cumsum will only be applied on elements of the same group.
  • apply the formula mentioned above ( id > 0 ) * cumsum(items_sold) + id * 100 + items_sold * ( id == 0 ): if id > 0 add the cumsum to 100 multiplied by id, if not put the values of item_sold.
  • difference can be calculated using the diff function.
library(tidyverse)

#### if you're tagged data.frame is wrong
## basically there is no longer a need for the cumsums
## so no need to group thus we drop the id column and directly
df %>%
 mutate(items_sold_real= cumsum(items_sold<lag(items_sold, default=F))*100 + items_sold, items_sold_differene_day= c(NA,diff(items_sold_real))) -> df

#### if you want to have a result just like the tagged data.frame
df %>%
 mutate(id = cumsum(items_sold < lag(items_sold, default=F))) %>%
 group_by(id) %>%
 mutate(items_sold_real= ( id > 0 ) * cumsum(items_sold) + id * 100 + items_sold * ( id == 0 ) ) %>% 
 ungroup %>% select(-id) %>%
  mutate(items_sold_differene_day= c(NA, diff(items_sold_real))) -> df
identical(df, tibble(df.result))
#> TRUE
df
# A tibble: 5 x 4
  date       items_sold items_sold_real items_sold_differene_day
  <fct>           <dbl>           <dbl>                    <dbl>
1 2020-01-01         10              10                       NA
2 2020-01-02         40              40                       30
3 2020-01-03         90              90                       50
4 2020-01-04         10             110                       20
5 2020-01-05         20             130                       20

DATA

df <- data.frame(date = c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-05"),
 items_sold = c(10,40,90, 10, 20))

df.result <- data.frame(date = c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-04", "2020-01-05"), 
 items_sold = c(10,40,90, 10, 20),
 items_sold_real = c(10,40,90, 110, 130),
 items_sold_differene_day = c(NA, 30, 50, 20, 20)
)
Abdessabour Mtk
  • 3,895
  • 2
  • 14
  • 21
  • your solution works well for the example. However, when I applied it to my primary dataset whose first observation is missing and it has some missing values in other rows, I found out that all I get is NA which happens to be the case because of the way id var is made. – H Hosseini Nov 04 '20 at 00:46
  • @SeyedHosseini can add the head(20) of your df into the question? – Abdessabour Mtk Nov 04 '20 at 00:53
  • @SeyedHosseini I'd suggest replacing the NAs with zeros using this `df %>% mutate(items_sold = if_else(is.na(items_sold), 0, items_sold))` – Abdessabour Mtk Nov 04 '20 at 00:58
0

We can use diff to do the difference between adjacent elements of 'items_sold_real'

df <- transform(df, items_sold_dfifference = c(NA, diff(items_sold_real)))

data

df <- structure(list(date = c("2020-01-01", "2020-01-02", "2020-01-03", 
"2020-01-04", "2020-01-05"), items_sold = c(10, 40, 90, 10, 20
), items_sold_real = c(10, 40, 90, 110, 130), 
items_sold_differene_day = c(NA, 
30, 50, 20, 20)), class = "data.frame", row.names = c(NA, -5L
))
akrun
  • 874,273
  • 37
  • 540
  • 662