How to Create Required Matrix Using Dataframe in R

Question

I have one dataframe which looks like:

DF_1>

T_id  D1             D2                   Num     type    type_2     fig
xt-1  2017-05-01     2017-03-25 12:11:45  10      A       X          25.20
xt-2  2017-05-01     2017-03-25 21:05:25  20      A       Y          20.15
xt-3  2017-05-01     2017-03-25 08:10:55  25      B       X          15.11
xt-4  2017-05-03     2017-03-25 07:19:35  30      B       Y          22.56
xt-5  2017-05-03     2017-03-25 13:12:56  45      C       Z          35.45
xt-6  2017-05-03     2017-03-25 18:14:44  20      D       Z          27.21
xt-7  2017-04-06     2017-03-25 19:21:35  15      A       Z          23.20
xt-8  2017-04-06     2017-03-25 21:11:15  40      C       X          21.40
xt-9  2017-04-08     2017-02-25 22:25:04  20      A       A          27.50
xt-10 2017-04-06     2017-02-25 16:04:08  30      A       Y          32.20
xt-11 2017-04-05     2017-02-25 18:15:25  20      C       Z          30.20
xt-12 2017-04-01     2017-01-25 19:22:25  50      A       Z          33.15
xt-13 2017-04-02     2017-01-25 23:19:05  15      A       A          30.12
xt-14 2017-03-03     2017-01-25 14:25:09  15      D       Y          31.25
xt-15 2017-03-10     2017-01-25 23:25:36  40      A       X          25.45

From above dataframe i want below mentioned two matrix:

1. Date (Last Three Date from `sys.date()`)

    D1    count  sum  mean_num total_sum count_A sum_A count_other sum_other mean_fig   mean_TAT

    2017-05-03 3 95  31.66     6         0       0     3           95        28.40
    2017-05-02 0 0   0         3         0       0     0           0         0.00
    2017-05-01 3 55  18.33     3         2       30    1           25        20.15

For calculation of mean_TAT: substract D2-D1 and than take mean for that day based on count value for the same date.
total_sum would be cumulative from the first date of the month.
count_A and sum_A based on type as A for the particular day.
count_other and sum_other for those where type is other than A.

2.Based of Month (Last Three month as per the dataframe)

For based on the Month the format would be same only calculation would be month basis.

There are 5 additional rows and 2 columns for each month where first three would be the top 3 type_2 based on count for the particular month.
increase_% would be calculated on previous month (i.e if count for May-17 is 50 than Apr-17 100 than there would be -50% and same for the other 5 rows based on their previous month count and sum.
Fourth A would be constant for each month for the value where type_2 is "A".
Fifth Other would be the other than those 4 type_2 as mentioned above.
Total would be as per the column for count and sum there would be addition and for mean there would be mean.

It seems i couldn't explained properly, hope the matrix are understandable by dataframe.

Looking forward for some help.

You might consider trying [dplyr](http://genomicsclass.github.io/book/pages/dplyr_tutorial.html#group-operations-using-group_by), especially the `group_by()` function. — Kamil Slowikowski, Mar 05 '18 at 17:32
@KamilSlowikowski Thanks but i have never code for this much complex matrix. — Rahul shah, Mar 05 '18 at 17:35
I don't understand your second table, could you provide an expected output ? — moodymudskipper, Mar 07 '18 at 09:10
@Moody_Mudskipper My second table would be same as first table only that would be based on Month basis. (For example Count would be for whole month based on `D1` and same for `sum`, `mean` etc. only here one new column is introduced which is `% increase` which give me the percentage increase or decrease based on previous month `count` and `sum` for month as well as for other 5 variables. — Rahul shah, Mar 07 '18 at 12:53
what abour rows `X Y Z A other` ? are they the same but filtered respectively on type_2 and type ? — moodymudskipper, Mar 07 '18 at 13:09
@Moody_Mudskipper Let's keep `X Y Z A Other` constant for all the month, this should be calculated on count of `X Y Z A Other` for particular month, where it should be in sequence for every month based on the count of `X Y Z A Other` for that month, if Count of `Z` is higher for particular month than it should be on than my series would be like `Z X Y A Other`. Further `count_A` is number of `X Y Z A Other` in `type_2` Having value `A` in `type` and `count_other` is number of `X Y Z A Other` having value other than `A` in `type`. Same logic for `sum_A` and `sum_Other`. — Rahul shah, Mar 07 '18 at 14:09
@Moody_Mudskipper For `Other` the value would be other than `X Y Z A` in `type_2` for both `count` and `sum`. — Rahul shah, Mar 07 '18 at 14:11
@Moody_Mudskipper Have you understand the logic for second part?? — Rahul shah, Mar 09 '18 at 06:02
I think I do, but I don't have access to a computer in the next 4 days. You're going to have to build 3 dfs, each a variant of what we have on step one. One where you group only by ym instead of ym and D1, not mention D1 after summarize. One where you group by ym and type_2 instead of ym and D1, arrange by count and keep top 3. And one where you group by ym and type and start the chain with a `mutate(type=ifelse(type=='A', 'A', 'Other')`. In each of these dfs, you get your increase by using the `lag` function in the end, after grouping by relevant type col. — moodymudskipper, Mar 09 '18 at 09:05
In your sample data, Type_2 is never equal to "A". So it is impossible to achieve your fourth condition: "A would be constant for each month for the value where type_2 is "A". — GGAnderson, Mar 11 '18 at 09:37
Can I suggest you manually populate one month of your table 2? Based on your description alone, it is impossible to determine what should be in the row labeled "A", And even more challenging to determine whether the column "count_A" contains the total for the month from the first matrix, or should be stratified by the corresponding type_2. More clarification? — GGAnderson, Mar 11 '18 at 09:41
@GGAnderson Thank for highlighting the issue, i have changed some `type_2` as `A`. — Rahul shah, Mar 11 '18 at 10:56
you changed your data without changing your expected output for second step, so you have some W on your first column that is not anymore on your input data — moodymudskipper, Mar 13 '18 at 18:36
`total_sum would be cumulative (count ?) from the first date of the month.` , so in second step, as it's group per month, it's not a useful column as it would just be equal to count. — moodymudskipper, Mar 13 '18 at 18:38
`There are 5 additional rows and 2 columns for each month where first three would be the top 3 type_2 based on count for the particular month.` And what if A is part of the top 3, should it be repeated ? or should it be the top 3 without A ? And Other is everything except for this top 3 AND A ? — moodymudskipper, Mar 13 '18 at 18:42
I edited my answer but as mentioned, without a clearer question I can't do miracles. — moodymudskipper, Mar 13 '18 at 19:20

moodymudskipper · Accepted Answer · 2018-03-13T19:18:40.073

Here's already the first part:

library(lubridate)
library(dplyr)

df2 <- df1 %>%
  mutate(ym = year(D1)*100+month(D1)) %>%
  arrange(D1) %>%
  group_by(D1,ym) %>%
  summarize(count = n(),
            sum=sum(Num),
            mean_num=mean(Num),
            count_A=sum(type=='A'),
            sum_A=sum(Num * (type=='A')),
            count_other=sum(type!='A'),
            sum_other=sum(Num * (type!='A')),
            mean_fig = mean(fig),
            mean_TAT = mean(D2-D1)) %>%
  group_by(ym) %>%
  mutate(total_sum=cumsum(count)) %>%
  ungroup %>%
  arrange(desc(D1)) %>%
  select(D1,count,sum,mean_num,total_sum,count_A,sum_A,count_other,sum_other,mean_fig,mean_TAT)


# # A tibble: 9 x 11
# D1 count   sum mean_num total_sum count_A sum_A count_other sum_other mean_fig       mean_TAT
# <date> <int> <int>    <dbl>     <int>   <int> <int>       <int>     <int>    <dbl>         <time>
# 1 2017-05-03     3    95 31.66667         6       0     0           3        95 28.40667 -39.00000 days
# 2 2017-05-01     3    55 18.33333         3       2    30           1        25 20.15333 -37.00000 days
# 3 2017-04-08     1    20 20.00000         7       1    20           0         0 27.50000 -42.00000 days
# 4 2017-04-06     3    85 28.33333         6       2    45           1        40 25.60000 -21.33333 days
# 5 2017-04-05     1    20 20.00000         3       0     0           1        20 30.20000 -39.00000 days
# 6 2017-04-02     1    15 15.00000         2       1    15           0         0 30.12000 -67.00000 days
# 7 2017-04-01     1    50 50.00000         1       1    50           0         0 33.15000 -66.00000 days
# 8 2017-03-10     1    40 40.00000         2       1    40           0         0 25.45000 -44.00000 days
# 9 2017-03-03     1    15 15.00000         1       0     0           1        15 31.25000 -37.00000 days

data

df1 <- read.table(text="T_id  D1             D2                   Num     type    type_2     fig
                  xt-1  2017-05-01     '2017-03-25 12:11:45'  10      A       X          25.20
                  xt-2  2017-05-01     '2017-03-25 21:05:25'  20      A       Y          20.15
                  xt-3  2017-05-01     '2017-03-25 08:10:55'  25      B       X          15.11
                  xt-4  2017-05-03     '2017-03-25 07:19:35'  30      B       Y          22.56
                  xt-5  2017-05-03     '2017-03-25 13:12:56'  45      C       Z          35.45
                  xt-6  2017-05-03     '2017-03-25 18:14:44'  20      D       Z          27.21
                  xt-7  2017-04-06     '2017-03-25 19:21:35'  15      A       Z          23.20
                  xt-8  2017-04-06     '2017-03-25 21:11:15'  40      C       W          21.40
                  xt-9  2017-04-08     '2017-02-25 22:25:04'  20      A       Q          27.50
                  xt-10 2017-04-06     '2017-02-25 16:04:08'  30      A       W          32.20
                  xt-11 2017-04-05     '2017-02-25 18:15:25'  20      C       V          30.20
                  xt-12 2017-04-01     '2017-01-25 19:22:25'  50      A       Z          33.15
                  xt-13 2017-04-02     '2017-01-25 23:19:05'  15      A       Z          30.12
                  xt-14 2017-03-03     '2017-01-25 14:25:09'  15      D       Y          31.25
                  xt-15 2017-03-10     '2017-01-25 23:25:36'  40      A       X          25.45",h=T,strin=F)

df1$D1 <- as.Date(df1$D1,"%Y-%m-%d")
df1$D2 <- as.Date(df1$D2,"%Y-%m-%d")

expected_output <- read.table(text="D1    count  sum  mean_num total_sum count_A sum_A count_other sum_other mean_fig
                     2017-05-03 3 95  31.66     6         0       0     3           95        28.40
                     2017-05-02 0 0   0         3         0       0     0           0         0.00
                     2017-05-01 3 55  18.33     3         2       30    1           25        20.15")

Some hints for part 2:

I can't do miracles without you reworking your questions (giving an accurate reproducible output is more than necessary here). But here's a way to get close, hopefully:

df_month <- df1 %>%
  mutate(ym = year(D1)*100+month(D1)) %>%
  arrange(D1) %>%
  group_by(ym) %>%
  summarize(count = n(),
            sum=sum(Num),
            mean_num=mean(Num),
            count_A=sum(type=='A'),
            sum_A=sum(Num * (type=='A')),
            count_other=sum(type!='A'),
            sum_other=sum(Num * (type!='A')),
            mean_fig = mean(fig),
            mean_TAT = mean(D2-D1)) %>%
  mutate(type_2=paste0(month.abb[ym%% 100],"-",ym %/% 100 -2000)) %>%
  select(ym,type_2,count,sum,mean_num,count_A,sum_A,count_other,sum_other,mean_fig,mean_TAT)


df_top3 <- df1 %>%
  filter(type_2 !="A") %>%
  mutate(ym = year(D1)*100+month(D1)) %>%
  arrange(desc(ym)) %>%
  group_by(ym,type_2) %>%
  summarize(count = n(),
            sum=sum(Num),
            mean_num=mean(Num),
            count_A=sum(type=='A'),
            sum_A=sum(Num * (type=='A')),
            count_other=sum(type!='A'),
            sum_other=sum(Num * (type!='A')),
            mean_fig = mean(fig),
            mean_TAT = mean(D2-D1)) %>%
  group_by(ym) %>%
  arrange(desc(count)) %>%
  slice(1:3) %>%
  ungroup %>%
  select(ym,type_2,count,sum,mean_num,count_A,sum_A,count_other,sum_other,mean_fig,mean_TAT)


df_A <- df1 %>%
  filter(type_2 == "A") %>%
  mutate(ym = year(D1)*100+month(D1)) %>%
  arrange(desc(ym)) %>%
  group_by(ym,type_2) %>%
  summarize(count = n(),
            sum=sum(Num),
            mean_num=mean(Num),
            count_A=sum(type=='A'),
            sum_A=sum(Num * (type=='A')),
            count_other=sum(type!='A'),
            sum_other=sum(Num * (type!='A')),
            mean_fig = mean(fig),
            mean_TAT = mean(D2-D1)) %>%
  select(ym,type_2,count,sum,mean_num,count_A,sum_A,count_other,sum_other,mean_fig,mean_TAT)



df_other <- df1 %>%
  mutate(ym = year(D1)*100+month(D1)) %>%
  anti_join(bind_rows(df_top3,df_A),by = c("ym","type_2")) %>%
  mutate(type_2="Other") %>%
  arrange(desc(ym)) %>%
  group_by(ym,type_2) %>%
  summarize(count = n(),
            sum=sum(Num),
            mean_num=mean(Num),
            count_A=sum(type=='A'),
            sum_A=sum(Num * (type=='A')),
            count_other=sum(type!='A'),
            sum_other=sum(Num * (type!='A')),
            mean_fig = mean(fig),
            mean_TAT = mean(D2-D1)) %>%
  select(ym,type_2,count,sum,mean_num,count_A,sum_A,count_other,sum_other,mean_fig,mean_TAT)
# it's empty with your example data


bind_rows(df_month,df_top3,df_A,df_other) %>%
  arrange(ym) %>%
  select(-ym) %>%
  rename(Month = type_2)

I am getting "Incorrect size at position 1" error while running the code for first part. — Rahul shah, Mar 19 '18 at 12:01
I've just rerun it on my side and it worked fine, can you try in a clean session, making sure to load the data properly beforehand ? — moodymudskipper, Mar 19 '18 at 12:07
I have tried it many time, but this error is also new for me. Not sure how to resolve this. — Rahul shah, Mar 19 '18 at 12:11
can you run it line by line, i.e. `df1 %>% mutate(ym = year(D1)*100+month(D1))`, then `df1 %>% mutate(ym = year(D1)*100+month(D1)) %>% arrange(D1)` etc, and tell me where it fails ? — moodymudskipper, Mar 19 '18 at 12:13
I am getting error at this line: `select(D1,count,sum,mean_num,total_sum,count_A,sum_A,count_other,sum_other,mean_fig,mean_TAT)` — Rahul shah, Mar 19 '18 at 13:13
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/167126/discussion-between-moody-mudskipper-and-rahul-shah). — moodymudskipper, Mar 19 '18 at 23:50
@Moody_Mudskipper Just a little help require, i need `bind_rows` in descending form like in April-18, March-18, Feb-18, Jan-18 sequence. — Roy1245, May 02 '18 at 11:15
I haven't dived back in the question but to order rows you can use `arrange`. To order descending use `desc` as done on `count` in my answer — moodymudskipper, May 02 '18 at 11:41
@Moody_Mudskipper In my case i just want last three months data including current month (`Sys.date()`) in `bind_rows`. How to filter that? — Roy1245, May 02 '18 at 14:14
@Moody_Mudskipper How to calculate `increase_%` in second part?? — Roy1245, May 05 '18 at 17:14
@Moody_Mudskipper The question is not properly asked, what if for some month if we don't have any row for any value of `Type_2` (i.e Rows for X,Y,Z or A are missing) but still one want to show the particular Row in final output of `bind_row` with 0 value?? — Roy1245, May 06 '18 at 11:54
Roy as stated in my answer i wish i could have given you a complete answer but here we really miss a detailed and exhaustive expected output in place of the empty table you proposed, it's a bit too much of a headache to sort out and makes it less likely to benefit to other readers. I encourage you to isolate the specific points that are problematic, build very simple examples around it and ask them as new questions. — moodymudskipper, May 06 '18 at 17:39

How to Create Required Matrix Using Dataframe in R

1 Answers1