12

I have a below-mentioned dataframe:

structure(
  list(ID = c("P-1", " P-1", "P-1", "P-2", "P-3", "P-4", "P-5", "P-6", "P-7",
              "P-8"),
       Date = c("2020-03-16 12:11:33", "2020-03-16 13:16:04",
                "2020-03-16 06:13:55", "2020-03-16 10:03:43",
                "2020-03-16 12:37:09", "2020-03-16 06:40:24",
                "2020-03-16 09:46:45", "2020-03-16 12:07:44",
                "2020-03-16 14:09:51", "2020-03-16 09:19:23"),
       Status = c("SA", "SA", "SA", "RE", "RE", "RE", "RE", "XA", "XA", "XA"),
       Flag = c("L", "L", "L", NA, "K", "J", NA, NA, "H", "G"),
       Value = c(5929.81, 5929.81, 5929.81, NA, 6969.33, 740.08, NA, NA, 1524.8,
                 NA),
       Flag2 = c("CL", "CL", "CL", NA, "RY", "", NA, NA, "", NA),
       Flag3 = c(NA, NA, NA, NA, "RI", "PO", NA, "SS", "DDP", NA)),
  .Names=c("ID", "Date", "Status", "Flag", "Value", "Flag2", "Flag3"),
  row.names=c(NA, 10L), class="data.frame")

I am using below-mentioned code:

    df %>% mutate(L = ifelse(Flag == "L",1,0),
                  K = ifelse(Flag == "K",1,0),
                  # etc for Flag) %>%
      mutate(sub_status = NA) %>%
      mutate(sub_status = ifelse(!is.na(Flag2) & Flag3 == 0, "a", sub_status),
             sub_status = ifelse(is.na(Flag2) & Flag3 != 0, "b", sub_status),
             # etc for sub-status) %>%
      mutate(value_class = ifelse(0 <= Value & Value <= 15000, "0-15000",
                                  "15000-50000")) %>%
      group_by(Date, status, sub_status, value_class) %>%
      summarise(L = sum(L),
                K = sum(K),
                # etc
                count = n())

Which provides me the following output:

    Date         Status  sub_status   value_class G H I J K L NA Count
    2020-03-20   SA      a            0-15000     0 0 0 0 1 1 0  2
    2020-03-20   SA      b            0-15000     0 0 0 0 1 0 0  1
    ................
    ................

I want to get the following output using the DF, where the Status column has distinct 3 values and Flag2 has either values or [null] or NA and finally Flag3 column has distinct 7 values with [null] or NA. For one distinct ID we have multiple entry of Flag3 column.

I Need to create the following dataframe, by creating a 3 group based on Value like 0-15000, 15000-50000.

  • If for a distinct ID Flag2 has some value other than 0 or [null]/NA but Flag3 has value 0 or [null]/NA then it would be a.
  • If for a distinct ID Flag3 has some value other than 0 or [null]/NA but Flag2 has value 0 or [null]/NA then it would be b
  • If for a distinct ID both Flag2 & Flag3 has some value other than 0 or [Null]/NA then it would be c
  • If for a distinct ID both Flag2 & Flag3 has value 0 or [Null]/NA the it would be d

I want to arrange the above mentioned datafrmae in the following structure with percent and Total column.

I have mentioned the percentage like 2/5 to show that status would be divided by the Total whereas sub_status would be divided by their respective Status.

16/03/2020         0 - 15000                    15000 - 50000
Status  count   percent  L K J H G [Null]    count   percent  L K J H G [Null]   Total
SA        1 1/8 (12.50%) 1 0 0 0 0   0         0       -      0 0 0 0 0    0       1
a         1 1/1(100.00%) 1 0 0 0 0   0         0       -      0 0 0 0 0    0       1
b         0       -      0 0 0 0 0   0         0       -      0 0 0 0 0    0       0
c         0       -      1 0 0 0 0   0         0       -      0 0 0 0 0    0       0
d         0       -      0 0 0 0 0   0         0       -      0 0 0 0 0    0       0
RE        4      50.00%  0 1 1 0 0   2         0       -      0 0 0 0 0    0       4
a         0        -     0 0 0 0 0   0         0       -      0 0 0 0 0    0       0
b         1      25.00%  0 0 1 0 0   1         0       -      0 0 0 0 0    0       1
c         1      25.00%  0 1 0 0 0   1         0       -      0 0 0 0 0    0       1
d         2      50.00%  0 0 0 0 0   2         0       -      0 0 0 0 0    0       2
XA        3      37.50%  0 0 0 1 1   1         0       -      0 0 0 0 0    0       3
a         0        -     0 0 0 0 0   0         0       -      0 0 0 0 0    0       0
b         2      66.67%  0 0 0 1 0   1         0       -      0 0 0 0 0    0       2
c         0        -     0 0 0 0 0   0         0       -      0 0 0 0 0    0       0
d         1      33.33%  0 0 0 0 1   0         0       -      0 0 0 0 0    0       1
Total     8     100.00%  1 1 0 0 1   3         0       -      0 0 0 0 0    0       8

I have mentioned the required output based on the latest date which is 16/03/2020, if the dataframe doesn't have the latest date as per startdate keep all the value 0 in the output dataframe. The percentage column is just for the reference there will be calculated percentage values,.

Also, I want to keep the structure static. For Example, if for any of the parameter are not present for a day the output structure would be same with 0 value.

For Example, Suppose date 17/03/2020 don't have any row with status SA or sub_status c the place holder for that will be there in the output with value as 0.

user9211845
  • 131
  • 1
  • 12
  • @akrun: The percent column I have kept like `2/5` just for the representation purpose. There would be percentage value only with 2 decimal point with percentage sign. – user9211845 Apr 10 '20 at 18:14
  • @akrun: Please suggest if the required output is possible through R:( – user9211845 Apr 10 '20 at 20:52
  • your data input is 10 rows, but expected iis more. Is the expected based on the input example – akrun Apr 10 '20 at 20:59
  • @akrun: I'm sorry but the output is just for the visual representation only. I need to understand the approach to get such output. – user9211845 Apr 10 '20 at 21:06
  • @akrun: All the counts are distinct group by `ID`. – user9211845 Apr 10 '20 at 22:59
  • @akrun: Did you check, need help to understand possible approach (if possible). – user9211845 Apr 11 '20 at 14:24
  • @akrun: Please help with the possible approach. – user9211845 Apr 12 '20 at 20:12
  • as i mentioned earlier, i check with the expected output to frame the logic. From your input and output, i cannot crosscheck – akrun Apr 12 '20 at 22:50
  • @akrun: Updated the expected output. – user9211845 Apr 13 '20 at 00:44
  • @akrun: Need to keep all the variables static. – user9211845 Apr 13 '20 at 00:48
  • 1
    Could you start with the ```dput``` of the dataset you like - it's the third code block. The previous code does not appear relevant as you seem content with the output. – Cole Apr 25 '20 at 21:12
  • your preferred output DF interleaves rows with aggregates of the next 4 rows (SA: a,b,c,d etc). This is an unusual format. Do you need this? A more straightforward approach would be to create a DF aggregated by sub_status (a, b, c, d) and then calculate a second aggregate to work out the sums and percentages per group. – Paul van Oppen Apr 26 '20 at 23:33
  • @PaulvanOppen: Yes Paul, I need the output in the required format. I'm not sure about what will be the correct approach. – user9211845 Apr 27 '20 at 01:57

1 Answers1

3

Hopefully that'll be enough to get you started, to go further, I'll need an expected output that looks like it comes from R, and further explanations as to how variables are computed.

library(tidyverse)
df <- structure(
  list(ID = c("P-1", " P-1", "P-1", "P-2", "P-3", "P-4", "P-5", "P-6", "P-7",
              "P-8"),
       Date = c("2020-03-16 12:11:33", "2020-03-16 13:16:04",
                "2020-03-16 06:13:55", "2020-03-16 10:03:43",
                "2020-03-16 12:37:09", "2020-03-16 06:40:24",
                "2020-03-16 09:46:45", "2020-03-16 12:07:44",
                "2020-03-16 14:09:51", "2020-03-16 09:19:23"),
       Status = c("SA", "SA", "SA", "RE", "RE", "RE", "RE", "XA", "XA", "XA"),
       Flag = c("L", "L", "L", NA, "K", "J", NA, NA, "H", "G"),
       Value = c(5929.81, 5929.81, 5929.81, NA, 6969.33, 740.08, NA, NA, 1524.8,
                 NA),
       Flag2 = c("CL", "CL", "CL", NA, "RY", "", NA, NA, "", NA),
       Flag3 = c(NA, NA, NA, NA, "RI", "PO", NA, "SS", "DDP", NA)),
  .Names=c("ID", "Date", "Status", "Flag", "Value", "Flag2", "Flag3"),
  row.names=c(NA, 10L), class="data.frame")

df2 <- df %>%
  mutate(
    # add variables
    Value = ifelse(0 <= Value & Value <= 15000, "0-15000", "15000-50000"),
    substatus = case_when(
      !is.na(Flag2) & is.na(Flag3) ~ "a",
      !is.na(Flag3) & is.na(Flag2) ~ "b",
      !is.na(Flag3) & !is.na(Flag2) ~ "c",
      TRUE ~ "d"),
    # make Date an actual date rather than a timestamp
    Date = as.Date(Date),
    # remove obsolete columns
    Flag2 = NULL,
    Flag3 = NULL,
    ID = NULL,
    # renames NAs into the name of the desired column
    Flag = ifelse(is.na(Flag), "[Null]", Flag),
    # create column of 1 for pivot
    temp = 1,
    # and row id
    id = row_number()
    ) %>%
  # create new columns L K etc, this also drops the Flag col
  pivot_wider(names_from = "Flag", values_from = "temp", values_fill = list(temp=0)) %>%
  # move `[Null]` column to the end
  select(everything(), -`[Null]`, `[Null]`) %>%
  mutate(
    id = NULL,
    count = 1,
    Total = rowSums(select(., L:`[Null]`))) 
df2
#> # A tibble: 10 x 12
#>    Date       Status Value substatus     L     K     J     H     G `[Null]`
#>    <date>     <chr>  <chr> <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
#>  1 2020-03-16 SA     0-15~ a             1     0     0     0     0        0
#>  2 2020-03-16 SA     0-15~ a             1     0     0     0     0        0
#>  3 2020-03-16 SA     0-15~ a             1     0     0     0     0        0
#>  4 2020-03-16 RE     <NA>  d             0     0     0     0     0        1
#>  5 2020-03-16 RE     0-15~ c             0     1     0     0     0        0
#>  6 2020-03-16 RE     0-15~ c             0     0     1     0     0        0
#>  7 2020-03-16 RE     <NA>  d             0     0     0     0     0        1
#>  8 2020-03-16 XA     <NA>  b             0     0     0     0     0        1
#>  9 2020-03-16 XA     0-15~ c             0     0     0     1     0        0
#> 10 2020-03-16 XA     <NA>  d             0     0     0     0     1        0
#> # ... with 2 more variables: count <dbl>, Total <dbl>

# As you didn't tell what to do with NA values so I left them as NA 

bind_rows(
  df2 %>%
    # add missing combinations of abcd
    complete(nesting(Date, Status, Value), substatus) %>%
    group_by(Date, Value, Status, substatus) %>% 
    summarize_all(~sum(., na.rm=TRUE)) %>%
    group_by(Status, Value) %>%
    mutate(percent = paste(round(100 * Total / sum(Total), 2), "%")) %>%
    ungroup(),
  df2 %>% 
    mutate(substatus = Status, Status = paste0(Status, "_")) %>%
    group_by(Date, Value, Status, substatus) %>% 
    mutate(count = n()) %>%
    group_by(count, add = TRUE) %>%
    summarize_all(~sum(., na.rm=TRUE)) %>%
    group_by(Value) %>%
    mutate(percent = paste(round(100 * Total / sum(Total), 2), "%"))
) %>%
  arrange(Date, Value, desc(Status)) %>%
  mutate(Status = NULL) %>%
  rename(Status = substatus) %>%
  print(n=Inf)
#> # A tibble: 25 x 12
#>    Date       Value Status     L     K     J     H     G `[Null]` count Total
#>    <date>     <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>
#>  1 2020-03-16 0-15~ XA         0     0     0     1     0        0     1     1
#>  2 2020-03-16 0-15~ a          0     0     0     0     0        0     0     0
#>  3 2020-03-16 0-15~ b          0     0     0     0     0        0     0     0
#>  4 2020-03-16 0-15~ c          0     0     0     1     0        0     1     1
#>  5 2020-03-16 0-15~ d          0     0     0     0     0        0     0     0
#>  6 2020-03-16 0-15~ SA         3     0     0     0     0        0     3     3
#>  7 2020-03-16 0-15~ a          3     0     0     0     0        0     3     3
#>  8 2020-03-16 0-15~ b          0     0     0     0     0        0     0     0
#>  9 2020-03-16 0-15~ c          0     0     0     0     0        0     0     0
#> 10 2020-03-16 0-15~ d          0     0     0     0     0        0     0     0
#> 11 2020-03-16 0-15~ RE         0     1     1     0     0        0     2     2
#> 12 2020-03-16 0-15~ a          0     0     0     0     0        0     0     0
#> 13 2020-03-16 0-15~ b          0     0     0     0     0        0     0     0
#> 14 2020-03-16 0-15~ c          0     1     1     0     0        0     2     2
#> 15 2020-03-16 0-15~ d          0     0     0     0     0        0     0     0
#> 16 2020-03-16 <NA>  XA         0     0     0     0     1        1     2     2
#> 17 2020-03-16 <NA>  a          0     0     0     0     0        0     0     0
#> 18 2020-03-16 <NA>  b          0     0     0     0     0        1     1     1
#> 19 2020-03-16 <NA>  c          0     0     0     0     0        0     0     0
#> 20 2020-03-16 <NA>  d          0     0     0     0     1        0     1     1
#> 21 2020-03-16 <NA>  RE         0     0     0     0     0        2     2     2
#> 22 2020-03-16 <NA>  a          0     0     0     0     0        0     0     0
#> 23 2020-03-16 <NA>  b          0     0     0     0     0        0     0     0
#> 24 2020-03-16 <NA>  c          0     0     0     0     0        0     0     0
#> 25 2020-03-16 <NA>  d          0     0     0     0     0        2     2     2
#> # ... with 1 more variable: percent <chr>
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Thanks a lot, can you help in classify the framework based on value (i.e `0-15`, `15-50` and `50+`). Also, how can I get the required percentage column, – user9211845 May 08 '20 at 13:46