Organizing a dataframe - splitting one column into three

Question

I have a dataset that looks like this:

Ord_ID      Supplier     Trans_Type     Date
1           A            PO             2/3/18
1           A            Receipt        2/15/18
2           B            PO             2/4/18
2           B            Receipt        3/13/18
3           C            PO             2/7/18
3           C            Receipt        3/1/18
3           C            Receipt        3/5/18
3           C            Receipt        3/29/18
4           B            PO             2/9/18
4           B            Receipt        2/20/18
4           B            Receipt        2/27/18
5           D            PO             2/18/18
5           D            Receipt        4/2/18

Basically, I need to separate the Date column into 3 different columns. I need a PO_Date column, a column that lists the earliest receipt date for each order, and the last receipt date for each order. Because some orders only have one receipt date, the 2nd and 3rd columns should be the same. I've tried using spread(), but I guess because there are varying numbers of Receipt dates for each order it didn't work. How can I make this happen?

Desired result:

Ord_ID     Supplier    PO_Date   First_Receipt_Date    Last_Receipt_Date
1          A           2/3/18    2/15/18               2/15/18
2          B           2/4/18    3/13/18               3/13/18
3          C           2/7/18    3/1/18                3/29/18
4          B           2/9/18    2/20/18               2/27/18
5          D           2/18/18   4/2/18                4/2/18

score 3 · Accepted Answer · answered Mar 21 '19 at 22:06

Using dplyr. First, make sure column Date is in date format. Assume dataframe is named mydata:

library(dplyr)
mydata <- mydata %>% 
  mutate(Date = as.Date(Date, "%m/%d/%y")

Now you can filter for Receipt, calculate max/min dates, then filter the original data for PO and join them together:

mydata %>% 
  filter(Trans_Type == "Receipt") %>% 
  group_by(Ord_ID, Supplier) %>% 
  summarise(First_Receipt_Date = min(Date), 
            Last_Receipt_Date = max(Date)) %>% 
  ungroup() %>%
  left_join(filter(mydata, Trans_Type == "PO")) %>% 
  select(Ord_ID, Supplier, PO_Date = Date, First_Receipt_Date, Last_Receipt_Date)

Result:

  Ord_ID Supplier PO_Date    First_Receipt_Date Last_Receipt_Date
   <int> <chr>    <date>     <date>             <date>           
1      1 A        2018-02-03 2018-02-15         2018-02-15       
2      2 B        2018-02-04 2018-03-13         2018-03-13       
3      3 C        2018-02-07 2018-03-01         2018-03-29       
4      4 B        2018-02-09 2018-02-20         2018-02-27       
5      5 D        2018-02-18 2018-04-02         2018-04-02

When I run this I get this error: "Error: `by` required, because the data sources have no common variables" — Millie, Mar 22 '19 at 18:36
Works for me using the example data in the question: the join is on Ord_ID and Supplier. — neilfws, Mar 23 '19 at 03:06
Got it to work this time. Not sure what was wrong last week; I updated packages earlier which might have done the trick. — Millie, Mar 25 '19 at 22:05

moodymudskipper · Answer 2 · 2019-03-22T17:25:58.103

2

With tidyverse, borrowing @divibisan's sample data :

library(tidyverse)

df %>%
  group_by(Ord_ID, Supplier) %>%
  slice(c(1:2, n())) %>%
  mutate(Trans_Type = c("PO_Date","First_Receipt_Date","Last_Receipt_Date")) %>%
  spread(Trans_Type, Date) %>%
  ungroup()

# # A tibble: 5 x 5
#   Ord_ID Supplier First_Receipt_Date Last_Receipt_Date PO_Date   
#    <int> <fct>    <date>             <date>            <date>    
# 1      1 A        2018-02-15         2018-02-15        2018-02-03
# 2      2 B        2018-03-13         2018-03-13        2018-02-04
# 3      3 C        2018-03-01         2018-03-29        2018-02-07
# 4      4 B        2018-02-20         2018-02-27        2018-02-09
# 5      5 D        2018-04-02         2018-04-02        2018-02-18

If the data is not sorted as in the sample data, add %>% arrange(Trans_Type, Date) as a first step.

edited Mar 22 '19 at 17:25

answered Mar 22 '19 at 09:58

moodymudskipper

46,417
11
121
167

I would recommend not to use `slice`, it is not reproducible as you do not know the order of data. At least, use `arrange` before. A better way would be a combination of `filter` with `first` and `last` I guess. – Sébastien Rochette Mar 22 '19 at 14:32
If data is not sorted, you can add `%>% arrange(Trans_Type, Date)` as a first step. given the shape of the sample data I assumed it was fair to assume it is sorted. Other assumptions are that there is always a `"PO"` value, that there are no other values than `"PO"` and `"Receipt"`, that order of the output columns wasn't important etc... – moodymudskipper Mar 22 '19 at 17:15
I don't understand the point on `filter` with `first` and `last`, I use `slice` in its precise intended use case IMO. – moodymudskipper Mar 22 '19 at 17:18
I know this is correct with your assumptions in this specific case. But because you never know how a dataset is built, nor can you know how it will be later updated, I do not recommend the use of indices for the selection/filtering of datasets. There must be a better explanation, included in the data, as why you chose these specific lines. Here, these are the smallest and the biggest values. I try to always think about the future use of my scripts. This is a personal recommendation. – Sébastien Rochette Mar 22 '19 at 17:22
1

That's a legitimate point, but there's also value in concise code, and on SO you're rarely 100% explicit about assumptions anyway so it's a gray area. Actually my first answer had the `arrange` part but I edited it out (not showing in edit history as i edited right away). I added a note at the end of my post as a compromise :). – moodymudskipper Mar 22 '19 at 17:29
This works on the sample data, but for some reason when I use the entire dataset it doesn't. I get an error that says Error in `[[<-.data.frame`(`*tmp*`, col, value = c("PO_Date", "First_Receipt_Date", : replacement has 3 rows, data has 1094 Do you know what might be causing this? – Millie Mar 22 '19 at 19:07
Did you forget to `group` or to `slice` ? Or did you subgroup before the `mutate` call ? Did you make sure to run it in a fresh session ? – moodymudskipper Mar 22 '19 at 19:21
Restarting the session and adjusting something in my earlier code worked. However I'm just realizing that there are in fact some orders that have POs but not Receipts, which does cause problems for those rows... shame on me for not knowing my data well enough :/ – Millie Mar 22 '19 at 19:48
Nothing to be ashamed about, it's always an iterative process, do you want NAs in Receipt columns in this case ? Or drop the observations with orphan POs? I can adapt my answer. – moodymudskipper Mar 22 '19 at 21:53
NAs in the Receipt columns would be preferred. If you could adjust your answer to work that way, that would be phenomenal! – Millie Mar 26 '19 at 19:30

score 1 · Answer 3 · answered Mar 21 '19 at 21:55

1

I would start with something like this:

data %>%
  group_by(Supplier, Trans_Type) %>%
  summarise(min_date = min(Date),
    max_date = max(Date)
)  %>%
  ungroup()

Then, you can play with gatherand spread to retrieve the columns you need.

answered Mar 21 '19 at 21:55

Sébastien Rochette

6,536
2
22
43

score 0 · Answer 4 · answered Mar 21 '19 at 22:24

Here's another tidyverse based solution that avoids the left_join. I have no idea which approach would be faster on a large dataset, but it's always good to have more options:

df <- structure(list(Ord_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 
4L, 4L, 5L, 5L), Supplier = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 
3L, 3L, 2L, 2L, 2L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), 
    Trans_Type = c("PO", "Receipt", "PO", "Receipt", "PO", "Receipt", 
    "Receipt", "Receipt", "PO", "Receipt", "Receipt", "PO", "Receipt"
    ), Date = structure(c(17565, 17577, 17566, 17603, 17569, 
    17591, 17595, 17619, 17571, 17582, 17589, 17580, 17623), class = "Date")), row.names = c(NA, 
-13L), class = "data.frame")



df %>%
    group_by(Ord_ID, Supplier, Trans_Type) %>%
    # Keep only min and max date values
    filter(Date == min(Date) | Date == max(Date) | Trans_Type != 'Receipt') %>%
    # Rename 2nd Receipt value Receipt_2 so there are no duplicated values
    mutate(Trans_Type2 = if_else(Trans_Type == 'Receipt' & row_number() == 2,
                                 'Receipt_2', Trans_Type)) %>%
    # Drop Trans_Type variable (we can't replace in mutate since it's a grouping var)
    ungroup(Trans_Type) %>%
    select(-Trans_Type) %>%
    # Spread the now unduplicated Trans_Type values
    spread(Trans_Type2, Date) %>%
    # Fill in Receipt_2 values where they're missing
    mutate(Receipt_2 = if_else(is.na(Receipt_2), Receipt, Receipt_2))

# A tibble: 5 x 5
  Ord_ID Supplier PO         Receipt    Receipt_2 
   <int> <fct>    <date>     <date>     <date>    
1      1 A        2018-02-03 2018-02-15 2018-02-15
2      2 B        2018-02-04 2018-03-13 2018-03-13
3      3 C        2018-02-07 2018-03-01 2018-03-29
4      4 B        2018-02-09 2018-02-20 2018-02-27
5      5 D        2018-02-18 2018-04-02 2018-04-02

score 0 · Answer 5 · answered Mar 21 '19 at 23:21

You can just use dplyr to mutate new columns for PO date, and first and last receipt dates:

test1<-test %>%
  mutate(Date = mdy(Date)) %>%
  group_by(Ord_ID) %>%
  mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
         Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
         Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
  filter(!is.na(PO_Date)) %>%
  mutate(PO_Date = as.Date(as.numeric(PO_Date)))

A breakdown:

test1<-test %>%

  #convert format of "Date" column to as.Date to identify min and max dates
  mutate(Date = mdy(Date)) %>%

  #group by the Order ID
  group_by(Ord_ID) %>%

  #PO_Date will be where the "Trans_Type" is "PO" --> since the column is in date format,
  #dplyr will convert this to numeric, but can be fixed later
  mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),

     #first receipt date is the minimum date of a receipt transaction
     Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),

     #last receipt date is the maximum date of a receipt transaction
     Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%

  #to remove duplicates
  filter(!is.na(PO_Date)) %>%

  #convert "PO_Date" column back to as.Date from numeric
  mutate(PO_Date = as.Date(as.numeric(PO_Date)))

Organizing a dataframe - splitting one column into three

5 Answers5