2

There are a lot of packages for Sankey diagrams. However, these packages assume the data is already structured. I'm looking at a transaction dataset where I would like to pull out the first sequence of products in a time series. Assume the time series is already ordered.

Here is the dataset:

structure(list(date = structure(c(1546300800, 1546646400, 1547510400, 1547596800, 1546387200, 1546646400, 1546732800), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
               client = c("a", "a", "a", "a", "b", "b", "b"),
               product = c("butter", "cheese", "cheese", "butter", "milk", "garbage bag", "candy"),
               qty = c(2, 3, 4, 1, 3, 4, 6)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame")) 

image

Here is the desired output:

image1

LocoGris
  • 4,432
  • 3
  • 15
  • 30

1 Answers1

1

Here is my proposal:

dt <-structure(list(date = structure(c(1546300800, 1546646400, 1547510400, 1547596800, 1546387200, 1546646400, 1546732800), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
               client = c("a", "a", "a", "a", "b", "b", "b"),
                          product = c("butter", "cheese", "cheese", "butter", "milk", "garbage bag", "candy"),
               qty = c(2, 3, 4, 1, 3, 4, 6)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))

library(data.table)
library(stringr)
dt <- as.data.table(dt)
dt[, From:=shift(product,type = "lag"), by=client]
dt <- dt[!is.na(From)]

setnames(dt, "product", "To")
dt <- dt[From!=To]
setcolorder(dt, c("client", "From", "To", "qty"))
dt[, comp:=paste0(sort(c(From, To)), collapse = "_"), by=seq_len(nrow(dt))]
dt <- unique(dt, by="comp")

dt[, date:=NULL]
dt[, comp:=NULL]

A caveat: why was cheese to cheese deleted? I assumed that you are looking for sequence of different products. If it is for other reasons my code might need some tweaks.

#  client        From          To qty       
#      a      butter      cheese   3 
#      b        milk garbage bag   4 
#      b garbage bag       candy   6
LocoGris
  • 4,432
  • 3
  • 15
  • 30
  • Thanks! Correct looking for sequence of different products which should feed well into the Sankey. – kakashi hatake Mar 24 '19 at 03:08
  • Looking for unique flows for each client. In the new example there should be a "butter to cheese" added to client b. structure(list(date = structure(c(1546300800, 1546646400, 1547510400, 1547596800, 1546387200, 1546646400, 1546732800, 1546819200, 1546992000 ), class = c("POSIXct", "POSIXt"), tzone = "UTC"), client = c("a", "a", "a", "a", "b", "b", "b", "b", "b"), product = c("butter", "cheese", "cheese", "butter", "milk", "garbage bag", "candy", "butter", "cheese"), qty = c(2, 3, 4, 1, 3, 4, 6, 2, 3)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame")) – kakashi hatake Mar 24 '19 at 03:35
  • 1
    I changed the unique statement to achieve this and added client dt <- unique(dt, by="comp", by="client") – kakashi hatake Mar 24 '19 at 03:43