0

Looking for assistance on how to prepare multiple time series of varying length (and sampling frequency) for clustering with the dtwclust package in R. Note that my series are at 15 or 30 minute sampling frequencies and the date/time isn't relevant so just storing as minute index.

From looking at the ?tslist constructor, I need to create a list of time series.

I currently have the data in long format in a R data.table, toy example below:

structure(list(new_id = structure(c("4f755b0b0ba498f2d2e00c7951eaaeb9", 
"a73d5cf68dee23eb83d5a8a59ac22312", "4f755b0b0ba498f2d2e00c7951eaaeb9", 
"a73d5cf68dee23eb83d5a8a59ac22312", "4f755b0b0ba498f2d2e00c7951eaaeb9", 
"a73d5cf68dee23eb83d5a8a59ac22312", "4f755b0b0ba498f2d2e00c7951eaaeb9", 
"a73d5cf68dee23eb83d5a8a59ac22312", "4f755b0b0ba498f2d2e00c7951eaaeb9", 
"a73d5cf68dee23eb83d5a8a59ac22312", "4f755b0b0ba498f2d2e00c7951eaaeb9", 
"a73d5cf68dee23eb83d5a8a59ac22312", "4f755b0b0ba498f2d2e00c7951eaaeb9", 
"a73d5cf68dee23eb83d5a8a59ac22312", "4f755b0b0ba498f2d2e00c7951eaaeb9", 
"a73d5cf68dee23eb83d5a8a59ac22312", "a73d5cf68dee23eb83d5a8a59ac22312", 
"a73d5cf68dee23eb83d5a8a59ac22312", "a73d5cf68dee23eb83d5a8a59ac22312", 
"a73d5cf68dee23eb83d5a8a59ac22312", "a73d5cf68dee23eb83d5a8a59ac22312", 
"a73d5cf68dee23eb83d5a8a59ac22312", "a73d5cf68dee23eb83d5a8a59ac22312", 
"a73d5cf68dee23eb83d5a8a59ac22312", "a73d5cf68dee23eb83d5a8a59ac22312", 
"a73d5cf68dee23eb83d5a8a59ac22312", "a73d5cf68dee23eb83d5a8a59ac22312", 
"a73d5cf68dee23eb83d5a8a59ac22312"), class = c("hash", "md5")), 
    value = c(500, 2400, 500, 2200, 500, 1400, 400, 600, 300, 
    900, 200, 800, 200, 800, 175, 800, 900, 600, 1700, 1700, 
    800, 700, 850, 750, 600, 500, 400, 350), elapsed_time = c(15, 
    10, 30, 20, 45, 30, 60, 40, 75, 50, 90, 60, 105, 70, 120, 
    80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 
    200)), row.names = c(NA, -28L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000001cfcb878c00>)

I feel like I've tried a few ideas but not successfully created the correct data structure. I started looking at tsibble but it doesn't look like the object can be coerced into tslist. Is there a way to do this with data.table or the *pply family?

Really appreciate any guidance - I have searched but not found a close enough solution to my problem. Thanks.

Meep
  • 521
  • 3
  • 14

1 Answers1

0

Fortunately, I found a related question and could work from that. How to generate a list-column holding named-vectors, when grouping by other data frame variables?

There may be a more elegant solution than this, but will put here for anyone in same predicament as me.

    library(tidyverse)
    toy_res =   toy %>%
      group_by(new_id) %>%
      summarise(named_vec = map2(list(value), list(elapsed_time),
~set_names(.x, .y)), .groups = "drop")
    
    t_list <- toy_res$named_vec
    names(t_list) = toy_res$new_id

t_list can now be used in dtwclust - except in this toy case because I only provided two series it will error. In my larger dataset it works.

Actually - it 'works' but does not honor the timestamp names in each vector. This answer is still a WIP.

Meep
  • 521
  • 3
  • 14