Selecting a unique index for tsibble in R when no unique column exists

Question

I have a dataset for aircraft flight crashes. It looks like this

date,time,location,operator,flight,route,type,registration,cn_in,aboard,fatalities,ground,summary
1908-09-17,17:18,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1,2,1,0,"During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a height of approximately 75 feet, killing Lt. Thomas E. Selfridge who was a passenger. This was the first recorded airplane fatality in history.  One of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft.  Orville Wright suffered broken ribs, pelvis and a leg.  Selfridge suffered a crushed skull and died a short time later."
1912-07-12,06:30,"AtlantiCity, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,5,5,0,"First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight."
1913-08-06,,"Victoria, British Columbia, Canada",Private,-,,Curtiss seaplane,,,1,1,0,"The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed."
1913-09-09,18:30,Over the North Sea,Military - German Navy,,,Zeppelin L-1 (airship),,,20,14,0,The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Island into the sea. The ship broke in two and the control car immediately sank drowning its occupants.

I want to import it into a tsibble for analysis and forecasting. The first issue that I notice is this. There are no duplicate rows in dataset but there is no column that has only distinct rows. By this I mean, there are duplicate dates and there are duplicate times etc for different rows.

I have tried combining the date and time columns into a date_time column but even the combination of two still has duplicate values (i.e. same date and time but differences in other columns like location etc)

I only got to unique values if I combine date, time, location and summary. However, this kind of complicates the ensuing time-series analysis and forecasting since a combination of those four is no longer numerical. I have also tried introducing an id column which I don't think helps much for the analysis although it does also present a unique key.

Does the id column I have added suffice or is there a better index I can choose? The full dataset is here. https://gist.github.com/vaMuchenje/1f7a009b4214d5a78021fb632830bdc5

Here is my code:

library(tidyverse)
library(anytime)
data <- readr::read_csv("aircraft_data.csv")
data <- data %>%
  mutate(date_time = with(data, anytime(paste(date, time)))) %>%
  mutate(id = row_number()) %>%
  as_tsibble(index=id)

score 0 · Answer 1 · answered Nov 30 '22 at 21:39

0

Depending on what you want to achieve with the tsibble, you maybe should rather go with either:

data %>% 
    mutate(date_time = anytime(paste(date, time))) %>%
    mutate(id = interaction(location, type)) %>%
    as_tsibble(index=date_time, key=id)

but then it's not clear what you need the tsibble for, as you will have mostly one, sometime two records for each key value.

Or if you want to treat the all the data together, you could distinguish them by adding a second for duplicate date_times:

data %>% 
    mutate(date_time = anytime(paste(date, time))) %>%
    group_by(date_time) %>% mutate(date_time=date_time+row_number()-1) %>%
    as_tsibble(index=date_time)

In any case, index should rather be a date or datetime

answered Nov 30 '22 at 21:39

Jan Z

171
3

i want to forecast number of accidents and fatalities for future years – Tendekai Muchenje Dec 01 '22 at 09:49
@TendekaiMuchenje OK. So probably the second option would fit best, but still not clear why you need tsibble instead of siple tibble. – Jan Z Dec 01 '22 at 10:17
would I not run into the same issue? How would I do it with a tibble if it is simpler? As in, how would I address the issues finding an index and dealing with irregular intervals within the dates? – Tendekai Muchenje Dec 01 '22 at 10:26
@TendekaiMuchenje Not clear what are your next steps. Can you clarify what precisely your next steps are? – Jan Z Dec 01 '22 at 11:00

Selecting a unique index for tsibble in R when no unique column exists

1 Answers1