I have a dataset for aircraft flight crashes. It looks like this
date,time,location,operator,flight,route,type,registration,cn_in,aboard,fatalities,ground,summary
1908-09-17,17:18,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1,2,1,0,"During a demonstration flight, a U.S. Army flyer flown by Orville Wright nose-dived into the ground from a height of approximately 75 feet, killing Lt. Thomas E. Selfridge who was a passenger. This was the first recorded airplane fatality in history. One of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft. Orville Wright suffered broken ribs, pelvis and a leg. Selfridge suffered a crushed skull and died a short time later."
1912-07-12,06:30,"AtlantiCity, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,5,5,0,"First U.S. dirigible Akron exploded just offshore at an altitude of 1,000 ft. during a test flight."
1913-08-06,,"Victoria, British Columbia, Canada",Private,-,,Curtiss seaplane,,,1,1,0,"The first fatal airplane accident in Canada occurred when American barnstormer, John M. Bryant, California aviator was killed."
1913-09-09,18:30,Over the North Sea,Military - German Navy,,,Zeppelin L-1 (airship),,,20,14,0,The airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of Helgoland Island into the sea. The ship broke in two and the control car immediately sank drowning its occupants.
I want to import it into a tsibble for analysis and forecasting. The first issue that I notice is this. There are no duplicate rows in dataset but there is no column that has only distinct rows. By this I mean, there are duplicate dates and there are duplicate times etc for different rows.
I have tried combining the date
and time
columns into a date_time
column but even the combination of two still has duplicate values (i.e. same date and time but differences in other columns like location etc)
I only got to unique values if I combine date
, time
, location
and summary
. However, this kind of complicates the ensuing time-series analysis and forecasting since a combination of those four is no longer numerical. I have also tried introducing an id
column which I don't think helps much for the analysis although it does also present a unique key.
Does the id
column I have added suffice or is there a better index I can choose?
The full dataset is here. https://gist.github.com/vaMuchenje/1f7a009b4214d5a78021fb632830bdc5
Here is my code:
library(tidyverse)
library(anytime)
data <- readr::read_csv("aircraft_data.csv")
data <- data %>%
mutate(date_time = with(data, anytime(paste(date, time)))) %>%
mutate(id = row_number()) %>%
as_tsibble(index=id)