finding all flights that have at least three years of data in R

Question

I am using the flight dataset that is freely available in R.

flights <- read_csv("http://ucl.ac.uk/~uctqiax/data/flights.csv")

Now, lets say i want to find all flight that have been flying for at least three consecutive years: so there are dates available for three years in the date column. Basically i am only interested in the year part of the data.

i was thinking of the following approach: create a unique list of all plane names and then for each plane get all the dates and see if there are three consecutive years.

I started as follows:

NOyears = 3
planes <- unique(flights$plane) 

# at least 3 consecutive years 
for (plane in planes){
  plane = "N576AA"
  allyears <- which(flights$plane == plane)
}

but i am stuck here. This whole approach start looking too complicated to me. Is there an easier/faster way? Considering that i am working on a very large dataset...

Note: I want to be able to specify the number of year later on, that is why i included NOyears = 3 in the first place.

EDIT:

I have just noticed this question on SO. Very interesting use of diff and cumsum which are both new to me. Maybe a similiar approach is possible here using data.table?

ila · Answer 1 · 2020-05-14T15:25:24.167

dplyr will do the trick here

library(dplyr)
library(lubridate)

flights %>%
  mutate(year = year(date)) %>%
  group_by(plane) %>%
  summarise(range = max(year) - min(year)) %>%
  filter(range >= 2)

Though I'm not seeing any planes that meet criteria!

Edit: Per mnist's comment, consecutive years are a little more tricky, but here's a working example with consecutive months (the data you supplied only has one year) - just swap out for years!

nMonths = 6
flights %>%
  mutate(month = month(date)) %>% #Calculate month
  count(plane, month) %>% #Summarize to one row for each plane/month combo
  arrange(plane, month) %>% #Arrange by plane, month so we can look at consecutive months
  group_by(plane) %>% #Within each plane...
  mutate(consecutiveMonths = c(0, sequence(rle(diff(month))$lengths))) %>% #...calculate the number of consecutive months each row represents
  group_by(plane) %>% #Then, for each plane...
  summarise(maxConsecutiveMonths = max(consecutiveMonths)) %>% #...return the maximum number of consecutive months
  filter(maxConsecutiveMonths > nMonths) #And keep only those planes that meet criteria!

what about planes that have been in use in 2014, 2015, and 2018? How do you check for **consecutive** years? — mnist, May 14 '20 at 15:14
i was looking for more of a "data.table" solution, as my data.table is very large — Nneka, May 14 '20 at 16:00

score 0 · Answer 2 · answered May 14 '20 at 20:41

Here is a data.table approach (using month, since there is only one year in that file, filtering flights that operated consecutively during 12 months):

library(data.table)
flights <- fread("http://ucl.ac.uk/~uctqiax/data/flights.csv")
flights[, month:=month(date)]
setkey(flights, plane, date)
flights[, max_run:=lapply(.SD, function(x) max(rle(cumsum(c(0, diff(unique(x))) > 1))$lengths)), 
.SDcols="month", by="plane"][max_run > 11][]
#>                        date hour minute  dep  arr dep_delay arr_delay carrier
#>      1: 2011-01-01 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>      2: 2011-01-01 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>      3: 2011-01-01 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>      4: 2011-01-02 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>      5: 2011-01-02 12:00:00   NA     NA   NA   NA        NA        NA      XE
#>     ---                                                                      
#> 151636: 2011-11-21 12:00:00   10     56 1056 1359        25        37      FL
#> 151637: 2011-12-09 12:00:00   18     36 1836 2126        -5        -4      FL
#> 151638: 2011-12-13 12:00:00   17     27 1727 2013        -3        -7      FL
#> 151639: 2011-12-14 12:00:00    6     28  628  914        -2        -8      FL
#> 151640: 2011-12-14 12:00:00   11     57 1157 1438        -3       -14      FL
#>         flight dest  plane cancelled time dist month max_run
#>      1:   2174  PNS                1   NA  489     1      12
#>      2:   2277  BRO                1   NA  308     1      12
#>      3:   2811  MOB                1   NA  427     1      12
#>      4:   2204  OKC                1   NA  395     1      12
#>      5:   2570  BTR                1   NA  253     1      12
#>     ---                                                     
#> 151636:    298  ATL N983AT         0   98  696    11      12
#> 151637:    296  ATL N983AT         0   89  696    12      12
#> 151638:    292  ATL N983AT         0   87  696    12      12
#> 151639:    290  ATL N983AT         0   86  696    12      12
#> 151640:    286  ATL N983AT         0   87  696    12      12

^{Created on 2020-05-14 by the reprex package (v0.3.0)}

chinsoon12 · Accepted Answer · 2020-05-15T01:25:46.533

Here is another option using data.table:

#summarize into a smaller dataset; assuming that we are not counting days to check for consecutive years
yearly <- flights[, .(year=unique(year(date))), .(carrier, flight)]

#add a dummy flight to demonstrate consecutive years
yearly <- rbindlist(list(yearly, data.table(carrier="ZZ", flight="111", year=2011:2014)))

setkey(yearly, carrier, flight, year)    
yearly[, c("rl", "rw") := {
    iscons <- cumsum(c(0L, diff(year)!=1L))
    .(iscons, rowid(carrier, flight, iscons))
}]

yearly[rl %in% yearly[rw>=3L]$rl]

output:

   carrier flight year   rl rw
1:      ZZ    111 2011 5117  1
2:      ZZ    111 2012 5117  2
3:      ZZ    111 2013 5117  3
4:      ZZ    111 2014 5117  4

finding all flights that have at least three years of data in R

3 Answers3