0

I need to create a sequence of numbers based on another column

I have this data frame:

head(df)
        id                date      lc     lon     lat   gap_days  gap
1 20162.03 2003-10-19 14:33:00 Tagging -39.370 -18.480         NA <NA>
2 20162.03 2003-10-21 12:19:00       1 -38.517 -18.253 1.90694444  gap
3 20162.03 2003-10-21 13:33:00       1 -38.464 -18.302 0.05138889   no
4 20162.03 2003-10-21 16:38:00       A -38.461 -18.425 0.12847222   no
5 20162.03 2003-10-21 18:50:00       A -38.322 -18.512 0.09166667   no
6 20162.03 2003-10-23 10:33:00       B -38.674 -19.824 1.65486111  gap

I indicated the gaps of more than one day in column "gap", basead on the column gap_days. Now, I need to split my data. Each sequence of gaps it will be a new individual frame.

So, if I have the ID 20162.03, and this id have one or more gaps, this sequence will be split according the number of gaps. For this, I will use the package move and the fuctions burst" and split

But, for this I need to create a new column with a sequence of numbers indicate the new separations of ids, as (seq column):

id                date              lc     lon     lat    gap_days  gap seq
1 20162.03 2003-10-19 14:33:00 Tagging -39.370 -18.480         NA <NA>   1
2 20162.03 2003-10-21 12:19:00       1 -38.517 -18.253 1.90694444  gap   1
3 20162.03 2003-10-21 13:33:00       1 -38.464 -18.302 0.05138889   no   1
4 20162.03 2003-10-21 16:38:00       A -38.461 -18.425 0.12847222   no   1
5 20162.03 2003-10-21 18:50:00       A -38.322 -18.512 0.09166667   no   1
6 20162.03 2003-10-23 10:33:00       B -38.674 -19.824 1.65486111  gap   2
7  20162.03 2003-10-23 17:52:00       B -38.957 -19.511 0.30486111   no  2
8  20162.03 2003-11-02 08:14:00       B -42.084 -24.071 9.59861111  gap  3
9  20162.03 2003-11-02 09:36:00       A -41.999 -24.114 0.05694444   no  3
10 20687.03 2003-10-27 17:02:00 Tagging -39.320 -18.460         NA <NA>  4
11 20687.03 2003-10-27 19:44:00       2 -39.306 -18.454 0.11250000   no  4
12 20687.03 2003-10-27 21:05:00       1 -39.301 -18.458 0.05625000   no  4

But, as can you see I have a sequencie of "gaps" and "no", but also NA's. I can't find a solution. does anyone have a solution?

EDIT:

structure(list(id = c("20162.03", "20162.03", "20162.03", "20162.03", 
"20162.03", "20162.03", "20162.03", "20162.03", "20162.03", "20687.03", 
"20687.03", "20687.03"), date = structure(c(1066573980, 1066738740, 
1066743180, 1066754280, 1066762200, 1066905180, 1066931520, 1067760840, 
1067765760, 1067274120, 1067283840, 1067288700), class = c("POSIXct", 
"POSIXt"), tzone = "GMT"), lc = structure(c(4L, 1L, 1L, 2L, 2L, 
3L, 3L, 3L, 2L, 4L, 6L, 1L), .Label = c("1", "A", "B", "Tagging", 
"0", "2", "3", "N", "P", "Z"), class = "factor"), lon = c(-39.37, 
-38.517, -38.464, -38.461, -38.322, -38.674, -38.957, -42.084, 
-41.999, -39.32, -39.306, -39.301), lat = c(-18.48, -18.253, 
-18.302, -18.425, -18.512, -19.824, -19.511, -24.071, -24.114, 
-18.46, -18.454, -18.458), gap_days = c(NA, 1.90694444444444, 
0.0513888888888889, 0.128472222222222, 0.0916666666666667, 1.65486111111111, 
0.304861111111111, 9.59861111111111, 0.0569444444444444, NA, 
0.1125, 0.05625), gap = c(NA, "gap", "no", "no", "no", "gap", 
"no", "gap", "no", NA, "no", "no")), row.names = c(NA, 12L), class = "data.frame")
Anne Elise
  • 133
  • 2
  • 9
  • Does this answer your question? [Split a large dataframe into a list of data frames based on common value in column](https://stackoverflow.com/questions/18527051/split-a-large-dataframe-into-a-list-of-data-frames-based-on-common-value-in-colu) – jolii Oct 29 '20 at 12:47

1 Answers1

1

A simple solution with Base R:

df$seq <- ave(sapply(df$gap, identical, "gap"), df$id, FUN = cumsum)
df
#>          id                date      lc     lon     lat   gap_days  gap seq
#> 1  20162.03 2003-10-19 14:33:00 Tagging -39.370 -18.480         NA <NA>   0
#> 2  20162.03 2003-10-21 12:19:00       1 -38.517 -18.253 1.90694444  gap   1
#> 3  20162.03 2003-10-21 13:33:00       1 -38.464 -18.302 0.05138889   no   1
#> 4  20162.03 2003-10-21 16:38:00       A -38.461 -18.425 0.12847222   no   1
#> 5  20162.03 2003-10-21 18:50:00       A -38.322 -18.512 0.09166667   no   1
#> 6  20162.03 2003-10-23 10:33:00       B -38.674 -19.824 1.65486111  gap   2
#> 7  20162.03 2003-10-23 17:52:00       B -38.957 -19.511 0.30486111   no   2
#> 8  20162.03 2003-11-02 08:14:00       B -42.084 -24.071 9.59861111  gap   3
#> 9  20162.03 2003-11-02 09:36:00       A -41.999 -24.114 0.05694444   no   3
#> 10 20687.03 2003-10-27 17:02:00 Tagging -39.320 -18.460         NA <NA>   0
#> 11 20687.03 2003-10-27 19:44:00       2 -39.306 -18.454 0.11250000   no   0
#> 12 20687.03 2003-10-27 21:05:00       1 -39.301 -18.458 0.05625000   no   0

And then split it:

split(df, list(df$id, df$seq), drop = TRUE)
#> $`20162.03.0`
#>         id                date      lc    lon    lat gap_days  gap seq
#> 1 20162.03 2003-10-19 14:33:00 Tagging -39.37 -18.48       NA <NA>   0
#> 
#> $`20687.03.0`
#>          id                date      lc     lon     lat gap_days  gap seq
#> 10 20687.03 2003-10-27 17:02:00 Tagging -39.320 -18.460       NA <NA>   0
#> 11 20687.03 2003-10-27 19:44:00       2 -39.306 -18.454  0.11250   no   0
#> 12 20687.03 2003-10-27 21:05:00       1 -39.301 -18.458  0.05625   no   0
#> 
#> $`20162.03.1`
#>         id                date lc     lon     lat   gap_days gap seq
#> 2 20162.03 2003-10-21 12:19:00  1 -38.517 -18.253 1.90694444 gap   1
#> 3 20162.03 2003-10-21 13:33:00  1 -38.464 -18.302 0.05138889  no   1
#> 4 20162.03 2003-10-21 16:38:00  A -38.461 -18.425 0.12847222  no   1
#> 5 20162.03 2003-10-21 18:50:00  A -38.322 -18.512 0.09166667  no   1
#> 
#> $`20162.03.2`
#>         id                date lc     lon     lat  gap_days gap seq
#> 6 20162.03 2003-10-23 10:33:00  B -38.674 -19.824 1.6548611 gap   2
#> 7 20162.03 2003-10-23 17:52:00  B -38.957 -19.511 0.3048611  no   2
#> 
#> $`20162.03.3`
#>         id                date lc     lon     lat   gap_days gap seq
#> 8 20162.03 2003-11-02 08:14:00  B -42.084 -24.071 9.59861111 gap   3
#> 9 20162.03 2003-11-02 09:36:00  A -41.999 -24.114 0.05694444  no   3
Edo
  • 7,567
  • 2
  • 9
  • 19
  • Hi, @Edo. Thanks for the message, but I still have the problem with NA's. – Anne Elise Oct 29 '20 at 12:57
  • How do you want to handle NAs? – Edo Oct 29 '20 at 12:58
  • I tried ```df %>% group_split(gap)``` But shows the message that i have NA's – Anne Elise Oct 29 '20 at 13:00
  • 1
    @Anne, NA in the gap column should be "no"? It looks like you have NA at the beginning of each `id` sequence (which makes sense if you calculated `gap_days` from `date`). I think the split I did in my edit is what you are looking for. – Edo Oct 29 '20 at 13:03
  • the NA's are shown because is the first location that I have for each individuals, so I can have the difference time. NA's isn't a new track, but is a new sequence, like the gaps. I demonstrated in the exemple of the column **seq**. – Anne Elise Oct 29 '20 at 13:04
  • 1
    Yes but in your example the second row is more than 1 day after the first row. According to your own rules, that implies the first and the second row must be in two different groups. It is not the same for the first and the second row of the second ID, which are indeed in the same group because they are closer than a day. – Edo Oct 29 '20 at 13:06
  • I tried to run your suggestion but I only got "NA's" in the "seq" column. Do you know why? – Anne Elise Oct 29 '20 at 13:13
  • As you see from my reproducible example, it works to me.. update your question by sharing the output of: `dput(head(df, 12))`. And we will see why that happens. – Edo Oct 29 '20 at 13:15
  • is it what you're looking for? – Edo Oct 29 '20 at 13:29