-1

I am new to the R language. My problems are

  1. I want to manage the origin-destination location with multiple ids. Here is an example of my raw data (csv file)
vehicle_id location time
0111111111 13_100.27 13.58
0111111111 13_140.87 17.38
0222222222 12_445.78 02.15

Ps. there are many vehicle_id data which some are same id and differs.

  1. I would like to sum vehicle_id into groups and create the start and finish location based on time like this;
vehicle_id location of origin start time location of destination finish time
0111111111 13_100.27 13.58 13_140.87 17.38
0222222222 13_140.87 17.38 13_102.99 23.57
0333333333 12_445.78 02.15 10.589.58 05.69

How can I code it to get the result like above? Thanks for your help in advance.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Yasumin
  • 443
  • 2
  • 8
  • 1
    hi! if you provide a small sample of your dataset, with which readers can directly work with, you might get helpful answers more quickly. Try, for example, dput(head, your_data_set, 10) and append the output to your question. – Marcelo Avila Mar 25 '21 at 15:38

2 Answers2

1

Does this work:

library(dplyr)
df %>% group_by(vehicle_id) %>% summarise('location origin' = location, 'start time' = time, 
                                           'location destination' = lead(location), 'finish time' = lead(time)) %>% na.omit()
`summarise()` regrouping output by 'vehicle_id' (override with `.groups` argument)
# A tibble: 2 x 5
# Groups:   vehicle_id [2]
  vehicle_id `location origin` `start time` `location destination` `finish time`
       <dbl> <chr>                    <dbl> <chr>                          <dbl>
1  111111111 13_100.27                13.6  13_140.87                       17.4
2  222222222 12_445.78                 2.15 13_102.99                       23.6

Data used:

df
# A tibble: 4 x 3
  vehicle_id location   time
       <dbl> <chr>     <dbl>
1  111111111 13_100.27 13.6 
2  111111111 13_140.87 17.4 
3  222222222 12_445.78  2.15
4  222222222 13_102.99 23.6 
Karthik S
  • 11,348
  • 2
  • 11
  • 25
1

Using SQL we can extract the start time and location for each trip into start_data and similarly the end time and location into end_data and then join them. This makes use of the fact that SQLite will automatically pick out the location on the minimizing row if min is used and similarly for max.

library(sqldf)

sqldf("
 with start_data as (
   select vehicle_id, location start_location, min(time) start_time 
   from Trips 
   group by vehicle_id),
 end_data as (
   select vehicle_id, location end_location, max(time) end_time
   from Trips 
   group by vehicle_id)
 select * from start_data left join end_data using(vehicle_id)")

giving:

  vehicle_id start_location start_time end_location end_time
1  111111111      13_100.27      13.58    13_140.87    17.38
2  222222222      12_445.78       2.15    12_445.78     2.15

Note

The input in reproducible form.

Trips <- structure(list(vehicle_id = c(111111111L, 111111111L, 222222222L
), location = c("13_100.27", "13_140.87", "12_445.78"), time = c(13.58, 
17.38, 2.15)), class = "data.frame", row.names = c(NA, -3L))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you for your help, Grothendieck! However, after I applied your code, my start time and end time are very weird. It doesn't be a time anymore. For example, the value of the start time is 36 and the end time is 86312 like this but the rest of the result is sustisfied. – Yasumin Mar 25 '21 at 14:35
  • The data in the question needs to be provided in a reproducible form. Since that was not done I did it for you in the Note and with that you can see from the output in the answer that it worked correctly. Note the instructions at the top of the [tag:r] tag page which ask to provide data showing the output of `dput(X)` where X is your input cut down suitably. – G. Grothendieck Mar 25 '21 at 15:05
  • Also note that the solutions at this link: https://stackoverflow.com/questions/66787354/dplyracross-performance-and-dplyrsummarise-to-data-table-efficiency/66819758#66819758 may be applicable. In particular the solution using the collapse package runs very fast. – G. Grothendieck Mar 27 '21 at 09:07