How to iterate through the columns of a data frame efficiently?

Question

So, data is a data frame consists of many columns, and one of which called lpep_pickup_datetime has date and time in the format of "01/01/2016 12:39:36 AM"

I want to analyze these data by date and time, so I am trying to create a new column named pickup_date and one named pickup_time with AM or PM information.

I have used the strsplit function to split the string into the following form: c("01/01/2016", "12:29:24", "AM"), and I am trying to create the aforementioned columns out of this data.

I have wrote the following code:

data$lpep_pickup_datetime=strsplit(data$lpep_pickup_datetime, " ")

data$pickup_date=data$lpep_pickup_datetime[[1]][1]


for (i in seq(1,90181))
{
  data$pickup_time[i]=data$lpep_pickup_datetime[[i]][2]  
}

This is gravely inefficient, as it takes too long to iterate through 90181 rows of data. Is there a better way to accomplish this task?

Thanks.

Please show a small reproducible example and expected output? — akrun, Feb 17 '17 at 05:22
It would be a lot more useful to parse the datetime to a datetime class like POSIXct, after which you can extract parts if you really need. — alistaire, Feb 17 '17 at 05:27
Advisable to not play with strings. Instead convert them to Date-time object and extract Date and time separately as shown in the example. — Ronak Shah, Feb 17 '17 at 05:28

akrun · Answer 1 · 2017-02-17T17:27:47.340

1

In base R, we can use sub to create a delimiter and then with read.csv create two columns

data[paste0("pickup_", c("date", "time"))] <- read.csv(text=sub("\\s+", 
         ",", data$lpep_pickup_datetime),  header=FALSE, stringsAsFactors=FALSE)

edited Feb 17 '17 at 17:27

answered Feb 17 '17 at 05:24

akrun

874,273
37
540
662

I think there is a mismatched parenthesis for paste0. Also, it is giving me this error message:"Error in `[.data.table`(x, i, which = TRUE) : When i is a data.table (or character vector), the columns to join by must be specified either using 'on=' argument (see ?data.table) or by keying x (i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM." – user101998 Feb 17 '17 at 17:27
@user101998 Yes, you are right. Forgot about the `)`.. Fixed it. If your dataset is `data.table`, the syntax is different – akrun Feb 17 '17 at 17:28

Jean · Accepted Answer · 2017-02-17T05:42:41.747

1

?apply(df, 2, function(...) ) is the usual way of iterating through columns. But you don't need to do that here

> df<-data.frame("datetime" = format(seq(c(ISOdate(2000,3,20)), by = "day", length.out = 100000), "%Y-%m-%d %r"), stringsAsFactors=FALSE)
> str(df)
'data.frame':   100000 obs. of  1 variable:
 $ datetime: chr  "2000-03-20 08:00:00 PM" "2000-03-21 08:00:00 PM" "2000-03-22 08:00:00 PM" "2000-03-23 08:00:00 PM" ...
> df$dateonly<-format(as.Date(df$datetime, format="%Y-%m-%d %r"),"%Y-%m-%d")
> head(df)
                datetime   dateonly
1 2000-03-20 08:00:00 PM 2000-03-20
2 2000-03-21 08:00:00 PM 2000-03-21
3 2000-03-22 08:00:00 PM 2000-03-22
4 2000-03-23 08:00:00 PM 2000-03-23
5 2000-03-24 08:00:00 PM 2000-03-24
6 2000-03-25 08:00:00 PM 2000-03-25

edited Feb 17 '17 at 05:42

answered Feb 17 '17 at 05:28

Jean

1,480
15
27

Since my format is of the form `01/01/2016 12:39:36 AM`, I modified your code to `data$pickup_date<-format(as.Date(data$lpep_pickup_datetime, format="%m-%d-%y %r"),"%m-%d-%y")`, but the result is a bunch of NA's. – user101998 Feb 17 '17 at 17:34
You were careless. Your format would be `%m/%d/%Y %r`. – Jean Feb 17 '17 at 17:42

How to iterate through the columns of a data frame efficiently?

2 Answers2