0

I am reading a big data file from a database (test1). Millions of rows that I cannot read and processed directly in R.

I would like to create sub-files from this big file based on the column "horodatage". I gave an example below to extract one file from the big file, but now I want to do it for all the file not only between these two dates.

The split must start at this date "23/03/2005 11:00" and go until the end of the big file (approximatively around "31/12/2005 23:59" (test1 from the data base) and the duration of one sub file must be 30min (in other words: exactly 36000 rows per sub files).

Each sub file must then be saved with a name like (A200503231100.dat, A200503231130.dat,A200503231200.dat, A200503231230.dat etc...)

The format of the column horodatage is already

> class(montableau$horodatage)
[1] "POSIXct" "POSIXt" 

The code I started with is:

heuredebut = "23/03/2005 11:00"
heurefin = "23/03/2005 11:30"
query = paste("select * from test1 where horodatage >= ",heuredebut," and horodatage < ",heurefin," order by horodatage;",sep="'")
montableau <- dbGetQuery (connection_db,query)

If you have any insights of the loop to do on this big file, it would be very helpful.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
virginie
  • 35
  • 7

1 Answers1

1

Dates in R are sort of notoriously annoying.

The key trick here is to use the strptime function to format dates in the way you need them.

# Quick function to go from string to time
cleanDate <- function(x){
    as.POSIXct(strptime(x, '%d/%m/%Y %H:%M'))
}

# Function to print time in format of SQL database
printDate <- function(x){
    as.character(x, '%d/%m/%Y %H:%M')
}


# Create sequence of times
times <- seq(
    cleanDate('23/03/2005 11:00'), 
    cleanDate('01/01/2006 00:00'), 
    by = 60 * 30) # adding 30 minutes

for( i in 1:(length(times) - 1) ){

    # Generate SQL
    sql <- paste("select * from test1 where horodatage >= ", 
        printDate(times[i]),
        " and horodatage < ",
        printDate(times[i+1]),
        " order by horodatage;",sep="'")

    # Query
    montableau <- dbGetQuery (connection_db, sql)

    # Write table
    write.table(montableau, 
        file= as.character(times[i], 'A%Y%m%d%H%M.dat'), 
        row.names=FALSE, sep="\t", quote=FALSE)

}
  • Thank you very much user2133017. This is really interesting. The fact is that the extraction for i, i+1, i+2...gives the same outputfile. The name is good (A200503231100.dat, A200503231130.dat, A200503231200.dat....) but the content is the same (the data from the first file actually between 11h and 11h30). I am not sure why it doesn't go through the database correctly while the code seems very good to me...any idea? – virginie Feb 18 '16 at 15:37
  • Got it!! Actually, I forgot to change "query" into the new name of the "paste command : sql", in the dbGetQuery command! That is working perfectly now! Many thanks for such an helpful answer! – virginie Feb 18 '16 at 15:51