0

I am importing a very large fixed-width dataset into R and wish to use vroom for much better speed. However, the dates in this dataset are in numeric format with either 7 or 8 digits, depending on whether the day of the month has 1 or 2 digits (examples below).

#8 digit date (1985-03-21):
#  21031985
#7 digit date (1985-03-01):
#  1031985

I cannot see any way to specify this type of format using col_date(format = ) as one normally would. It is easy to make a function that converts these 7/8 digit numbers into dates, but doing that means materialising the imported data and removes the speed advantage that vroom provides.

I am looking for a way to have vroom interpret these numbers on its own, or a workaround that does not sacrifice vroom's speed.

Thanks very much for any help here.

Richard Berry
  • 396
  • 1
  • 9
  • Try `format = "%d%m%Y"` – Chamkrai May 03 '22 at 15:12
  • 2
    That is a nightmare of a data format. If you can't fix the input to make it more normalized, you could try just importing as character values and then doing the transformation to date as a separate step. It's not clear how you've currently written your code that's triggering the slower behavior. – MrFlick May 03 '22 at 15:12
  • 2
    @TomHoel I suggest you try it, that doesn't work with the second example (without zero-padding, that is). – r2evans May 03 '22 at 15:12
  • The slower behaviour is a result of how vroom imports data- it doesn't import everything at once, it retrieves it from the location specified on demand when needed. So If I do the transformation afterwards, it must retrieve that whole column and imports the entire data set like a much slower function would normally. – Richard Berry May 05 '22 at 07:55

3 Answers3

2

Those formats are horrible in general, but regardless I expect nothing in readr is going to work right for you because of the 1 or 2 digit day-of-month. I suggest importing reading that column in as col_character, then post-processing them with

vec <- c("21031985", "1031985")
as.Date(paste0(strrep("0", pmax(8 - nchar(vec), 0)), vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"

Quick walk-through:

  • 8 - nchar(vec) tells us how many 0s need to be padded to the left of each string. In this case, it should be 0 and 1, respectively. This might be a problem if you have length 6 strings, only you know if that's an issue.

  • strrep("0", ..) repeats the 0 string as many times as we need, including strrep("0", 0) producing "" (no zeroes).

  • pmax(.., 0) is the defensive programmer, if there's a length-9 string in there, we cannot do strrep("0", -1), we want to keep it from going negative.

  • paste0(.., vec) to do the actual padding.

From there, all strings should be normalized and able to be converted using "%d%m%Y".

r2evans
  • 141,215
  • 6
  • 77
  • 149
1

Vroom can use a pipe as input. That means you can use a tool like awk to fix the format (e.g. make it always 8 digit, which is eaasy with sprintf). That way you can still benefit from vroom streaming the file. You could even use R - but if you are after performance, you need something that can process the file streaming and better be lightweight.

I used a test file test.csv:

id,date,text
1,1022020,some
2,12042020,more
3,2012020,text

I could read it via (of course the awk call needs to be adjusted for your data - but essentially if you need to just adjust the column $2 means 2nd column, the ',' specifies the separator):

vroom(pipe("awk -F ',' 'BEGIN{OFS=\",\"}; NR==1{print}; NR!=1 {$2=sprintf(\"%08d\",$2);print;}' test.csv"),
      col_types=cols(date=col_date(format='%d%m%Y'))
)

giving

# A tibble: 3 × 3
     id date       text
  <int> <date>     <chr>
1     1 2020-02-01 some
2     2 2020-04-12 more
3     3 2020-01-02 text
bdecaf
  • 4,652
  • 23
  • 44
  • This is perfect, thanks so much! Other responses seemed to ignore that I didn't want to post-process the data due to vroom's streaming – Richard Berry May 05 '22 at 07:52
0

If you have integer data you can left pad the lost 0s back on.

as.Date(sprintf("%08d", vec), format = "%d%m%Y")
# [1] "1985-03-21" "1985-03-01"
s_baldur
  • 29,441
  • 4
  • 36
  • 69