1

I have a large CSV with hundreds of columns that I want to ingest into SQL server. I'm using Pandas to read it as a dataframe and invoking a stored procedure through pyodbc to write the data. I've done this before many times with smaller CSVs. In this case, I have several columns that represent datetime values. The problem is that in a single column I can have rows that use different formats for the date:

'2020-07-08T09:00:01+00:00'
'2020-07-08T05:00:07Z'
'2020-07-08T06:40:23.000Z'

I can parse these all with this

from dateutil import parser
z1 = parser.isoparse(z)

where z is any of those strings. Here's my dilemma. If I just write these as string, the call to sql likes the format '2020-07-08T06:40:23.000Z', but not the others. So I want to write in that format.

So I try

date_format = "%Y-%m-%dT%H:%M:%S.%fZ"
parser.isoparse(z).strftime(frm)

BUT, that format writes the string '2012-11-29T16:33:00.000000Z', not '2012-11-29T16:33:00.000Z' There must be exactly 3 zeros between the last period and the Z. I tried using "%3f" but strftime() does not like that. I could just "fix it" with my own function to pull out the extra "0"s, but that seems tacky. It seems like there should be some way to get it write directly like I want (the way sql seems to expect). What is the incantation?

elbillaf
  • 1,952
  • 10
  • 37
  • 73
  • 1
    [This question](https://stackoverflow.com/questions/11040177/datetime-round-trim-number-of-digits-in-microseconds) is related. The best answer is simply do what you mention and slice the string. – Dodge Aug 26 '20 at 03:35
  • 1
    You could also consider nested try/except blocks, each with a different format string for strptime. Admittedly not elegant, but any unmatched formats would raise a ValueError. – jsmart Aug 26 '20 at 03:44

1 Answers1

1

It seems that parsing to datetime is not the issue here. pd.to_datetime does a good job for these formats (essentially using dateutil.parser.parse). To get the desired format of 3 digit fractional seconds, you can use a regex search & replace as described in the docs. Ex:

import pandas as pd

df = pd.DataFrame({'date0': ['2020-07-08T09:00:01+00:00'], 
                   'date1': ['2020-07-08T05:00:07Z'], 
                   'date2': ['2020-07-08T06:40:23.000Z']})

# parsing to datetime is no issue:
df = df.applymap(pd.to_datetime)

# now replace the ".ffffffZ" pattern with ".fffZ":
pat = '.[0-9]{6}UTC'
repl = lambda m: m.group(0)[:4] + 'Z'

# e.g. like
df['date0'].dt.strftime("%Y-%m-%dT%H:%M:%S.%f%Z").str.replace(pat, repl)
# 0    2020-07-08T09:00:01.000Z
# Name: date0, dtype: object

Note that this truncates to milliseconds. If you wish to round, do so before, e.g. as described here with frequency alias 'L' (see also here).


If you know all datetime is in UTC, you can of course also simply cut the string and add the 'Z',

df['date0'].dt.strftime("%Y-%m-%dT%H:%M:%S.%f").str[:-3] + 'Z'
# 0    2020-07-08T09:00:01.000Z
# Name: date0, dtype: object

Interestingly, this method doesn't seem to bring a performance increase in a quick call to timeit:

%timeit df['date0'].dt.strftime("%Y-%m-%dT%H:%M:%S.%f").str[:-3] + 'Z'
615 µs ± 38.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['date0'].dt.strftime("%Y-%m-%dT%H:%M:%S.%f%Z").str.replace(pat, repl)
468 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • 1
    Probably both faster and clearer to use `[:-4] + 'Z'` rather than `.str.replace(...)`. – Jiří Baum Aug 26 '20 at 07:06
  • 1
    If speed is an issue, and rounding is needed, it can also be done by adding 0.0005s then truncating; however, do measure to check whether it is in fact faster. – Jiří Baum Aug 26 '20 at 07:07
  • 2
    @sabik: yes, explicitly slicing will be faster. But that is not always the issue - here, a regex replace will be specific to 'UTC' and skip strings with another tzinfo. – FObersteiner Aug 26 '20 at 07:17
  • 2
    @sabik: just did a small benchmark, the `replace` method seems to perform pretty good – FObersteiner Aug 26 '20 at 12:45