1

I am working on a csv file which includes a column including dates, but dtype of this column is actually just object so I changed it to datetime. This part went without a flaw data wasn't changed except it's datatype. But when I turn this dataframe into a parquet file it turns every single row into the same date and it is not even in the previous date format.

Date format in csv is "%Y-%m-%d" like 2011-01-29. This is the last few steps after working on the dataframe:

df_merged_CA["date"] = pd.to_datetime(df_merged_CA["date"], format = "%Y-%m-%d")

df_merged_CA.to_parquet("merged1.parquet", compression = "gzip", engine = "pyarrow")

I checked if date is in the correct form after convertion to datetime and it was, it was still in the form of 2011-01-29. After this I checked the parquet file to see if the date is in correct form, however I see that every date is turned into dates like this 43060-07-05.03:00:00.000 . I saw that problem may be occuring because of the timezone so I changed datetime convertion to this but nothing changed.

df_merged_CA["date"] = pd.to_datetime(df_merged_CA["date"],
 format = "%Y-%m-%d").dt.tz_localize('UTC').dt.tz_convert('Europe/Berlin')
  • I recently found a similar problem after going a little deeper. This post managed to solve my problem. I will keep this here in case someone comes across the same problem: https://stackoverflow.com/questions/57798479/editing-parquet-files-with-python-causes-errors-to-datetime-format – miraakbutnotded Aug 24 '23 at 20:33

1 Answers1

0

Your approach with the assignment to the time zone was already correct. You just have to convert the field with the date into a string before you write the data into the parquet file. You do this with .dt.strftime() from pandas package. There you can enter your desired format in the brackets.

Here is the complete code:

df_merged_CA["date"] = pd.to_datetime(df_merged_CA["date"], format="%Y-%m-%d")

df_merged_CA["date"] = df_merged_CA["date"].dt.tz_localize('Europe/Berlin')

df_merged_CA["date_str"] = df_merged_CA["date"].dt.strftime('%Y-%m-%d')
df_merged_CA.drop(["date"], axis=1, inplace=True)

df_merged_CA.to_parquet("merged1.parquet", compression="gzip", engine="pyarrow")
dominikfey
  • 108
  • 6
  • 1
    Thank you! I already found a way to solve my problem. But I will keep this in mind in case I need it. If you are curious about the solution I found, I added a comment leading to the post including the solution. – miraakbutnotded Aug 24 '23 at 20:37