1

I'm using pandas to load data from Excel with the resulting DataFrame containing both strings and dates. The columns containing strings are of dtype "object" while the date-columns are of dtype "datetime64[ns]". At some point in my code I need to convert one column from datetime to string for writing back to Excel, but pandas will not let me do that if I try it the way that would be most obvious to me and seems to be the recommended way of doing it according to documentation: using .loc to get the column to be changed and assign them with the same column converted to strings.

I have found ways to circumvent the problem and get pandas to do what I need, but either this is a bug or I do not understand some underlying mechanic which could come back to bite me in the longer run, hence my question.

The code to reproduce this (occurs in both pandas 2.0.0 and 2.0.1, this might cause the problem) is as follows (in the actual DataFrame I am using there are many more than a single column):

import pandas as pd

not_yet_datetime_df = pd.DataFrame([["2023-01-06", "2023-01-06", "2023-01-06", "2023-01-06", "2023-01-06"]]).T
datetime_df = not_yet_datetime_df.astype("datetime64[ns]")
datetime_df.loc[:, 0] = datetime_df.loc[:, 0].dt.strftime("%d.%m.%Y")
datetime_df.loc[:, 0] = datetime_df.loc[:, 0].astype("object")  # neither of these two will work for me
print(datetime_df.dtypes)  # will return "datetime64[ns]" for this single column

There are multiple ways to circumvent this that work for me, including simply replacing line 5 with datetime_df[0] = datetime_df.loc[:, 0].dt.strftime("%d.%m.%Y") (omitting the .loc left of the equals sign) and I at least can get the column to being "object" dtype with datetime_df = datetime_df.astype({0:"object"}), but I don't quite understand why especially the first solution works and what I misunderstood about .loc - or about datetimes in general.

I read a bit into the pandas 2.0.0 change on returning views vs. copies but to my (limited) understanding this should not be affected by any of the 2.0.0 changes.

Could anyone help me understand what's happening here under the hood? I like using .loc over assigning just with []-brackets and I feel like it's not as intuitive as I had hoped.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Archernar
  • 33
  • 6
  • I don't see why you need to use `loc` here in the first place - If you want to convert a column to datetime, why not simply use `df["dt"] = pd.to_datetime(df["date-as-string"])`? If you want to convert to string, why not use `df["back-to-string"] = df["dt"].dt.strftime("%Y-%m-%d")`? – FObersteiner Apr 27 '23 at 18:12
  • @FObersteiner that's exactly what I am doing as a workaround for the issue. According to [pandas documentation](https://pandas.pydata.org/docs/user_guide/indexing.html), the recommended way of doing it is using .loc though. Also there is no reason for brackets alone to work but .loc behaving differently, at least none that I would understand, which is the point of my question. – Archernar Apr 28 '23 at 06:52
  • I don't think this is a work-around; to the contrary it's the pandas-normal way to create or change whole Series / df columns. You use `loc` if you want to modify *part* of a Series/column for instance, or a part of a df that is not an individual column. – FObersteiner Apr 28 '23 at 08:04
  • As I understand the documentation of pandas, using brackets is the most intuitive way when coming from python in general and numpy specifically while the specialized methods (`loc`, `at`, `iloc` etc) are still to be preferred wherever possible; this stems from the introductory note to their indexing section I linked to above and is mentioned at a few places in the rest of the guide. In any case, the two methods should be equal, which they are not. My questions was mostly about the why and if that is even intended behaviour. Other working methods existing does not really answer that. – Archernar Apr 28 '23 at 09:16

1 Answers1

1

See also Does .loc[:, ['A','B']] assignment allow to change the dtype of the columns? - .loc tries to cast back to the original type. There's a note in the docs on this, that might be hard to find:

When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.

Anyways, that's why it keeps being datetime64[ns]. For demonstration, if you change the datetime format to something that pandas' parser doesn't accept, "upcasting" fails and the dtype is kept:

import pandas as pd

not_yet_datetime_df = pd.DataFrame([["2023-01-06", "2023-01-06", "2023-01-06", "2023-01-06", "2023-01-06"]]).T
datetime_df = not_yet_datetime_df.astype("datetime64[ns]")
datetime_df.loc[:, 0] = datetime_df.loc[:, 0].dt.strftime("%d. asdf %m. xx %Y")
print(datetime_df.dtypes) 
# 0    object
# dtype: object

In general, I would still argue that in this case, the most basic [ ] gets you safely where you want to. If you create a new column, you're neither slicing nor selecting nor indexing. Same goes for if you replace a column (e.g. string with datetime dtype Series; ok you're selecting something here...). So I don't see the point of using loc here in the first place.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • I generally never use `[ ]` unless assigning new columns because it sometimes behaves differently than `.loc` and i don't want to worry about that potentially subtle behaviour; also `.loc` gives more control over what I am doing in general and allows slicing on both axes while the brackets alone do not. That is why I had the impression I could just always use `.loc` and be done with it; turns out I can not. Thanks alot for the answer. – Archernar Apr 28 '23 at 14:25
  • @Archernar Actually, I was surprised by this behavior of `loc`. I would have expected it to work as you did. Also, don't get me wrong, `loc` is great for selecting and modifying specific parts of dataframes / series. It's just that I find it kind of bulky (and hard to read) when it comes to adding or replacing whole columns. I used `polars` lately, and I think the methods there are much more concise. So maybe I'm just biased ;-) – FObersteiner Apr 28 '23 at 14:49