1

Hi whoever is reading this! I am attempting to use df['ride_duration'] = df['ride_duration'].replace(r'^\s+$',"00:00:00", regex=True) to remove whitespaces/blanks from the screenshot attached below. Thing is, for some reason this is not working even though this is what I found from looking around on Stack Overflow. I Also tried: df['ride_duration'] = df['ride_duration'].replace("","00:00:00) The Result unfortunately does not change the column at all. Could someone please point me in the right direction of why this is not working?

Sample code:

fmtymd = '%Y/%m/%d'
 df = pd.read_csv(csvfilelocation, sep=',')
 df['scheduled departure time'] =  pd.to_datetime(df['scheduled departure time'], format = fmtymd)
 df['epoch_arrival'] = pd.to_datetime(df['epoch_arrival'], format = fmtymd)
 df['latetime'] = (df['epoch_arrival'] - df['scheduled departure time']).where(df['OTP'] == False)
 df['latetime'] = df['latetime'].replace("","00:00:00")
 df['latetime'] = pd.to_timedelta(df['latetime'])
 df['latetime'] = (df['latetime'] / np.timedelta64(1, 'm')).astype(int)
 df.to_csv(csvfilelocation, index=False, float_format='%.0f')

CSV Sample: https://ufile.io/qtkxb

scarecrow
  • 84
  • 13
  • Maybe your code is working but Microsoft Excel replace 00:00:00 with empty values. Excel often decide the context of the values and replace them automatically. I recommend you to open your csv file with another software like Atom, Sublime Text or Visual Studio Code to get the true values of your file. – piratefache Jan 29 '18 at 17:46
  • @MichaelCara Just checked in LibreCalc and it is still a blank – scarecrow Jan 29 '18 at 17:48
  • mmh okay, instead of your very restricted regex, try with this one : df['ride_duration'] = df['ride_duration'].replace(r'\s+',"00:00:00", regex=True) If its not working, try with one of the tools I suggested you. Even a simple nano. Cheers :) – piratefache Jan 29 '18 at 17:51
  • what type is df['ride_duration']? Your replace looks like it would only work on `str` rather than any type derived from `datetime` – Ron Kalian Jan 29 '18 at 17:54
  • @RonKalian its a timedelta. Thing is, this code was working fine for a year and now it stopped working which is interesting. – scarecrow Jan 29 '18 at 17:57
  • @scarecrow this is why the replace() won't work. To display a timedelta column in your preferred format, look here: https://stackoverflow.com/questions/13409682/how-can-i-format-timedelta-for-display (there may be other postings about this). – Ron Kalian Jan 29 '18 at 18:03
  • @RonKalian Im not trying to change how it is displayed. All i care for is to not have blanks/whitespaces. – scarecrow Jan 29 '18 at 18:16
  • @MichaelCara I tried that and no cigar. FYI: ride_duration timedelta64[ns] – scarecrow Jan 29 '18 at 18:17
  • Okay, can you please give us a working code and a sample of your data? Edit your post. I will try to find a solution. – piratefache Jan 29 '18 at 18:23
  • @scarecrow elimination of blanks and whitespaces is a timedelta formatting operation. – Ron Kalian Jan 29 '18 at 18:32
  • @MichaelCara Added. FYI to get the "raw" file just delete: epoch_arrival ride_start ride_duration – scarecrow Jan 29 '18 at 18:40

2 Answers2

2

It may behoove you to read a regexp tutorial.

This regexp is looking for one or more characters of whitespace and that's it.

df['ride_duration'].replace(r'^\s+$',"00:00:00", regex=True)

Since you have now commented that this is a timedelta column we would need to use some string formatting for that. Please see updated solution below.

df['ride_duration'].astype(str).replace(r"\s+","",regex=True)

And then to get that back into a timedelta you can do:

pd.to_timedelta(df['ride_duration'].astype(str).replace(r"\s+","",regex=True))
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • Tried this and unfortunately it does not work in my case. FYI its a ride_duration timedelta64[ns] – scarecrow Jan 29 '18 at 18:15
  • Ah that would make a difference thank you for the update. Please see edited answer. – mechanical_meat Jan 29 '18 at 19:07
  • 1
    Yes, very interesting! The pandas API does indeed evolve at a fast clip, and your prior methodology may in fact be one of the casualties of that evolution. Ah, I found a simpler way to do the same. Let me update the answer again. – mechanical_meat Jan 29 '18 at 19:42
  • deleted old comment by accident but what I said was that it is interesting that it works in 2.7 but not in python 3 – scarecrow Jan 29 '18 at 19:50
  • Ah, I tested in 3.6.4; I'm thinking we need to move to 3 anyway so it's unfortunate that it's a bit more code however I do think that it makes some sense that a timedelta wouldn't necessarily support string functions. – mechanical_meat Jan 29 '18 at 20:00
1

I made an example for you. I have the following dataset:

Date, Name
, John
26.05.16, David
27.05.16, Rose
, Yolo
29.05.16, Marie
30.05.16, Mark

And this is my code in 3 steps:

import pandas as pd

# Date format
fmtymd = '%d.%m.%y'

# Load csv file
df = pd.read_csv('./sample-data.csv', sep=',')

print(df)

Result:

       Date    Name
0       NaN    John
1  26.05.16   David
2  27.05.16    Rose
3       NaN    Yolo
4  29.05.16   Marie
5  30.05.16    Mark

Now I want to set the date format to the column Date:

df['Date'] =  pd.to_datetime(df['Date'], format = fmtymd)
print(df)

Result:

        Date    Name
0        NaT    John
1 2016-05-26   David
2 2016-05-27    Rose
3        NaT    Yolo
4 2016-05-29   Marie
5 2016-05-30    Mark

This is an important step. As you can see, the empty strings are converted to "NaT" (Not-a-Timestamp). This is the reason why you can replace "" with "00:00:00"

Now I want to replace NaT with "00:00:00":

df['Date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '00:00:00' for d in df['Date']]

print(df)

Result:

         Date    Name
0    00:00:00    John
1  2016-05-26   David
2  2016-05-27    Rose
3    00:00:00    Yolo
4  2016-05-29   Marie
5  2016-05-30    Mark

This is just an example, I know that the date format are different (For the column Date). You need to set the same dateformat for your new column.

Complete code:

import pandas as pd

fmtymd = '%d.%m.%y'
df = pd.read_csv('./sample-data.csv', sep=',')

print(df)

df['Date'] =  pd.to_datetime(df['Date'], format = fmtymd)
print(df)

df['Date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '00:00:00' for d in df['Date']]

print(df)
piratefache
  • 1,308
  • 11
  • 17