0

I have a really messy file where there are datetime that I need to read and use it as an index. (I am adding this to clarify how it looks my data). My messy file were the datetime are located look like:

31.01.2016 13:59:13 31.01.2016 13:59:13 31.01.2016 14:39:20 31.01.2016 14:39:20 31.01.2016 15:19:27 31.01.2016 15:19:27 31.01.2016 15:59:34 31.01.2016 15:59:34 31.01.2016 16:39:41 31.01.2016 16:39:41 31.01.2016 17:19:48 31.01.2016 17:19:48 31.01.2016 17:59:55 31.01.2016 17:59:55 31.01.2016 18:40:02 

The code I used for reading the file is the next (It is a really messy file):

df = pd.melt(pd.read_csv(file, delimiter='\t', skiprows=3, index_col=0, decimal=",").rename(columns={"Unnamed: 1":"Depth [m]"}), id_vars=["Depth [m]"], var_name="Date", value_name="Temperature").set_index("Date"))

The result that I have is a DataFrame. This DataFrame has a column which should be converted to DateTime (Date) and afterwards, I will use it as an index. When I am reading the DataFrame and try to convert the dates to a DateTime is not working. In this step, the Date is an "object". My Data looks like this:

                  Date  Depth [m]  Temperature
0  01.01.2016 00:32:09        0.0        26.59
1  01.01.2016 01:12:16        0.0        26.67
2  01.01.2016 01:52:23        0.0        26.45
3  01.01.2016 02:32:30        0.0        26.41
4  01.01.2016 03:12:37        0.0        26.43

When I apply the pd.to_datetime() with "coerce" occurs the next:

df["Date] = pd.to_datetime(df["Date], errors="coerce", format= "%d.%m.%Y %H:%M:%S" )
DatetimeIndex: 10001662 entries, 2016-01-01 00:32:09 to 2015-12-31 23:52:02

This can work but I get a lot of NAT values within

When I apply the pd.to_datetime() with "ignore" and check the info of the DF occurs the next:

df["Date] = pd.to_datetime(df["Date], errors="ignore", format= "%d.%m.%Y %H:%M:%S" )

Stills and object

Index: 10001662 entries, 01.01.2016 00:32:09 to 31.12.2015 23:52:02

When I apply the pd.to_datetime() with "raise" it is not working.

Is there any idea about what I am doing incorrectly?

Geomario
  • 152
  • 1
  • 12
  • Can you provide the code used to read the dataframe? This issue can be solved while reading, by providing the right date parser – vvk24 Jul 14 '20 at 14:35
  • It is a messy file that needs some additional routines. I already updated the entry. – Geomario Jul 14 '20 at 14:41
  • well, inside the `pd.read_csv` provide another attribute `parse_dates = True, infer_datetime_format = True`. This should automatically find the columns with the datetime string and convert to datetime object – vvk24 Jul 14 '20 at 14:45
  • I tried that already and is not working. It reads the data as if we were in the 70s – Geomario Jul 14 '20 at 14:57
  • DatetimeIndex: 10001662 entries, 1970-01-01 00:00:00 to 1970-01-01 00:00:00.010001661 – Geomario Jul 14 '20 at 14:57

1 Answers1

2

Finally I got it. I did not apply the format to the date. I used this :

df["Date"] = pd.to_datetime(df["Date"], errors="coerce", dayfirst=True )

I got what I wanted, even with repeated values inside the data frame. I have to drop them.

df = df.reset_index()
df = df.drop_duplicates(keep="first")

All is ready.

Geomario
  • 152
  • 1
  • 12