0

I have the fallowing dataFrame:

    Timestamp           real time
0   17FEB20:23:59:50    0.003
1   17FEB20:23:59:55    0.003
2   17FEB20:23:59:57    0.012
3   17FEB20:23:59:57    02:54.8
4   17FEB20:24:00:00    0.03
5   18FEB20:00:00:00    0
6   18FEB20:00:00:02    54.211
7   18FEB20:00:00:02    0.051

How to convert the columns to datetime64?

There're 2 things that is making this challengeable form me:

  1. The column Timestamp, index 4 has the value: 17FEB20:24:00:00, which seems not to be a valid date-time (although it was output by a SAS program...).
  2. The column real time don't fallow a pattern and seems it cannot be matched through a date_parser.

This is what I've tried to address the first column (Timestamp):

data['Timestamp'] = pd.to_datetime(
    data['Timestamp'],
    format='%d%b%y:%H:%M:%S')

But due the value of the index 4 (17FEB20:24:00:00) I get: ValueError: time data '17FEB20:24:00:00' does not match format '%d%b%y:%H:%M:%S' (match). If I remove this line, it does work, but I have to find a way to address it, as my dataset have of thousands of lines and I cannot simply ignore them. Perhaps there's a way to convert it to zero hours of the next day?

Here's a snippet code to create the dataFrame sample as above to to gain some time working on the answer (if you need):

data = pd.DataFrame({
    'Timestamp':[
        '17FEB20:23:59:50',
        '17FEB20:23:59:55',
        '17FEB20:23:59:57',
        '17FEB20:23:59:57',
        '17FEB20:24:00:00',
        '18FEB20:00:00:00',
        '18FEB20:00:00:02',
        '18FEB20:00:00:02'],
    'real time': [
        '0.003',
        '0.003',
        '0.012',
        '02:54.8',
        '0.03',
        '0',
        '54.211',
        '0.051',
        ]})

Appreciate your help!

Bruno Ambrozio
  • 402
  • 3
  • 18
  • Try this post: https://stackoverflow.com/a/52689127/6067379 Explains how to replace 24:00 with 00:00 appropriately. – merit_2 Mar 20 '20 at 20:21
  • _which seems not to be a valid date-time (although it was output by a SAS program)_ [This Wikipedia article](https://en.wikipedia.org/wiki/24-hour_clock#Midnight_00:00_and_24:00) might help clarify things. – AMC Mar 20 '20 at 21:57
  • Also, do you know anything about the data in `real_time` ? – AMC Mar 20 '20 at 22:11
  • @AMC `real time` is a delta, which represents time consumed by some process. Can be some milliseconds, seconds, minutes, hours, days etc.. – Bruno Ambrozio Mar 21 '20 at 09:13
  • Couldn't you just prepend enough zeros and colons to the smaller time values so they all look like h:m:s.mmmm values? Do you have an where the hours part is greater than 23? Does python time functions handle those? – Tom Mar 21 '20 at 14:56

2 Answers2

1

If your data is not too big, you might want to consider looping through the dataframe. You can do something like this.

for index, row in data.iterrows():
    if row['Timestamp'][8:10] == '24':
        date = (pd.to_datetime(row['Timestamp'][:7]).date() + pd.DateOffset(1)).strftime('%d%b%y').upper()
        data.loc[index, 'Timestamp'] = date + ':00:00:00'

This is the result.

        Timestamp      real time
0   17FEB20:23:59:50    0.003
1   17FEB20:23:59:55    0.003
2   17FEB20:23:59:57    0.012
3   17FEB20:23:59:57    02:54.8
4   18FEB20:00:00:00    0.03
5   18FEB20:00:00:00    0
6   18FEB20:00:00:02    54.211
7   18FEB20:00:00:02    0.051
dzakyputra
  • 682
  • 4
  • 16
  • The data is very big, but I could iterate only over the dates that needs to be tread. Thanks for your code! Very handy. Any idea how to address the second column? Thus I can mark your answer as correct. Thanks again! – Bruno Ambrozio Mar 21 '20 at 08:43
  • @BrunoAmbrozio well, it is very difficult if you do not have the standard type of what the column `real time` is (you said it can be miliseconds, seconds, minutes, hours, etc). But from the data you provide, everything looks the same except the 02:54.8, can you tell me what does it mean ? – dzakyputra Mar 21 '20 at 16:09
  • `HH:MM:SS.sss` where `HH` is hours, `MM` is minutes, `SS` seconds, and `sss` milliseconds. – Bruno Ambrozio Mar 21 '20 at 17:36
0

Here's how I addressed it:

  • For the column Timestamp, I've used this reply (Thanks @merit_2 for sharing it in the first comment).
  • For the column real time, I parse using some conditions.

Here's the code:

import os
import pandas as pd
from datetime import timedelta

# Parsing "real time" column:

## Apply mask '.000' to the microseconds
data['real time'] = [sub if len(sub.split('.')) == 1 else sub.split('.')[0]+'.'+'{:<03s}'.format(sub.split('.')[1]) for sub in data['real time'].values]

## apply mask over all '00:00:00.000'
placeholders = {
    1: '00:00:00.00',
    2: '00:00:00.0',
    3: '00:00:00.',
    4: '00:00:00',
    5: '00:00:0',
    6: '00:00:',
    7: '00:00',
    8: '00:0',
    9: '00:',
    10:'00',
    11:'0'}

for cond_len in placeholders:
    condition = data['real time'].str.len() == cond_len
    data.loc[(condition),'real time'] = placeholders[cond_len] + data.loc[(condition),'real time']

# Parsing "Timestamp" column:
selrow = data['Timestamp'].str.contains('24:00')
data['Timestamp'] = data['Timestamp'].str.replace('24:00', '00:00')
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['Timestamp'] = data['Timestamp'] + selrow * timedelta(days=1)

# Convert to columns to datetime type:
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['real time'] = pd.to_datetime(data['real time'], format='%H:%M:%S.%f')

# check results:
display(data)
display(data.dtypes)

Here's the output:

    Timestamp           real time
0   2020-02-17 23:59:50 1900-01-01 00:00:00.003
1   2020-02-17 23:59:55 1900-01-01 00:00:00.003
2   2020-02-17 23:59:57 1900-01-01 00:00:00.012
3   2020-02-17 23:59:57 1900-01-01 00:02:54.800
4   2020-02-18 00:00:00 1900-01-01 00:00:00.030
5   2020-02-18 00:00:00 1900-01-01 00:00:00.000
6   2020-02-18 00:00:02 1900-01-01 00:00:54.211
7   2020-02-18 00:00:02 1900-01-01 00:00:00.051

Timestamp    datetime64[ns]
real time    datetime64[ns]

Perhaps there's a clever way to do that, but for now it suits.

Bruno Ambrozio
  • 402
  • 3
  • 18
  • Note: better to replace the line `df[timing_col] = pd.to_datetime(df[timing_col], format='%H:%M:%S.%f')` to: `df[timing_col] = pd.to_timedelta(df[timing_col])`, because the column is actually a delta, not a date. It helps to execute calculation functions like `sum` later on. – Bruno Ambrozio Mar 24 '20 at 12:54