0

I work with a variety of instruments, and one is particularly troublesome in that the exported data is in XLS or XLSX format with multiple pages, and multiple columns. I only want some pages and some columns, I have achieved reading this into pandas already.

I want to convert time (see below) into a decimal, in hours. This would be from an initial time (in the time stamp data) at the top of the column so timedelta is probably a more correct value, in hours. I am only concerned about this column. How to convert an entire column of data from one format, to another?

date/time (absolute time) timestamped format YYYY-MM-DD TT:MM:SS

I have found quite a few answers but they don't seem to apply to this particular case, mostly focusing on individual cells or manually entered small data sets. My thousands of data files each have as many as 500,000 lines so something more automated is preferred. There is no upper limit to the number of hours.

What might be part of the same question (someone asked me) is this is already in a Pandas dataframe, should it be converted before or after being read in?

This might seem an amateur-ish question, and it is. I've avoided code writing for years, now I have to learn to data-wrangle for my job and it's frustrating so go easy on me.

Going about it the usual way by trying to adapt most of the solutions I found to a column, I get errors

**This is the code which works

import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path

from datetime import datetime  # not used

import time  #  not used

import numpy as np  # Not used


loc1 = r"path\file.xls"
pd.read_excel(loc1)   

filename=Path(loc1).stem  
str_1=filename

df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)

***I NEED A CODE TO CONVERT DATESTAMPS TO HOURS (decimal) most likely a form of timedelta***

df.plot(x='Relative Time(h:min:s.ms)',y='Voltage(V)', color='blue')
plt.xlabel("relative time")   #  This is a specific value
plt.ylabel("voltage (V)")  
plt.title(str_1)        # filename is used in each sample as a graph              title
plt.show()  

Image of relevent information (already described above) enter image description here

enter image description here

Zero Code
  • 11
  • 4
  • Can you provide an example of your dataframe? a few lines only. And also what would be the desired output for these lines. – Wilian Dec 10 '21 at 11:46
  • @Wilian After looking at it again, only absolute time will work. Relative time is the time for each step/cycle. Cannot format comments so (these are columns): print(df)Record Index Status 1 Rest ... 2 Rest ... 3 CC_Chg ... 4 CC_Chg ... 5 CC_Chg ... Relative Time(h:min:s.ms) 0:00:00.000 0:00:05.000 0:00:00.000 0:00:00.200 0:00:00.300 Absolute Time 2019-10-21 17:22:06 2019-10-21 17:22:11 2019-10-21 17:22:11 2019-10-21 17:22:12 2019-10-21 17:22:12 – Zero Code Dec 11 '21 at 15:06
  • For this, please edit your post. – Wilian Dec 11 '21 at 19:48
  • Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Pierre D Dec 15 '21 at 13:13
  • @PierreD Having looked at the minimal reproducible example, I have already done that i think. The remaining code is plotting the data that I don't have as it is not useful in current form (time stamp). I have no troubles with the graphs yet, but need to convert the timestamp values to hours. The answer you provided to an entire column? My larger files have upwards of 400,000 rows. I've seen this (and many other) solutions but I don't know how to apply it that way. It most certainly works for individual values. – Zero Code Dec 15 '21 at 15:52
  • That's not a reproducible example. We don't know what your `file.xls` contains. In other words, there is no way to reproduce your issue. Please try to provide a _small example_ of a `DataFrame` and what the corresponding desired output should be. – Pierre D Dec 16 '21 at 15:25
  • Thx for editing the question, but it is still very confusing. First, please don't paste images of code or data -- this is not reproducible. You say: "...(existing) answers don't seem to apply to this particular case, mostly focusing on individual cells or manually entered small data sets". Why is that so? If a solution (like the one provided in my answer below) works on a small Series, why can't you apply it to millions of cells? That solution is vectorized and very fast. What errors are you seeing? Make an example of the problematic input, if any. – Pierre D Dec 17 '21 at 13:56
  • ...and the fact that the plot "looks terrible" is a different question altogether. – Pierre D Dec 17 '21 at 13:58
  • @PierreD ..... maybe I was unclear in my question. My question is not "why", my question is "how?". My question this entire time has been "how?". I have found loads of helpful information, all different. "How" do I enter this code in such a way that it treats data through the entire column? I have no idea. I will leave the images as whatever I say seems unclear no matter "how" I say it. I was asked "What does the exce data look like".... it's either post an image or resort to shadow puppets or something. – Zero Code Dec 17 '21 at 14:00
  • I added an addendum with my best guess of 'how' to adapt the answer to your specific case. – Pierre D Dec 17 '21 at 14:17

1 Answers1

0

You should provide a minimal reproducible example, to help understand what exactly are the issues you are facing.

Setup

Reading between the lines, here is a setup that hopefully exemplifies the kind of data you have:

vals = pd.Series([
    '2019-10-21 17:22:06',  # absolute date
    '2019-10-21 23:22:06.236',  # absolute date, with milliseconds
    '2019-10-21 12:00:00.236145',  # absolute date, with microseconds
    '5:10:10',  # timedelta
    '40:10:10.123',  # timedelta, with milliseconds
    '345:10:10.123456',  # timedelta, with microseconds
])

Solution

Now, we can use two great tools that Pandas offers to quickly convert string series into Timestamps (pd.to_datetime) and Timedelta (pd.to_timedelta), for absolute date-times and durations, respectively.

In both cases, we use errors='coerce' to convert what is convertible, and leave the rest to NaN.

origin = pd.Timestamp('2019-01-01 00:00:00')  # origin for absolute dates

a = pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce') - origin
b = pd.to_timedelta(vals, errors='coerce')
tdelta = a.where(~a.isna(), b)
hours = tdelta.dt.total_seconds() / 3600

With the above:

>>> hours
0    7049.368333
1    7055.368399
2    7044.000066
3       5.169444
4      40.169479
5     345.169479
dtype: float64

Explanation

Let's examine some of the pieces above. a handles absolute date-times. Before subtraction of origin to obtain a Timedelta, it is still a Series of Timestamps:

>>> pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce')
0   2019-10-21 17:22:06.000000
1   2019-10-21 23:22:06.236000
2   2019-10-21 12:00:00.236145
3                          NaT
4                          NaT
5                          NaT
dtype: datetime64[ns]

b handles values that are already expressed as durations:

>>> b
0                       NaT
1                       NaT
2                       NaT
3           0 days 05:10:10
4    1 days 16:10:10.123000
5   14 days 09:10:10.123456
dtype: timedelta64[ns]

tdelta is the merge of the non-NaN values of a and b:

>>> tdelta
0          293 days 17:22:06
1   293 days 23:22:06.236000
2   293 days 12:00:00.236145
3            0 days 05:10:10
4     1 days 16:10:10.123000
5    14 days 09:10:10.123456
dtype: timedelta64[ns]

Of course, you can change your origin to be any particular date of reference.

Addendum

After clarifying comments, it seems that the main issue is how to adapt the solution above (or any similar existing example) to their specific problem.

Using the names seen in the images of the edited question, I would suggest:

# (...)
# df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)

# note: if df['Absolute Time'] is still of dtypes str, then do this:
# (adapt format as needed; hard to be sure from the image)
df['Absolute Time'] = pd.to_datetime(
    df['Absolute Time'],
    format='%m.%d.%Y %H:%M:%S.%f',
    errors='coerce')

# origin of time; this may have to be taken over multiple sheets
# if all experiments share an absolute origin
origin = df['Absolute Time'].min()

df['Time in hours'] = (df['Absolute Time'] - origin).dt.total_seconds() / 3600
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • 1
    It won't let me upvote (I am not an active user here). THANK YOU for the help, this worked! Now, I have other problems that I will work on, but it kind of revolved around converting time. The "terrible" comment is very real, as many of these graphs will be presented to clients or collaborators so it looks bad when the presenter cannot make a simple graph. I have avoided writing code for years, but it has caught up with me. Thank you so much!! (Formerly I did this the slow way in Origin) – Zero Code Dec 28 '21 at 08:22
  • Glad it helped. For matplotlib timedelta axis formatting, there are a number of answers on SO. E.g. [this one](https://stackoverflow.com/a/17834289/758174) or [this one](https://stackoverflow.com/a/49849072/758174). – Pierre D Dec 28 '21 at 13:38