2

I'm using/learning Pandas to load a csv style dataset where I have a time column that can be used as index. The data is sampled roughly at 100Hz. Here is a simplified snippet of the data:

Time (sec) Col_A Col_B Col_C 0.0100 14.175 -29.97 -22.68 0.0200 13.905 -29.835 -22.68 0.0300 12.257 -29.32 -22.67 ... ... 1259.98 -0.405 2.205 3.825 1259.99 -0.495 2.115 3.735

There are 20 min of data, resulting in about 120,000 rows at 100 Hz. My goal is to select those rows within a certain time range, say 100-200 sec.

Here is what I've figured out

import panda as pd
df = pd.DataFrame(my_data)                    # my_data is a numpy array
df.set_index(0, inplace=True)
df.columns = ['Col_A', 'Col_B', 'Col_C']
df.index = pd.to_datetime(df.index, unit='s', origin='1900-1-1')  # the date in origin is just a space-holder

My dataset doesn't include the date. How to avoid setting a fake date like I did above? It feels wrong, and also is quite annoying when I plot the data against time.

I know there are ways to remove date from the datatime object like here. But my goal is to select some rows that are in a certain time range, which means I need to use pd.date_range(). This function does not seem to work without date.

It's not the end of the world if I just use a fake date throughout my project. But I'd like to know if there are more elegant ways around it.

F.S.
  • 1,175
  • 2
  • 14
  • 34
  • Could you iterate through something like: `range(0, df['Time (sec)'].max(), 200)`? – elPastor Jul 18 '18 at 23:07
  • @pshep123 yes i could... I've actually implemented the said features by manipulations in the original numpy array. Yet after realizing what panda can do, i thought my data is a perfect candidate to use panda because it's nicely structured and has a time index, so I decided to give it a try. I thought it'd be a piece of cake... – F.S. Jul 19 '18 at 01:12

1 Answers1

2

I don't see why you need to use datetime64 objects for this. Your time column is an number, so you can very easily select time intervals with inequalities. You can also plot the columns without issue.

Sample Data

import pandas as pd
import numpy as np

np.random.seed(123)
df = pd.DataFrame({'Time': np.arange(0,1200,0.01),
                   'Col_A': np.random.randint(1,100,120000),
                   'Col_B': np.random.randint(1,10,120000)})

Select Data between 100 and 200 seconds.

df[df.Time.between(100,200)]

Outputs:

         Time  Col_A  Col_B
10000  100.00     75      9
10001  100.01     23      7
...
19999  199.99     39      7
20000  200.00     25      2

Plotting against time

#First 100 rows just for illustration
df[0:100].plot(x='Time')

enter image description here


Convert to timedelta64

If you really wanted to, you could convert the column to a timedelta64[ns]

df['Time'] = pd.to_datetime(df.Time, unit='s') - pd.to_datetime('1970-01-01')

print(df.head())
#             Time  Col_A  Col_B
#0        00:00:00     67      6
#1 00:00:00.010000     93      1
#2 00:00:00.020000     99      3
#3 00:00:00.030000     18      2
#4 00:00:00.040000     84      3

df.dtypes
#Time     timedelta64[ns]
#Col_A              int32
#Col_B              int32
#dtype: object
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Ah ha! I didn't know this trick `df[df.Time.between(100,200)]`. This is my first time using pandas. Thanks! Your answer is very nicely presented. – F.S. Jul 19 '18 at 14:55