Python Pandas: Data Downsampling

Question

My data looks like this:

TEST
2012-05-01 00:00:00.203 OFF 0
2012-05-01 00:00:11.203 OFF 0
2012-05-01 00:00:22.203 ON 1
2012-05-01 00:00:33.203 ON 1
2012-05-01 00:00:44.203 OFF 0
TEST
2012-05-02 00:00:00.203 OFF 0
2012-05-02 00:00:11.203 OFF 0
2012-05-02 00:00:22.203 OFF 0
2012-05-02 00:00:33.203 ON 1
2012-05-02 00:00:44.203 ON 1
2012-05-02 00:00:55.203 OFF 0

Ultimately, I want to be able to downsample data like this to individual days, using, mean, min, max -values, for example. I cannot get it to work for my data and get this error:

TypeError: unhashable type: 'list'

Perhaps it has something to do with the date format in the data frame since an index line looks like this:

[datetime.datetime(2012, 5, 1, 0, 0, 0, 203000)]   OFF  0

Can anyone help. My code so far is this:

import time
import dateutil.parser
from pandas import *
from pandas.core.datetools import *



t0 = time.clock()

filename = "testdata.dat"

index = []
data = []

with open(filename) as f:
    for line in f:
        if not line.startswith('TEST'):
            line_content =  line.split(' ')

            mydatetime =  dateutil.parser.parse(line_content[0] +  " " + line_content[1])

            del line_content[0] # delete the date
            del line_content[0] # delete the time so that only values remain

            index_row = [mydatetime]
            data_row = []
            for item in line_content:
                data_row.append(item)

            index.append(index_row)
            data.append(data_row)


df = DataFrame(data, index = index)
print df.head()
print df.tail()

print
date_from =  index[0] # first datetime entry in data frame
print date_from
date_to =  index[len(index)-1] #last datetime entry in date frame
print date_to

print date_to[0] - date_from[0]
dayly= DateRange(date_from[0], date_to[0], offset=datetools.DateOffset())
print dayly

grouped = df.groupby(dayly.asof)
#print grouped.mean()
#df2 = df.groupby(daily.asof).agg({'2':np_mean})


time2 = time.clock() - t0
print time2

Dear user1412286, please provide error output to get efficient help — Alexis, May 30 '12 at 13:16

0xc0de · Answer 1 · 2012-05-30T14:08:25.270

0

I do not have any experience with pandas but from what I can make out from your code,

df = DataFrame(data, index = index)

and the error, it seems that the index is not supposed to be a mutable object like python lists. Maybe this will work:

df = DataFrame(data, index = tuple(index))

Also it doesn't seem obvious that your index_row & data_row are lists themselves & you are appending them in index & data lists.

edited May 30 '12 at 14:08

answered May 30 '12 at 14:03

0xc0de

8,028
5
49
75

No, that does not work.Originally, all columns were in one list "data" which was then converted to a dataframe. Then, the date and time were displayed correctly. But for the downsampling to work they need to be in the index. – user1412286 May 31 '12 at 05:47

score 0 · Accepted Answer · answered May 30 '12 at 14:41

0

You'd better leave all the date-time interpolation to pandas and just feed it with a clean input stream. Then you can separate fields using read_fwf (for fixed-width formatted lines). For example:

import pandas
import StringIO

buf = StringIO.StringIO()
buf.write(''.join(line
    for line in open('f.txt')
    if not line.startswith('TEST')))
buf.seek(0)

df = pandas.read_fwf(buf, [(0, 24), (24, 27), (27, 30)],
        index_col=0, names=['switch', 'value'])
print df

Output:

                        switch  value
2012-05-01 00:00:00.203    OFF      0
2012-05-01 00:00:11.203    OFF      0
2012-05-01 00:00:22.203     ON      1
2012-05-01 00:00:33.203     ON      1
2012-05-01 00:00:44.203    OFF      0
2012-05-02 00:00:00.203    OFF      0
2012-05-02 00:00:11.203    OFF      0
2012-05-02 00:00:22.203    OFF      0
2012-05-02 00:00:33.203     ON      1
2012-05-02 00:00:44.203     ON      1
2012-05-02 00:00:55.203    OFF      0

answered May 30 '12 at 14:41

lbolla

5,387
1
22
35

The data does not necessarily always come with the same number of columns and I wanted to avoid, if possible, having to manually adjust the code every time a new file is read. – user1412286 May 31 '12 at 05:47
You don't need to. Just use `read_table` or `read_csv` or `read_fwf`, depending on what format you expect. If the files you are receiving have no format, then I hardly see a way to automate the parsing! – lbolla May 31 '12 at 07:31
They do have a format in that the timestamp is always there but the number of data columns can vary. So far I haven't been able to read the timestamp correctly with read_csv, probably because there is a space between date and time and thus no distinction to the other columns. Or let me more specific: I have been able to read the time stamp correctly by creating a list out of each line and then appending it to another list, but I haven't managed to get the timestamps as the index of the dataframe. – user1412286 May 31 '12 at 08:18
Again, `read_table` can be used, for example: `pandas.read_table(buf, sep=' ', index_col=[0,1], header=None)` will create a table with multiple columns and a `multiindex` made of 2 levels: first level the year-month-day, second level the time. If you wish, you can then merge the multiindex in a normal index (for example: `df.index = ['%s %s' % (a, b) for a, b in zip(df.index.get_level_values(0), df.index.get_level_values(1))]`.) – lbolla May 31 '12 at 08:25
Great. If you can approve the answer, then, so it can be closed. – lbolla May 31 '12 at 09:28
One more thing: the grouping with multiindex works. But then I tried your conversion to one index (which also works) and I cannot get the downsampling to work `date_from = df.index[0] # first datetime entry in data frame print date_from date_to = df.index[len(df.index)-1] #last datetime entry in date frame print date_to dayly= pandas.DateRange(date_from, date_to, offset=pandas.datetools.DateOffset()) print dayly grouped = roupby(dayly.asof)` but I get "TypeError_ can't compare datetime.datetime to str – user1412286 May 31 '12 at 09:38
That's because the index is a list of strings, and it should be a list of datetime objects (the snippet to "reconstruct" the index was an example). You can fix it with: `df.index = map(dateutil.parser.parse, df.index)` and then date-ranges are: `pandas.DateRange(df.index[0], df.index[-1])`. – lbolla May 31 '12 at 11:16

Python Pandas: Data Downsampling

2 Answers2