decimate data in python

Question

I put decimate in the title, but I am not sure that is exactly what I mean. Here is the full description of the issue. I have a dataframe that contains data from several subjects. What I want to do is to analyze data that is X number of days apart. The idea is that I only want to consider data that was collected every, say, 4th day from a subject. The catch here is that the data were collected in parallel for the subjects, so I can't just take every 4th day across subjects but rather need to do the decimation/downsampling/whatever for each subject. The two key columns in the dataframe are "subject" and "session_timestamp". In the latter, the date and time are formatted as in this example: 2017-11-10 16:30:47. Is there a good way to accomplish this in python?

Edit: The first commenters asked for a more concrete example of the dataframe with some example data. Here is a toy dataframe that is similar to what I have and should be easy to work with. The code below creates a dataframe with 4 columns: subjectID, date, score1 and score2. Note that a subject can have more than one entry for a given date (basically, these are neural recordings and each row of the dataframe represents one neuron and we can record more than one neuron per subject)

import pandas as pd
import numpy as np
ab = pd.DataFrame()
ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
ab["date"] = np.random.randint(20, size=200)#random list of "dates" from 0 to 19
ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement

What I want to do is to filter for the data (score1 and score2) that was collected at least 4 days apart for each subject. The code could be extremely simple and take the first day that a subject has an entry and every 4th day after that. But a better solution would be if it took the first day, then the next one that is more than 3 days later and then the one that is more than 3 days after that (not every subject has daily samples, so a rigid "every 4th day" code would not be so elegant). All data collected on the allowed days should be included. For example, all data with the day code 0 (if that is the first day of the subject) should be included.

This doesn't sound hard per se, but what to do depends entirely on how you have structured the data. A good Stack Overflow question should include a (small) set of sample data and your attempt at solving this in code so we can tell in more detail where you are stuck. — tripleee, Jul 02 '18 at 10:03
It would be much clearer if you show some records of your original dataframe and some records as example of what you want to get as result. — Marco, Jul 02 '18 at 10:18
Did you mean you want a data frame subset that still contains all subjects, but has no more than one sample per 4 days? Or a subset that has only one subject and one sample / 4 days? Either way, this might be an interesting filter to try to code - and will likely require getting a list of dates with the correct spread first and using that as the selector for the final subset. — Leo K, Jul 02 '18 at 11:54
@Leo K. I edited the post and hopefully it is clear. But maybe I can step through an example. Let’s say I have subject 1 and subject 1 has data from dates 1 through 10. I want to take all data from subject 1 on Days 1, 5 and 9. Now let’s say subject 2 also has data on dates 1 through 10, I want all of his data on dates 1, 5 and 9 as well. Now let’s say subject 3 has data on dates 20-30. Fir him I want day 20, day 24 and day 28 — dc_neuro, Jul 02 '18 at 12:15
@Leo K. The examples above are for a simple case. The ideal code should deal with a more complex scenario like a subject 4 who has data on days 1, 10 and 20. In his case, I want to keep all of the data because each entry is at least 4 days apart and all data from the same day should be kept. Does that make a bit more sense? — dc_neuro, Jul 02 '18 at 12:31

score 0 · Answer 1 · answered Jul 02 '18 at 12:26

I believe what you may be looking for is a way to sub-sample the population of your training examples. For this you may want to use some of the imbalanced learning methods, e.g.: ADASYN, SMOTE, Tomek link. random sub-/over-sampling, etc. (wikipedia article on Oversampling and undersampling in data analysis gives a decent overview). There is a convenient implementation in imbalanced-learn package.

score 0 · Accepted Answer · answered Jul 02 '18 at 13:19

First create a dataframe (with random data):

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

ab = pd.DataFrame()
ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
ab["day_number"] = np.random.randint(50, size=200)#random list of "dates" from 0 to 50
ab['real_date'] = ab.day_number.apply(lambda d: datetime(2018, 1, 1) + timedelta(days=d)) #to simulate real dates
ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement

min_day = ab.real_date.min()
ab = ab.groupby(['subjectID', 'real_date']).sum() #because some subjects have more than 1 score each day

print(ab.head(10))

                      day_number  score1  score2
subjectID real_date                             
0         2018-01-01           0     306     273
          2018-01-04           3      32      60
          2018-01-05           4      61     135
          2018-01-08          21     477     393
          2018-01-09           8      22     341
          2018-01-10           9     137      30
          2018-01-11          30     281     674
          2018-01-14          13     183     396
          2018-01-15          14      41     337
          2018-01-16          15      83      50

Then fill the days when there are no data with the data of the next existing day:

df = ab.reset_index(level='subjectID').groupby('subjectID').resample('D').mean() #Complete missing dates with NaN
df = df.drop(columns='subjectID')
df = df.groupby(level='subjectID').fillna(method='bfill') #fills the NaN with the first next non NaN value
df = df.apply(pd.to_numeric, downcast='integer') #just to have ints, easier to read

print(df.head(10))

                      day_number  score1  score2
subjectID real_date                             
0         2018-01-01           0     306     273
          2018-01-02           3      32      60
          2018-01-03           3      32      60
          2018-01-04           3      32      60
          2018-01-05           4      61     135
          2018-01-06          21     477     393
          2018-01-07          21     477     393
          2018-01-08          21     477     393
          2018-01-09           8      22     341
          2018-01-10           9     137      30

Next resample (group by) periods of 4 days:

res = df.reset_index(level='subjectID').groupby('subjectID').resample('4D').first() #group by 4 days periods and keep only the first value
res = res.drop(columns='subjectID')
print(res.head(10))

                      day_number  score1  score2
subjectID real_date                             
0         2018-01-01           0     306     273
          2018-01-05           4      61     135
          2018-01-09           8      22     341
          2018-01-13          13     183     396
          2018-01-17          18      91      46
          2018-01-21          20      76     333
          2018-01-25          48     131     212
          2018-01-29          29      92      81
          2018-02-02          32     172      55
          2018-02-06          72      98     246

Finally reset the indexes and take care of the case when there are periods of more than 4 days with no data:

res = res.reset_index('real_date', drop=True) #the real_date has no meaning anymore
res['real_date'] = res.day_number.apply(lambda d: min_day + timedelta(days=d)) #good real_date based on the day_number
res = res.drop(columns='day_number')
res = res.set_index('real_date', append=True)
res = res.groupby(level=['subjectID', 'real_date']).first() #regroups periods with no data for more than 4 days

print(res.head(10))

                      score1  score2
subjectID real_date                 
0         2018-01-01     306     273
          2018-01-05      61     135
          2018-01-09      22     341
          2018-01-14     183     396
          2018-01-19      91      46
          2018-01-21      76     333
          2018-01-30      92      81
          2018-02-02     172      55
          2018-02-10      40     218
          2018-02-15     110     112

It is a little complicated, but I think this the best way to do it. I have no idea about the efficiency though, but it doesn't seem so bad.

This is a great start and has already given me clues about how to do it. I just wonder about the line "ab = ab.groupby(['subjectID', 'real_date']).sum()". This will take the sum of the two values on that date, yes? Is there a way around this? In my application, more than one independent sample can be taken on each day and I want to perserve that. This might also complicate the resample steps as well... — dc_neuro, Jul 04 '18 at 07:53
Ok then instead do `df = ab.groupby(['subjectID', 'real_date']).sum()` so you don't modify your initial dataframe, and build the `res` dataframe that I described. Once you have done this, you can use the index of `res` to filter `ab`, like so: `ab = ab.set_index(['subjectID', 'real_date'])` `ab = ab[ab.index.isin(res.index)]` — Zuma, Jul 04 '18 at 08:27
Thanks! It works well for me and I accepted the answer. Tried to upvote as well but I am too new to stack overflow for it to count. I am newish to python and some of the steps are still a little mysterious but it works in practice. Thanks again. — dc_neuro, Jul 06 '18 at 08:00

decimate data in python

2 Answers2