3

I've got a dataframe of the form:

         time     value   label
0  2020-01-01 -0.556014    high
1  2020-01-02  0.185451    high
2  2020-01-03 -0.401111  medium
3  2020-01-04  0.436111  medium
4  2020-01-05  0.412933    high
5  2020-01-06  0.636421    high
6  2020-01-07  1.168237    high
7  2020-01-08  1.205073    high
8  2020-01-09  0.798674    high
9  2020-01-10  0.174116    high

And I'd like to populate a list of dataframes where each dataframe is built when the string in the column label changes. So the first dataframe would be:

         time     value   label
0  2020-01-01 -0.556014    high
1  2020-01-02  0.185451    high

The second dataframe would be:

         time     value   label
2  2020-01-03 -0.401111  medium
3  2020-01-04  0.436111  medium

And so on. And the desired list would be [df, df, ...]. If you think that a dict would be a more appropriate container I wouldn't mind that at all.

There's a similar post named split data frame pandas if sequence of column value change, but that only handles changes in numeric values. I've made a few attempts but keep running into indexing problems when comparing a row value for label with the previous value. So any suggestions would be great!

Here's a reproducible snippet:

# imports
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
import random

# settings
observations = 100
np.random.seed(5)
value = np.random.uniform(low=-1, high=1, size=observations).tolist()
time = [t for t in pd.date_range('2020', freq='D', periods=observations).format()]

df=pd.DataFrame({'time': time, 
                 'value':value})
df['value']=df['value'].cumsum()

def classify(e):
    if e > 0.75: return 'high'
    if e > 0.25: return 'medium'
    if e >= 0: return 'low'

df['label1'] = [(elem-df['value'].min())/(df['value'].max()-df['value'].min()) for elem in df['value']]
df['label'] = [classify(elem) for elem in df['label1']]
df = df.drop('label1', 1)
df
vestland
  • 55,229
  • 37
  • 187
  • 305
  • Suppose you have three rows. The first row is high, the second is low, and the third is high. Is it important to you that the first and third rows go into different dataframes? If the answer is no, you can use a groupby here, and it will be much simpler. – Nick ODell Oct 01 '20 at 22:44
  • 1
    @NickODell No, the answer is not no. And yes, that would be much simpler. – vestland Oct 01 '20 at 22:53

1 Answers1

7

I would create a column that increments on each change, then group by that column. If you need separate dataframes you can assign them in a loop.

df['group'] = df['label'].ne(df['label'].shift()).cumsum()
df = df.groupby('group')
dfs = []
for name, data in df:
    dfs.append(data)

dfs will be a list of dataframes like so:

[         time     value label  group
 0  2020-01-01 -0.556014  high      1
 1  2020-01-02  0.185451  high      1,
          time     value   label  group
 2  2020-01-03 -0.401111  medium      2
 3  2020-01-04  0.436111  medium      2,
          time     value label  group
 4  2020-01-05  0.412933  high      3
 5  2020-01-06  0.636421  high      3
 6  2020-01-07  1.168237  high      3
 7  2020-01-08  1.205073  high      3
 8  2020-01-09  0.798674  high      3
 9  2020-01-10  0.174116  high      3]
B. Bogart
  • 998
  • 6
  • 15