Finding clusers or coherent values in a dataset with Python/pandas

Question

i am prettys new to Python and trying to do an event analysis. I have two datasets: One with events and one with stockdata. Now i need to construct equally weighted portfolios and 'refresh' the portfolio construction every month. Therfor I need consistent data (i guess). I mean for every date i need stock prices for all stocks which are in this analysis. No I wanted to filter the data in this way that it shows me the biggest 'cluster' where i have data for all stocks for a certain period of time. As alternative show me all stocks and the period of time which have data. I hope you guys understand my explanation.

import pandas_datareader as pdr
import pandas as pd
import numpy as np
from sklearn import linear_model
import scipy.stats as st
 d = {'Date': ['1.02.2019', '2.02.2019', '3.02.2019', '4.02.2019', '5.02.2019'], 
     'a': [3.6, 3.4, 4.1, 4.2, 4.3], 
     'b': ['',2.4, 2.5, 2.6, 2.5], 
     'c': [2.5, 2.4,'',2.5, 2.5], 
     'd': [2.3, 2.4, 2.4, 2.5, '']}

df = pd.DataFrame(data=d)
df.set_index('Date')

In this case it should give me either a,b from 2.5.2019 to 5.02.2019 or a,d from 1.05.2019 to 4.5.2019 or something like this.

Is there a forumla for this in pandas?

Thx in advance

score 0 · Answer 1 · answered May 17 '20 at 15:30

You could do something like this, where the date ranges between the nulls are returned:

d = {'Date': ['1.02.2019', '2.02.2019', '3.02.2019', '4.02.2019', '5.02.2019', '6.02.2019', '7.02.2019', '8.02.2019', '9.02.2019', '10.02.2019'], 
     'a': [3.6, 3.4, 4.1, 4.2, 4.3, 3.6, 3.4, 4.1, 4.2, 4.3], 
     'b': ['',2.4, 2.5, 2.6, 2.5, '',2.4, 2.5, 2.6, 2.5], 
     'c': [2.5, 2.4,'',2.5, 2.5, 2.5, 2.4,'',2.5, 2.5], 
     'd': [2.3, 2.4, 2.4, 2.5, '', 2.3, 2.4, 2.4, 2.5, '']}

df = pd.DataFrame(data=d)
df = df.replace('', np.nan)
df['Date'] = pd.to_datetime(df['Date'])
df.sort_values(by=['Date'], inplace=True)
df['Date'] = df['Date'].astype(str)

stocks = ['a', 'b', 'c', 'd']
for stock in stocks:
    print(f'Stock {stock}')
    nan_indexes = df[df[stock].isna()].index
    for i, value in enumerate(nan_indexes):
        if value == df.shape[0]-1:
            print(f'end index = {df.iloc[value-1]["Date"]}')
            continue
        if i==0:
            print(f'start index = {df.iloc[value+1]["Date"]}')
        else:
            print(f'end index = {df.iloc[value-1]["Date"]}')
            print(f'start index = {df.iloc[value+1]["Date"]}')
        if i==len(nan_indexes)-1:
            print(f'end index = {df.iloc[df.shape[0]-1]["Date"]}')

Output:

Stock a
Stock b
start index = 2019-02-02
end index = 2019-05-02
start index = 2019-07-02
end index = 2019-10-02
Stock c
start index = 2019-04-02
end index = 2019-07-02
start index = 2019-09-02
end index = 2019-10-02
Stock d
start index = 2019-06-02
end index = 2019-09-02

Finding clusers or coherent values in a dataset with Python/pandas

1 Answers1