How to filter stocks in a large dataset - python

Question

I have a large dataset of over 20,000 stocks from 1964-2018. (It's CRSP data I got from my university). I now want to apply the following filter technique according to Amihud (2002): 1. include all stocks that have a price greater than $5 at end of year t-1 2. include all stocks that have data for at least 200 days at end of year t-1 3. the stocks have information about market capitalization at end of year t-1

I'm quite stuck on this since I've never worked with such a large dataset. Any suggestions where I can find ideas on how to solve this problem? Many thanks.

I already tried to filter on a monthly basis. I created new dataframe that included those stocks whose prices where above $5 in december. Now I got stuck. The graph shows the number of stocks over time before and after applying the first filter. dataframe with filter

#of stocks over time

df['month'] = pd.DatetimeIndex(df.index).month
df2= df[(df.month == 12) & (df.prc >= 5)]

EDIT:

I created a sample dataframe that looks like my original dataframe

import pandas as pd
import numpy as np
df1 = pd.DataFrame( { 'date': ['2010-05-12', '2010-05-13', '2010-05-13', 
               '2011-11-13', '2011-11-14', '2011-03-30', '2011-12-01', 
               '2011-12-02', '2011-12-01', '2011-12-02'],
               "stock" : ["stock_1", "stock_1", "stock_2", "stock_3", 
               "stock_3", "stock_3", 'stock_1', 'stock_1', 'stock_2', 
               'stock_2'] , 
               "price" : [100, 102, 300, 51, 49, 45, 101, 104, 301, 299],
               'volume':[1000, 1020, np.nan, 510, 490, 450, 1010, 1040, 
               np.nan, 2990],
              'return':[0.01, 0.03, 0.02, np.nan, 0.02, -0.04, -0.08, 
               -0.01, np.nan, -0.01] } )

df1 = df1.set_index(pd.DatetimeIndex(df1['date']))
pivot_df = df1.pivot_table(index=[df1.index, 'stock'], values=['price', 
'vol', 'ret'])

The resulting dataframe is basically panel data. I want to to check whether each stock has return and volume data (not NaN) each day. Then I want to remove all stocks that have return and volume data for less than 200 days in a given year. Since the original dataframe contains nearly 20,000 stocks from 1964 - 2018 I want to do this in an efficient way.

How does a single datapoint look like? What is df? I propose to reduce complexity of your problem by scaling down your problem: First analyse the first month (or even Day) of the first chart if it meets your first requirement. Then the second and third. **Then** I would start with more data. — Manuel, May 13 '19 at 19:06
Can you post a small subset of the actual df? It will make it much easier to respond. — Jack Fleeting, May 14 '19 at 16:05
Sorry for not being clear enough, still learning here. I posted an updated version as an answer — Sebastian, May 18 '19 at 19:04

How to filter stocks in a large dataset - python

0 Answers0