0

I am working with pandas and I wish to sample 2 stocks from each trade date and store as part of the dataset the average "Stock_Change" and the average "Vol_Change" for the given day in question based on the sample taken (in this case, 2 stocks per day). The actual data is much larger spanning years and hundreds of names. My sample will be of 100 names, I just use 2 for the purposes of this question.

Sample data set:

In [3]:

df
​
Out[3]:
        Date Symbol  Stock_Change  Vol_Change
0   1/1/2008      A         -0.05        0.07
1   1/1/2008      B         -0.06        0.17
2   1/1/2008      C         -0.05        0.07
3   1/1/2008      D          0.05        0.13
4   1/1/2008      E         -0.03       -0.10
5   1/2/2008      A          0.03       -0.17
6   1/2/2008      B          0.08        0.34
7   1/2/2008      C          0.03        0.17
8   1/2/2008      D          0.06        0.24
9   1/2/2008      E          0.02        0.16
10  1/3/2008      A          0.02        0.05
11  1/3/2008      B          0.01        0.39
12  1/3/2008      C          0.05       -0.17
13  1/3/2008      D         -0.01        0.37
14  1/3/2008      E         -0.06        0.23
15  1/4/2008      A          0.03        0.31
16  1/4/2008      B         -0.07        0.16
17  1/4/2008      C         -0.06        0.29
18  1/4/2008      D          0.00        0.09
19  1/4/2008      E          0.00       -0.02
20  1/5/2008      A          0.04       -0.04
21  1/5/2008      B         -0.06        0.16
22  1/5/2008      C         -0.08        0.07
23  1/5/2008      D          0.09        0.16
24  1/5/2008      E          0.06        0.18
25  1/6/2008      A          0.00        0.22
26  1/6/2008      B          0.08       -0.13
27  1/6/2008      C          0.07        0.18
28  1/6/2008      D          0.03        0.32
29  1/6/2008      E          0.01        0.29
30  1/7/2008      A         -0.08       -0.10
31  1/7/2008      B         -0.09        0.23
32  1/7/2008      C         -0.09        0.26
33  1/7/2008      D          0.02       -0.01
34  1/7/2008      E         -0.05        0.11
35  1/8/2008      A         -0.02        0.36
36  1/8/2008      B          0.03        0.17
37  1/8/2008      C          0.00       -0.05
38  1/8/2008      D          0.08       -0.13
39  1/8/2008      E          0.07        0.18

One other point, the samples can not contain the same security more than once (sample without replacement). My guess is that this a good R question but I don't know the last thing about R . .

I have no idea of even how to start this question.

thanks in advance for any help.

Edit by OP

I tried this but don't seem to be able to get it to work on a the group-by dataframe (grouped by Symbol and Date):

In [35]:

import numpy as np
import pandas as pd
from random import sample
​
# create random index
​
rindex =  np.array(sample(range(len(df)), 10))
​
# get 10 random rows from df
dfr = df.ix[rindex]
In [36]:

dfr
Out[36]:
        Date Symbol Stock_Change Vol_Change
6   1/2/2008      B           8%        34%
1   1/2/2008      B          -6%        17%
37  1/3/2008      C           0%        -5%
25  1/1/2008      A           0%        22%
3   1/4/2008      D           5%        13%
12  1/3/2008      C           5%       -17%
10  1/1/2008      A           2%         5%
2   1/3/2008      C          -5%         7%
26  1/2/2008      B           8%       -13%
17  1/3/2008      C          -6%        29%

OP Edit #2

As I read the question I realize that I may not have been very clear. What I want to do is sample the data many times (call it X) for each day and in essence end up with X times "# of dates" as my new dataset. This may not look like it makes sense with the data i am showing but my actual data has 500 names and 2 years (2x365 = 730) of dates and I wish to sample 50 random names for each day for a total of 50 x 730 = 36500 data points.

first attempt gave this:
In [10]:

# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, sample_size=3):
    symbols = group.Symbol.values
    symbols_selected = np.random.choice(symbols, size=sample_size, replace=False)
    return group.loc[group.Symbol.isin(symbols_selected)]
​
df.groupby(['Date']).apply(get_subsample).reset_index(drop=True)
​
Out[10]:
        Date Symbol Stock_Change Vol_Change
0   1/1/2008      A          -5%         7%
1   1/1/2008      A           3%       -17%
2   1/1/2008      A           2%         5%
3   1/1/2008      A           3%        31%
4   1/1/2008      A           4%        -4%
5   1/1/2008      A           0%        22%
6   1/1/2008      A          -8%       -10%
7   1/1/2008      A          -2%        36%
8   1/2/2008      B          -6%        17%
9   1/2/2008      B           8%        34%
10  1/2/2008      B           1%        39%
11  1/2/2008      B          -7%        16%
12  1/2/2008      B          -6%        16%
13  1/2/2008      B           8%       -13%
14  1/2/2008      B          -9%        23%
15  1/2/2008      B           3%        17%
16  1/3/2008      C          -5%         7%
17  1/3/2008      C           3%        17%
18  1/3/2008      C           5%       -17%
19  1/3/2008      C          -6%        29%
20  1/3/2008      C          -8%         7%
21  1/3/2008      C           7%        18%
22  1/3/2008      C          -9%        26%
23  1/3/2008      C           0%        -5%
24  1/4/2008      D           5%        13%
25  1/4/2008      D           6%        24%
26  1/4/2008      D          -1%        37%
27  1/4/2008      D           0%         9%
28  1/4/2008      D           9%        16%
29  1/4/2008      D           3%        32%
30  1/4/2008      D           2%        -1%
31  1/4/2008      D           8%       -13%
32  1/5/2008      E          -3%       -10%
33  1/5/2008      E           2%        16%
34  1/5/2008      E          -6%        23%
35  1/5/2008      E           0%        -2%
36  1/5/2008      E           6%        18%
37  1/5/2008      E           1%        29%
38  1/5/2008      E          -5%        11%
39  1/5/2008      E           7%        18%
John
  • 531
  • 1
  • 8
  • 23
  • Have you tried [`sample`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html#pandas.DataFrame.sample)? – EdChum Jul 09 '15 at 08:02

1 Answers1

1
import pandas as pd
import numpy as np

# replicate your data structure
# ==============================
np.random.seed(0)
dates = pd.date_range('2008-01-01', periods=100, freq='B')
symbols = 'A B C D E'.split()
multi_index = pd.MultiIndex.from_product([dates, symbols], names=['Date', 'Symbol'])
stock_change = np.random.randn(500)
vol_change = np.random.randn(500)
df = pd.DataFrame({'Stock_Change': stock_change, 'Vol_Change': vol_change}, index=multi_index).reset_index()



# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, X=100, sample_size=3):
    frame = pd.DataFrame(columns=['sample_{}'.format(x) for x in range(1,X+1)])
    for col in frame.columns.values:
        frame[col] = group.loc[group.Symbol.isin(np.random.choice(symbols, size=sample_size, replace=False)), ['Stock_Change', 'Vol_Change']].mean()
    return frame.mean(axis=1)


result = df.groupby(['Date']).apply(get_subsample)

    Out[169]: 
            Stock_Change  Vol_Change
Date                                
2008-01-01        1.3937      0.2005
2008-01-02        0.0406     -0.7280
2008-01-03        0.6073     -0.2699
2008-01-04        0.2310      0.7415
2008-01-07        0.0718     -0.7269
2008-01-08        0.3808     -0.0584
2008-01-09       -0.5595     -0.2968
2008-01-10        0.3919     -0.2741
2008-01-11       -0.4856      0.0386
2008-01-14       -0.4700     -0.4090
...                  ...         ...
2008-05-06        0.1510      0.1628
2008-05-07       -0.1452      0.2824
2008-05-08       -0.4626      0.2173
2008-05-09       -0.2984      0.6324
2008-05-12       -0.3817      0.7698
2008-05-13        0.5796     -0.4318
2008-05-14        0.2875      0.0067
2008-05-15        0.0269      0.3559
2008-05-16        0.7374      0.1065
2008-05-19       -0.4428     -0.2014

[100 rows x 2 columns]
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • Hi Jianxun, this is close to what I need, what i really need to have is the same thing but instead of having one sample of 3 stocks per day I need to have "X" samples of three stocks per day and have the average of the "X" samples be displayed. In essence, each day will consist of "X" random groups of 3 stocks and the data will be the average change in stock price and the average vol change for the "X" number of portfolios. I don't need to have each sample displayed (too much data to store) just the average of that days "X" portfolios. Thanks again :-) – John Jul 09 '15 at 10:49
  • for some reason I get all of the "A" stock on 1/1/2008 and all of the "B" stock on 1/2/2008 etc . . . there is no selection and no average, just the same stock for one day and a different stock the next (but each day has all the instances of the stock in question . .all put on one day. I put a sample in my OP – John Jul 09 '15 at 11:06
  • @John I will work on your first comment shortly. For the second one, that's because your sample data has all symbol `A` in `1/1/2008`. Look again that sample data at the very beginning of your post. – Jianxun Li Jul 09 '15 at 11:32
  • the only thing i can think of is that the my date is showing up as an object and when I run your code the date shows up as datetime64[ns] – John Jul 09 '15 at 11:45
  • @John no, I mean please give a look at your data at row 0, 5, 10. all these records are 1/1/2008 and for symbol A. – Jianxun Li Jul 09 '15 at 11:47
  • @John I've just updated the code. Let me know whether that's what you want. – Jianxun Li Jul 09 '15 at 11:56
  • Close, just want one number for each day that is the average of the sample for stock change and vol change, no need to all of the individual sample data to be saved (as I have millions of rows) – John Jul 09 '15 at 12:15
  • just missing symbols = group.Symbol.values Once I added that it worked great. Thanks a million!!!! – John Jul 09 '15 at 12:28
  • how would I have the number os stocks selected be random as well (within a range of 5-100) – John Jul 09 '15 at 14:33
  • @John Instead of setting `sample_size=3`, create a variable inside the `for` loop. for example, `sample_size=np.random.randint(5,100)` – Jianxun Li Jul 09 '15 at 14:35