I am working with pandas and I wish to sample 2 stocks from each trade date and store as part of the dataset the average "Stock_Change" and the average "Vol_Change" for the given day in question based on the sample taken (in this case, 2 stocks per day). The actual data is much larger spanning years and hundreds of names. My sample will be of 100 names, I just use 2 for the purposes of this question.
Sample data set:
In [3]:
df
Out[3]:
Date Symbol Stock_Change Vol_Change
0 1/1/2008 A -0.05 0.07
1 1/1/2008 B -0.06 0.17
2 1/1/2008 C -0.05 0.07
3 1/1/2008 D 0.05 0.13
4 1/1/2008 E -0.03 -0.10
5 1/2/2008 A 0.03 -0.17
6 1/2/2008 B 0.08 0.34
7 1/2/2008 C 0.03 0.17
8 1/2/2008 D 0.06 0.24
9 1/2/2008 E 0.02 0.16
10 1/3/2008 A 0.02 0.05
11 1/3/2008 B 0.01 0.39
12 1/3/2008 C 0.05 -0.17
13 1/3/2008 D -0.01 0.37
14 1/3/2008 E -0.06 0.23
15 1/4/2008 A 0.03 0.31
16 1/4/2008 B -0.07 0.16
17 1/4/2008 C -0.06 0.29
18 1/4/2008 D 0.00 0.09
19 1/4/2008 E 0.00 -0.02
20 1/5/2008 A 0.04 -0.04
21 1/5/2008 B -0.06 0.16
22 1/5/2008 C -0.08 0.07
23 1/5/2008 D 0.09 0.16
24 1/5/2008 E 0.06 0.18
25 1/6/2008 A 0.00 0.22
26 1/6/2008 B 0.08 -0.13
27 1/6/2008 C 0.07 0.18
28 1/6/2008 D 0.03 0.32
29 1/6/2008 E 0.01 0.29
30 1/7/2008 A -0.08 -0.10
31 1/7/2008 B -0.09 0.23
32 1/7/2008 C -0.09 0.26
33 1/7/2008 D 0.02 -0.01
34 1/7/2008 E -0.05 0.11
35 1/8/2008 A -0.02 0.36
36 1/8/2008 B 0.03 0.17
37 1/8/2008 C 0.00 -0.05
38 1/8/2008 D 0.08 -0.13
39 1/8/2008 E 0.07 0.18
One other point, the samples can not contain the same security more than once (sample without replacement). My guess is that this a good R question but I don't know the last thing about R . .
I have no idea of even how to start this question.
thanks in advance for any help.
Edit by OP
I tried this but don't seem to be able to get it to work on a the group-by dataframe (grouped by Symbol and Date):
In [35]:
import numpy as np
import pandas as pd
from random import sample
# create random index
rindex = np.array(sample(range(len(df)), 10))
# get 10 random rows from df
dfr = df.ix[rindex]
In [36]:
dfr
Out[36]:
Date Symbol Stock_Change Vol_Change
6 1/2/2008 B 8% 34%
1 1/2/2008 B -6% 17%
37 1/3/2008 C 0% -5%
25 1/1/2008 A 0% 22%
3 1/4/2008 D 5% 13%
12 1/3/2008 C 5% -17%
10 1/1/2008 A 2% 5%
2 1/3/2008 C -5% 7%
26 1/2/2008 B 8% -13%
17 1/3/2008 C -6% 29%
OP Edit #2
As I read the question I realize that I may not have been very clear. What I want to do is sample the data many times (call it X) for each day and in essence end up with X times "# of dates" as my new dataset. This may not look like it makes sense with the data i am showing but my actual data has 500 names and 2 years (2x365 = 730) of dates and I wish to sample 50 random names for each day for a total of 50 x 730 = 36500 data points.
first attempt gave this:
In [10]:
# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, sample_size=3):
symbols = group.Symbol.values
symbols_selected = np.random.choice(symbols, size=sample_size, replace=False)
return group.loc[group.Symbol.isin(symbols_selected)]
df.groupby(['Date']).apply(get_subsample).reset_index(drop=True)
Out[10]:
Date Symbol Stock_Change Vol_Change
0 1/1/2008 A -5% 7%
1 1/1/2008 A 3% -17%
2 1/1/2008 A 2% 5%
3 1/1/2008 A 3% 31%
4 1/1/2008 A 4% -4%
5 1/1/2008 A 0% 22%
6 1/1/2008 A -8% -10%
7 1/1/2008 A -2% 36%
8 1/2/2008 B -6% 17%
9 1/2/2008 B 8% 34%
10 1/2/2008 B 1% 39%
11 1/2/2008 B -7% 16%
12 1/2/2008 B -6% 16%
13 1/2/2008 B 8% -13%
14 1/2/2008 B -9% 23%
15 1/2/2008 B 3% 17%
16 1/3/2008 C -5% 7%
17 1/3/2008 C 3% 17%
18 1/3/2008 C 5% -17%
19 1/3/2008 C -6% 29%
20 1/3/2008 C -8% 7%
21 1/3/2008 C 7% 18%
22 1/3/2008 C -9% 26%
23 1/3/2008 C 0% -5%
24 1/4/2008 D 5% 13%
25 1/4/2008 D 6% 24%
26 1/4/2008 D -1% 37%
27 1/4/2008 D 0% 9%
28 1/4/2008 D 9% 16%
29 1/4/2008 D 3% 32%
30 1/4/2008 D 2% -1%
31 1/4/2008 D 8% -13%
32 1/5/2008 E -3% -10%
33 1/5/2008 E 2% 16%
34 1/5/2008 E -6% 23%
35 1/5/2008 E 0% -2%
36 1/5/2008 E 6% 18%
37 1/5/2008 E 1% 29%
38 1/5/2008 E -5% 11%
39 1/5/2008 E 7% 18%