1

Could someone give me a quick/clear lesson on grabbing specific data points in the multiindex dataframe below? I've been looking at tutorials all day, but none have been very helpful. This should be simple for someone who knows Pandas.

How can I do the following:

  1. Extract the 'close' of 'AAPL' on the last date of the dataframe

  2. If the 'close' is > 'open' of 'AAPL' on a specific date, extract all the data for 'AAPL' and add to a new dataframe

  3. Add a new column for each symbol (AAPL, FB) which is labeled 'range' and is the 'high'-'low' for each day

.

symbol      AAPL                                                FB
ohlcv       open    high    low     close   adj     volume      open    high    low     close   adj     volume
      Date                                              
2018-09-17  222.15  222.95  217.27  217.88  217.88  37195100    161.92  162.06  159.77  160.58  160.58  21005300
2018-09-18  217.79  221.85  217.12  218.24  218.24  31571700    159.39  161.76  158.87  160.30  160.30  22465200
2018-09-19  218.50  219.62  215.30  218.37  218.37  27123800    160.08  163.44  159.48  163.06  163.06  19629000
2018-09-20  220.24  222.28  219.15  220.03  220.03  26460800    164.50  166.45  164.47  166.02  166.02  18824200
2018-09-21  220.78  221.36  217.29  217.66  217.66  96246748    166.64  167.25  162.81  162.93  162.93  25956794

Here is a dictionary of the Dataframe as one of the comments below requested,

df = pd.DataFrame({('AAPL', 'adj_close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 170.3,
  pd.Timestamp('2018-01-03 00:00:00'): 170.27,
  pd.Timestamp('2018-01-04 00:00:00'): 171.07,
  pd.Timestamp('2018-01-05 00:00:00'): 173.01,
  pd.Timestamp('2018-01-08 00:00:00'): 172.37},
 ('AAPL', 'close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 172.26,
  pd.Timestamp('2018-01-03 00:00:00'): 172.23,
  pd.Timestamp('2018-01-04 00:00:00'): 173.03,
  pd.Timestamp('2018-01-05 00:00:00'): 175.0,
  pd.Timestamp('2018-01-08 00:00:00'): 174.35},
 ('AAPL', 'high'): {
  pd.Timestamp('2018-01-02 00:00:00'): 172.3,
  pd.Timestamp('2018-01-03 00:00:00'): 174.55,
  pd.Timestamp('2018-01-04 00:00:00'): 173.47,
  pd.Timestamp('2018-01-05 00:00:00'): 175.37,
  pd.Timestamp('2018-01-08 00:00:00'): 175.61},
 ('AAPL', 'low'): {
  pd.Timestamp('2018-01-02 00:00:00'): 169.26,
  pd.Timestamp('2018-01-03 00:00:00'): 171.96,
  pd.Timestamp('2018-01-04 00:00:00'): 172.08,
  pd.Timestamp('2018-01-05 00:00:00'): 173.05,
  pd.Timestamp('2018-01-08 00:00:00'): 173.93},
 ('AAPL', 'open'): {
  pd.Timestamp('2018-01-02 00:00:00'): 170.16,
  pd.Timestamp('2018-01-03 00:00:00'): 172.53,
  pd.Timestamp('2018-01-04 00:00:00'): 172.54,
  pd.Timestamp('2018-01-05 00:00:00'): 173.44,
  pd.Timestamp('2018-01-08 00:00:00'): 174.35},
 ('AAPL', 'volume'): {
  pd.Timestamp('2018-01-02 00:00:00'): 25555900,
  pd.Timestamp('2018-01-03 00:00:00'): 29517900,
  pd.Timestamp('2018-01-04 00:00:00'): 22434600,
  pd.Timestamp('2018-01-05 00:00:00'): 23660000,
  pd.Timestamp('2018-01-08 00:00:00'): 20567800},
 ('FB', 'adj_close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.42,
  pd.Timestamp('2018-01-03 00:00:00'): 184.67,
  pd.Timestamp('2018-01-04 00:00:00'): 184.33,
  pd.Timestamp('2018-01-05 00:00:00'): 186.85,
  pd.Timestamp('2018-01-08 00:00:00'): 188.28},
 ('FB', 'close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.42,
  pd.Timestamp('2018-01-03 00:00:00'): 184.67,
  pd.Timestamp('2018-01-04 00:00:00'): 184.33,
  pd.Timestamp('2018-01-05 00:00:00'): 186.85,
  pd.Timestamp('2018-01-08 00:00:00'): 188.28},
 ('FB', 'high'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.58,
  pd.Timestamp('2018-01-03 00:00:00'): 184.78,
  pd.Timestamp('2018-01-04 00:00:00'): 186.21,
  pd.Timestamp('2018-01-05 00:00:00'): 186.9,
  pd.Timestamp('2018-01-08 00:00:00'): 188.9},
 ('FB', 'low'): {
  pd.Timestamp('2018-01-02 00:00:00'): 177.55,
  pd.Timestamp('2018-01-03 00:00:00'): 181.33,
  pd.Timestamp('2018-01-04 00:00:00'): 184.1,
  pd.Timestamp('2018-01-05 00:00:00'): 184.93,
  pd.Timestamp('2018-01-08 00:00:00'): 186.33},
 ('FB', 'open'): {
  pd.Timestamp('2018-01-02 00:00:00'): 177.68,
  pd.Timestamp('2018-01-03 00:00:00'): 181.88,
  pd.Timestamp('2018-01-04 00:00:00'): 184.9,
  pd.Timestamp('2018-01-05 00:00:00'): 185.59,
  pd.Timestamp('2018-01-08 00:00:00'): 187.2},
 ('FB', 'volume'): {
  pd.Timestamp('2018-01-02 00:00:00'): 18151900,
  pd.Timestamp('2018-01-03 00:00:00'): 16886600,
  pd.Timestamp('2018-01-04 00:00:00'): 13880900,
  pd.Timestamp('2018-01-05 00:00:00'): 13574500,
  pd.Timestamp('2018-01-08 00:00:00'): 17994700}})
rafaelc
  • 57,686
  • 15
  • 58
  • 82
sslack88
  • 1,403
  • 3
  • 10
  • 15

2 Answers2

0

You can access columns from multiindex directly by indexing. Since you haven't posted your dataframe code, you can use these snippets to try if they work:

  1. df[('AAPL', 'close')] will give you the 'close' column of 'AAPL'. You can sort this column by date to extract the last date.

    df.sort_values('Date', ascending=False).head(1)[('AAPL', 'close')]
    
  2. To compare and extract all 'AAPL' data, you can do something like:

    df[df[('AAPL', 'close')] > df[('AAPL', 'open')]]['AAPL']
    

    Add the date as well in the filter condition.

  3. There might be a more optimal way, but this might still work:

    for col in df.columns.levels[0]:
        df[(col, 'range')] = df[(col, 'high')] - df[(col, 'low')]
    

You can add your date conditions just like we do in a normal dataframe.

panktijk
  • 1,574
  • 8
  • 10
  • I am having trouble with the second part of your answer on #1 above. I can't sort on 'Date' using the code above. It doesn't seem to recognize 'Date'. – sslack88 Sep 24 '18 at 12:22
  • So from the code you posted, it seems 'Date' is your index. You can access it like @RafaelC has mentioned. Or if you want to sort and extract then `df.sort_index(ascending=False).head(1)[('AAPL', 'close')]` – panktijk Sep 24 '18 at 16:57
0

IIUC,

  1. Extract the 'close' of 'AAPL' on the last date of the dataframe

Just get the maximum date by doing df.index.max() and choose AAPL/close

df.loc[df.index.max(), ('AAPL', 'close')]
  1. If the 'close' is > 'open' of 'AAPL' on a specific date, extract all the data for 'AAPL' and add to a new dataframe

Basically, if you filter using a mask, a data frame is returned. So no need to "append to other data frame".

mask = df.loc[:, ('AAPL', 'open')] > df.loc[:, ('AAPL', 'close')]
df.loc[mask[mask].index, ('AAPL')]
  1. Add a new column for each symbol (AAPL, FB) which is labeled 'range' and is the 'high'-'low' for each day

You can achieve this by simply selecting the columns (ticker, info) where ticker will be AAPL, FB, ... and info will be high, close, ... and then join.

r = df.loc[:, [('AAPL', 'high'), ('FB', 'high')]].sub(df.loc[:, [('AAPL', 'low'), ('FB', 'low')]].values).rename(columns={"high": "range"})
df = df.join(r).sort_index(1)

Notice that you are working with MultiIndex columns. That makes all operations more complicated to write in code. You might consider changing to a single-index column with a new column called ticker' and values asAAPL, FB etc.

For example, using stack+reset_index, you get

df2 = df.stack(level=0).reset_index().rename(columns={'level_0': 'date', 'level_1': 'ticker'}).sort_values('ticker')

    date    ticker  adj_close   close   high    low     open    range   volume
0   2018-01-02  AAPL    170.30  172.26  172.30  169.26  170.16  3.04    25555900
2   2018-01-03  AAPL    170.27  172.23  174.55  171.96  172.53  2.59    29517900
4   2018-01-04  AAPL    171.07  173.03  173.47  172.08  172.54  1.39    22434600
6   2018-01-05  AAPL    173.01  175.00  175.37  173.05  173.44  2.32    23660000
8   2018-01-08  AAPL    172.37  174.35  175.61  173.93  174.35  1.68    20567800
1   2018-01-02  FB      181.42  181.42  181.58  177.55  177.68  4.03    18151900
3   2018-01-03  FB      184.67  184.67  184.78  181.33  181.88  3.45    16886600
5   2018-01-04  FB      184.33  184.33  186.21  184.10  184.90  2.11    13880900
7   2018-01-05  FB      186.85  186.85  186.90  184.93  185.59  1.97    13574500
9   2018-01-08  FB      188.28  188.28  188.90  186.33  187.20  2.57    17994700

Then, for example, to calculate the range, it is much simpler:

df2['range2'] = df2['high'] - df2['low']
rafaelc
  • 57,686
  • 15
  • 58
  • 82