How to extract specific data from multiindex dataframe (Yahoo stock data)

Question

Could someone give me a quick/clear lesson on grabbing specific data points in the multiindex dataframe below? I've been looking at tutorials all day, but none have been very helpful. This should be simple for someone who knows Pandas.

How can I do the following:

Extract the 'close' of 'AAPL' on the last date of the dataframe
If the 'close' is > 'open' of 'AAPL' on a specific date, extract all the data for 'AAPL' and add to a new dataframe
Add a new column for each symbol (AAPL, FB) which is labeled 'range' and is the 'high'-'low' for each day

.

symbol      AAPL                                                FB
ohlcv       open    high    low     close   adj     volume      open    high    low     close   adj     volume
      Date                                              
2018-09-17  222.15  222.95  217.27  217.88  217.88  37195100    161.92  162.06  159.77  160.58  160.58  21005300
2018-09-18  217.79  221.85  217.12  218.24  218.24  31571700    159.39  161.76  158.87  160.30  160.30  22465200
2018-09-19  218.50  219.62  215.30  218.37  218.37  27123800    160.08  163.44  159.48  163.06  163.06  19629000
2018-09-20  220.24  222.28  219.15  220.03  220.03  26460800    164.50  166.45  164.47  166.02  166.02  18824200
2018-09-21  220.78  221.36  217.29  217.66  217.66  96246748    166.64  167.25  162.81  162.93  162.93  25956794

Here is a dictionary of the Dataframe as one of the comments below requested,

df = pd.DataFrame({('AAPL', 'adj_close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 170.3,
  pd.Timestamp('2018-01-03 00:00:00'): 170.27,
  pd.Timestamp('2018-01-04 00:00:00'): 171.07,
  pd.Timestamp('2018-01-05 00:00:00'): 173.01,
  pd.Timestamp('2018-01-08 00:00:00'): 172.37},
 ('AAPL', 'close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 172.26,
  pd.Timestamp('2018-01-03 00:00:00'): 172.23,
  pd.Timestamp('2018-01-04 00:00:00'): 173.03,
  pd.Timestamp('2018-01-05 00:00:00'): 175.0,
  pd.Timestamp('2018-01-08 00:00:00'): 174.35},
 ('AAPL', 'high'): {
  pd.Timestamp('2018-01-02 00:00:00'): 172.3,
  pd.Timestamp('2018-01-03 00:00:00'): 174.55,
  pd.Timestamp('2018-01-04 00:00:00'): 173.47,
  pd.Timestamp('2018-01-05 00:00:00'): 175.37,
  pd.Timestamp('2018-01-08 00:00:00'): 175.61},
 ('AAPL', 'low'): {
  pd.Timestamp('2018-01-02 00:00:00'): 169.26,
  pd.Timestamp('2018-01-03 00:00:00'): 171.96,
  pd.Timestamp('2018-01-04 00:00:00'): 172.08,
  pd.Timestamp('2018-01-05 00:00:00'): 173.05,
  pd.Timestamp('2018-01-08 00:00:00'): 173.93},
 ('AAPL', 'open'): {
  pd.Timestamp('2018-01-02 00:00:00'): 170.16,
  pd.Timestamp('2018-01-03 00:00:00'): 172.53,
  pd.Timestamp('2018-01-04 00:00:00'): 172.54,
  pd.Timestamp('2018-01-05 00:00:00'): 173.44,
  pd.Timestamp('2018-01-08 00:00:00'): 174.35},
 ('AAPL', 'volume'): {
  pd.Timestamp('2018-01-02 00:00:00'): 25555900,
  pd.Timestamp('2018-01-03 00:00:00'): 29517900,
  pd.Timestamp('2018-01-04 00:00:00'): 22434600,
  pd.Timestamp('2018-01-05 00:00:00'): 23660000,
  pd.Timestamp('2018-01-08 00:00:00'): 20567800},
 ('FB', 'adj_close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.42,
  pd.Timestamp('2018-01-03 00:00:00'): 184.67,
  pd.Timestamp('2018-01-04 00:00:00'): 184.33,
  pd.Timestamp('2018-01-05 00:00:00'): 186.85,
  pd.Timestamp('2018-01-08 00:00:00'): 188.28},
 ('FB', 'close'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.42,
  pd.Timestamp('2018-01-03 00:00:00'): 184.67,
  pd.Timestamp('2018-01-04 00:00:00'): 184.33,
  pd.Timestamp('2018-01-05 00:00:00'): 186.85,
  pd.Timestamp('2018-01-08 00:00:00'): 188.28},
 ('FB', 'high'): {
  pd.Timestamp('2018-01-02 00:00:00'): 181.58,
  pd.Timestamp('2018-01-03 00:00:00'): 184.78,
  pd.Timestamp('2018-01-04 00:00:00'): 186.21,
  pd.Timestamp('2018-01-05 00:00:00'): 186.9,
  pd.Timestamp('2018-01-08 00:00:00'): 188.9},
 ('FB', 'low'): {
  pd.Timestamp('2018-01-02 00:00:00'): 177.55,
  pd.Timestamp('2018-01-03 00:00:00'): 181.33,
  pd.Timestamp('2018-01-04 00:00:00'): 184.1,
  pd.Timestamp('2018-01-05 00:00:00'): 184.93,
  pd.Timestamp('2018-01-08 00:00:00'): 186.33},
 ('FB', 'open'): {
  pd.Timestamp('2018-01-02 00:00:00'): 177.68,
  pd.Timestamp('2018-01-03 00:00:00'): 181.88,
  pd.Timestamp('2018-01-04 00:00:00'): 184.9,
  pd.Timestamp('2018-01-05 00:00:00'): 185.59,
  pd.Timestamp('2018-01-08 00:00:00'): 187.2},
 ('FB', 'volume'): {
  pd.Timestamp('2018-01-02 00:00:00'): 18151900,
  pd.Timestamp('2018-01-03 00:00:00'): 16886600,
  pd.Timestamp('2018-01-04 00:00:00'): 13880900,
  pd.Timestamp('2018-01-05 00:00:00'): 13574500,
  pd.Timestamp('2018-01-08 00:00:00'): 17994700}})

Can you post your `df.head(5).to_dict()` to make it easier to reproduce ? — rafaelc, Sep 23 '18 at 21:23
I added the dictionary of the dataframe in the original post. Can you have a look and see if you can help me answer the original three questions? — sslack88, Sep 24 '18 at 12:20

score 0 · Answer 1 · answered Sep 23 '18 at 22:50

You can access columns from multiindex directly by indexing. Since you haven't posted your dataframe code, you can use these snippets to try if they work:

df[('AAPL', 'close')] will give you the 'close' column of 'AAPL'. You can sort this column by date to extract the last date.
```
df.sort_values('Date', ascending=False).head(1)[('AAPL', 'close')]
```
To compare and extract all 'AAPL' data, you can do something like:
```
df[df[('AAPL', 'close')] > df[('AAPL', 'open')]]['AAPL']
```
Add the date as well in the filter condition.

There might be a more optimal way, but this might still work:

for col in df.columns.levels[0]:
    df[(col, 'range')] = df[(col, 'high')] - df[(col, 'low')]

You can add your date conditions just like we do in a normal dataframe.

I am having trouble with the second part of your answer on #1 above. I can't sort on 'Date' using the code above. It doesn't seem to recognize 'Date'. — sslack88, Sep 24 '18 at 12:22
So from the code you posted, it seems 'Date' is your index. You can access it like @RafaelC has mentioned. Or if you want to sort and extract then `df.sort_index(ascending=False).head(1)[('AAPL', 'close')]` — panktijk, Sep 24 '18 at 16:57

rafaelc · Answer 2 · 2018-09-24T13:40:22.130

IIUC,

Extract the 'close' of 'AAPL' on the last date of the dataframe

Just get the maximum date by doing df.index.max() and choose AAPL/close

df.loc[df.index.max(), ('AAPL', 'close')]

If the 'close' is > 'open' of 'AAPL' on a specific date, extract all the data for 'AAPL' and add to a new dataframe

Basically, if you filter using a mask, a data frame is returned. So no need to "append to other data frame".

mask = df.loc[:, ('AAPL', 'open')] > df.loc[:, ('AAPL', 'close')]
df.loc[mask[mask].index, ('AAPL')]

Add a new column for each symbol (AAPL, FB) which is labeled 'range' and is the 'high'-'low' for each day

You can achieve this by simply selecting the columns (ticker, info) where ticker will be AAPL, FB, ... and info will be high, close, ... and then join.

r = df.loc[:, [('AAPL', 'high'), ('FB', 'high')]].sub(df.loc[:, [('AAPL', 'low'), ('FB', 'low')]].values).rename(columns={"high": "range"})
df = df.join(r).sort_index(1)

Notice that you are working with MultiIndex columns. That makes all operations more complicated to write in code. You might consider changing to a single-index column with a new column called ticker' and values asAAPL, FB etc.

For example, using stack+reset_index, you get

df2 = df.stack(level=0).reset_index().rename(columns={'level_0': 'date', 'level_1': 'ticker'}).sort_values('ticker')

    date    ticker  adj_close   close   high    low     open    range   volume
0   2018-01-02  AAPL    170.30  172.26  172.30  169.26  170.16  3.04    25555900
2   2018-01-03  AAPL    170.27  172.23  174.55  171.96  172.53  2.59    29517900
4   2018-01-04  AAPL    171.07  173.03  173.47  172.08  172.54  1.39    22434600
6   2018-01-05  AAPL    173.01  175.00  175.37  173.05  173.44  2.32    23660000
8   2018-01-08  AAPL    172.37  174.35  175.61  173.93  174.35  1.68    20567800
1   2018-01-02  FB      181.42  181.42  181.58  177.55  177.68  4.03    18151900
3   2018-01-03  FB      184.67  184.67  184.78  181.33  181.88  3.45    16886600
5   2018-01-04  FB      184.33  184.33  186.21  184.10  184.90  2.11    13880900
7   2018-01-05  FB      186.85  186.85  186.90  184.93  185.59  1.97    13574500
9   2018-01-08  FB      188.28  188.28  188.90  186.33  187.20  2.57    17994700

Then, for example, to calculate the range, it is much simpler:

df2['range2'] = df2['high'] - df2['low']

How to extract specific data from multiindex dataframe (Yahoo stock data)

2 Answers2