Splitting a dataframe into smaller ones and running loops over the subsets

Question

I have downloaded one year worth of S&P 500 stock data using the Python package yfinance as follows (making sure to only keep the opening price) for each of the 500 firms:

import pandas as pd
import yfinance as yf
import numpy as np
source=pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
df = pd.DataFrame(source[0])
tickers_symbols=df['Symbol'].values.tolist()
GICS_sectors = df['GICS Sector'].values.tolist()
data = pd.DataFrame()
for t,s in zip(tickers_symbols, GICS_sectors):
    tmp = yf.download(t, period='1y', progress=False)
    tmp.reset_index(inplace=True)
    tmp['Ticker'] = t
    tmp['GICS'] = s
    data = data.append(tmp, ignore_index=True)
##KEEP ONLY OPENING PRICE##
data=data.drop(["Close", "High", "Low", "Adj Close", "Volume"], axis=1)

Now, I need to sort this large dataaset into smaller datasets according to each company's GICS sector. In order to do this, I included the tuple GICS and dataset in a dict object (as suggested here, such that I would then be able to call each smaller dataset by simply typing dataset_list['GICS sector'].

dataset_list = dict(tuple(data.groupby('GICS')))
print(dataset_list)
##SPLIT DATASET BY GICS SECTOR AND REMOVE GICS COLUMN##
for sector, dataset in dataset_list.items():
    long_dataset=data.drop(columns='GICS', axis=1)

However, I am having trouble with the subsequent steps. Indeed, as I run a loop to transform each dataset from long to wide and save it as a .csv file, it correctly creates 11 files (as many as there should be datasets), but the data in each file is exactly the same.

##CONVERT EACH DATASET FROM LONG TO WIDE##
for sector, dataset in dataset_list.items():
    final_datasets=long_dataset.pivot_table(index="Date", columns="Ticker", values="Open")    
    final_datasets.to_csv(str(sector)+' DataFrame.csv', index=True, sep=',')

I think there is a problem with the loop I wrote, but I am not sure how to fix it. Each loop above should modify all datasets in the dataset_list object, i.e. I should be able to retrieve a dataframe final_datasets['GICS sector'], but only one dataframe is produced.

Any help is much appreciated.

score 0 · Answer 1 · answered Sep 13 '21 at 14:51

You're looping over dataset_list.items() but doing the .pivot_table only on long_dataset (which is the same each iteration).

Try:

import numpy as np
import pandas as pd
import yfinance as yf

source = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
)
df = pd.DataFrame(source[0])
tickers_symbols = df["Symbol"].values.tolist()
GICS_sectors = df["GICS Sector"].values.tolist()
data = pd.DataFrame()
for t, s in zip(tickers_symbols, GICS_sectors):
    print(t, s)
    tmp = yf.download(
        t, period="2d", progress=False
    )  # <-- increase the period here
    tmp.reset_index(inplace=True)
    tmp["Ticker"] = t
    tmp["GICS"] = s
    data = data.append(tmp, ignore_index=True)

##KEEP ONLY OPENING PRICE##
data = data.drop(["Close", "High", "Low", "Adj Close", "Volume"], axis=1)

for sector, dataset in data.groupby("GICS"):
    final_dataset = dataset.pivot_table(
        index="Date", columns="Ticker", values="Open"
    )
    print(final_dataset)
    print()
    final_dataset.to_csv(
        f"{sector.replace(' ', '_')}_DataFrame.csv", index=True, sep=","
    )

Creates many CSV files, for example:

-rw-r--r-- 1 root root  68 sep 13 16:49 Communication_Services_DataFrame.csv
-rw-r--r-- 1 root root  69 sep 13 16:49 Consumer_Discretionary_DataFrame.csv
-rw-r--r-- 1 root root 138 sep 13 16:49 Health_Care_DataFrame.csv
-rw-r--r-- 1 root root  68 sep 13 16:49 Industrials_DataFrame.csv
-rw-r--r-- 1 root root 126 sep 13 16:49 Information_Technology_DataFrame.csv
-rw-r--r-- 1 root root  55 sep 13 16:49 Utilities_DataFrame.csv

...and so on.

That is a very nice solution! However, what if I then have to loop over all the datasets again? If, for instance, I have to compute correlations within the smaller datasets, I can run the loop using it returns an error since the variable "GICS" is no longer there. Is there a way around this? — Ernest East, Sep 13 '21 at 17:25
@ErnestEast All data you have in dataframe `data`. So you can do `data.groupby('GICS').corr()` for example. Or inside the for-loop: `dataset.corr()` — Andrej Kesely, Sep 13 '21 at 17:27
Yes, but the data I need to compute correlations on is contained in final_dataset. If I run data.groupby('GICS').corr() I get: Communication Services Open 1.0 Consumer Discretionary Open 1.0 Consumer Staples Open 1.0 Energy Open 1.0 Financials Open 1.0 Health Care Open 1.0 Industrials Open 1.0 Information Technology Open 1.0 Materials Open 1.0 Real Estate Open 1.0 Utilities Open 1.0 where the data is still in the long format and therefore meaningless. — Ernest East, Sep 13 '21 at 18:03
Sorry for the messy notation, I don't know how to insert code into comments — Ernest East, Sep 13 '21 at 18:04

Splitting a dataframe into smaller ones and running loops over the subsets

1 Answers1