I have a pandas dataframe with metadata on a bunch of text documents:
meta_df = pd.read_csv(
"./mdenny_copy_early2015/Metadata/Metadata/Bill_Metadata_1993-2014.csv",
low_memory=False,
parse_dates=['time'],
infer_datetime_format=True,
dayfirst=True,
)
For each row in it, there is JSON file which has the full tokenized text. The filename of the JSON file is the index of the row it corresponds to. I can't load all the JSON files into memory at once, but was able to put them in a dask dataframe using dask.dataframe.from_delayed
. This puts each document into it's own partition:
def doc_paths():
p = Path('./mdenny_copy_early2015/bills/POS_Tagged_Bills_1993_2014/')
return p.glob("*.anno")
paths = list(doc_paths())
def load_doc(path):
with open(str(path.resolve())) as f:
doc = json.load(f)
id_ = int(path.stem[4:])
sentences = [s['tokens'] for s in doc['sentences']]
print(id_)
return pd.DataFrame({
'sentences': [sentences]
}, index=[id_])
dfs = list(map(delayed(load_doc, pure=True), paths))
df = dask.dataframe.from_delayed(dfs)
I can then join them together in order to get the sentences associated with the metadata:
joined_df = df.join(meta_df)
However, if I do something like joined_df[joined_df['author'] == 'Saul'].compute()
it will load all the files into memory. Is there a way I can set it up so it only reads the files it needs to? It seems like this should be possible since it has all the metadata in memory already and can then find the ID's it needs from that and look them up on disk.