4

I have a pandas dataframe with metadata on a bunch of text documents:

meta_df = pd.read_csv(
    "./mdenny_copy_early2015/Metadata/Metadata/Bill_Metadata_1993-2014.csv",
    low_memory=False,
    parse_dates=['time'],
    infer_datetime_format=True,
    dayfirst=True,
)

For each row in it, there is JSON file which has the full tokenized text. The filename of the JSON file is the index of the row it corresponds to. I can't load all the JSON files into memory at once, but was able to put them in a dask dataframe using dask.dataframe.from_delayed. This puts each document into it's own partition:

def doc_paths():
    p = Path('./mdenny_copy_early2015/bills/POS_Tagged_Bills_1993_2014/')
    return p.glob("*.anno")

paths = list(doc_paths())

def load_doc(path):
    with open(str(path.resolve())) as f:
        doc = json.load(f)
        id_ = int(path.stem[4:])
        sentences = [s['tokens'] for s in doc['sentences']]
        print(id_)
        return pd.DataFrame({
            'sentences': [sentences]
        }, index=[id_])

dfs = list(map(delayed(load_doc, pure=True), paths))
df = dask.dataframe.from_delayed(dfs)

I can then join them together in order to get the sentences associated with the metadata:

joined_df = df.join(meta_df)

However, if I do something like joined_df[joined_df['author'] == 'Saul'].compute() it will load all the files into memory. Is there a way I can set it up so it only reads the files it needs to? It seems like this should be possible since it has all the metadata in memory already and can then find the ID's it needs from that and look them up on disk.

saul.shanabrook
  • 3,068
  • 3
  • 31
  • 49
  • How many partitions? How precisely was the dataframe constructed? Are you joining along an index or another column forcing a shuffle? Have you considered using Dask.bag? Have you considered solving your problem with just Dask.delayed? If so how would you do it? It will be easier to answer your question if you are able to produce an [mcve](http://stackoverflow.com/help/mcve) – MRocklin Nov 08 '16 at 02:21
  • @MRocklin I added some more explanation and my example code. Is that sufficient? – saul.shanabrook Nov 08 '16 at 02:54
  • This doc might help http://dask.pydata.org/en/latest/dataframe-design.html#partitions . You might want to add division information to your dask.dataframe and ensure that you are joining along the index. – MRocklin Nov 08 '16 at 13:19

0 Answers0