5

I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and are 12G compressed & 90+G expanded. An important detail is that the records are not completely flat.

The simplest way to do this would be

import json
import dask
from  dask import bag as db, dataframe as ddf
from  toolz import curried as tz
from toolz.curried import operator as op

blocksize=2**24
npartitions='auto'
parquetopts=dict(engine='fastparquet', object_encoding='json')

lang = 'en'
wiki = 'wiki'
date = 20180625
path='./'

source = f'{path}{lang}{wiki}-{date}-cirrussearch-content.json'

(
 db
 .read_text(source, blocksize=blocksize)
 .map(json.loads)
 .filter(tz.flip(op.contains, 'title'))
 .to_dataframe()
 .set_index('title', npartitions=npartitions)
 .to_parquet(f'{lang}{wiki}-{date}-cirrussearch.pq', **parquetopts)
)

The first problem is that with the default scheduler this utilizes only one core. That problem can be avoided by explictly using either the distributed or multiprocessing schedulers.

The bigger problem with all schedulers and settings I have tried is memory usage. It appears that dask tries to load the entire dataframe into memory when indexing. Even 450G is not enough RAM for this.

  • How can I reduce the memory usage for this task?
  • How can I estimate the minimum memory required without resorting to trial and error?
  • Is there a better approach?
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90

1 Answers1

4

Why is Dask using only one core?

The JSON parsing part of this is probably GIL-bound, you want to use processes. However when you finally compute something you're using dataframes, which generally assume that computations release the GIL (this is common in Pandas) so it uses the threading backend by default. If you are mostly bound by the GIL parsing stage then you probably want to use the multiprocessing scheduler. This should solve your problem:

dask.config.set(scheduler='multiprocessing')

How do I avoid memory use during the set_index phase

Yeah, the set_index computation requires the full dataset. This is a hard problem. If you're using the single-machine scheduler (which you appear to be doing) then it should be using an out-of-core data structure to do this sorting process. I'm surprised that it's running out of memory.

How can I estimate the minimum memory required without resorting to trial and error?

Unfortunately it's difficult to estimate the size of JSON-like data in memory in any language. This is much easier with flat schemas.

Is there a better approach?

This doesn't solve your core issue, but you might consider staging data in Parquet format before trying to sort everything. Then try doing dd.read_parquet(...).set_index(...).to_parquet(...) in isolation. This might help to isolate some costs.

MRocklin
  • 55,641
  • 23
  • 163
  • 235