I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and are 12G compressed & 90+G expanded. An important detail is that the records are not completely flat.
The simplest way to do this would be
import json
import dask
from dask import bag as db, dataframe as ddf
from toolz import curried as tz
from toolz.curried import operator as op
blocksize=2**24
npartitions='auto'
parquetopts=dict(engine='fastparquet', object_encoding='json')
lang = 'en'
wiki = 'wiki'
date = 20180625
path='./'
source = f'{path}{lang}{wiki}-{date}-cirrussearch-content.json'
(
db
.read_text(source, blocksize=blocksize)
.map(json.loads)
.filter(tz.flip(op.contains, 'title'))
.to_dataframe()
.set_index('title', npartitions=npartitions)
.to_parquet(f'{lang}{wiki}-{date}-cirrussearch.pq', **parquetopts)
)
The first problem is that with the default scheduler this utilizes only one core. That problem can be avoided by explictly using either the distributed or multiprocessing schedulers.
The bigger problem with all schedulers and settings I have tried is memory usage. It appears that dask tries to load the entire dataframe into memory when indexing. Even 450G is not enough RAM for this.
- How can I reduce the memory usage for this task?
- How can I estimate the minimum memory required without resorting to trial and error?
- Is there a better approach?