3

I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be parallelized? Here is the basic layout of what I am doing:

import dask.bag as db
import json

js = db.read_text('path/to/json').map(json.loads).filter(lambda d: d['field'] == 'value')
result = js.pluck('field')
result = result.map(cleantext, tbl=tbl).str.lower().remove(exclusion).str.split()
result.map(stopwords,stop=stop).compute()

The basic premise is to extract text entries from the json file and then perform some cleaning operations. This seems like something that can be parallelized since each piece of text could be handed off to a processor since each text and the cleaning of each text is independent of any of the other. Is this an incorrect thought? Is there something I should be doing differently?

Thanks.

Billiam
  • 35
  • 3

1 Answers1

5

The read_text function breaks up a file into chunks based on byte ranges. My guess is that your file is small enough to fit into one chunk. You can check this by looking at the .npartitions attribute.

>>> js.npartitions
1

If so, then you might consider reducing the blocksize to increase the number of partitions

>>> js = db.read_text(..., blocksize=1e6)...  # 1MB chunks
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    Give this man a medal! That looks like it was it. All the processors started firing up when I made the change! I had thought it was something like this since for other data structures in Dask people said check npartitions but that wasn't setable based on documentation for read_text, but setting the blocksize makes sense. Thank you! – Billiam Oct 08 '17 at 00:24
  • You can also use the `repartition` method, but this isn't as efficient as doing it earlier with `read_text`. – MRocklin Oct 08 '17 at 00:37