2

I'm trying to use dask bag for wordcount 30GB of json files, I strict according to the tutoral from offical web: http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html

But still not work, my single machine is 32GB memory and 8 cores CPU.

My code below, I used to processing 10GB file even not work, the error is running couple of hours without any notification the jupyter was collapsed, i tried on Ubuntu and Windows both system is the same problem. So i suspect if dask bag can processing data out of memory? or is that my code incorrect?

The test data from http://files.pushshift.io/reddit/comments/

import dask.bag as db
import json
b = db.read_text('D:\RC_2015-01\RC_2012-04')
records = b.map(json.loads)
result = b.str.split().concat().frequencies().topk(10, lambda x: x[1])
%time f = result.compute()
f
SharpLu
  • 1,136
  • 2
  • 12
  • 28

1 Answers1

1

Try setting a blocksize in the 10MB range when reading from the single file to break it up a bit.

In [1]: import dask.bag as db

In [2]: b = db.read_text('RC_2012-04', blocksize=10000000)

In [3]: %time b.count().compute()
CPU times: user 1.22 s, sys: 56 ms, total: 1.27 s
Wall time: 20.4 s
Out[3]: 19044534

Also, as a warning, you create a bag records but then don't do anything with it. You might want to remove that line.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • I try to use the exactly code you has given , seems still OverflowError: Python int too large to convert to C long – SharpLu Nov 02 '16 at 14:41
  • Did you unzip the bz2 file? – MRocklin Nov 02 '16 at 14:51
  • Yes, 100% sure i decompressed, but still experience same error. I do not sure if the problem for the windows system? I also tried on Ubuntu Virtual Machine, still the same problem . can you tell me your experiment environment ? – SharpLu Nov 02 '16 at 15:24