I'm trying to use dask bag for wordcount 30GB of json files, I strict according to the tutoral from offical web: http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html
But still not work, my single machine is 32GB memory and 8 cores CPU.
My code below, I used to processing 10GB file even not work, the error is running couple of hours without any notification the jupyter was collapsed, i tried on Ubuntu and Windows both system is the same problem. So i suspect if dask bag can processing data out of memory? or is that my code incorrect?
The test data from http://files.pushshift.io/reddit/comments/
import dask.bag as db
import json
b = db.read_text('D:\RC_2015-01\RC_2012-04')
records = b.map(json.loads)
result = b.str.split().concat().frequencies().topk(10, lambda x: x[1])
%time f = result.compute()
f