1

I want to read a 28Gb csv file and print the contents. However, my code:

import json
import sys
from datetime import datetime
from hashlib import md5

import dask.dataframe as dd
import dask.multiprocessing
import pandas as pd

from kyotocabinet import *


class IndexInKyoto:

    def hash_string(self, string):
        return md5(string.encode('utf-8')).hexdigest()

    def dbproc(self, db):
        db[self.hash_string(self.row)] = self.row

    def index_row(self, row):
        self.row = row
        DB.process(self.dbproc, "index.kch")

start_time = datetime.utcnow()
row_counter = 0
ob = IndexInKyoto()
df = dd.read_csv("/Users/aviralsrivastava/dev/levelsdb-learning/10gb.csv", blocksize=1000000)
df = df.compute(scheduler='processes')     # convert to pandas
df = df.to_dict(orient='records')
for row in df:
    ob.index_row(row)
print("Total time:")
print(datetime.utcnow-start_time)

is not working. When I run the command htop I can see dask running but there is no output whatsoever. Nor there is any index.kch file created. I rant the same thing without using dask and it was running fine; I was using Pandas streaming api (chunksize) but it was too slow and hence, I want to use dask.

aviral sanjay
  • 953
  • 2
  • 14
  • 31

1 Answers1

1
df = df.compute(scheduler='processes')     # convert to pandas

Do not do this!

You are loading the pieces in separate processes, and then transferring all the data to be stitched into a single data-frame in the main process. This will only add overhead to your processing, and create copies of the data in memory.

If all you want to do is (for some reason) print every row to the console, then you would be perfectly well using Pandas streaming CSV reader (pd.read_csv(chunksize=..)). You could run it using Dask's chunking and maybe get a speedup is you do the printing in the workers which read the data:

df = dd.read_csv(..)

# function to apply to each sub-dataframe
@dask.delayed
def print_a_block(d):
    for row in df:
        print(row)

dask.compute(*[print_a_block(d) for d in df.to_delayed()])

Note that for row in df actually gets you the columns, maybe you wanted iterrows, or maybe you actually wanted to process your data somehow.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I want to read each row and take and want the ability to extract a particular column's value given the index. For eg: mdurant, 24, Male aviral, 22, Male I want to read the above two rows and extact the first(0th) column. My aim is not to output the rows to console but to write them in Kyotocabinet. Does it make sense? – aviral sanjay Jan 11 '19 at 15:07
  • 1
    Sorry, no, I do not understand. You should change your question to reflect what you actually want. In my answer, you would put any processing you like into the delayed function. – mdurant Jan 11 '19 at 15:10
  • My answer still holds, please try to adapt to your situation. – mdurant Jan 11 '19 at 15:24
  • Sure, but could you answer as to how can I get the actual row as I asked in the question before? I want to read the exact row but `itertuples` and `iterrows` give me extra stuff as well. – aviral sanjay Jan 11 '19 at 15:25
  • Now you are asking pandas-specific stuff, you should read their docs. – mdurant Jan 11 '19 at 15:52
  • But yes, the code does work on my sample data. It prints the column names. – mdurant Jan 11 '19 at 15:52