2

I have a CSV with 17,850,209 rows which is too large for Pandas to handle my entire code so I am trying to use Dask to operate on it. All of my code "works" but when I write a CSV to disk I don't get all of the 17,850,209 records. Instead I get N CSVs (where N = npartitions) each with just 50,000 records for a total of, in this case, 900,000 records.

First, I read in the raw CSV and create the clean dataframe with the first 2 rows and a timestamp:

import pandas as pd 
import numpy as np
import time as t 
import dask.dataframe as dd


my_dtypes = {
    'uid': object, 
    'state': object, 
    'var01': np.float64, 
    'var02': np.float64
    }

df_raw = pd.read_csv('/Users/me/input_data/stackoverflow_raw.csv', dtype = my_dtypes, sep=',') 

df_clean = pd.DataFrame(df_raw['uid'].str.strip().str.replace('{','').str.replace('}',''))

df_clean['state'] = pd.DataFrame(df_raw['state'].str.strip())

df_clean['rowcreatetimestamp'] = t.strftime("%Y-%m-%d %H:%M:%S")

Which gives me the following (correct) counts:

df_clean.count()
# uid                   17850209
# state                 17850209
# rowcreatetimestamp    17850209
# dtype: int64

I then move it to Dask with a chucksize of 1,000,000 (which most of my team's machines can handle).

df_clean = dd.from_pandas(df_clean, chunksize=1000000) 

df_clean
# dd.DataFrame<from_pa..., npartitions=18, divisions=(0, 1000000, 2000000, ..., 17000000, 17850208)>

df_clean.compute()
# [17850209 rows x 3 columns]

df_clean.count().compute()
# uid                   17850209
# state                 17850209
# rowcreatetimestamp    17850209
# dtype: int64

Yet when I do the first Dask operation it only "keeps" 900,000 rows of the dataframe and creates 50,000 of the new column:

df_clean['var01'] = dd.from_array(np.where((df_raw['var01'] > 0), 1, 0))

df_clean.compute()
# [900000 rows x 4 columns]

df_clean.count().compute()
uid                   900000
state                 900000
rowcreatetimestamp    900000
var01                  50000
dtype: int64

When I write the Dask dataframe to disk I get 18 CSV with 50,000 records each. I've used compute=True parameter and omitted it and get the same results:

df_clean.to_csv('/Users/me/input_data/stackoverflow_clean_*.csv', header=True, sep=',', index=False, compute=True)

df_clean.to_csv('/Users/me/input_data/stackoverflow_clean_*.csv', header=True, sep=',', index=False)

When I write to a single file I get 900,000 records plus the header:

df_clean.compute().to_csv('/Users/me/input_data/stackoverflow_clean_one_file.csv', header=True, sep=',', index=False)

(in bash)

wc -l '/Users/me/input_data/stackoverflow_clean_one_file.csv' 
900001

And while the 900,000 records is wrong, when I open the CSV only the first 50,000 rows have data for var01.

I've searched the latest documentation but haven't seen what I am missing in terms of outputting both the chunk files with all of the data or a single file with the correct number of rows.

TIA.

Frank B.
  • 1,813
  • 5
  • 24
  • 44

1 Answers1

4

This line is a bit bizarre

df_clean['var01'] = dd.from_array(np.where((df_raw['var01'] > 0), 1, 0))

You're mixing dask.dataframe, dask.array, and numpy all together. Even if this behavior is supported (which is uncertain) it's likely to be very very slow mixing lazy and concrete operations like this.

Instead I recommend using dd.Series.where

df_clean['var01'] = df_raw.var01.where(df_raw.var01 > 0, 1)
df_clean['var01'] = df_raw.var01.where(df_raw.var01 < 0, 0)
MRocklin
  • 55,641
  • 23
  • 163
  • 235