I have a CSV with 17,850,209 rows which is too large for Pandas
to handle my entire code so I am trying to use Dask
to operate on it. All of my code "works" but when I write a CSV to disk I don't get all of the 17,850,209 records. Instead I get N
CSVs (where N
= npartitions
) each with just 50,000 records for a total of, in this case, 900,000 records.
First, I read in the raw CSV and create the clean dataframe with the first 2 rows and a timestamp:
import pandas as pd
import numpy as np
import time as t
import dask.dataframe as dd
my_dtypes = {
'uid': object,
'state': object,
'var01': np.float64,
'var02': np.float64
}
df_raw = pd.read_csv('/Users/me/input_data/stackoverflow_raw.csv', dtype = my_dtypes, sep=',')
df_clean = pd.DataFrame(df_raw['uid'].str.strip().str.replace('{','').str.replace('}',''))
df_clean['state'] = pd.DataFrame(df_raw['state'].str.strip())
df_clean['rowcreatetimestamp'] = t.strftime("%Y-%m-%d %H:%M:%S")
Which gives me the following (correct) counts:
df_clean.count()
# uid 17850209
# state 17850209
# rowcreatetimestamp 17850209
# dtype: int64
I then move it to Dask
with a chucksize of 1,000,000 (which most of my team's machines can handle).
df_clean = dd.from_pandas(df_clean, chunksize=1000000)
df_clean
# dd.DataFrame<from_pa..., npartitions=18, divisions=(0, 1000000, 2000000, ..., 17000000, 17850208)>
df_clean.compute()
# [17850209 rows x 3 columns]
df_clean.count().compute()
# uid 17850209
# state 17850209
# rowcreatetimestamp 17850209
# dtype: int64
Yet when I do the first Dask
operation it only "keeps" 900,000 rows of the dataframe and creates 50,000 of the new column:
df_clean['var01'] = dd.from_array(np.where((df_raw['var01'] > 0), 1, 0))
df_clean.compute()
# [900000 rows x 4 columns]
df_clean.count().compute()
uid 900000
state 900000
rowcreatetimestamp 900000
var01 50000
dtype: int64
When I write the Dask
dataframe to disk I get 18 CSV with 50,000 records each. I've used compute=True
parameter and omitted it and get the same results:
df_clean.to_csv('/Users/me/input_data/stackoverflow_clean_*.csv', header=True, sep=',', index=False, compute=True)
df_clean.to_csv('/Users/me/input_data/stackoverflow_clean_*.csv', header=True, sep=',', index=False)
When I write to a single file I get 900,000 records plus the header:
df_clean.compute().to_csv('/Users/me/input_data/stackoverflow_clean_one_file.csv', header=True, sep=',', index=False)
(in bash)
wc -l '/Users/me/input_data/stackoverflow_clean_one_file.csv'
900001
And while the 900,000 records is wrong, when I open the CSV only the first 50,000 rows have data for var01
.
I've searched the latest documentation but haven't seen what I am missing in terms of outputting both the chunk files with all of the data or a single file with the correct number of rows.
TIA.