7

I am trying to read multiple CSV files, each around 15 GB using dask read_csv. While performing this task, dask interprets a particular column as float, however it has some few values which are of string type and later on it fails when I try to perform some operation stating it cannot convert string to float. Hence I used dtype=str argument to read all the columns as string. Now I want to convert the particular column to numeric with errors='coerce' so that I those records contain string are converted to NaN values and rest get converted to float correctly. Can you please advise how this can be achieved using dask?

Have already tried: astype conversion

import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8', 
                 assume_missing = True, 
                 usecols =col_names.values.tolist(),
                    dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------------------------+--------+----------+
| Column                            | Found  | Expected |
+-----------------------------------+--------+----------+
| mycol                             | object | float64  |
+-----------------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- mycol
  ValueError("could not convert string to float: 'cliqz.com/tracking'")
#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8', 
                 assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error
#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8', 
                 assume_missing = True,
                 converters={"count":to_n}
                )
df.dtypes 
search_df = df.query('count >0').compute() 
Karrtik Iyer
  • 131
  • 1
  • 6

2 Answers2

0

I had a similar problem and I solved it using .where.

p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")

Or perhaps replacing the second line with:

p.where(p.str.isnumeric(), 999).astype("u4")

In my case my DataFrame (or Series) was the result of other operations, so I couldn't apply it directly to read_csv.

gerrit
  • 24,025
  • 17
  • 97
  • 170
0

As of March 2020, dask.dataframe.to_numeric() has been implemented and is documented here

Here's a minimal example:

import pandas as pd
import dask.dataframe as dd

# create dask dataframe with dummy data incl. number as string
data = {'A': '1', 'B': 2, 'C': 3}
df = pd.DataFrame([data])
ddf = dd.from_pandas(df, npartitions=3)

# inspect dtypes
ddf.dtypes

> A    object
> B    int64
> C    int64
> dtype: object

# apply to_numeric method
ddf.A = dd.to_numeric(ddf.A)

# verify dtypes
ddf.dtypes

> A    int64
> B    int64
> C    int64
> dtype: object

rrpelgrim
  • 342
  • 2
  • 13