3
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})

Just imagine this dataframe now with pandas it is easy for me to find a column based on another column's value just like this:

df.loc[df["B"] == "three", "A"]

but with dask the output i receive if i use the same code doesn't really help me

df.loc[df["ActionGeo_Lat"] == "42#.5", "SQLDATE"]

after executing this line i receive the following output, which doesn't really help me:

Output after executing my code

The problem i'm having is that everytime i try to execute df.compute i receive

ValueError:ValueError: could not convert string to float: '42#.5'.

After cutting out some columns i found out that the error is caused somewhere in the column ActionGeo_Lat, now i would like to manually edit the csv file to resolve the error, but cannot find out on which date the error occurs.

Thanks for the help in advance!

Val
  • 6,585
  • 5
  • 22
  • 52

1 Answers1

2

Looks like your underlying problem is with the loading/typing of your data. Here's an example that shows that the same pandas syntax works without problems on the dask dataframe:

import pandas as pd
import numpy as np
import dask.dataframe as dd

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
ddf = dd.from_pandas(df, npartitions=2)

print(df.loc[df['B'] == "three", "A"])
print(ddf.loc[ddf['B'] == "three", "A"].compute())

dask.dataframe is not a good way to debug csv files, so best way for you is to use shell/bash utilities for editing files, e.g.

grep -ai "42#.5" your_file_name_here.csv
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46