0

My csv file contains numeric data where some values have greater than or less than symbols e.g. ">244". I want my data type to be a float. When reading the file into pandas:

df = pd.read_csv('file.csv')

I get a warning:

Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.

I have checked this question: Pandas read_csv: low_memory and dtype options and tried specifying the date type of the relevant column with:

df = pd.read_csv('file.csv',dtype={'column':'float'})

However, this gives an error:

ValueError: could not convert string to float: '>244'

I have also tried

df = pd.read_csv('file.csv',dtype={'column':'float'}, error_bad_lines=False)

However this does not solve my problem, and I get the same error above.

My problem appears to be that my data has a mixture of string and floats. Can I ignore any rows containing strings in particular columns when reading in the data?

nf2
  • 25
  • 6
  • `>244` is not float, use `df = pd.read_csv('file.csv',dtype={'column':'str'})` – Ynjxsjmh Aug 01 '22 at 12:33
  • OK. To clarify, I don't want my datatype to be a string, can I ignore the strings when reading in the data? – nf2 Aug 01 '22 at 12:40

2 Answers2

1

I found a workaround which was read in my data

df = pd.read_csv('file.csv')

Then remove any values with '<' or '>'

df = df.loc[df['column'].str[:1] != '<']
df = df.loc[df['column'].str[:1] != '>']

Then convert to numeric with pd.to_numeric

df['column'] = pd.to_numeric(df['column'])
nf2
  • 25
  • 6
1

You can use:

df = pd.read_csv('file.csv', dtype={'column':'str'})

Then:

df['column'] = pd.to_numeric(df['column'], errors='coerce')
Corralien
  • 109,409
  • 8
  • 28
  • 52