3

I have a fairly large pandas dataframe (11k rows and 20 columns). One column has a mixed data type, mostly numeric (float) with a handful of strings scattered throughout.

I subset this dataframe by querying other columns before performing some statistical analysis using the data in the mixed column (but can't do this if there's a string present). 99% of the time once subsetted this column is purely numeric, but rarely a string value will end up in the subset, which I need to trap.

What's the most efficient/pythonic way of looping through a Pandas mixed type column to check for strings (or conversely check whether the whole column is full of numeric values or not)?

If there is even a single string present in the column I want to raise an error, otherwise proceed.

jpp
  • 159,742
  • 34
  • 281
  • 339
Altycoder
  • 270
  • 3
  • 15
  • 1
    Can you simply try to cast it to a float? df['col_test'].astype(float) – emmet02 Mar 07 '18 at 11:46
  • yes, I'm already doing that and catching type and value errors I was just looking for an alternative way, but I may already have the solution. I was wondering if pandas had a built in method for example (although I can't find it in the docs) – Altycoder Mar 07 '18 at 11:56
  • plus I should add that casting will tell me there's a problem but not necessarily how many strings or their indexes which would be my next step potentially – Altycoder Mar 07 '18 at 11:57

2 Answers2

3

This is one way. I'm not sure it can be vectorised.

import pandas as pd

df = pd.DataFrame({'A': [1, None, 'hello', True, 'world', 'mystr', 34.11]})

df['stringy'] = [isinstance(x, str) for x in df.A]

#        A stringy
# 0      1   False
# 1   None   False
# 2  hello    True
# 3   True   False
# 4  world    True
# 5  mystr    True
# 6  34.11   False
jpp
  • 159,742
  • 34
  • 281
  • 339
0

Here's a different way. It converts the values of column A to numeric, but does not fail on errors: strings are replaced by NA. The notnull() is there to remove these NA.

df = df[pd.to_numeric(df.A, errors='coerce').notnull()]

However, if there were NAs in the column already, they too will be removed.

See also: Select row from a DataFrame based on the type of the object(i.e. str)

caram
  • 1,494
  • 13
  • 21