5

I am writing a data quality script using pandas, where the script would be checking certain conditions on each column

At the moment i need to find out the rows that don't have a decimal or an actual number in a a particular column. I am able to find the numbers if its a whole number, but the methods I have seen so far ie isdigit() , isnumeric(), isdecimal() etc fail to correctly identify when the number is a decimal number. eg: 2.5, 0.1245 etc.

Following is some sample code & data:

>>> df = pd.DataFrame([
    [np.nan, 'foo', 0],
    [1, '', 1],
    [-1.387326, np.nan, 2],
    [0.814772, ' baz', ' '],     
    ["a", '      ', 4],
    ["  ",  'foo qux ', '  '],         
], columns='A B C'.split(),dtype=str)

>>> df
    A   B   C
0   NaN foo 0
1   1       1
2   -1.387326   NaN 2
3   0.814772    baz 
4   a       4
5       foo qux 

>>> df['A']
0          NaN
1            1
2    -1.387326
3     0.814772
4            a
5             
Name: A, dtype: object

The following method all fails to identify the decimal numbers

df['A'].fillna('').str.isdigit()
df['A'].fillna('').str.isnumeric()
df['A'].fillna('').str.isdecimal()

0    False
1     True
2    False
3    False
4    False
5    False
Name: A, dtype: bool

So when i try the following I only get 1 row

>>> df[df['A'].fillna('').str.isdecimal()]
    A   B   C
1   1       1

NB: I am using dtype=str to get the data wihtout pandas interpreting/changing the values of the dtypes. The actual data could have spaces in column A, I will trim that out using replace(), I have kept the code simple here so as not to confuse things.

stormfield
  • 1,696
  • 1
  • 14
  • 26

2 Answers2

5

Use to_numeric with errors='coerce' for non numeric to NaNs and then test by Series.notna:

print (pd.to_numeric(df['A'], errors='coerce').notna())
0    False
1     True
2     True
3     True
4    False
5    False
Name: A, dtype: bool

If need return Trues for missing values:

print (pd.to_numeric(df['A'], errors='coerce').notna() | df['A'].isna())
0     True
1     True
2     True
3     True
4    False
5    False
Name: A, dtype: bool

Another solution with custom function:

def test_numeric(x):
    try:
        float(x)
        return True
    except Exception:
        return False

print (df['A'].apply(test_numeric))
0     True
1     True
2     True
3     True
4    False
5    False
Name: A, dtype: bool

print (df['A'].fillna('').apply(test_numeric))
0    False
1     True
2     True
3     True
4    False
5    False
Name: A, dtype: bool
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Alternativ, if you want to keep the string structure you can use:

df['A'].str.contains('.')

0    False
1     True
2    False
3    False
4    False
5    False

The only risk in that case could be that you identify words with .as well..which is not your wish

PV8
  • 5,799
  • 7
  • 43
  • 87