3

I notice that when an element of a column from a Pandas DataFrame has numeric substrings, the method isnumeric returns false.

For example:

row 1, column 1 has the following: 0002 0003 1289
row 2, column 1 has the following: 89060 324 123431132
row 3, column 1 has the following: 890GB 32A 34311TT
row 4, column 1 has the following: 82A 34311TT
row 4, column 1 has the following: 82A 34311TT 889 9999C

Clearly, the rows 1 and 2 are all numbers, but isnumeric returns false for rows 1 and 2.

I found a work-around the involves separating each substring into their own columns and then creating a boolean column for each to add the booleans together to reveal whether a row is all numeric or not. This, however, is tedious and my function doesn't look tidy. I also to not want to strip and replace the whitespace (to squeeze all the substrings into just one number) because I need to preserve the original substrings.

Does anyone know of a simpler solution/technique that will correctly tell me that these elements with one or more numeric sub strings is all numeric? My ultimate goal is to delete these numeric-only rows.

jpp
  • 159,742
  • 34
  • 281
  • 339
spacedustpi
  • 351
  • 5
  • 18

2 Answers2

2

I think need list comprehension with split with all for check all numeric strings:

mask = ~df['a'].apply(lambda x: all([s.isnumeric() for s in x.split()]))

mask = [not all([s.isnumeric() for s in x.split()]) for x in df['a']]

If want check if at least one numeric string use any:

mask = ~df['a'].apply(lambda x: any([s.isnumeric() for s in x.split()]))

mask = [not any([s.isnumeric() for s in x.split()]) for x in df['a']]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks, I am avoiding any stripping of whitespaces. I am not quite sure what you are doing there with the first one. Where does the s come from in the s.isnumeric? Because I am getting '''str' has no attribute 'isnumeric'. – spacedustpi Mar 27 '18 at 15:45
  • My code:mask = ~df['ORGNTR_ACCT_ID'].apply(lambda x: all([str.isnumeric() for s in x.split()]))... that is the column with the rows of substrings. – spacedustpi Mar 27 '18 at 15:49
  • @spacedustpi - You working with splitted strings, so need `s.isnumeric()` not `str.isnumeric()` – jezrael Mar 27 '18 at 16:14
  • @spacedustpi explanation - For each value first split it by whitespaces and for each splitted value check if numeric. It return list of boolean, so need `all` for check if all vslues are `True`. – jezrael Mar 27 '18 at 16:17
  • Thanks jezrael, this works in my python 3.6 environment! I actually just started a tutorial on masking at datacamp.com like two days ago, but didn't finish it yet. :) – spacedustpi Mar 28 '18 at 14:13
  • @spacedustpi - Glad can help! Happy coding! :) – jezrael Mar 28 '18 at 14:13
1

Here is one way using pd.Series.map, any with a generator expression, str.isdecimal and str.split.

import pandas as pd

df = pd.DataFrame({'col1': ['0002 0003 1289', '89060 324 123431132', '890GB 32A 34311TT',
                            '82A 34311TT', '82A 34311TT 889 9999C']})

df['numeric'] = df['col1'].map(lambda x: any(i.isdecimal() for i in x.split()))

Note that isdecimal is more strict than isdigit. But you may need to use str.isdigit or str.isnumeric in Python 2.7.

To remove such rows where result is False:

df = df[df['col1'].map(lambda x: any(i.isdecimal() for i in x.split()))]

Result

First part of logic:

                    col1 numeric
0         0002 0003 1289    True
1    89060 324 123431132    True
2      890GB 32A 34311TT   False
3            82A 34311TT   False
4  82A 34311TT 889 9999C    True
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks, which version of python are you using? I get the error message: AttributeError: 'str' object has no attribute 'isdecimal'. – spacedustpi Mar 27 '18 at 15:59
  • @spacedustpi, good point. I have 3.6, so you may have to use `str.isnumeric` or `str.isdigit`. – jpp Mar 27 '18 at 16:01
  • So this code runs in my conda python 3.6 environment, however, why is index=4 coming up true? Clearly it is not numeric. Only substring '889' is numeric. – spacedustpi Mar 28 '18 at 14:06
  • So you need to replace `any` with `all` for this logic. – jpp Mar 28 '18 at 14:07
  • 1
    You beat me to it. That is what just I did (duh) and it worked, thanks! – spacedustpi Mar 28 '18 at 14:14