2

I am trying to remove rows from a dataframe where the first sequence of letters in the Ref column are equal to the Product column.

For example, for the input:

+---------+---------------+
| Product | Provision Ref |
+---------+---------------+
| DVX     | DVX9251       |
+---------+---------------+
| CDV     | 22CDV95       |
+---------+---------------+
| TV      | TV12369       |
+---------+---------------+
| TV      | 992TV15       |
+---------+---------------+

Desired output:

+---------+---------------+
| Product | Provision Ref |
+---------+---------------+
| CDV     | 22CDV95       |
+---------+---------------+
| TV      | 992TV15       |
+---------+---------------+

I have tried both of the following pieces of code but they are not working

df = df.loc[df['Provision Ref'].str[0:df['Product'].map(len)] != df['Product']]
df = df.loc[df['Provision Ref'].str[0:int(df['Product'].map(len))] != df['Product']]
smci
  • 32,567
  • 20
  • 113
  • 146
Zabman
  • 117
  • 1
  • 12

2 Answers2

3

Try this:

filtered = df[df.groupby('Product', sort=False).apply(lambda g: g['Provision Ref'].str.startswith(g['Product'].iloc[0])).tolist()]

Output:

>>> filtered
  Product Provision Ref
0     DVX       DVX9251
2      TV       TV12369

More readable but less efficient:

filtered = df[df.apply(lambda x: x['Provision Ref'].startswith(x['Product']), axis=1)]

Another method, probably more efficient if the items of Product have few unique lengths (e.g. most are either 2, 3 or 4 chars long, etc.):

filtered = df[df.groupby(df['Product'].str.len(), sort=False).apply(lambda x: x['Provision Ref'].str[:len(x['Product'].iloc[0])] == x['Product']).tolist()]
1

We can use a row-wise .apply(), because df['Provision Ref'].str.startswith(df['Product']) isn't vectorized like that (as @anarchy wrote).

df[~df.apply(lambda row: row['Provision Ref'].startswith(row['Product']), axis=1)]

  Product Provision Ref
1     CDV       22CDV95
3      TV       992TV15
smci
  • 32,567
  • 20
  • 113
  • 146