-1

I have a DF which is quite big. A snippet like the one shown below.

enter image description here

SrNo | Merchant | Revenue | Currency

1    | UBER SR  | 123     | INR

2    | UBER (SR)| 123     | INR

3    | SR UBER  | 123     | INR

4    | ZOMATO SR| 123     | INR

5    | ZOMATOSR | 123     | INR

6    |12FLIPAKRT| 123     | INR

7    | FLIPKART | 123     | INR

My Output should look like:

SrNo | Merchant | Revenue | Currency |Merchant_Flag

1    | UBER SR  | 123     | INR      | UBER

2    | UBER (SR)| 123     | INR      | UBER

3    | SR UBER  | 123     | INR      | UBER

4    | ZOMATO SR| 123     | INR      | ZOMATO

5    | ZOMATOSR | 123     | INR      | ZOMATO

6    |12FLIPAKRT| 123     | INR      | FLIPKART

7    | FLIPKART | 123     | INR      | FLIPKART

Explanation : I want to add an additional column which should have values wrt to Merchant column i.e. if the Merchant column value has UBER in it, Merchant_Flag should be UBER and likewise for other ZOMATO, FLIPKART.

My Dataset is huge. I tried using re.search and then .replace using if and else for my conditions, it is giving me performance issue. Another solution, I tried was using .loc

df.loc[df['columnname'].str.contains('')]. Not sure how to proceed. Can someone help on this.

Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41
  • what are you using `pyspark` or `pandas` and how do you suppose spelling errors should be handled? `i.e 12FLIPAKRT` won't be picked up. – Umar.H Jun 28 '21 at 13:35

2 Answers2

2

If you know what merchants could exist, then you could use something like this:

merchants = ['UBER', 'ZOMATO', 'FLIPKART']

for merchant in merchants:
    df.loc[df["Merchant"].str.contains(merchant), "Merchant_Flag"] = merchant
A.M. Ducu
  • 892
  • 7
  • 19
  • 1
    @AnukritiSingh Glad it did. If you could, please pick my answer as correct. It would mean a lot to me. :) – A.M. Ducu Jun 28 '21 at 15:45
1

I would use the pandas.DataFrame.applymap method, it is always (as a rule of thumb) better in terms of performance to use built-in methods when performing iterative operations over an object instead of coding a plain for/while loop in python. This way you get rid of the overhead of calling the pandas.DataFrame.loc accessor multiple times.

import re # import regex matching

df = pd.DataFrame({"Merchant":["UBER SR", "SR UBER", "ZOMATO SR", "12FLIPKART"]})

# Use applymap to quickly iterate over the whole dataframe
df["Merchant_Flag"] = df.applymap(lambda x: re.search(r"UBER|ZOMATO|FLIPKART", x).group(0))

(Note: .group(0) is used to select the first match of re.search)

Output:

     Merchant Merchant_Flag
0     UBER SR          UBER
1     SR UBER          UBER
2   ZOMATO SR        ZOMATO
3  12FLIPKART      FLIPKART

EDIT

In case you don't know all the merchants' names, you will need to modify a little bit the function passed to applymap:

import re # import regex matching

# Function definition
def match_merchant(elem, reg):
    # Match 
    m = re.search(reg, elem)
    if m != None:
        # Return first match if we got a match
        return m.group(0)
    else:
        # Here you may specify exactly what it is you want to return: None, empty String...
        return None 
    
df = pd.DataFrame({"Merchant":["UBER SR",
                               "SR UBER",
                               "ZOMATO SR",
                               "12FLIPKART",
                               "CABIFY"]})

# Merchants you want to match
regex = r"UBER|ZOMATO|FLIPKART"

# Use applymap to quickly iterate over the whole dataframe
df["Merchant_Flag"] = df.applymap(lambda x: match_merchant(x, regex))

Output:

     Merchant Merchant_Flag
0     UBER SR          UBER
1     SR UBER          UBER
2   ZOMATO SR        ZOMATO
3  12FLIPKART      FLIPKART
4      CABIFY          None
Jay
  • 95
  • 8
  • Thanks for the information. However it is giving me an error as mentioned below :- ----> AttributeError: 'NoneType' object has no attribute 'group'. Can you please guide me on this. – Anukriti Singh Jun 28 '21 at 15:42
  • Sure, could you tell me exactly where is that error occurring? Is it with my code or when using your DataFrame? Perhaps I should add that, given that you have more than one column, you should first select the column ```Merchant``` before applying ```.applymap```, like this: ```df["Merchant_Flag"] = df[["Merchant"]].applymap(lambda x: re.search(r"UBER|ZOMATO|FLIPKART", x).group(0))``` – Jay Jun 28 '21 at 15:50
  • If I just run your code as it is, it works fine. But when I am using it on my DF, it is giving me this error. – Anukriti Singh Jun 28 '21 at 15:58
  • If I just run your code as it is, it works fine. But when I am using it on my DF, it is giving me this error. Yes I have used on a particular column only. I am using the below code **df=pd.read_csv("myfile.csv") **and df[["Merchant_Name"]].applymap(lambda x: re.search(r"UBER|ZOMATO|FLIPKART", x).group(0)) . Error : pandas\_libs\lib.pyx in pandas._libs.lib.map_infer() in (x)----> 1 df2["Merchant_Flag"] = df2[["Merchant_Name"]].applymap(lambda x: re.search(r"UBER|ZOMATO|FLIPKART", x).group(0)) AttributeError: 'NoneType' object has no attribute 'group' – Anukriti Singh Jun 28 '21 at 16:04
  • This means that ```re.search``` did not match any of the merchants in the regular expression. Do you have any other merchants? if so you should include them all in the ```r"UBER|ZOMATO|FLIPKART"``` regex. If you don't know all the merchants the function passed on to ```applymap``` should be modified to handle the special case when no match occurs, is that your case? – Jay Jun 29 '21 at 07:17
  • Yes I am aware of the merchants. It worked for me. Thanks for the help. Appreciated – Anukriti Singh Jun 30 '21 at 09:04
  • In case of pyspark, applymap is not useful. What can you suggest for big datasets in pyspark. – Anukriti Singh Jun 30 '21 at 09:36