0

I have gotten a very strange data. I have dictionary with keys and values where I want to use this dictionary to search if these keywords are ONLY starting and/or end of the text not middle of the sentence. I tried to create simple data frame below to show the problem case and python codes that I have tried so far. How do I get it go search for only starting or ending of the sentence? This one searches whole text sub-strings.

Code:

d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
    "The word Apple is commonly confused with Apple Corp which is a business",
    "Apple Corp is a business they make computers",
    "Apple Corp also writes App",
    "The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df

Original Dataframe:

id   text 
1    The word Apple is commonly confused with Apple Corp which is a business         
2    Apple Corp is a business they make computers                                    
3    Apple Corp also writes App                                                      
4    The Apple Corp also writes App                                                  

Code Tried out:

def matcher(k):
    x = (i for i in d if i in k)
    # i.startswith(k) getting error
    return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df

Error: TypeError: 'in <string>' requires string as left operand, not bool when I use this x = (i for i in d if i.startswith(k) in k)

Empty values if i tried this x = (i for i in d if i.startswith(k) == True in k)

TypeError: sequence item 0: expected str instance, NoneType found when i use this x = (i.startswith(k) for i in d if i in k)

Results from Code above ... Create new field 'text_value':

id   text                                                                            text_value
1    The word Apple is commonly confused with Apple Corp which is a business         Company;Application
2    Apple Corp is a business they make computers                                    Company;Application
3    Apple Corp also writes App                                                      Company;Application
4    The Apple Corp also writes App                                                  Company;Application

Trying to get an FINAL output like this:

id   text                                                                            text_value
1    The word Apple is commonly confused with Apple Corp which is a business         NaN
2    Apple Corp is a business they make computers                                    Company
3    Apple Corp also writes App                                                      Company;Application
4    The Apple Corp also writes App                                                  Application
sharp
  • 2,140
  • 9
  • 43
  • 80
  • Aren't your actual output and desired output the same ? – Benoit Drogou Aug 27 '19 at 19:37
  • No. Its not. I will add original DataFrame to make it less confusing. – sharp Aug 27 '19 at 19:40
  • Why id 2 has "Application"? It ends with "Computer", not "app". – harvpan Aug 27 '19 at 19:43
  • @BenoitDrogou. Yes, please see the text_value from these section. They are different. I am basically trying to show what I have tried and what didn't work. Good catch for application id 2. My typo. I fixed it – sharp Aug 27 '19 at 19:44
  • The biggest complication here is that `apple corp` is two words, meaning you can't easily define the first "value". – ALollz Aug 27 '19 at 19:49
  • @ALollz. Yes. I struggled with that. I could have used 'split' function that just ran the search based on the key value. I have dictionary with two words so I agree, it make it bit more complicated. That's why I had to post it here. Run out of tricks – sharp Aug 27 '19 at 19:52

2 Answers2

1

You need a matcher function which can accept flag and then call that twice to get the results for startswith and endswith.

def matcher(s, flag="start"):
    if flag=="start":
        for i in d:
            if s.startswith(i):
                return d[i]
    else:
        for i in d:
            if s.endswith(i):
                return d[i]
    return None

df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]

The text_value column looks like:

0                       
1                Company
2    Company;Application
3            Application
Name: text_value, dtype: object
harvpan
  • 8,571
  • 2
  • 18
  • 36
0
joined = "|".join(d.keys())

pat = '(?i)^(?:the\\s*)?(' + joined + ')\\b.*?|.*\\b(' + joined + ')$'+'|.*'

get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')

df.text.str.replace(pat,get)


0                       
1                Company
2    Company;Application
3    Company;Application
Name: text, dtype: object
Onyambu
  • 67,392
  • 3
  • 24
  • 53