Stemmer the words in NLP

Question

can anyone tell me which is the best stemmer. Also I have a text and i only want to stem the words which are in a list and leave the rest of tokens as it is. Below is my code.

Text:swot del swot analys 2013 strengths weak brand nam valu at $ 7 .', '5 bil produc custom environ record compet in merg and acquisit direct sel busy model commod ( comput hardw ) produc poor custom serv low invest in r & d weak pat portfolio too few retail loc low differenty opportun threats expand serv and enterpr solv busy obtain mor pat through acquisit strengthen their pres in emerg market tablet market grow grow demand for smartphon and tablet profit margin declin on hardw produc slow grow rat of the laptop market intens competit strengths brand nam .', 'del has a very strong brand reput for qual produc .', 'compet in merg and acquisit .', 'ov the last fiv year del has spent $ 13 bil for success merg and acquisit , which brought pat , new cap , asset and skil to the busy .', 'direct sel busy model .', 'it wil prov hard for del to compet in such market or at least fight back the lost market shar .', 'intens competit .', 'the company fac intens competit in al it busy seg .', 'intens competit .', 'the company fac intens competit in al it busy seg .', 'it compet in term of pric , qual , brand , technolog , reput , distribut and rang of produc , with ac , appl , hp , ibm , lenovo and toshib .']

The stemmer has stemmed every word losing itseld meaning.

The word list is [force', 'speciality', 'durability', 'military_posture', 'long_suit', 'intensity', 'metier', 'military_strength', 'strong_suit', 'strength', 'forte', 'enduringness', 'effectiveness', 'strong_point', 'specialty', 'posture', 'persuasiveness', 'potency', 'military_capability', 'forcefulness', 'intensity_level']

The code is:

 br = mechanize.Browser()
 br.set_handle_robots(False)
 br.addheaders = [('User-agent','Chrome')]
 html = br.open(url).read()
 titles = br.title()
 readable_article= Document(html).summary()
 readable_title = Document(html).short_title()
 soup = bs4.BeautifulSoup(readable_article)
 Final_Article = soup.text
    #final.append(titles)
    #final.append(url)
    #final.append(Final_Article)
 raw = nltk.clean_html(html)
 cleaned = re.sub(r'& ?(ld|rd)quo ?[;\]]', '\"', raw)
 tokens = nltk.wordpunct_tokenize(cleaned)
 lancaster = stem.lancaster.LancasterStemmer()
 word = words('strength')
 Words = [lancaster.stem(e) for e in word]
 t = [lancaster.stem(t) for t in tokens if t in Words]
 text = nltk.Text(t)
 find = ' '.join(str(e) for e in Words

Please help

score 0 · Answer 1 · answered Jul 24 '14 at 18:10

Your question is more of an opinion based one, I guess. Every stemmer is created using some well established stemming algorithm. Personally, I prefer the Porter Stemming Algorithm because of it's simplicity and fundamental nature. You can read more on it here : Porter Stemmeing Algorithm (with implementation)

Stemmer the words in NLP

1 Answers1