I am trying to split a term which contains a hashtag of multiple words such as:
#goodmorning #everythingIsGood
The problem that I am facing is raised in the cases in which the individual words are not capitalized. I am using a list of common words and it appears that the segmentation is dependent on the location of the searched word in the list. For example, for
#everythingisgood
I would get the following two outputs:
everything is good ### when everything appears first
every thing is good ### when every appears first
Here is a small piece of code used for testing:
import re
wordList = 'awe some awesome because day every everything good is morning nice thing'.split()
wordList_ = '|'.join(wordList)
def splitFunction(word):
for wordSequence in re.findall('(?:' + wordList_ + ')+', word):
print ('We want to split:', wordSequence)
for word in re.findall(wordList_, wordSequence):
print (word)
for wordSeq in 'goodmorning! awesomeday becauseeverything isgood'.split():
splitFunction(wordSeq)
Any help would be greatly appreciated.
EDIT: Do you think that (maybe?) taking the longest possible word could work?