Hashtag segmentation in Python

Question

I am trying to split a term which contains a hashtag of multiple words such as:

#goodmorning #everythingIsGood

The problem that I am facing is raised in the cases in which the individual words are not capitalized. I am using a list of common words and it appears that the segmentation is dependent on the location of the searched word in the list. For example, for

#everythingisgood

I would get the following two outputs:

everything is good ### when everything appears first
every thing is good ### when every appears first

Here is a small piece of code used for testing:

import re

wordList = 'awe some awesome because day every everything good is morning nice thing'.split()
wordList_ = '|'.join(wordList)

def splitFunction(word):
    for wordSequence in re.findall('(?:' + wordList_ + ')+', word):
        print ('We want to split:', wordSequence)   
        for word in re.findall(wordList_, wordSequence):
            print (word)

for wordSeq in 'goodmorning! awesomeday becauseeverything isgood'.split():
    splitFunction(wordSeq)

Any help would be greatly appreciated.

EDIT: Do you think that (maybe?) taking the longest possible word could work?

Well the issue kind of is that both _everything_ and _every_ _thing_ are valid words. I am sure you can always select the longer one, but which word was actually used you can only conclude form the context. At least I do not see another way. — Nils, Aug 15 '19 at 07:38
What you what to do in situation with word `beverything`, where the tho words `bevery` and `everything` exists? You want tot return the only first word or both? — Dmitrii Sidenko, Aug 15 '19 at 07:51
The main problem is that there are cases in which by taking the first encountered word you may (later) remain with a segment of a word, or just a bunch of letters that wouldn't mean anything. I do agree with the first comment, so maybe returning all possible solutions would make more sense. — patri, Aug 15 '19 at 10:39

Hashtag segmentation in Python

0 Answers0