4

I am trying to split a term which contains a hashtag of multiple words such as "#I-am-great" or "#awesome-dayofmylife'
then the output that I am looking for is:

 I am great
 awesome day of my life

All I could achieve is:

 >>> import re
 >>> name = "big #awesome-dayofmylife because #iamgreat"
 >>> name =  re.sub(r'#([^\s]+)', r'\1', name)
 >>> print name
 big awesome-dayofmylife because iamgreat

If I am asked whether I have a list of possible words then the answer is 'No' so if I can get guidance in that then that would be great. Any NLP experts?

fscore
  • 2,567
  • 7
  • 40
  • 74
  • 3
    Would you split `#something` as `some thing` or `something`? – devnull Dec 11 '13 at 10:07
  • You can't split joined words without knowing they are just that; words. You need a dictionary. – qstebom Dec 11 '13 at 10:08
  • @devnull doesnt matter but thts a good question – fscore Dec 11 '13 at 10:09
  • @qstebom do you know any online api or dictionary for words where i can parse and split? – fscore Dec 11 '13 at 10:10
  • @devnull how would I proceed.. any suggestions? – fscore Dec 11 '13 at 10:11
  • http://stackoverflow.com/questions/11039178/is-there-any-free-online-dictionary-api-json-xml-with-multiple-languages-to-ch – qstebom Dec 11 '13 at 10:11
  • One more comment.. It's gonna be hard to parse that information, because languages are ambiguous. For instance, you need to have a grammar correct parser. Consider the following words: 'somethingsunclear'. How would you split them? – qstebom Dec 11 '13 at 10:13
  • @qstebom Even I have the same question hence I am confused as how to proceed with this. – fscore Dec 11 '13 at 10:14
  • If the hashtags are split by a delimiter then it would be easy. Without it becomes very complex. :) – qstebom Dec 11 '13 at 10:16
  • @qstebom I dnt think hashtag is a concern here bec I can go through the entire sentence to parse the hashtag and only go to that wrd to check if there are multiple words from some dictionary – fscore Dec 11 '13 at 10:18
  • 1
    I have one idea.. You could store a list of common words (e.g. http://www-personal.umich.edu/~jlawler/wordlist) and then just do a lookup. Then just do a longest-match against the list. – qstebom Dec 11 '13 at 10:26
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/42948/discussion-between-fscore-and-qstebom) – fscore Dec 11 '13 at 11:57

2 Answers2

3

All the commentators above are correct of course: A hashtag without spaces or other clear separators between the words (especially in English) is often ambiguous and cannot be parsed correctly in all cases.

However, the idea of the word list is rather simple to implement and might yield useful (albeit sometimes wrong) results nevertheless, so I implemented a quick version of that:

wordList = '''awesome day of my life because i am great something some
thing things unclear sun clear'''.split()

wordOr = '|'.join(wordList)

def splitHashTag(hashTag):
  for wordSequence in re.findall('(?:' + wordOr + ')+', hashTag):
    print ':', wordSequence   
    for word in re.findall(wordOr, wordSequence):
      print word,
    print

for hashTag in '''awesome-dayofmylife iamgreat something
somethingsunclear'''.split():
  print '###', hashTag
  splitHashTag(hashTag)

This prints:

### awesome-dayofmylife
: awesome
awesome
: dayofmylife
day of my life
### iamgreat
: iamgreat
i am great
### something
: something
something
### somethingsunclear
: somethingsunclear
something sun clear

And as you see it falls into the trap qstebom has set for it ;-)

EDIT:

Some explanations of the code above:

The variable wordOr contains a string of all words, separated by a pipe symbol (|). In regular expressions that means "one of these words".

The first findall gets a pattern which means "a sequence of one or more of these words", so it matches things like "dayofmylife". The findall finds all these sequences, so I iterate over them (for wordSequence in …). For each word sequence then I search each single word (also using findall) in the sequence and print that word.

Alfe
  • 56,346
  • 20
  • 107
  • 159
2

The problem can be broken down to several steps:

  1. Populate a list with English words
  2. Split the sentence into terms delimited by white-space.
  3. Treat terms starting with '#' as hashtags
  4. For each hashtag, find words by longest match by checking if they exist in the list of words.

Here is one solution using this approach:

# Returns a list of common english terms (words)
def initialize_words():
    content = None
    with open('C:\wordlist.txt') as f: # A file containing common english words
        content = f.readlines()
    return [word.rstrip('\n') for word in content]


def parse_sentence(sentence, wordlist):
    new_sentence = "" # output    
    terms = sentence.split(' ')    
    for term in terms:
        if term[0] == '#': # this is a hashtag, parse it
            new_sentence += parse_tag(term, wordlist)
        else: # Just append the word
            new_sentence += term
        new_sentence += " "

    return new_sentence 


def parse_tag(term, wordlist):
    words = []
    # Remove hashtag, split by dash
    tags = term[1:].split('-')
    for tag in tags:
        word = find_word(tag, wordlist)    
        while word != None and len(tag) > 0:
            words.append(word)            
            if len(tag) == len(word): # Special case for when eating rest of word
                break
            tag = tag[len(word):]
            word = find_word(tag, wordlist)
    return " ".join(words)


def find_word(token, wordlist):
    i = len(token) + 1
    while i > 1:
        i -= 1
        if token[:i] in wordlist:
            return token[:i]
    return None 


wordlist = initialize_words()
sentence = "big #awesome-dayofmylife because #iamgreat"
parse_sentence(sentence, wordlist)

It prints:

'big awe some day of my life because i am great '

You will have to remove the trailing space, but that's easy. :)

I got the wordlist from http://www-personal.umich.edu/~jlawler/wordlist.

Pedro Castilho
  • 10,174
  • 2
  • 28
  • 39
qstebom
  • 719
  • 4
  • 12