0

I am looking for new ideas for two features I am implementing.

1.) Text segmentation feature:

Ex: 
                    User Query:                 Resolved Query:
                    -----------                 ---------------
            It has lotsofwordstogether   It has lots of words together

    I am using normal recursion or DP solution using unigrams probability.

2.) Kind of collocation:

Ex:
        User Query:                       Resolved Query:
        ----------                      ---------------
    I like t shirts in Wal mart       I like t-shirts in Walmart

No clue how do to this. Only Idea I have currently is tokenise the sentence and combine non meaningful tokens with previous tokens or next tokens to form words which can be checked against the unigrams.

These solutions are slow for my requirements(especially the first one). I want to use these features together. Looking for better ideas.

starkk92
  • 5,754
  • 9
  • 43
  • 59

2 Answers2

0

I guess the standard approaches involve letter n-grams.

So 'wal mart' would become 'wal' 'alm' 'lma' 'mar' 'art'.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • I am new to this field. Can you point some book or online source on this.How can t shirt be resolved tshirt using letter n-grams? – starkk92 Feb 08 '17 at 16:43
0

For problem 1), finding word boundaries, you could use existing algorithms for tokenising East-Asian languages. They usually involve applying Hidden Markov models:

http://dev.datasift.com/blog/using-japanese-tokenization-generate-more-accurate-insight

https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

I can also think of applying the CKY algorithm (used for parsing context-free grammars), specially if you can find a dictionary that provides syllable segmentation, and a syllable inventory.

Problem 2), I think, is just an instance of spelling correction. Just treat the spaces as you'd treat any other character.

I'd post more links but I don't have enough reputation.

These aren't easy problems, good luck!

Julio
  • 73
  • 6