11

Background (TLDR; provided for the sake of completion)

Seeking advice on an optimal solution to an odd requirement. I'm a (literature) student in my fourth year of college with only my own guidance in programming. I'm competent enough with Python that I won't have trouble implementing solutions I find (most of the time) and developing upon them, but because of my newbness, I'm seeking advice on the best ways I might tackle this peculiar problem.

Already using NLTK, but differently from the examples in the NLTK book. I'm already utilizing a lot of stuff from NLTK, particularly WordNet, so that material is not foreign to me. I've read most of the NLTK book.

I'm working with fragmentary, atomic language. Users input words and sentence fragments, and WordNet is used to find connections between the inputs, and generate new words and sentences/fragments. My question is about turning an uninflected word from WordNet (a synset) into something that makes sense contextually.

The problem: How to inflect the result in a grammatically sensible way? Without any kind of grammatical processing, the results are just a list of dictionary-searchable words, without agreement between words. First step is for my application to stem/pluralize/conjugate/inflect root-words according to context. (The "root words" I'm speaking of are synsets from WordNet and/or their human-readable equivalents.)

Example scenario

Let's assume we have a chunk of a poem, to which users are adding new inputs to. The new results need to be inflected in a grammatically sensible way.

The river bears no empty bottles, sandwich papers,   
Silk handkerchiefs, cardboard boxes, cigarette ends  
Or other testimony of summer nights. The sprites

Let's say now, it needs to print 1 of 4 possible next words/synsets: ['departure', 'to have', 'blue', 'quick']. It seems to me that 'blue' should be discarded; 'The sprites blue' seems grammatically odd/unlikely. From there it could use either of these verbs.

If it picks 'to have' the result could be sensibly inflected as 'had', 'have', 'having', 'will have', 'would have', etc. (but not 'has'). (The resulting line would be something like 'The sprites have' and the sensibly-inflected result will provide better context for future results ...)

I'd like for 'depature' to be a valid possibility in this case; while 'The sprites departure' doesn't make sense (it's not "sprites'"), 'The sprites departed' (or other verb conjugations) would.

Seemingly 'The sprites quick' wouldn't make sense, but something like 'The sprites quickly [...]' or 'The sprites quicken' could, so 'quick' is also a possibility for sensible inflection.

Breaking down the tasks

  1. Tag part of speech, plurality, tense, etc. -- of original inputs. Taking note of this could help to select from the several possibilities (i.e. choosing between had/have/having could be more directed than random if a user had inputted 'having' rather than some other tense). I've heard the Stanford POS tagger is good, which has an implementation in NLTK. I am not sure how to handle tense detection here.
  2. Consider context in order to rule out grammatically peculiar possibilities. Consider the last couple words and their part-of-speech tags (and tense?), as well as sentence boundaries if any, and from that, drop things that wouldn't make sense. After 'The sprites' we don't want another article (or determiner, as far as I can tell), nor an adjective, but an adverb or verb could work. Comparison of the current stuff with sequences in tagged corpora (and/or Markov chains?) -- or consultation of grammar-checking functions -- could provide a solution for this.
  3. Select a word from the remaining possibilities (those that could be inflected sensibly). This isn't something I need an answer for -- I've got my methods for this. Let's say it's randomly selected.
  4. Transform the selected word as needed. If the information from #1 can be folded in (for example, perhaps the "pluralize" flag was set to True), do so. If there are several possibilities (e.g. picked word is a verb, but a few tenses are possible) select, randomly. Regardless I'm going to need to morph/inflect the word.

I'm looking for advice on the soundness of this routine, as well as suggestions for steps to add. Ways to break down these steps further would also be helpful. Finally I'm looking for suggestions on what tool might best accomplish each task.

floer32
  • 2,190
  • 4
  • 29
  • 50
  • 3
    Have you considered using an n-gram language model rather than a part of speech tagger? They are often used in NLP systems for machine translation or speech to text when you want to programatically pick fluent text from a set of options, given some previous history. – Rob Neuhaus Dec 17 '11 at 00:07
  • 2
    @rrenaud n-gram _language_ models are valuable in MT when combined with _alignment_ models (simplest is statistical noisy channel model) to select word order. In this case, I suspect the parse tree from nltk or Stanford POS tagger would be a bit more valuable. Michael, it might be worth it to take a look at some simple word alignment models, which when combined with a decent language model will correctly handle tense/pluralization a decent percentage of the time (say around 80-90%). – Alex Churchill Dec 17 '11 at 01:18
  • @eowl Should I move/duplicate the question? I'm not sure what the protocol is ... An NLP stack site is such an exciting idea :) – floer32 Dec 17 '11 at 19:14

1 Answers1

5

I think that the comment above on n-gram language model fits your requirements better than parsing and tagging. Parsers and taggers (unless modified) will suffer from the lack of right context of the target word (i.e., you don't have the rest of the sentence available at time of query). On the other hand, language models consider the past (left context) efficiently, especially for windows up to 5 words. The problem with n-grams is that they don't model long distance dependencies (more than n words).

NLTK has a language model: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.model.ngram-pysrc.html . A tag lexicon may help you smooth the model more.

The steps as I see them: 1. Get a set of words from the users. 2. Create a larger set of all possible inflections of the words. 3. Ask the model which inflected word is most probable.

cyborg
  • 9,989
  • 4
  • 38
  • 56
  • Okay, I was suspecting that n-grams would be most appropriate for small-window predictions like this but wanted confirmation. That sounds good. Now, I know smoothing will help with zero-frequency words. But I'll probably have a lot of zero-frequency words. A possible solution: make an alternative corpus to my normal-words corpus, where all the words had been replaced with tags (like POS), and when I hit zero-frequency words, backoff to an n-gram model based on the 'POS corpus'? So if the exact sequence is unknown, perhaps a likely _grammatical_ structure could be chosen. Any thoughts? – floer32 Dec 17 '11 at 19:10
  • How do you know the part of speech of an unknown word? Word shape features might help a bit (*ly is adverb, *ies is likely a verb), but I wouldn't worry too much about it yet. Build a prototype first, worry about potential problems later. – Rob Neuhaus Dec 17 '11 at 20:28
  • There are really good English lexicons out there. What kind of words do you think will be missing? Proper nouns? Old English? – cyborg Dec 17 '11 at 21:00
  • 1
    Well, I figured I might make my own n-gram corpus from the texts of famous poems (I was going to grab the top couple hundred files in the "poetry" section on Project Gutenberg), rather than a prose corpus. I figured users might put in words that create a sequence not found in the corpus. Words that are unknown to my corpus might yet be known to a POS tagger, so they could be reduced to POS status if necessary. In any case I'll build a prototype that ignores these cases first, good point! – floer32 Dec 17 '11 at 21:44