-2

Hey guys so i'm quite confused on how to do an empirical analysis on a stemming algorithm for example lancaster and porter stemmer because they don't have a time efficiency compared to the sorting algorithm.

What i tried is importing both of them on nltk then time both of them using timer in python and do an normalization to the data by 1000 times, but i'm not too sure is that what it means to do empirical analysis on word stemming algorithm or is it completely different?

user3646742
  • 199
  • 12

1 Answers1

0

Empirical analysis itself just means analyzing (usually evaluating) your 'stuff' (algorithm, theory, application, code) against some data (as opposed to theoretical/logical evaluation). What kind of data you use (and what kind of stuff you evaluate) can vary. Since the time it takes to stem, say, 1000 words, is something easily measurable, it is something that is indeed easy to evaluate empirically. Another way would be to evaluate the quality of the output (which, I'm guessing, is what you want to do). You can do this when you have some data (say, a list of words and their stems or lemmas), run your stemmer/lemmatizer and then see how many times it got it right. There is something in wordnet that may help you (since the synsets in wordnet have lemma information (which you could also interpret as the stem (there is a difference between stemming and lemmatization, but some googling or searching here on SO will explain that in more detail).

Something like the following may help you:

>>> from nltk.corpus import wordnet as wn
>>> d = wn.synsets('dogs')[0]
>>> d.lemmas()
[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), 
Lemma('dog.n.01.Canis_familiaris')]

It takes the first synset for the word 'dogs' (usually there's multiple synsets for a given word), and by using .lemmas() you can access its lemmas. So now you have the word-lemma pair. Run your stemmer/lemmatizer on the input word, count how many times it got the stem/lemma correct, do this for every stemmer you have, and there you are. You will have to look into it in a bit more detail (which synset, and which lemma to take, for ex.), but hopefully this helps you on the way.

Igor
  • 1,251
  • 10
  • 21