Empirical analysis itself just means analyzing (usually evaluating) your 'stuff' (algorithm, theory, application, code) against some data (as opposed to theoretical/logical evaluation). What kind of data you use (and what kind of stuff you evaluate) can vary. Since the time it takes to stem, say, 1000 words, is something easily measurable, it is something that is indeed easy to evaluate empirically. Another way would be to evaluate the quality of the output (which, I'm guessing, is what you want to do). You can do this when you have some data (say, a list of words and their stems or lemmas), run your stemmer/lemmatizer and then see how many times it got it right.
There is something in wordnet that may help you (since the synsets in wordnet have lemma information (which you could also interpret as the stem (there is a difference between stemming and lemmatization, but some googling or searching here on SO will explain that in more detail).
Something like the following may help you:
>>> from nltk.corpus import wordnet as wn
>>> d = wn.synsets('dogs')[0]
>>> d.lemmas()
[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'),
Lemma('dog.n.01.Canis_familiaris')]
It takes the first synset for the word 'dogs' (usually there's multiple synsets for a given word), and by using .lemmas() you can access its lemmas. So now you have the word-lemma pair. Run your stemmer/lemmatizer on the input word, count how many times it got the stem/lemma correct, do this for every stemmer you have, and there you are.
You will have to look into it in a bit more detail (which synset, and which lemma to take, for ex.), but hopefully this helps you on the way.