1

I have a set of informal documents (couple of thousands) which I want to apply topic modeling (MALLET) on. The problem is, there are a considerable number of misspelled words in the documents. Most are intentional, such as short-forms and local lingo like `'juz' -> 'just', 'alr' -> 'already'. A couple of these variations exists, due to the different authors' peculiar styles of writing.

After feeding them to MALLET, I kinda bothered that one of the topics generated is actually a set of misspelled stopwords. I believe these words are mostly used in the small subset of documents from the same author, hence MALLET picked it up.

My question is, do I spell-check and correct these sets of misspelled words, and perhaps save the corrected text somewhere, before conducting further tasks on them? I suppose this would meant that I do need to manually verify the corrections before committing right? What would be the most "efficient" way to do this?

Or do I actually ignore these misspelled words?

goh
  • 27,631
  • 28
  • 89
  • 151

2 Answers2

0

I don't think we can answer that without knowing the impact of misspelled words or miscorrected misspelt words on the outcome of your topic modelling. So if you could give more information, that would be good.

However, I would have thought you wanted to correct them, at least where the correction is clearly the intent of the original author.

The Archetypal Paul
  • 41,321
  • 20
  • 104
  • 134
  • @Paul, for instance I have one topic with the set of words {'juz'(just),'tt'(that), 'oso' (also), 'alrdy'(already),'frm' (from), 'wan'(want)...} – goh Nov 25 '10 at 15:28
  • That's not my question. If you don't correct, what's the impact on your topic modelling. If you do, what's the impact? – The Archetypal Paul Nov 25 '10 at 22:27
  • @Paul, at the moment (when i do not correct), I see a number of topics consisting of different variations and short-hands of the same words. They are noise, and some are actually stopwords and they affect the way I'm reading the topics. I'm having some trouble interpreting the topics actually. as for correcting them, I have no idea as I have not done so because I believe I need to spellcheck and correct them manually. – goh Nov 26 '10 at 09:32
  • OK, then it seems you need to spellcheck them. What would be the reason not to? – The Archetypal Paul Nov 26 '10 at 09:34
  • hmmmm, I am wondering what would be the best way to handle this task. Do I use a spell checker to identify potential misspelt words, then verify and commit manually, save them separately from the original text? Or is there a better way to do it... – goh Nov 26 '10 at 10:16
  • There must be a spellchecker with an API you can use. – The Archetypal Paul Nov 26 '10 at 10:59
0

What do you do with stopwords at the moment? If you are doing topic modelling then it would make sense to filter them out. If so, why don't you filter out these terms too?

[Edit in response to reply]

There is some research about handling stopwords within LDA in a more principled way. There are two papers that spring to mind:

  1. Term Weighting Schemes for Latent Dirichlet Allocation
  2. Rethinking LDA: Why Priors Matter.

[1] uses a term weighting scheme which apparently helps in a predictive task they set up, [2] uses a non-symmetric prior over the word distributions which apparently leads to a few topics which contain all the stop words, and other words common to the entire corpus.

It seems to me that the best way to automatically infer stop words and other non-topic words in LDA is still a research question.

Stompchicken
  • 15,833
  • 1
  • 33
  • 38
  • I am already using a stoplist. But I was wondering if theres a better way to solve this, given that I have to look at these documents and add the different misspelt words in. – goh Nov 25 '10 at 15:17