How to avoid error in textmatrix function in R's LSA package

Question

I'm taking part in this Kaggle competition and I'm wondering if anyone has any familiarity with the textmatrix function from the LSA package in R.

Basically, the textmatrix function accepts a directory as an argument and it will create a textmatrix using all text files found within the specified directory.

Unfortunately, the textmatrix function will throw an error when it comes across a text file that contains zero terms (this can happen if you filter using stop words, for example).

Does anyone know of a simple way to have textmatrix ignore files that end up with zero terms? Or of a relatively quick way to identify and remove these files?

TIA!

score 1 · Answer 1 · answered Mar 27 '13 at 15:03

I don't know how to make it ignore empty files. A sort-of workaround that I have used is to add a word that was not yet in the corpus to every file.

Advantages:

every file will have at least one word, so that textmatrix does not fail
the same word in every file will not affect the relevance of individual documents
you know that the number of words according to the textmatrix is one more than the number of words in the original documents

Disadvantage:

each file becomes a bit similar to all the others, because they all share one word.

(Note: there may be disadvantages that I haven't thought of.)

How to avoid error in textmatrix function in R's LSA package

1 Answers1