1

I'm taking part in this Kaggle competition and I'm wondering if anyone has any familiarity with the textmatrix function from the LSA package in R.

Basically, the textmatrix function accepts a directory as an argument and it will create a textmatrix using all text files found within the specified directory.

Unfortunately, the textmatrix function will throw an error when it comes across a text file that contains zero terms (this can happen if you filter using stop words, for example).

Does anyone know of a simple way to have textmatrix ignore files that end up with zero terms? Or of a relatively quick way to identify and remove these files?

TIA!

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
user141146
  • 3,285
  • 7
  • 38
  • 54

1 Answers1

1

I don't know how to make it ignore empty files. A sort-of workaround that I have used is to add a word that was not yet in the corpus to every file.

Advantages:

  • every file will have at least one word, so that textmatrix does not fail
  • the same word in every file will not affect the relevance of individual documents
  • you know that the number of words according to the textmatrix is one more than the number of words in the original documents

Disadvantage:

  • each file becomes a bit similar to all the others, because they all share one word.

(Note: there may be disadvantages that I haven't thought of.)

Ben Companjen
  • 1,417
  • 10
  • 24