0

I am testing text2vec. There are only 2 files under a dir (1.txt, 2.txt, of very small size, about 20 k each). I wanted to test their similarity. I do not know why it says 54 documents.

> library(stringr)
>  library(NLP)
>  library(tm)
>  library(text2vec)


>  filedir="F:\\0 R\\similarity test\\corpus"
>  prep_fun = function(x) {
+     x %>% 
+     # make text lower case
+     str_to_lower %>% 
+     # remove non-alphanumeric symbols
+     str_replace_all("[^[:alnum:]]", " ") %>% 
+     # collapse multiple spaces
+     str_replace_all("\\s+", " ")
+  }
>  allfile=idir(filedir)
>  #files=list.files(path=filedir, full.names=T)
>  #allfile=ifiles(files)
>  it=itoken(allfile, preprocessor=prep_fun, progressbar=F)
>  stopwrd=stopwords("en")
>  v=create_vocabulary(it, stopwords=stopwrd)
> v
Number of docs: 54 
174 stopwords: i, me, my, myself, we, our ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
          term term_count doc_count
  1:     house          2         2
  2: 224161072          2         2
  3:  suggests          2         2
  4:   remains          2         2
  5: published          2         2
 ---                               
338:      year         14         6
339:       nep         16        12
340:      will         16        10
341:   chinese         20        12
342:     malay         20        10
> 

I export the data into csv and find that the new file names are called:

1.txt_1
1.txt_2
1.txt_3
1.txt_4
...

...

If I used

#files=list.files(path=filedir, full.names=T)
#allfile=ifiles(files)

it still says 54 documents

And there are also similarity measures between them. Most of them are 0 similarity.

Please let me know if it should be such case or what ever.

What I want is only one similarity meaure for 1.txt and 2.txt and output such matrix that only contain measure for these two files.

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Dylan
  • 1,183
  • 4
  • 13
  • 26
  • 2
    Are there hidden files? Eventually your file browser does not show all files. – jogo Jul 01 '19 at 11:21
  • Just checked. No hidden files in that folder. Another test shows that if I put 10 files in a dir, it says Number of docs: 304 . New file names are similar, called 1.txt_1, 1.txt_2...1.txt_11...2.txt_1, 2.txt_2... – Dylan Jul 01 '19 at 12:40

1 Answers1

3

text2vec consider each line in each file as a separate document. In your case I suggest to provide another reader function to the idir/ifiles function. Reader should just read whole file and collapse rows into a single string. (For example reader = function (x) paste(readLines(x), collapse=' '))

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • It is not every day that the creator of a package answers a question about it. Thank you for your work. – John Coleman Jul 01 '19 at 16:52
  • @ Dmitriy Selivanov I am very glad Author of text2vec quickly and perfectly answered this question. Thanks a million. – Dylan Jul 02 '19 at 05:17