R: How get file name with Quanteda: char_segment

Question

I am using char_segment from Quanteda library to separate multiple documents from one file separatted by a pattern, this command works great and easily! (I did try with str_match and strsplit but without success).

Lamentably I am unable to get the filename as a Variable, this is key to next analysis.example

Example of my commands:

Library(quanteda)
doc <- readtext(paste0("PATH/*.docx"))
View(doc)

docc=char_segment(doc$text,  pattern = ",", remove_pattern = TRUE)

Please any suggestion or other options to split documents are welcome.

Nicolás Velasquez · Accepted Answer · 2018-07-06T21:59:38.277

Simply get the list of your docx files first, it will yield the name of the files. Then run the char_segment function on them them by a lapply, loop, or purrr::map()

The following code assumes that your target documents are stored in a directory called "docx" within your working directory.

library(quanteda)
library(readtext)  ## Remember to include in your posts the libraries required to replicate the code.


list_of_docx <- list.files(path = "./docx", ## Looks inside the ./docx directory
                       full.names = TRUE,   ## retrieves the full path to the documents
                       pattern = "[.]docx$", ## retrieves al documents whose names ends in ".docx"
                       ignore.case = TRUE)  ## ignores the letter case of the document's names

Preparing the for loop

df_docx <- data.frame() ## Create an empty dataframe to store your data

for (d in seq_along(list_of_docx)) {  ## Tell R to run the loop/iterate along the number of elements within thte list of doccument paths
    temp_object <-readtext(list_of_docx[d])
    temp_segmented_object <- char_segment(temp_object$text, pattern = ",", remove_pattern = TRUE)
    temp_df <- as.data.frame(temp_segmented_object)
    colnames(temp_df) <- "segments"
    temp_df$title <- as.character(list_of_docx[d])  ## Create a variable with the title of the source document
    temp_df <- temp_df[, c("title", "segments")]
    df_docx <- rbind(df_docx, temp_df) ## Append each dataframe to the previously created empty dataframe
    rm(temp_df, temp_object, d)
    df_docx
 }


head(df_docx)

Sorry I am basic in R. Could give an example to add file name with lapply or a loop. Thanks in advance — Rodrigo B, Jul 06 '18 at 17:03
My file contain multiple documents, then the row number for doc_id is not the same that the number of row documents when I separate with the pattern. — Rodrigo B, Jul 06 '18 at 17:57
Hello Rodrigo. I updated the answer to be more thorough. Please check it. — Nicolás Velasquez, Jul 06 '18 at 22:00
wow! work great!! Thanks for your time and effort with your example! — Rodrigo B, Jul 07 '18 at 04:10

score 0 · Answer 2 · answered Jul 06 '18 at 00:55

0

You should have names of Word files already:

require(readtext)
data_dir <- system.file("extdata/", package = "readtext")
readtext(paste0(data_dir, "/word/*"))

readtext object consisting of 6 documents and 0 docvars.    
# data.frame [6 × 2]
  doc_id                                 text                
  <chr>                                  <chr>               
1 21Parti_Socialiste_SUMMARY_2004.doc    "\"[pic]\nRésu\"..."
2 21vivant2004.doc                       "\"http://www\"..." 
3 21VLD2004.doc                          "\"http://www\"..." 
4 32_socialisti_democratici_italiani.doc "\"DIVENTARE \"..." 
5 UK_2015_EccentricParty.docx            "\"The Eccent\"..." 
6 UK_2015_LoonyParty.docx                "\"The Offici\"..."

They are passed to quanteda's downstream objects as document names.

answered Jul 06 '18 at 00:55

Kohei Watanabe

750
3
6

In your example?.Where I set up the pattern to separate documents? This is my problem, I can not get names of files when I separate documents. – Rodrigo B Jul 06 '18 at 16:59
I dont have problems to get file names from – Rodrigo B Jul 06 '18 at 17:47
I dont have problems to get file names with readtext, but this column (file id) dessapear when I separated the document with char_segment – Rodrigo B Jul 06 '18 at 17:49
try `docnames()` on a corpus, tokens, or DFM. – Kohei Watanabe Jul 08 '18 at 01:50

score 0 · Answer 3 · answered Jul 06 '18 at 18:05

0

Example when I read text This is my problem, when I separe documents by ###*

When I use Char segment

answered Jul 06 '18 at 18:05

Rodrigo B

21
7

R: How get file name with Quanteda: char_segment

3 Answers3

Preparing the for loop