Converting a text corpus of character string to character vector before using the stringi package

Question

I have a corpus containing two text files that I imported as:

temp = list.files(pattern = ".txt")  
mydata = lapply(temp, read.delim, sep ="\t", quote = "")  
mydata

the output class was list but I converted it to character as:

class(mydata)  
list  
mydata <- as.character(mydata)

the texts are of the character class:

class(mydata)    
[1] "character"

but it seems they are character strings as the output first shows:

[[1]]ï..We.give.the.observer.as.much.time.as.he.wants.to.make.his.response..we.simply.increase.the.number.of.alternative.stimuli.among.which.he.must.

(the above line is just an example of one of the texts); then it prints the actual texts as they are each sentence on a separate line, e.g., :

ï..this.is.just.a.bunch.of.crab.to.analyse. 
1  I need to understand how this R package works.                                                                                                                                                                                                                                                                                                                                                                        
2  lexical diversity needs to be analysed for two texts for now.                                                                                                                                                                                                                                                                                                                                                           
3  In this document I am typing each sentence on a separate line.

I need to have this texts converted as character vector for the next step of the analysis to convert them to ASCII with the help of stringi package in R, e.g., :

stri_enc_toascii(mydata)

--this package only converts character vector to ascii encoding. So the question is:

--How to convert a corpus of character string to vector?

P.S: I have already reviewed all other questions in StackOverflow to avoid a duplicate question. Thanks for your help!

Thanks guys for your help! I simply used the as.vector to convert the character string to character vector:

as.vector(mydata)
is.vector(mydata)
TRUE

But the main problem remains: I wanted a character vector as input for the stringi package and the stri_enc_toascii(mydata) function to convert mydata to ASCII encoding (check here, but the encoding still shows unknown. Is there any straightforward way to convert an "unknown" encoding to "ascii"?

Please format your code appropriately, see https://stackoverflow.com/help/formatting — jay.sf, May 26 '18 at 13:20
Parts of your question do not make sense. If `mydata` is the result of a call to `lapply`, then `class(mydata)` should return "list", not "character". Furthermore, `read.delim` is designed to read tables and is the wrong function for reading a non-tabular text file. — Ryan C. Thompson, May 26 '18 at 13:25
Yes the result was initially **list** but I changed to character with as.character(). Before, I have imported text files with the 'read.delim' and specifying the pattern as .txt and I could do a bunch of work with it. Please let me know of any better method of reading a whole corpus of text files if you know. Please note that I'm trying to use the qdap package in R. — Maryam Nasseri, May 26 '18 at 13:34

score 0 · Answer 1 · answered May 26 '18 at 15:38

The question isn't very clear, but it sounds like you want to flatten a vector of strings that are also converted to ASCII:

library(stringi)

string1 <- "Here's a random phrase."          # English, ASCII
string2 <- ".هنا عبارة عشوائية هناائية"     # Arabic, not ASCII
string3 <- "여기에 임의의 문구가 있습니다."      # Korean, not ASCII

strings <- c(string1, string2, string3)       # as a vector of strings of length 3

ascii_strings <- stri_enc_toascii(strings)    # convert to ASCII

stri_flatten(ascii_strings)           # as a flat, single element string

# other options....
stri_c(ascii_strings, collapse = " ") # as a flat, single element string
Reduce(paste, ascii_strings)          # base::Reduce() / purrr::reduce() with paste() will do the same
stringr::str_c(ascii_strings)         # stringr::str_c() just wraps stringi::str_c()
stringr::str_flatten()                # stringr::str_flatten() just wraps stringi::flatten()

Converting a text corpus of character string to character vector before using the stringi package

1 Answers1