0

I have a corpus containing two text files that I imported as:

temp = list.files(pattern = ".txt")  
mydata = lapply(temp, read.delim, sep ="\t", quote = "")  
mydata

the output class was list but I converted it to character as:

class(mydata)  
list  
mydata <- as.character(mydata)

the texts are of the character class:

class(mydata)    
[1] "character"  

but it seems they are character strings as the output first shows:

[[1]]ï..We.give.the.observer.as.much.time.as.he.wants.to.make.his.response..we.simply.increase.the.number.of.alternative.stimuli.among.which.he.must.

(the above line is just an example of one of the texts); then it prints the actual texts as they are each sentence on a separate line, e.g., :

ï..this.is.just.a.bunch.of.crab.to.analyse. 
1  I need to understand how this R package works.                                                                                                                                                                                                                                                                                                                                                                        
2  lexical diversity needs to be analysed for two texts for now.                                                                                                                                                                                                                                                                                                                                                           
3  In this document I am typing each sentence on a separate line.                                                                                                                                                                                                                                                                                                                                                         

I need to have this texts converted as character vector for the next step of the analysis to convert them to ASCII with the help of stringi package in R, e.g., :

stri_enc_toascii(mydata) 

--this package only converts character vector to ascii encoding. So the question is:

--How to convert a corpus of character string to vector?

P.S: I have already reviewed all other questions in StackOverflow to avoid a duplicate question. Thanks for your help!


Thanks guys for your help! I simply used the as.vector to convert the character string to character vector:

as.vector(mydata)
is.vector(mydata)
TRUE

But the main problem remains: I wanted a character vector as input for the stringi package and the stri_enc_toascii(mydata) function to convert mydata to ASCII encoding (check here, but the encoding still shows unknown. Is there any straightforward way to convert an "unknown" encoding to "ascii"?

Samuel Liew
  • 76,741
  • 107
  • 159
  • 260
  • Please format your code appropriately, see https://stackoverflow.com/help/formatting – jay.sf May 26 '18 at 13:20
  • Parts of your question do not make sense. If `mydata` is the result of a call to `lapply`, then `class(mydata)` should return "list", not "character". Furthermore, `read.delim` is designed to read tables and is the wrong function for reading a non-tabular text file. – Ryan C. Thompson May 26 '18 at 13:25
  • Yes the result was initially **list** but I changed to character with as.character(). Before, I have imported text files with the 'read.delim' and specifying the pattern as .txt and I could do a bunch of work with it. Please let me know of any better method of reading a whole corpus of text files if you know. Please note that I'm trying to use the qdap package in R. – Maryam Nasseri May 26 '18 at 13:34

1 Answers1

0

The question isn't very clear, but it sounds like you want to flatten a vector of strings that are also converted to ASCII:

library(stringi)

string1 <- "Here's a random phrase."          # English, ASCII
string2 <- ".هنا عبارة عشوائية هناائية"     # Arabic, not ASCII
string3 <- "여기에 임의의 문구가 있습니다."      # Korean, not ASCII

strings <- c(string1, string2, string3)       # as a vector of strings of length 3

ascii_strings <- stri_enc_toascii(strings)    # convert to ASCII

stri_flatten(ascii_strings)           # as a flat, single element string

# other options....
stri_c(ascii_strings, collapse = " ") # as a flat, single element string
Reduce(paste, ascii_strings)          # base::Reduce() / purrr::reduce() with paste() will do the same
stringr::str_c(ascii_strings)         # stringr::str_c() just wraps stringi::str_c()
stringr::str_flatten()                # stringr::str_flatten() just wraps stringi::flatten()
knapply
  • 647
  • 1
  • 5
  • 11