Mapping unicode characters to language in R

Question

I'm extracting data from a .pdf file which is in Tamil(an Indian local language) Language, After extracting the text in R from pdf file gives me some junk or unicode character format text. I'm unable to map it to proper text or the same text as it is in pdf file, Here is the code

library(tm)
library(pdftools)
library(qdapRegex)
library(stringr)
library(textreadr)

if(!require("ghit")){
  install.packages("ghit")
}
# on 64-bit Windows
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"))
text <- extract_tables("D:/first.pdf")
 text[[1]][,2][3]

This gives me some junk character like

"Â«Ã®Ã¹Â£Ã±Â¢Â«Ã°Ã¬Â¢Â¬Ã¬  , Ã¢Ã´Â¢Ã¬Â£Ã±Â¢ÃºÂ¢ Â«Ã³Â£ Ì"

I tried with changing the unicode type

library(stringi)
stri_trans_toupper("ÃªÂ¶Ã³Â®", locale = "Tamil")

But no success though. Any suggestion will be appreciable.

Thanks.

score 2 · Answer 1 · answered Sep 16 '17 at 13:32

2

If your text has been successfully extracted and it is the only problem of converting the encoding, I think iconv function works. I provide an example with text encoded by "cp932" (East Asian Languages).

# text file written in cp932
x <- readLines("test-cp932.txt", encoding="utf-8")  

x
## [1] "\x82\xa0\x82肪\x82Ƃ\xa4"
# this is garbled because the file has been read
# in a wrong encoding

iconv(x, "cp932", "utf-8")
## [1] "ありがとう"
# this means 'thank you'

If this does not work out, then your text may have been contaminated during the parsing process.

Another possibility is to make your strings to raw object (codes) and reformulate the original text using code mapping like this.

charToRaw(x)
##  [1] 82 a0 82 e8 82 aa 82 c6 82 a4

answered Sep 16 '17 at 13:32

Kota Mori

6,510
1
21
25

The text which I get after parsing data from .pdf file "text[[1]][,5][2] [1] "-.M/S Ã³Â£Ã±Â¢Ã³Â£Ã¼Â¢ Ã£Ã¼ÃÂ¢Â¢ÃªÂ¦ÃºÂ¢\r(Rep by its\rS.aÃµÃ©Â¢Ã¨Ã¬Â¢Ã³Â£Ã±Ã¹Â¢),V.Ã¿Ã¹Â¤ÃµÂ£ÃªÃ¹Â¢\r,V.Ã°Â£Ã´Â«Ã¨Â£Ã°Â£Ã´Â¢" And after using iconv(text[[1]][,5][2], "cp932", "utf-8") "-.M/S ï¾ƒï½³ï¾‚ï½£ï¾ƒï½±ï¾‚ï½¢ï¾ƒï½³ï¾‚ï½£ï¾ƒï½¼ï¾‚ï½¢ ï¾ƒï½£ï¾ƒï½¼ï¾ƒï½ï¾‚ï½¢ï¾‚ï½¢ï¾ƒï½ªï¾‚ï½¦ï¾ƒï½ºï¾‚ï½¢\r(Rep by its\rS.aï¾ƒï½µï¾ƒï½©ï¾‚ï½¢ï¾ƒï½¨ï¾ƒï½¬ï¾‚ï½¢ï¾ƒï½³ï¾‚ï½£ï¾ƒï½±ï¾ƒï½¹ï¾‚ï½¢),V.ï¾ƒï½¿ï¾ƒï½¹ï¾‚ï½¤ï¾ƒï½µï¾‚ï½£ï¾ƒï½ªï¾ƒï½¹ï¾‚ï½¢\r,V.ï¾ƒï½°ï¾‚ï½£ï¾ƒï½´ï¾‚ï½«ï¾ƒï½¨ï¾‚ï½£ï¾ƒï½°ï¾‚ï½£ï¾ƒï½´ï¾‚ï½¢"" – Andre_k Sep 18 '17 at 04:22
Definitely not "cp932". I used it in the example because that is the only local encoding that I am familiar with. You can search on the web in which encoding your text is likely to be written. I don't know what encoding is used for Tamali language often. – Kota Mori Sep 18 '17 at 10:16

Mapping unicode characters to language in R

1 Answers1