Can't write/read a string text extracted from a PDF

Question

I have extracted the whole text from a PDF and saved in a variable "CCR". I can print and it shows me the text fine. But when i try to read its lines or save in a txt file, it just show me/save blank/nothing. Any ideas?

Example when i print my variable (works fine):

"Chapter 9 - Digital Transformation"

Im using tika server to extract the text.

txt_CCR = open(r"C:\Users\guerr\OneDrive\Documentos\PYTHON\TXT_FILES\CCR.txt", "w")

txt_CCR.write(CCR)
txt_CCR.close()

It gives me this error when i try to write in a file:

UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-23-94a2126671fc> in <module>()
      1 txt_CCR = open(r'C:\Users\guerr\OneDrive\Documentos\PYTHON\TXT_FILES\CCR.txt', 'w')
----> 2 txt_CCR.write(CCR)
      3 txt_CCR.close()

~\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
     17 class IncrementalEncoder(codecs.IncrementalEncoder):
     18     def encode(self, input, final=False):
---> 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
     20 
     21 class IncrementalDecoder(codecs.IncrementalDecoder):

UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 95944: character maps to <undefined>

It's an encoding issue. Your `CCR` variable is holding chars in a different encoding than what you are trying to write in. Detect the encoding using `chardet` and `open` the txt file using the correct encoding (e.g. `txt_CCR = oepn(r'path', 'w', encoding=correct_encoding)`) — Endyd, May 29 '19 at 18:58
My friend said the same today, thanks for the suggestion of using chardet, guess i'll use a lot in the future. Thanks! — Vitor Vito, May 30 '19 at 18:00

Can't write/read a string text extracted from a PDF

0 Answers0