4

Here's what I'm doing, I'm web crawling for my personal use on a website to copy the text and put the chapters of a book on text format and then transform it with another program to pdf automatically to put it in my cloud. Everything is fine until this happens: special characters are not copying correctly, for example the accent is showed as: \xe2\x80\x99 on the text file and the - is showed as \xe2\x80\x93. I used this (Python 3):

    for text in soup.find_all('p'):
        texta = text.text
        f.write(str(str(texta).encode("utf-8")))
        f.write('\n')

Because since I had a bug when reading those characters and it just stopped my program, I encoded everything to utf-8 and retransform everything to string with python's method str()

I will post the whole code if anyone has a better solution to my problem, here's the part that crawl the website from page 1 to max_pages, you can modify it on line 21 to get more or less chapters of the book:

import requests

from bs4 import BeautifulSoup

def crawl_ATG(max_pages):
    page = 1
    while page <= max_pages:
        x= page
        url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
        source = requests.get(url)
        chapter = source.content
        soup = BeautifulSoup(chapter.decode('utf-8', 'ignore'), 'html.parser')
        f = open('atg_chapter' + str(x) + '.txt', 'w+')
        for text in soup.find_all('p'):
        texta = text.text
            f.write(str(str(texta).encode("utf-8")))
            f.write('\n')
        f.close
        page +=1
    
crawl_ATG(10)

I will do the clean up of the first useless lines that are copied later when I get a solution to this problem. Thank you

Seraf
  • 850
  • 1
  • 17
  • 34
  • Are you using Python 2 or 3? It matters. Read [the Python Unicode howto.](https://docs.python.org/release/3.2/howto/unicode.html) – Bob Dylan Nov 17 '15 at 16:55
  • I'm using Python 3, thank you for the link I will dig into it carrefully. @BobDylan – Seraf Nov 17 '15 at 17:37
  • It matters whether or not you are using Python 2 or 3 because str() means something completely different in each. You need to edit this question to tell which so people can help you – Bob Dylan Nov 17 '15 at 17:43

3 Answers3

2

The easiest way to fix this problem that I found is adding encoding= "utf-8" in the open function:

with open('file.txt','w',encoding='utf-8') as file :
   file.write('ñoño')
Ignacio Ambía
  • 479
  • 5
  • 16
0

The only error I can spot is,

str(texta).encode("utf-8")

In it, you are forcing a conversion to str and encoding it. It should be replaced with,

texta.encode("utf-8")

EDIT:

The error stems in the server not giving the correct encoding for the page. So requests assumes a 'ISO-8859-1'. As noted in this bug, it is a deliberate decision.

Luckily, chardet library correctly detects the 'utf-8' encoding, so you can do:

source.encoding = source.apparent_encoding
chapter = source.text

And there won't be any need to manually decode the text in chapter, since requests uses it to decode the content for you.

memoselyk
  • 3,993
  • 1
  • 17
  • 28
  • It was redundancy, thank you for that but it still doesn't write special characters on my text file :( – Seraf Nov 17 '15 at 17:22
0

For some reason, you (wrongly) have utf8 encoded data in a Python3 string. The real cause of that is probably that requests.content is already a unicode string, so you should not decode it, but use it directly:

    url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
    source = requests.get(url)
    chapter = source.content
    soup = BeautifulSoup(chapter, 'html.parser')

If it is not enough, that means if you still have ’ and – (Unicode u'\u2019' and u'\u2013') display as \xe2\x80\x99 and \xe2\x80\x93', that could be caused by the html page not correctly declaring its encoding. In that case you should first encode to a byte string with latin1 encoding, and then decode it as utf8:

chapter = source.content.encode('latin1', 'ignore').decode('utf8', 'ignore')
soup = BeautifulSoup(chapter, 'html.parser')

Demonstration:

t = u'\xe2\x80\x99 \xe2\x80\x93'
t = t.encode('latin1').decode('utf8')

Displays : u'\u2019 \u2013'

print(t)

Displays : ’ –

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • 1
    It didn't work saying there is no encode for a bytes content. I looked for some websites on the internet and found that each line of my text document is in Bynary Literal. I will just let my code as is and make a converter from Bynary Literal to String – Seraf Nov 17 '15 at 20:40