Here's what I'm doing, I'm web crawling for my personal use on a website to copy the text and put the chapters of a book on text format and then transform it with another program to pdf automatically to put it in my cloud. Everything is fine until this happens: special characters are not copying correctly, for example the accent is showed as: \xe2\x80\x99 on the text file and the - is showed as \xe2\x80\x93. I used this (Python 3):
for text in soup.find_all('p'):
texta = text.text
f.write(str(str(texta).encode("utf-8")))
f.write('\n')
Because since I had a bug when reading those characters and it just stopped my program, I encoded everything to utf-8 and retransform everything to string with python's method str()
I will post the whole code if anyone has a better solution to my problem, here's the part that crawl the website from page 1 to max_pages, you can modify it on line 21 to get more or less chapters of the book:
import requests
from bs4 import BeautifulSoup
def crawl_ATG(max_pages):
page = 1
while page <= max_pages:
x= page
url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
source = requests.get(url)
chapter = source.content
soup = BeautifulSoup(chapter.decode('utf-8', 'ignore'), 'html.parser')
f = open('atg_chapter' + str(x) + '.txt', 'w+')
for text in soup.find_all('p'):
texta = text.text
f.write(str(str(texta).encode("utf-8")))
f.write('\n')
f.close
page +=1
crawl_ATG(10)
I will do the clean up of the first useless lines that are copied later when I get a solution to this problem. Thank you