1

I have a problem about python 3.5 Turkish character.

You can see issue in pictures. How can I fix this ?

My Codes is below. You can see last row that print(blink1.text)give charcter problem but print("çÇğĞıİuÜoÖşŞ")is not problem despite that's all same

    from bs4 import BeautifulSoup
import requests

r = requests.get("http://www.ensonhaber.com/son-dakika")
soup = BeautifulSoup(r.text)
for tag in soup.find_all("ul",attrs={"class":"ui-list"}):
    for link1 in tag.find_all('li'):
        for link2 in link1.find_all('a',href=True):
            print("www.ensonhaber.com" + link2['href'])
            print("\n")
            print(link2['title'])
        for link3 in link1.find_all('span',attrs={"class":"spot"}):
            # özet kısmı print(link3.text)
            print("\n")      
            rbodysite = "http://www.ensonhaber.com"+link2['href']
            rbody = requests.get(rbodysite)
            soupbody = BeautifulSoup(rbody.text)
            for btag in soupbody.find_all("article",attrs={"class":""}):
                for blink1 in btag.find_all("p"):
                    print(blink1.text)
                    print("çÇğĞıİuÜoÖşŞ")

My output :

Hangi Åehirde çekildiÄi bilinmeyen videoda bir çocuk, ailesiyle yolculuk yaparken gördüÄü trafik polisinin üÅüdüÄünü düÅünerek gözyaÅlarına boÄuldu. Trafik polisi, yanına gelen çocuÄu "Ben üÅümüyorum" diyerek teselli etti.
çÇğĞıİuÜoÖşŞ

python codes

character issue

Caleb Kleveter
  • 11,170
  • 8
  • 62
  • 92
zer03
  • 325
  • 1
  • 4
  • 15

1 Answers1

1

The problem is most certainly wrong code page. Python is codepage agnostic and neither print nor beautifulsoup is going to fix it for you.

The site seems to serve all pages in UTF-8 so I think your terminal is something else. I don't know what character set has ı but the locations of the corrupted characters and their values suggest Windows-1254. You need to call iconv, but you first need to read the meta tag <meta charset= because it won't always be UTF-8. On the other side, you also need to know your terminal's encoding, but that's harder to get.

Joshua
  • 40,822
  • 8
  • 72
  • 132