Python - saving a web scraped file - error in encoding of Polish characters

Question

I have created a block of code, which web scrapes information for property listings of a Polish website.

import bs4
import csv
from urllib.request import urlopen as Open
from urllib.request import Request
from bs4 import BeautifulSoup as soup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}
results = "https://www.otodom.pl/sprzedaz/mieszkanie/?nrAdsPerPage=72&search%5Border%5D=created_at_first%3Adesc&page=1"
req = Request(url=results, headers=headers) 
html = Open(req).read()




page_soup = soup(html, "html.parser")
total_pages = int(page_soup.find("div",{"class":"after-offers clearfix"}).find("ul",{"class":"pager"}).findAll("li")[4].text)

offer_list = []
offer_list.append(["Price", 
    "Location", 
    "Forma własności",
    "Liczba pięter",
    "Liczba pokoi",
    "Materiał budynku",
    "Ogrzewanie",
    "Ogrzewanie",
    "Okna",
    "Okna",
    "Piętro",
    "Powierzchnia",
    "Rodzaj zabudowy",
    "Rok budowy",
    "Rynek",
    "Stan wykończenia",
    "Link"])

for page in range(0, 1):
    page += 1
    print(page)
    results = "https://www.otodom.pl/sprzedaz/mieszkanie/?nrAdsPerPage=72&search%5Border%5D=created_at_first%3Adesc&page="+str(page)
    #print(results)

    req = Request(url=results, headers=headers) 
    html = Open(req).read()

    page_soup = soup(html, "html.parser")

    listings = page_soup.findAll("article",{"data-featured-name":"listing_no_promo"})
    #print(len(listings))

    for i in listings:
        listing = i.a.get("href")
        req = Request(url=listing, headers=headers) 
        html = Open(req).read()

        page_soup = soup(html, "html.parser")

        # get location

        location = page_soup.find("a", {"href":"#map"}).text.split("}")[2]

        # get price

        price = page_soup.find("div", {"class":"css-1vr19r7"}).text.replace(" ","").replace("zł","")

        # get property features

        container = page_soup.find("section", {"class":"section-overview"}).findNext("div").ul.findAll("li")

        features = []

        for feature in ["Forma własności",
    "Liczba pięter",
    "Liczba pokoi",
    "Materiał budynku",
    "Ogrzewanie",
    "Okna",
    "Piętro",
    "Powierzchnia",
    "Rodzaj zabudowy",
    "Rok budowy",
    "Rynek",
    "Stan wykończenia"
                       ]:
            for contain in container:
                if feature in contain.text:
                    features.append(contain.text.split(":")[1].replace(" m²",""))
                    break
            else:  # if we didn't break
                features.append("N/A")


        offer = [price, location, *features, listing]
        offer_list.append(offer)

with open ('filename.csv','w', encoding='utf-8') as file:
   writer=csv.writer(file)
   for row in offer_list:
      writer.writerow(row)

print("data saved")

I have gotten to the stage, where the file is saved, however the Polish font gets destroyed, e.g. ÅÃ³dÅº, Å‚Ã³dzkie

Is there a way to make it either convert the Polish characters to pure Latin, e.g. ó to o, or just keep them in unchanged form?

Are you sure they are destroyed? It looks like you tried to load the UTF8 file using the wrong encoding. Anything outside the US-ASCII range takes two or more bytes in UTF8, and the first one would look like `Å` if you used Latin1 instead of UTF8 when reading the file — Panagiotis Kanavos, Dec 04 '19 at 09:35
Your own question proves this - StackOverflow like all web sites returns text using UTF8 encoding, and yet, your sample text isn't mangled — Panagiotis Kanavos, Dec 04 '19 at 09:36
I have gotten the idea for utf-8 from another post on SO, it seemed to have solved the problem there. On my first week of python though, so could have misunderstood something... — goldsilvy, Dec 04 '19 at 09:39
There's no problem. You used the wrong codepage to read the file. If I try to read the UTF8 bytes for `Forma własności` with the codepage 1252, I get `Forma wÅ‚asnoÅ›ci`. That's because `ł` and all characters in the Latin1 range above 0x7F are encoded using *two* bytes. The first one here is `0xC5`, which in Latin 1 is the value for `Å` — Panagiotis Kanavos, Dec 04 '19 at 09:51
Check the [Wikipedia article on UTF8](https://en.wikipedia.org/wiki/UTF-8) to see how the characters are actually represented. — Panagiotis Kanavos, Dec 04 '19 at 09:53
Thanks for all the replies, much appreciated. When I opened the file straight from Excel it loaded incorrect encoding. By opening a blank Excel and then importing csv and setting the encoding from there, changed it to utf-8 and it reads everything correctly now. Many thanks once again! — goldsilvy, Dec 04 '19 at 09:59
You should have said `Excel` from the start. There are far better ways to handle this, and it's not *Excel's* fault. When you double click on a text file Excel has no idea what's in there. If the file has byte order marks, it will try to load it using the UTF encoding specified by those marks. For UTF8 though, the standard way is to *not* use BOM. In that case Excel will try to import the text using your machine's locale settings. When you import the data explicitly though, you tell Excel what codepage to use — Panagiotis Kanavos, Dec 04 '19 at 10:04
An even *better* solution though is to use a library that generates `xlsx` files directly, like `openpyxl` or `xlsxwriter`. An `xlsx` file is a zip package containing XML files in a well-defined format. You could create them directly, but it's a lot easier to use a library for this. The result is a lot smaller than the uncompressed CSV file too — Panagiotis Kanavos, Dec 04 '19 at 10:05
PS: Excel can read data from web pages directly since at least Excel 2007. You could use it for quick tests, to explore the structure of a page etc — Panagiotis Kanavos, Dec 04 '19 at 10:10

Python - saving a web scraped file - error in encoding of Polish characters

0 Answers0