Getting text with accented characters using Python and Selenium

Question

I made a scraping script with python and selenium. It scrapes data from a Spanish language website:

for i, line in enumerate(browser.find_elements_by_xpath(xpath)):
    tds = line.find_elements_by_tag_name('td')  # takes <td> tags from line
    print tds[0].text  # FIRST PRINT
    if len(tds)%2 == 0:  # takes data from lines with even quantity of cells only
        data.append([u"".join(tds[0].text), u"".join(tds[1].text), ])
    print data  # SECOND PRINT

The first print statement gives me a normal Spanish string. But the second print gives me a string like this: "Data de Distribui\u00e7\u00e3o". What's the reason for this?

could you show the original string, and the data in tds please? — tglaria, Dec 02 '15 at 13:26

score 3 · Answer 1 · answered Dec 02 '15 at 11:25

3

You are mixing encodings:

u'' # unicode string
b'' # bytearray string

The text property of tds[0] is a bytearray string which is encoding agnostic, and you are operating in the second print with unicode string, thus mixing the encodings

answered Dec 02 '15 at 11:25

Josi

96
4

score 0 · Answer 2 · answered Jul 27 '21 at 10:15

0

for using any type of accented character we have to first encode or decode it before using them

accent_char = "ôâ"
name = accent_char.decode('utf-8')
print(name)

The above code will work for decoding the characters

answered Jul 27 '21 at 10:15

Akash zawar

41
5

Getting text with accented characters using Python and Selenium

2 Answers2