0

I made a scraping script with python and selenium. It scrapes data from a Spanish language website:

for i, line in enumerate(browser.find_elements_by_xpath(xpath)):
    tds = line.find_elements_by_tag_name('td')  # takes <td> tags from line
    print tds[0].text  # FIRST PRINT
    if len(tds)%2 == 0:  # takes data from lines with even quantity of cells only
        data.append([u"".join(tds[0].text), u"".join(tds[1].text), ])
    print data  # SECOND PRINT

The first print statement gives me a normal Spanish string. But the second print gives me a string like this: "Data de Distribui\u00e7\u00e3o". What's the reason for this?

iacob
  • 20,084
  • 6
  • 92
  • 119
Alexander Yudkin
  • 462
  • 3
  • 12

2 Answers2

3

You are mixing encodings:

u'' # unicode string
b'' # bytearray string

The text property of tds[0] is a bytearray string which is encoding agnostic, and you are operating in the second print with unicode string, thus mixing the encodings

Josi
  • 96
  • 4
0

for using any type of accented character we have to first encode or decode it before using them

accent_char = "ôâ"
name = accent_char.decode('utf-8')
print(name)

The above code will work for decoding the characters