5

I've been trying to write a script that would potentially scrape the list of usernames off the comments section on a defined YouTube video and paste those usernames onto a .csv file.

Here's the script :

from selenium import webdriver
import time
import csv
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
driver=webdriver.Chrome()
driver.get('https://www.youtube.com/watch?v=VIDEOURL')
time.sleep(5)
driver.execute_script("window.scrollTo(0, 500)")
time.sleep(3)
html = driver.find_element_by_tag_name('html')
html.send_keys(Keys.END)
time.sleep(5)
scroll_time = 40
for num in range(0, scroll_time):
    html.send_keys(Keys.PAGE_DOWN)
for elem in driver.find_elements_by_xpath('//span[@class="style-scope ytd-comment-renderer"]'):
    print(elem.text)
    with open('usernames.csv', 'w') as f:
        p = csv.writer(f)
        p.writerows(str(elem.text));

It keeps throwing out the error for line 19 :

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u30b9' in position 0: character maps to <undefined>

I'd read on here that this may have something to do with how windows console deals with unicodes and saw a potential solution about downloading and installing a unicode library package, but that didn't help either.

Could anyone help me figure out what I'm doing wrong?

PS. I'm using the latest version of python (3.7).

Much appreciated, Sergej.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
sergej.k
  • 51
  • 1
  • 3
  • Unrelated - but youll end up with one name only. You get lots of elements and for each you reopen a file with `'w'` (wich will delete the old one) and write smth in it thats then being deleted the next time round. use `'a'` or even better: open file once, then write all then close it - this way its much faster and wont have to open a bazillion times to write some names – Patrick Artner Oct 05 '18 at 05:58
  • Heya @PatrickArtner much appreciated. I did change that part up and put up a new test video with three comments left by me and the script worked, although not flawlessly. What it did do, was separate each character in the username with a comma and saved them as individual attributes in csv. Though, I can't seem to fix that without inspecting what kind of data get's returned (appears to be normal text in console), I did find a work around for that. I'm now almost certain that this has something to do with how the data from python is encoded and written in csv. – sergej.k Oct 05 '18 at 13:07

1 Answers1

11

Python 3 str values need to be encoded as bytes when written to disk. If no encoding is specified for the file, Python will use the platform default. In this case, the default encoding is unable to encode '\u0389', and so raises a UnicodeEncodeError.

The solution is to specify the encoding as UTF-8 when opening the file:

with open('usernames.csv', 'w', encoding='utf-8') as f:
    p = csv.writer(f)
    ...

Since UTF-8 isn't your platform's default encoding, you'll need to specify the encoding when opening the file as well, in Python code or in applications like Excel.

Windows supports a modified version of UTF-8, named "utf-8-sig" in Python. This encoding inserts three characters at the start of a file to identify the file's encoding to Windows applications which might otherwise attempt to decode using an 8-bit encoding. If the file will be used exclusively on Windows machines then it may be worth using this encoding instead.

with open('usernames.csv', 'w', encoding='utf-8-sig') as f:
    p = csv.writer(f)
    ...
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153