1

I need some guidance, please. I'm using the following code:

import requests
import bs4
import csv

results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')

reqSoup = bs4.BeautifulSoup(results.text, "html.parser")
i = 0
schools = []

for school in reqSoup:
    x = reqSoup.find_all("a", {"class" : "school-name"})
    while i < len(x):
        for name in x:
            y = x[i].get_text()
            i += 1
            schools.append(y)

with open('usnwr_schools.csv', 'wb') as f:
    writer = csv.writer(f)
        for y in schools:
        writer.writerow([y])

My problem is that the em-dashes are showing up as utf-8 in the resulting CSV file. I've tried several different things to fix it, but nothing seems to work (including attempting to use regex to get rid of it, as well as trying the .translate method that I found in a StackOverflow question from a few years ago).

What am I missing? I'd like the csv results to just include the text, minus the dashes.

I'm using Python 3.5, and am fairly new to Python.

Community
  • 1
  • 1
kknight
  • 43
  • 1
  • 6
  • How do you *expect* the em-dashes to show up? Unicode is an abstract enumeration of characters; a file is a sequence of bytes. UTF-8 is the default method for encoding a Unicode character as one or more bytes. If you want to remove the em-dashes or replace them with something else, you need to do it yourself; that isn't the encoder's job. – chepner Sep 20 '16 at 17:44
  • **All** your data is showing up as UTF-8 (apparently that's the preferred encoding for your locale, you didn't set an `encoding` when you opened the file). What did you want to show up instead? The rest of your text is still UTF-8 (even if the text could also be encoded in, say ASCII). – Martijn Pieters Sep 20 '16 at 17:46
  • Note that the `csv` module is just writing data in a specific format. You pass the data to the writer you want written. This means that this is not a `csv` module problem; it appears you want to pass in different data instead, so perhaps your question should be how you could limit the data to only contain ASCII characters (presumably that's what you wanted, just a-z, A-Z, 0-9 and basic punctuation). – Martijn Pieters Sep 20 '16 at 17:48
  • Yes: that's precisely what I want. My apologies if my question was confusing. I would like the final CSV data to just include text and basic punctuation, and haven't found any guidance on how to do this. – kknight Sep 20 '16 at 17:54
  • You would have to replace everything you do not want yourself (my answer) or just use a whitelist of allowed codepoints and replace the others with an empty string. – janbrohl Sep 20 '16 at 17:56

2 Answers2

1

For removing the dashes try y.replace("—","-").replace("–","-") (first one is em-dash to minus, second one is en-dash to minus)

If you only want ASCII-codepoints you can remove everything else with

import string
whitelist=string.printable+string.whitespace
def clean(s):
    return "".join(c for c in s if c in whitelist)

(this yields mostly-reasonable results for pure-english text only)

Btw try using

open('usnwr_schools.csv', 'w', newline='', encoding='utf-8') # or whatever encoding you like

because in Python 3 csv.writer takes text files not binary like it did in Python 2 (you opened it in binary mode ("wb"))

janbrohl
  • 2,626
  • 1
  • 17
  • 15
  • Thank you so much: I like the whitelist approach (I tried using replace, had no luck). Yet, I'm still getting the same results, which look like this: b'University of Michigan\xe2\x80\x94\xe2\x80\x8bAnn Arbor' – kknight Sep 20 '16 at 18:24
0

Learn to embrace Unicode...the world isn't ASCII anymore.

Assuming you are on Windows and viewing the .CSV with Excel or Notepad, use the following line on Python 3. With only this change (and fixing indentation of your post), You will even be able to view the non-ASCII characters correctly. Notepad and Excel like a UTF-8 BOM signature at the start of the file, which utf-8-sig provides.

with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:

If reading the file in another Python script, make sure to read the file with the following. Your example of what you read b'University of Michigan\xe2\x80\x94\xe2\x80\x8bAnn Arbor' was read in binary mode 'rb'.

with open('usnwr_schools.csv', encoding='utf-8-sig') as f:

If on Linux, you can use utf8 instead of utf-8-sig.

As an aside, you can replace your loops with:

with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.writer(f)
    for school in reqSoup:
        x = reqSoup.find_all("a", {"class" : "school-name"})
        for item in x:
            y = item.get_text()
            writer.writerow([y])

Reading it back:

with open('usnwr_schools.csv',encoding='utf-8-sig') as f:
    print(f.read())

Output:

Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—​Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—​Ann Arbor
Georgia Institute of Technology
University of Illinois—​Urbana-​Champaign
Purdue University—​West Lafayette
University of Texas—​Austin (Cockrell)
Texas A&M; University—​College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—​Los Angeles (Samueli)
University of California—​San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—​Santa Barbara
Harvard University
University of Maryland—​College Park (Clark)
University of Washington

If you still want to be ASCII only, this will do it:

import requests
import bs4
import csv

results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')

replacements = {ord('\N{EN DASH}'):'-',
                ord('\N{EM DASH}'):'-',
                ord('\N{ZERO WIDTH SPACE}'):None}

reqSoup = bs4.BeautifulSoup(results.text, "html.parser")

with open('usnwr_schools.csv', 'w', newline='', encoding='ascii') as f:
    writer = csv.writer(f)
    for school in reqSoup:
        x = reqSoup.find_all("a", {"class" : "school-name"})
        for item in x:
            y = item.get_text()
            writer.writerow([y.translate(replacements)])

with open('usnwr_schools.csv',encoding='ascii') as f:
    print(f.read())
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251