-1

So, I am doing a course on Python3 and in the scraping section, we have an assignment to scrape the http://quotes.toscrape.com/ website and get the text, author, and link of author's bio for all of the quotes, including the ones on "next" pages. I have done this but after I go to every new page, I get one line of empty row that I initially envisioned as headers.

import requests
from bs4 import BeautifulSoup
import csv
from time import sleep

base_url = "http://quotes.toscrape.com"
url = "/page/1"

f = open("scraping_project_final.csv", "w")
f.truncate()
f.close()

while url:

    with open("scraping_project_final.csv", "a") as file:
        csv_writer = csv.writer(file)
        csv_writer.writerow(["text", "name", "url"])

        response = requests.get(f"{base_url}{url}")
        print(f"Scraping {base_url}{url}")
        soup = BeautifulSoup(response.text, "html.parser")
        quotes = soup.find_all(class_="quote")

        for quote in quotes:
            txt = quote.find(class_="text").get_text()
            author = quote.find(class_="author").get_text()
            link = quote.find("a")["href"]
            csv_writer.writerow([txt, author, link])

        next_page = soup.find(class_="next")
        url = next_page.find("a")["href"] if next_page else None
    # sleep(2)

So, the issue that I have is that the initial writerrow actually creates one empty row each iteration, how do I avoid this? I would like to continue on this approach and not use DictReader if possible. I have added an image below, that is the CSV output. You can see that after ten rows, there is a row with just: text, names, url.

CSV Output

Matija
  • 69
  • 8
  • 1
    set a flag before your while loop `first_page = True` then wrap `csv_writer.writerow(["text", "name", "url"])` in an `if first_page` statement and set `first_page = False` – abdusco Jun 27 '20 at 06:19
  • 1
    Thanks, post it as an answer and I shall accept it. – Matija Jun 27 '20 at 06:26

3 Answers3

3

Open the file only once, write the headers once, then loop on the pages. For example:

with open('scraping_project_final.csv', 'w', encoding='utf-8-sig', newline='') as file:
    csv_writer = csv.writer(file)
    csv_writer.writerow(['text', 'name', 'url'])

    while url:

        response = requests.get(f'{base_url}{url}')
        ...

No need to re-open the file for each page and no need for truncating the file.

Note utf-8-sig is the best encoding for opening in Excel as it handles Unicode characters and newline='' is documented as the mode to open csv.writer files.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
1

Set a flag before your while loop then write headers only if you haven't before. Then flip the flag

# ...
first_page = False
while url:
    with open("scraping_project_final.csv", "a") as file:
        csv_writer = csv.writer(file)
        if first_page:
            csv_writer.writerow(["text", "name", "url"])
            first_page = False
        # ...
abdusco
  • 9,700
  • 2
  • 27
  • 44
0

investigate using dictionary writer to edit your csv. Do not write headers manually. https://docs.python.org/3/library/csv.html (scroll down to dict writer)

This is because the dict writer relies on the header to append/edit your csv values and all you need to do is to tell dict writer what your headers are and it will write the headers if necessary.

Obviously, if you are just looping, place the write header line outside of the loop so it only runs once as suggested by the people above. That should be the easiest way to solving your problem.