1

i am trying to scrape a table that spans multiple pages and export to a csv file. only one line of data seems to get exported and it is jumbled up.

I have looked on the web and tried many iterations and very frustrated now. As you can tell from code I am a novice at coding!

import bs4 as bs
import urllib.request
import pandas as pd
import csv


max_page_num = 14
max_page_dig = 1 # number of digits in the page number


with open('result.csv',"w") as f:
    f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")

for i in range(0, max_page_num):  
    page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
    print(page_num)
    source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
    print(source)

    url = urllib.request.urlopen(source).read()    

    soup = bs.BeautifulSoup(url,'lxml')
    table = soup.find('table')
    table_rows = table.find_all('tr')

    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text for i in td]
        #final = row.strip("\n")
        #final = row.replace("\n","")
        with open('result.csv', 'a') as f:
            f.write(row)

It seems when I write to csv it overwrites previous ones. It also pastes it on one line and the players name is concatenated with the school name . Thanks for any and all help.

Herman L
  • 165
  • 1
  • 7

1 Answers1

1

I think you have a problem with your inside for loop. Try re-writing it as

with open('result.csv', 'a') as f:
   for tr in table_rows:
      td = tr.find_all('td')
      row = [i.text for i in td]
      f.write(row)

and see if it works.

More generally, this can probably be done more simply by using pandas. Try changing your for loop to:

for i in range(0, max_page_num):  
   page_num = ...
   source = ....
   df = pd.read_html(source)
   df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • Hi Jack (fellow NY'er). Thanks so much for the effort. Using your first correction method it does indeed download them all but unfortunately puts them all on one line with /n's. The second method I got error 'AttributeError: 'list' object has no attribute 'to_csv''. – Herman L Oct 29 '19 at 00:49
  • 1
    @HermanL - Small World! Anyway, `df = pd.read_html(source)` may result in more than one table, depending on the page - that's the 'list' the error is referring to. To see how many tables you snagged, add `print(len(df))`. Try writing to csv like `df[0].to_csv(...)`. – Jack Fleeting Oct 29 '19 at 01:37