0

I'm doing some progress with web scraping however I still need some help to perform some operations:

import requests
import pandas as pd
from bs4 import BeautifulSoup




url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'

# soup = BeautifulSoup(requests.get(converturl).content, 'html.parser')

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = []

for tr in soup.select('.col-md-4 tbody tr'):

On the class col-md-4 I know there are 3 tables I want to generate a csv which as an output has three values: first name, last name, and for the last value I want the header name of the table.

first name, last name, header table

Any help would be appreciated.

zalexhp
  • 201
  • 1
  • 7
  • See if this help, https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe – sushanth Jun 01 '20 at 11:46
  • Thanks for the link but this is using pandas and I would like to use beautifulsoup. – zalexhp Jun 01 '20 at 12:16

3 Answers3

1

This is what I have done on my own:

import requests
import pandas as pd
from bs4 import BeautifulSoup





url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'


soup = BeautifulSoup(requests.get(url).content, 'html.parser')

filename = url.rsplit('/', 1)[1] + '.csv'


tables = soup.select('.col-md-4 table')
rows = []

for tr in tables:
    t = tr.get_text(strip=True, separator='|').split('|')
    rows.append(t)
    df = pd.DataFrame(rows)
    print(df)
    df.to_csv(filename)

Thanks,

zalexhp
  • 201
  • 1
  • 7
1

This might work:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tables = soup.select('.col-md-4 table')
rows = []

for table in tables:
    cleaned = list(table.stripped_strings)
    header, names = cleaned[0], cleaned[1:]
    data = [name.split(', ') + [header] for name in names]
    rows.extend(data)

result = pd.DataFrame.from_records(rows, columns=['surname', 'name', 'table'])
Milan Cermak
  • 7,476
  • 3
  • 44
  • 59
  • thanks for the help. I have pasted the code on visual studio but I have an error SyntaxError: 'return' outside function – zalexhp Jun 01 '20 at 14:02
  • I've edited the answer, you'll have the desired result in the `result` variable. – Milan Cermak Jun 01 '20 at 14:34
  • Hi Milan I appreciate your support, I have tried the code again and I still get an issue. Exception has occurred: TypeError 'generator' object is not subscriptable File "plantillasfcf.py", line 30, in header, names = cleaned[0], cleaned[1:] – zalexhp Jun 01 '20 at 19:59
  • Sorry. I've edited the answer - the output of `stripped_strings` needs to be wrapped in a `list`. Try again? – Milan Cermak Jun 01 '20 at 20:16
1

You need to first iterate through each table you want to scrape, then for each table, get its header and rows of data. For each row of data, you want to parse out the First Name and Last Name (along with the header of the table).

Here's a verbose working example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = []

# Iterate through each of the three tables
for table in soup.select(".col-md-4 table"):

    # Grab the header and rows from the table
    header = table.select("thead th")[0].text.strip()
    rows = [s.text.strip() for s in table.select("tbody tr")]

    t = []  # This list will contain the rows of data for this table

    # Iterate through rows in this table
    for row in rows:

        # Split by comma (last_name, first_name)
        split = row.split(",")

        last_name = split[0].strip()
        first_name = split[1].strip()

        # Create the row of data
        t.append([first_name, last_name, header])

    # Convert list of rows to a DataFrame
    df = pd.DataFrame(t, columns=["first_name", "last_name", "table_name"])

    # Append to list of DataFrames
    out.append(df)

# Write to CSVs...
out[0].to_csv("first_table.csv", index=None)  # etc...

Whenever you're web scraping, I highly recommend using strip() on all of the text you parse to make sure you don't have superfluous spaces in your data.

I hope this helps!

twhitcomb
  • 63
  • 1
  • 5