How to scrape a table and its links

Question

What I want to do is to take thw following website

And pick the year of execution, enter the Last Statement Link, and retrieve the statement... perhaps I would be creating 2 dictionaries, both with the execution number as key.

Afterwards, I would classify the statements by length, besides "flagging" the refusals to give it or if it was just not given.

Finally, all would be compiled in a SQLite database, and I would display a graph that shows how many messages, clustered by type, have been given each year.

Beautiful Soup seems to be the path to follow, I'm already having troubles with just printing the year of execution... Of course, I'm not ultimately interested in printing the years of execution, but it seems like a good way of checking if at least my code is properly locating the tags I want.

tags = soup('td')
for tag in tags:
    print(tag.get('href', None))

Why does the previous code only print None?

Thanks beforehand.

Try the Selenium library, it's more powerful. It lets you interact with the webpage (i.e. click links, enter values, wait for elements to load, etc.). — jun, Nov 17 '20 at 04:38

baduker · Accepted Answer · 2020-11-20T16:21:41.713

Use pandas to get and manipulate the table. The links are static and by that I mean they can be easily recreated with offender's first and last name.

Then, you can use requests and BeautifulSoup to scrape for offender's last statement, which are quite moving.

Here's how:

import requests
import pandas as pd

def clean(first_and_last_name: list) -> str:
    name = "".join(first_and_last_name).replace(" ", "").lower()
    return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")


base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")

df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)
df.rename(columns={'Link': "Offender Information", "Link.1": "Last Statement URL"}, inplace=True)

df["Offender Information"] = df[
    ["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)

df["Last Statement URL"] = df[
    ["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)

df.to_csv("offenders.csv", index=False)

This gets you:

EDIT:

I actually went ahead and added the code that fetches all offenders' last statements.

import random
import time

import pandas as pd
import requests
from lxml import html

base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")
statement_xpath = '//*[@id="content_right"]/p[6]/text()'


def clean(first_and_last_name: list) -> str:
    name = "".join(first_and_last_name).replace(" ", "").lower()
    return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")


def get_last_statement(statement_url: str) -> str:
    page = requests.get(statement_url).text
    statement = html.fromstring(page).xpath(statement_xpath)
    text = next(iter(statement), "")
    return " ".join(text.split())


df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)

df.rename(
    columns={'Link': "Offender Information", "Link.1": "Last Statement URL"},
    inplace=True,
)

df["Offender Information"] = df[
    ["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)

df["Last Statement URL"] = df[
    ["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)

offender_data = list(
    zip(
        df["First Name"],
        df["Last Name"],
        df["Last Statement URL"],
    )
)

statements = []
for item in offender_data:
    *names, url = item
    print(f"Fetching statement for {' '.join(names)}...")
    statements.append(get_last_statement(statement_url=url))
    time.sleep(random.randint(1, 4))

df["Last Statement"] = statements
df.to_csv("offenders_data.csv", index=False)

This will take a couple of minutes because the code "sleeps" for anywhere between 1 to 4 seconds, more or less, so the server doesn't get abused.

Once this gets done, you'll end up with a .csv file with all offenders' data and their statements, if there was one.

Thanks baduker, for taking the time to give such a detailed answer. Still, what I'm looking for is to use the execution numer as a key for a SQL DB, and add the YEAR (I'm guessing reading the last four characters of the date) of execution, the entire last statement (which is why I need to be able to click on that link), and add a value to that message (0 for not given, 1 for short, 2 for long). Also, would there be any problem with installing miniconda instead of full anaconda, in order to install panda? — Javier Lopez, Nov 17 '20 at 12:53
That I cannot help you with. You have to make those decisions related to your local dev environment. As for the DB, well, tabular data should be rather easy to parse the way you want. — baduker, Nov 17 '20 at 14:29
I understand. Then again, I attempted to install anaconda (and pandas by default), but I attempted to run a very simple code to "print" the table in the command line, but I'm getting this message. import pandas as pd ModuleNotFoundError: No module named 'pandas' Any suggestions? — Javier Lopez, Nov 17 '20 at 22:05
Well, basically it means that you don't have `pandas` install in whatever environment you're running the code in. — baduker, Nov 18 '20 at 08:45
It seems that I managed to install it, @baduker. However, I'm having a new glitch. The command line seems to run just fine, but then a traceback takes place, and I'm asked to install html5lib, which I did, Requirement already satisfied: html5lib in c:\users... (1.1) Requirement already satisfied: webencodings in c:\users... \python39\lib\site-packages (from html5lib) (0.5.1) Requirement already satisfied: six>=1.9 in c:\users... \python39\lib\site-packages (from html5lib) (1.15.0) but whe I re-run the code, the last line says: ImportError: html5lib not found, please install it. — Javier Lopez, Nov 18 '20 at 19:31
Comments are not the best place to debug code. Open a new question and we'll see what can be done. — baduker, Nov 18 '20 at 20:51
AMAZING @baduker. It created the csv just fine. Now I'll figure out how to convert it into sql table, drop the collumns I'm not interested, erase the day and month of the dates, and add a column to classify the statement: Afterwards, I'll just have to create a stratified yearly histogram. THANKS A LOT... By the way, I found the solution with html5lib, I uninstalled all Python related programs, reinstalled the latest version, and the traceback change into "AttributeError: module 'html5lib.treebuilders' has no attribute '_base'", which was easy to fix. — Javier Lopez, Nov 21 '20 at 02:25

How to scrape a table and its links

1 Answers1

Linked