3

While learning Rust I am trying to build a simple web scraper. My aim is to scrape https://news.ycombinator.com/ and get the title, hyperlink, votes and username. I am using the external libraries reqwest and scraper for this and wrote a program which scrapes the HTML link from that site.

Cargo.toml

[package]
name = "stackoverflow_scraper"
version = "0.1.0"
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
scraper = "0.12.0"
reqwest = "0.11.2"
tokio = { version = "1", features = ["full"] }
futures = "0.3.13"

src/main.rs

use scraper::{Html, Selector};
use reqwest;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://news.ycombinator.com/";
    let html = reqwest::get(url).await?.text().await?;
    let fragment = Html::parse_fragment(html.as_str());
    let selector = Selector::parse("a.storylink").unwrap();

    for element in fragment.select(&selector) {
        println!("{:?}",element.value().attr("href").unwrap());
        // todo println!("Title");
        // todo println!("Votes");
        // todo println!("User");
    }

    Ok(())
}

How do I get its corresponding title, votes and username?

Jason
  • 4,905
  • 1
  • 30
  • 38
Eka
  • 14,170
  • 38
  • 128
  • 212

2 Answers2

4

The items on the front page are stored in a table with class .itemlist.

As each item is made out of three consecutive <tr>, you'll have to iterate over them in chunks of three. I opted to first gather all the nodes.

The first row contains the:

  • Title
  • Domain

The second row contains the:

  • Points
  • Author
  • Post age

The third row is a spacer that should be ignored.

Note:

  • Posts created within the last hour seemingly do not display any points, so this needs to be handled accordingly.
  • Advertisements do not contain a username.
  • The last two table rows, tr.morespace and the tr containing a.morelink should be ignored. This is why I opted to first .collect() the nodes and then use .chunks_exact().
use reqwest;
use scraper::{Html, Selector};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let url = "https://news.ycombinator.com/";
    let html = reqwest::get(url).await?.text().await?;
    let fragment = Html::parse_fragment(html.as_str());

    let selector_items = Selector::parse(".itemlist tr").unwrap();

    let selector_title = Selector::parse("a.storylink").unwrap();
    let selector_score = Selector::parse("span.score").unwrap();
    let selector_user = Selector::parse("a.hnuser").unwrap();

    let nodes = fragment.select(&selector_items).collect::<Vec<_>>();

    let list = nodes
        .chunks_exact(3)
        .map(|rows| {
            let title_elem = rows[0].select(&selector_title).next().unwrap();
            let title_text = title_elem.text().nth(0).unwrap();
            let title_href = title_elem.value().attr("href").unwrap();

            let score_text = rows[1]
                .select(&selector_score)
                .next()
                .and_then(|n| n.text().nth(0))
                .unwrap_or("0 points");

            let user_text = rows[1]
                .select(&selector_user)
                .next()
                .and_then(|n| n.text().nth(0))
                .unwrap_or("Unknown user");

            [title_text, title_href, score_text, user_text]
        })
        .collect::<Vec<_>>();

    println!("links: {:#?}", list);

    Ok(())
}

That should net you the following list:

[
    [
        "Docker for Mac M1 RC",
        "https://docs.docker.com/docker-for-mac/apple-m1/",
        "327 points",
        "mikkelam",
    ],
    [
        "A Mind Is Born – A 256 byte demo for the Commodore 64 (2017)",
        "https://linusakesson.net/scene/a-mind-is-born/",
        "226 points",
        "matthewsinclair",
    ],
    [
        "Show HN: Video Game in a Font",
        "https://www.coderelay.io/fontemon.html",
        "416 points",
        "ghub-mmulet",
    ],
    ...
]

Alternatively, there is an API available that one can use:

Jason
  • 4,905
  • 1
  • 30
  • 38
3

This is more of a selectors question, and it depends on the html of the site being scraped. In this case, it's easy to get the title, but harder to get the points and user. Since the selector you're using selects the link which contains both the href and title, you can get the title using the .text() method

let title = element.text().collect::<Vec<_>>();

where element is the same as for the href

To get the other values however, it would be easier to change the first selector and get the data from that. Since the title and link of a news item on news.ycombinator.com is in a element with the .athing class, and the votes and user are in the next element, which doesn't have a class (making it harder to select), it might be best to select "table.itemlist tr.athing" and iterate over those results. From each element found, you can then subselect the "a.storylink" element, and separately get the following tr element and subselecting the points and user elements

let select_item = Selector::parse("table.itemlist tr.athing").unwrap();
let select_link = Selector::parse("a.storylink").unwrap();
let select_score = Selector::parse("span.score").unwrap();

for element in fragment.select(&select_item) {
    // Get the link element that contains the href and title
    let link_el = element.select(&select_link).next().unwrap();
    println!("{:?}", link_el.value().attr("href").unwrap());

    // Get the next tr element that follows the first, with score and user
    let details_el = ElementRef::wrap(element.next_sibling().unwrap()).unwrap();
    // Get the score element from within the second row element
    let score = details_el.select(&select_score).next().unwrap();
    println!("{:?}", score.text().collect::<Vec<_>>());
}

This only shows getting the href and score. I'll leave it to you get the user from details_el

transistor
  • 1,480
  • 1
  • 9
  • 12