Rvest and loops

Question

I am trying to scrape some info on the following website: https://www.evaluation.it/aziende/bilanci-aziende. I am not able to write the loop to do it automatically for each firm

I would like to select all firms in the tab called "Italia" and download all info about the balance sheet (from 2017 to 2021) and I would like to add a column with the name of the firm.

These codes are working well:

library(rvest)
library(dplyr)

link <- "https://www.evaluation.it/aziende/bilanci-aziende/a2a/"
page <- read_html(link)

azienda <- page %>%
  html_nodes(".big_text1:nth-child(1) i") %>%
  html_text()

voce <- page %>%
  html_nodes(".text-left") %>%
  html_text()

bil_2017 <- page %>%
  html_nodes(".text-right :nth-child(2)") %>%
  html_text()
bil_2017 <- bil_2017[-2]

bil_2018 <- page %>%
  html_nodes(".text-right :nth-child(3)") %>%
  html_text()
bil_2018 <- bil_2018[-3]

bil_2019 <- page %>%
  html_nodes(".text-right :nth-child(4)") %>%
  html_text()
bil_2019 <- bil_2019[-4]

bil_2020 <- page %>%
  html_nodes(".text-right :nth-child(5)") %>%
  html_text()
bil_2020 <- bil_2020[-5]

bil_2021 <- page %>%
  html_nodes(".text-right :nth-child(6)") %>%
  html_text()
bil_2021 <- bil_2021[-6]

bilancio <- data.frame(voce, bil_2017, bil_2018, bil_2019, bil_2020, bil_2021
                       , stringsAsFactors = FALSE)

bilancio$azienda <- azienda

However, as u can see, it is only for the first firm.

Can u help me to write a loop or a function to have data for each firm?

At the end I want a dataset for each firm and a dataset with all firms appended.

Thanks for ur help!

score 0 · Answer 1 · answered Nov 12 '22 at 12:52

Here's how one might approach this with Tidyverse.

library(rvest)
library(purrr)
library(dplyr)
library(tidyr)

urls_ita <- read_html("https://www.evaluation.it/aziende/bilanci-aziende/") %>%
  html_elements("#tab-two1 > div.listCompany > a") %>% html_attr("href")

get_table <- function(url){
  html <- read_html(url)
  # company name
  company <- html_element(html, "div.container > div > div:nth-child(1) > i") %>% 
    html_text()
  # sections in tables, empty rows 
  sections <- html_elements(html, "td.text-left > b") %>% 
    html_text()
  html_element(html, "div.container > div > table") %>% 
    html_table() %>% 
    # add company column
    mutate(Company = company, .before = everything()) %>% 
    # remove section rows
    filter(!`Voci di bilancio` %in% sections)
}

# collect and combine tables for first 5 URLs
table_ita <- map_df(urls_ita[1:5], get_table) %>% 
  # move each feature to a seprate column by first pivoting longer to remove yearly cols ..
  pivot_longer(cols = starts_with("2"), names_to = "year", values_to = "value") %>% 
  # .. and then back to wide to extract features from "Voci di bilancio"
  pivot_wider(names_from = "Voci di bilancio")

Glimpse of the result, note that all columns are still of type chr:

glimpse(table_ita)
#> Rows: 25
#> Columns: 24
#> $ Company                                    <chr> "A2A", "A2A", "A2A", "A2A",…
#> $ year                                       <chr> "2017", "2018", "2019", "20…
#> $ `Totale attivita' non correnti`            <chr> "6885,000", "7251,000", "76…
#> $ `Totale attivita' correnti`                <chr> "3064,000", "3082,000", "31…
#> $ `Totale attivo`                            <chr> "9949,000", "10333,000", "1…
#> $ `Totale passivita' non correnti`           <chr> "4593,000", "4088,000", "44…
#> $ `Totale passivita' correnti`               <chr> "2343,000", "2722,000", "26…
#> $ `Totale patrimonio netto`                  <chr> "3013,000", "3523,000", "36…
#> $ `Totale Passivo`                           <chr> "9949,000", "10333,000", "1…
#> $ `Totale ricavi`                            <chr> "5796,000", "6494,000", "73…
#> $ EBITDA                                     <chr> "1199,000", "1231,000", "12…
#> $ `Risultato operativo`                      <chr> "710,000", "588,000", "687,…
#> $ `Utile ante imposte`                       <chr> "576,000", "490,000", "581,…
#> $ `Risultato netto di competenza del Gruppo` <chr> "293,000", "344,000", "389,…
#> $ `Return on assets %`                       <chr> "7,136", "5,691", "6,406", …
#> $ `Return on investments %`                  <chr> "11,380", "8,984", "10,096"…
#> $ `Return on equity %`                       <chr> "10,181", "10,973", "11,827…
#> $ `Ebitda margin %`                          <chr> "20,687", "18,956", "16,849…
#> $ `Ebit margin %`                            <chr> "12,250", "9,055", "9,380",…
#> $ `E margin %`                               <chr> "5,055", "5,297", "5,311", …
#> $ `Posizione finanziaria netta comunicata`   <chr> "3226,000", "3022,000", "31…
#> $ `Debt/Ebitda`                              <chr> "2,691", "2,455", "2,556", …
#> $ `Debt to Equity`                           <chr> "1,071", "0,858", "0,864", …
#> $ `Tax Rate`                                 <chr> "33,333", "32,041", "32,530…

^{Created on 2022-11-12 with reprex v2.0.2}

Rvest and loops

1 Answers1