3

I encounter a big problem in scrapping of HTML table with nested columns.

The table is from the immigration department of Hong Kong.

A screenshot is shown here:

enter image description here

I tried to do it with rvest, but the result is messy.

library(rvest)
library(tidyverse)
library(stringr)
library(dplyr)

url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

url_data %>%
read_html()
css_selector <- "body > section:nth-child(7) > div > div > div > div > table"
immiTable <- url_data %>% 
read_html() %>% html_element(css = css_selector) %>% html_table()
immiTable

enter image description here

My goal is to extract the first row (i.e. Airport) and plot it to a pie chart, and produce a dataframe of the whole table and save it to excel.

I realize that teaching material for unnest table and scrapping nested table is rather scarce. Therefore I need your guidance. Thank you very much for your help.

ronzenith
  • 341
  • 3
  • 11

2 Answers2

3

Here is a way. The headers format complicates things but the code below works. It extracts the entire table, not just the first row.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
  library(ggplot2)
})

url_data <- "https://www.immd.gov.hk/eng/stat_20220901.html"

page <- url_data %>% read_html()

page %>%
  html_elements("[headers='Arrival']") %>%
  html_text() %>%
  paste("Arrival", .) -> col_names
page %>%
  html_elements("[headers='Departure']") %>%
  html_text() %>%
  paste("Departure", .) %>%
  c(col_names, .) -> col_names
page %>%
  html_elements("[headers='Control_Point']") %>%
  html_text() -> row_names
page %>%
  html_elements("[class='hRight']") %>%
  html_text() %>%
  sub(",", "", .) %>%
  as.numeric() %>%
  matrix(nrow = length(row_names), byrow = TRUE) %>%
  as.data.frame() %>%
  setNames(col_names) %>%
  `row.names<-`(row_names) -> final

final
#>                                Arrival Hong Kong Residents
#> Airport                                               4258
#> Express Rail Link West Kowloon                           0
#> Hung Hom                                                 0
#> Lo Wu                                                    0
#> Lok Ma Chau Spur Line                                    0
#> Heung Yuen Wai                                           0
#> Hong Kong-Zhuhai-Macao Bridge                          333
#> Lok Ma Chau                                              0
#> Man Kam To                                               0
#> Sha Tau Kok                                              0
#> Shenzhen Bay                                          3404
#> China Ferry Terminal                                     0
#> Harbour Control                                          0
#> Kai Tak Cruise Terminal                                  0
#> Macau Ferry Terminal                                     0
#> Total                                                 7995
#>                                Arrival Mainland Visitors Arrival Other Visitors
#> Airport                                             1488                    422
#> Express Rail Link West Kowloon                         0                      0
#> Hung Hom                                               0                      0
#> Lo Wu                                                  0                      0
#> Lok Ma Chau Spur Line                                  0                      0
#> Heung Yuen Wai                                         0                      0
#> Hong Kong-Zhuhai-Macao Bridge                         28                     39
#> Lok Ma Chau                                            0                      0
#> Man Kam To                                             0                      0
#> Sha Tau Kok                                            0                      0
#> Shenzhen Bay                                         348                     37
#> China Ferry Terminal                                   0                      0
#> Harbour Control                                        0                      0
#> Kai Tak Cruise Terminal                                0                      0
#> Macau Ferry Terminal                                   0                      0
#> Total                                               1864                    498
#>                                Arrival Total Departure Hong Kong Residents
#> Airport                                 6168                          3775
#> Express Rail Link West Kowloon             0                             0
#> Hung Hom                                   0                             0
#> Lo Wu                                      0                             0
#> Lok Ma Chau Spur Line                      0                             0
#> Heung Yuen Wai                             0                             0
#> Hong Kong-Zhuhai-Macao Bridge            400                           243
#> Lok Ma Chau                                0                             0
#> Man Kam To                                 0                             0
#> Sha Tau Kok                                0                             0
#> Shenzhen Bay                            3789                          1301
#> China Ferry Terminal                       0                             0
#> Harbour Control                            0                             0
#> Kai Tak Cruise Terminal                    0                             0
#> Macau Ferry Terminal                       0                             0
#> Total                                  10357                          5319
#>                                Departure Mainland Visitors
#> Airport                                               1154
#> Express Rail Link West Kowloon                           0
#> Hung Hom                                                 0
#> Lo Wu                                                    0
#> Lok Ma Chau Spur Line                                    0
#> Heung Yuen Wai                                           0
#> Hong Kong-Zhuhai-Macao Bridge                          194
#> Lok Ma Chau                                              0
#> Man Kam To                                               0
#> Sha Tau Kok                                              0
#> Shenzhen Bay                                           524
#> China Ferry Terminal                                     0
#> Harbour Control                                          0
#> Kai Tak Cruise Terminal                                  0
#> Macau Ferry Terminal                                     0
#> Total                                                 1872
#>                                Departure Other Visitors Departure Total
#> Airport                                             315            5244
#> Express Rail Link West Kowloon                        0               0
#> Hung Hom                                              0               0
#> Lo Wu                                                 0               0
#> Lok Ma Chau Spur Line                                 0               0
#> Heung Yuen Wai                                        0               0
#> Hong Kong-Zhuhai-Macao Bridge                        15             452
#> Lok Ma Chau                                           0               0
#> Man Kam To                                            0               0
#> Sha Tau Kok                                           0               0
#> Shenzhen Bay                                         28            1853
#> China Ferry Terminal                                  0               0
#> Harbour Control                                       0               0
#> Kai Tak Cruise Terminal                               0               0
#> Macau Ferry Terminal                                  0               0
#> Total                                               358            7549

Created on 2022-09-18 with reprex v2.0.2


To plot the pie chart in ggplot plot a bar chart then change to polar coordinates.

Airport <- final[1,,]
Airport %>%
  t() %>%
  as.data.frame() %>%
  mutate(`Arrival/Departure` = row.names(.)) %>%
  ggplot(aes("", Airport, fill = `Arrival/Departure`)) +
  geom_col(width = 1) +
  scale_fill_manual(values = RColorBrewer::brewer.pal(n = 8, name = "Spectral")) +
  coord_polar(theta = "y", start = 0) +
  theme_void()

Created on 2022-09-18 with reprex v2.0.2

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • You saved my life @Rui Barradas ! I'd like to know the trick, that is, how can your syntax get the first four rows of headers removed? It seems that what you did is, firstly, scrape the numbers, secondly, add the row and col names. Am I correct? – ronzenith Sep 18 '22 at 15:38
  • Maybe my question should be in this way: how to scrape the table without extracting the column names altogether? @ruibarradas – ronzenith Sep 18 '22 at 15:46
  • @ronzenith The diference is in what elements to extract, for the numbers it's `class='hRight'` and for the headers, well, it's `headers='Arrival'` and `Departure`, then `headers = 'Control_Point` (the row names). – Rui Barradas Sep 18 '22 at 16:15
2

An alternative would be to select the tbody rows, filtering out the hidden items by attribute, then add in the headers later.

library(rvest)
library(tidyverse)

rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

df <- map_dfr(
  rows,
  function(x) {
    x %>%
      html_elements("td[headers]") %>%
      set_names(headers) %>%
      html_text()
  }
) %>%
  mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))

Or somewhat condensed,

library(rvest)
library(tidyverse)

rows <- read_html("https://www.immd.gov.hk/eng/stat_20220901.html") %>% html_elements(".table-passengerTrafficStat tbody tr")
prefixes <- c("arr", "dep")
cols <- c("Hong Kong Residents", "Mainland Visitors", "Other Visitors", "Total")
headers <- c("Control_Point", crossing(prefixes, cols) %>% unite("headers", 1:2, remove = T) %>% unlist() %>% unname())

df <- map_dfr(rows, ~ set_names(.x %>% html_elements("td[headers]") %>% html_text(), headers)) %>%
  mutate(across(c(-1), ~ str_replace(.x, ",", "") %>% as.integer()))
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thank you for your help indeed. When I try to read your code closely, I don't quite understand the use of writing an extra function function(x) { x %>% html_elements("td[headers]") %>% set_names(headers) %>% html_text() } (1) What does x mean here? (2) Why can the function self-run without the need to call the function and tell R what x is? @qharr – ronzenith Oct 01 '22 at 17:16
  • I want to apply on a row by row basis so need to call a function row by row with, in this case, [map_dfr](https://purrr.tidyverse.org/reference/map.html). x is what is passed from the left (list of rows) to the right (function to apply to each row). So, x is a single row. The function is applied by map_dfr. – QHarr Oct 03 '22 at 04:35