4

I'm trying to scrape the WHOLE 'In more languages' table on Wikidata pages, e.g. https://www.wikidata.org/wiki/Q3044

I have tried 2 approaches in R:

library(rvest)
url <- "https://www.wikidata.org/wiki/Q3044"
pg <- url %>% read_html

pg <- pg %>% 
  html_nodes(".wikibase-entitytermsforlanguagelistview") %>%
  html_table()

table <- pg[[1]]

But this return only the English part (1 row).

I have also tried:

library(tidywikidatar)
tw_get_label(id = c("Q3044"),language = "nl")

But this returns only one label. However, I would like all the 'Also known as' category on Wikidata.

Any help would be much appreciated!

2 Answers2

2

What an excellent question. You're only getting the first row of the table because that's all that the page initially loads with, and there's some JavaScript magic happening in the background to load the rest of the table after the page loads. You can see this happen if you reload the page and watch closely - I've included a gif below to show this. Since R doesn't run all that extra magic, all it gets is the original page.

Gif showing page refresh with only the first row initially loaded

However, all this means is that we need to look for a different URL that's sourcing the full table. Using Chrome's developer tools we learn that the table's coming from https://www.wikidata.org/wiki/Special:EntityData/Q3044.json and that's the page we actually want to scrape. If we download that using jsonLite we don't get the table exactly, but we can reassemble it using some dplyr tools. Here's a snippet of code that does that:


wiki_data <- jsonlite::read_json("https://www.wikidata.org/wiki/Special:EntityData/Q3044.json")
table_data <- wiki_data$entities$Q3044

library(dplyr)
label_col <- bind_rows(table_data$labels) %>% rename(label=value)
desc_col <- bind_rows(table_data$descriptions) %>% rename(description=value)
alias_col <- bind_rows(table_data$aliases) %>% 
  rename(alias=value) %>%
  group_by(language) %>%
  summarise(alias=paste(alias, collapse = ", "))

full_table <- label_col %>%
  left_join(desc_col) %>%
  left_join(alias_col)

with the first few rows of the output shown below:

> full_table
# A tibble: 157 x 4
   language label                         description                                        alias
   <chr>    <chr>                         <chr>                                              <chr>
 1 fr       Charlemagne                   empereur d'Occident et roi des Francs              Char~
 2 en       Charlemagne                   King of the Franks, King of Italy, and Holy Roman~ Karo~
 3 it       Carlo Magno                   re dei Franchi e dei Longobardi e primo imperator~ NA   
 4 ilo      Karlomagno                    Ari dagiti Pranko ken Lombardo ken Emperador ti N~ NA   
Dubukay
  • 1,764
  • 1
  • 8
  • 13
0

This can be achieved also with tidywikidatar, as both labels and aliases are included in the response to tw_get().

You can get both labels and aliases for a given language by using the relevant language code as parameter, or, as mentioned in the documentation, use all_available if you are interested in having labels and aliases in all available languages. See reprex below for reference:

library("tidywikidatar")
item_df <- tw_get(id = c("Q3044"),
                  language = "all_available")

item_df %>%
  dplyr::filter(stringr::str_starts(string = property, pattern = "label"))
#> # A tibble: 158 × 4
#>    id    property  value                         rank 
#>    <chr> <chr>     <chr>                         <chr>
#>  1 Q3044 label_fr  Charlemagne                   <NA> 
#>  2 Q3044 label_en  Charlemagne                   <NA> 
#>  3 Q3044 label_it  Carlo Magno                   <NA> 
#>  4 Q3044 label_ilo Karlomagno                    <NA> 
#>  5 Q3044 label_af  Karel die Grote               <NA> 
#>  6 Q3044 label_gsw Karl dr Gross                 <NA> 
#>  7 Q3044 label_an  Carlos Magno                  <NA> 
#>  8 Q3044 label_ang Carl sē Micel Francena Cyning <NA> 
#>  9 Q3044 label_ar  شارلمان                       <NA> 
#> 10 Q3044 label_arz شارلمان                       <NA> 
#> # … with 148 more rows

item_df %>%
  dplyr::filter(stringr::str_starts(string = property, pattern = "alias"))
#> # A tibble: 55 × 4
#>    id    property value                                rank 
#>    <chr> <chr>    <chr>                                <chr>
#>  1 Q3044 alias_en Karolus Magnus                       <NA> 
#>  2 Q3044 alias_en Charles the Great                    <NA> 
#>  3 Q3044 alias_en Emperor Charlemagne                  <NA> 
#>  4 Q3044 alias_en Karl the Great                       <NA> 
#>  5 Q3044 alias_en Carolus Magnus                       <NA> 
#>  6 Q3044 alias_en King of the Franks Charles the Great <NA> 
#>  7 Q3044 alias_en King of the Franks Charlemagne       <NA> 
#>  8 Q3044 alias_en Charlemagne the Franc                <NA> 
#>  9 Q3044 alias_en Charles I                            <NA> 
#> 10 Q3044 alias_fr Charles Ier                          <NA> 
#> # … with 45 more rows

item_df %>%
  dplyr::filter(stringr::str_starts(string = property, pattern = "label")|stringr::str_starts(string = property, pattern = "alias"))
#> # A tibble: 213 × 4
#>    id    property  value                         rank 
#>    <chr> <chr>     <chr>                         <chr>
#>  1 Q3044 label_fr  Charlemagne                   <NA> 
#>  2 Q3044 label_en  Charlemagne                   <NA> 
#>  3 Q3044 label_it  Carlo Magno                   <NA> 
#>  4 Q3044 label_ilo Karlomagno                    <NA> 
#>  5 Q3044 label_af  Karel die Grote               <NA> 
#>  6 Q3044 label_gsw Karl dr Gross                 <NA> 
#>  7 Q3044 label_an  Carlos Magno                  <NA> 
#>  8 Q3044 label_ang Carl sē Micel Francena Cyning <NA> 
#>  9 Q3044 label_ar  شارلمان                       <NA> 
#> 10 Q3044 label_arz شارلمان                       <NA> 
#> # … with 203 more rows

Created on 2022-06-23 by the reprex package (v2.0.1)

giocomai
  • 3,043
  • 21
  • 24