Scraping html tables into R data frames

Question

I have a problem. I have to extract information from the website: https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1 with the name of the club, the address of their website (transfermarkt profile) and the name of the stadium from the team's profile. This is my first contact with the extraction of data from websites. Any help appreciated. At first I wrote such code:

library(rvest)
theurl <- "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)

What parameters exactly you need? the table of "CLUBS OF THE PREMIER LEAGUE 17/18"? — Henry Navarro, Nov 14 '17 at 15:55
if you pipe the html_nodes(file, "table") %>% html_nodes(a) you can see the hrefs, then it's a matter of regex — Elio Diaz, Nov 14 '17 at 16:12
I need table with names of the clubs, website clubs and stadium from the team's profile. — Kim, Nov 14 '17 at 16:21

score 0 · Accepted Answer · answered Nov 14 '17 at 16:12

As @Henry Navarro has pointed out, it is not clear which nodes, etc. you need exactly. Finding the right nodes is a time consuming task, so you need to specify which nodes you want. You can use Selectorgadget for this purpose.

In the following a quick example how you might generate the list of team websites that you will have to loop through with rvest to extract information. I think the main functionality you have been missing so far for this purpose is html_attr(), see, e.g., this answer. Of course, you will have to find the nodes on these sites to extract information on stadium, etc.

file %>% 
html_nodes("table") %>%
{ .[4]} %>% 
html_nodes("a") %>% 
html_attr("href") %>% 
{ .[grep("/startseite/verein",., fixed=T)]} %>% 
unique() %>% 
{ paste0("https://www.transfermarkt.co.uk", .) }

# [1] "https://www.transfermarkt.co.uk/fc-chelsea/startseite/verein/631/saison_id/2017"               
# [2] "https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2017"          
# [3] "https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985/saison_id/2017"        
# [4] "https://www.transfermarkt.co.uk/tottenham-hotspur/startseite/verein/148/saison_id/2017" 
#...

That's exactly what I meant, thanks. How can I get to know about the stadium for each of these clubs? — Kim, Nov 14 '17 at 16:34
Navigate to one of the "startseite" sites that you scraped in your browser. Then use Selectorgadget on this site to identify the node you want to extract, i.e., where the information about the stadium is. This should be the same link structure for all the other clubs (maybe the name of the club needs to be considered). With this information you can write a loop for all the pages you need to scrape. But be thoughtful, respect the conditions of the website, if you scrape many sites you create high traffic in short time, you might add a pause in your script to decrease the traffic generated. — Manuel Bickel, Nov 14 '17 at 16:41

Scraping html tables into R data frames

1 Answers1