0

I have a problem. I have to extract information from the website: https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1 with the name of the club, the address of their website (transfermarkt profile) and the name of the stadium from the team's profile. This is my first contact with the extraction of data from websites. Any help appreciated. At first I wrote such code:

library(rvest)
theurl <- "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
Kim
  • 117
  • 2
  • 11

1 Answers1

0

As @Henry Navarro has pointed out, it is not clear which nodes, etc. you need exactly. Finding the right nodes is a time consuming task, so you need to specify which nodes you want. You can use Selectorgadget for this purpose.

In the following a quick example how you might generate the list of team websites that you will have to loop through with rvest to extract information. I think the main functionality you have been missing so far for this purpose is html_attr(), see, e.g., this answer. Of course, you will have to find the nodes on these sites to extract information on stadium, etc.

file %>% 
html_nodes("table") %>%
{ .[4]} %>% 
html_nodes("a") %>% 
html_attr("href") %>% 
{ .[grep("/startseite/verein",., fixed=T)]} %>% 
unique() %>% 
{ paste0("https://www.transfermarkt.co.uk", .) }

# [1] "https://www.transfermarkt.co.uk/fc-chelsea/startseite/verein/631/saison_id/2017"               
# [2] "https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2017"          
# [3] "https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985/saison_id/2017"        
# [4] "https://www.transfermarkt.co.uk/tottenham-hotspur/startseite/verein/148/saison_id/2017" 
#...
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • That's exactly what I meant, thanks. How can I get to know about the stadium for each of these clubs? – Kim Nov 14 '17 at 16:34
  • Navigate to one of the "startseite" sites that you scraped in your browser. Then use Selectorgadget on this site to identify the node you want to extract, i.e., where the information about the stadium is. This should be the same link structure for all the other clubs (maybe the name of the club needs to be considered). With this information you can write a loop for all the pages you need to scrape. But be thoughtful, respect the conditions of the website, if you scrape many sites you create high traffic in short time, you might add a pause in your script to decrease the traffic generated. – Manuel Bickel Nov 14 '17 at 16:41