R - Importing and formatting mutiple tables from various urls

Question

I am a newbie in R so probably there is already some answer to the following question but I haven't find a solution matching the issue I am facing. I am trying to get tables from a number of webpages. They shold be around 5200. I have imported one in order to format it but I need to automatize the process to get them all. Here's the url:

  http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C

I have tried to find out a way to get all the tables by doing:

  url <- paste0("http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=", ., sep="" )

but I receive an error message according which

 , .,

cannot be read. In any case, I neither get how to automatize the process of formatting once I would get it. Any hint?

DaveArmstrong · Answer 1 · 2021-01-26T22:57:32.227

Here's how you would do it for one product:

url <- "http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C"
h <- read_html(url)
tab <- html_table(h, fill=TRUE) %>% 
  as_tibble(.name_repair = "universal")
tab
# # A tibble: 37 x 1
#    ...1$Componente $Unidades $`Valor por 100… $`Desvio padrão` $`Valor Mínimo` $`Valor Máximo` $`Número de dad…
#    <chr>           <chr>     <chr>            <chr>            <chr>           <chr>           <chr>           
#   1 Energia         kJ        578              -                -               -               -               
#   2 Energia         kcal      136              -                -               -               -               
#   3 Umidade         g         65,5             -                -               -               -               
#   4 Carboidrato to… g         33,3             -                -               -               -               
#   5 Carboidrato di… g         32,5             -                -               -               -               
#   6 Proteína        g         0,60             -                -               -               -               
#   7 Lipídios        g         0,26             -                -               -               -               
#   8 Fibra alimentar g         0,84             -                -               -               -               
#   9 Álcool          g         0,00             -                -               -               -               
#   10 Cinzas          g         0,39             -                -               -               -               
#   # … with 27 more rows, and 2 more variables: $Referências <chr>, $`Tipo de dados` <chr>

If you wanted to scrape all the codes and get all of the tables, you could do that with the following. First, we can set up a loop to scrape all of the links. By investigating the source, you would find, as you did, that all of the product codes have "cod_produto" in the href attribute. You could use an xpath selector to keep only those a tags containing that string. You're basically looping over every page until you get to one that doesn't have any links. This gives you 5203 links.

library(glue)
all_links <- NULL
links <- "init"
i <- 1
while(length(links) > 0){
  url <- glue("http://www.tbca.net.br/base-dados/composicao_alimentos.php?pagina={i}&atuald=3")
  h <- read_html(url)
  links <- h %>% html_nodes(xpath = "//a[contains(@href,'cod_produto')]") %>% html_attr("href") %>% unique()
  all_links <- c(all_links, links)
  i <- i+1
}

EDIT

Next, we can follow each link and pull the table out of it, storing the table in the list called tabs. In answer to the question about how to get the name of the product in the data, there are two easy things to do. The first is to make the table into a data frame and then make a variable (I called it code) in the data frame that has the code name. The second is to set the list names to be the product code. The answer below has been edited to do both things.

all_links <- unique(all_links)
tabs <- vector(mode="list", length=length(all_links))
for(i in 1:length(all_links)){
  url <- glue("http://www.tbca.net.br/base-dados/{all_links[i]}")
  code <- gsub(".*=(.*)$", "\\1", url)
  h <- read_html(url)
  tmp <- html_table(h, fill=TRUE)[[1]]
  tmp <- as.data.frame(tmp)
  tmp$code <- code
  tabs[[i]] <- tmp
  names(tabs)[i] <- code
}

First of all, thank you very much for the help: the routine works perfectly and is very clear. The only issue is that on the tables is not indicated the code of the product and I don't know which code each table refers to. In any case, that is shown in each webpage: should I add it through the xpath selector? Would it fits with the rest of the routine? Thanks again, in the meantime. — Il Forna, Jan 26 '21 at 22:26
@IlForna I edited the answer to include ways to put a variable that is the product code in the resulting table and to make the list names the product codes. If you had something else in mind, let me know. — DaveArmstrong, Jan 26 '21 at 22:58
This is perfect, thank you very much. Still I have to get how this last code works, actually, but it does definitevely works. — Il Forna, Jan 27 '21 at 07:45
@IlForna the main piece is this: `code <- gsub(".*=(.*)$", "\\1", url)`, where I scrape the code out of the url. The `gsub()` function looks in the third argument (`url`) for the first argument (a string) and then replaces it with the second argument. The regular expression in the first argument searches for thing followed by an equal sign and then extracts everything after the equal sign. The `"\\1"` says replace the entire string with the extracted text which in this case is just the product code. That's how the code gets scraped. — DaveArmstrong, Jan 27 '21 at 10:14

R - Importing and formatting mutiple tables from various urls

1 Answers1

EDIT