Here's how you would do it for one product:
url <- "http://www.tbca.net.br/base-dados/int_composicao_estatistica.php?cod_produto=C0195C"
h <- read_html(url)
tab <- html_table(h, fill=TRUE) %>%
as_tibble(.name_repair = "universal")
tab
# # A tibble: 37 x 1
# ...1$Componente $Unidades $`Valor por 100… $`Desvio padrão` $`Valor Mínimo` $`Valor Máximo` $`Número de dad…
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Energia kJ 578 - - - -
# 2 Energia kcal 136 - - - -
# 3 Umidade g 65,5 - - - -
# 4 Carboidrato to… g 33,3 - - - -
# 5 Carboidrato di… g 32,5 - - - -
# 6 Proteína g 0,60 - - - -
# 7 Lipídios g 0,26 - - - -
# 8 Fibra alimentar g 0,84 - - - -
# 9 Álcool g 0,00 - - - -
# 10 Cinzas g 0,39 - - - -
# # … with 27 more rows, and 2 more variables: $Referências <chr>, $`Tipo de dados` <chr>
If you wanted to scrape all the codes and get all of the tables, you could do that with the following. First, we can set up a loop to scrape all of the links. By investigating the source, you would find, as you did, that all of the product codes have "cod_produto"
in the href
attribute. You could use an xpath selector to keep only those a tags containing that string. You're basically looping over every page until you get to one that doesn't have any links. This gives you 5203 links.
library(glue)
all_links <- NULL
links <- "init"
i <- 1
while(length(links) > 0){
url <- glue("http://www.tbca.net.br/base-dados/composicao_alimentos.php?pagina={i}&atuald=3")
h <- read_html(url)
links <- h %>% html_nodes(xpath = "//a[contains(@href,'cod_produto')]") %>% html_attr("href") %>% unique()
all_links <- c(all_links, links)
i <- i+1
}
EDIT
Next, we can follow each link and pull the table out of it, storing the table in the list called tabs
. In answer to the question about how to get the name of the product in the data, there are two easy things to do. The first is to make the table into a data frame and then make a variable (I called it code
) in the data frame that has the code name. The second is to set the list names to be the product code. The answer below has been edited to do both things.
all_links <- unique(all_links)
tabs <- vector(mode="list", length=length(all_links))
for(i in 1:length(all_links)){
url <- glue("http://www.tbca.net.br/base-dados/{all_links[i]}")
code <- gsub(".*=(.*)$", "\\1", url)
h <- read_html(url)
tmp <- html_table(h, fill=TRUE)[[1]]
tmp <- as.data.frame(tmp)
tmp$code <- code
tabs[[i]] <- tmp
names(tabs)[i] <- code
}