r Web scraping: Unable to read the main table

Question

I am new to web scraping. I am trying to scrape a table with the following code. But I am unable to get it. The source of data is

https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1


url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])

I copied this apporach from the link (Web scraping of key stats in Yahoo! Finance with R) where it is applied on yahoo finance data.

In my opinion, in readHTMLTable(tableNodes[[12]]), it should be Table 12. But when I try giving tableNodes[[12]], it always gives me an error.

Error in do.call(data.frame, c(x, alis)) : 
  variable names are limited to 10000 bytes

Please suggest me the way to extract the table and combine the data from other tabs as well (Fundamental, Technical and Performance).

QHarr · Answer 1 · 2020-01-04T18:02:56.823

This data is returned dynamically as json. In R (behaves differently from Python requests) you get html from which you can extract a given page's results as json. A page includes all the tabs info and 50 records. From the first page you are given the total record count and therefore can calculate the total number of pages to loop over to get all results. Perhaps combine them info a final dataframe during a loop to total number of pages; where you alter the pn param of the XHR POST body to the appropriate page number for desired results in each new POST request. There are two required headers.

Probably a good idea to write a function that accepts a page number in signature and returns a given page's json as a dataframe. Apply that via a tidyverse package to handle loop and combining of results to final dataframe?

library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)

headers = c(
  'User-Agent' = 'Mozilla/5.0',
  'X-Requested-With' = 'XMLHttpRequest'
)

data = list(
  'country[]' = '6',
  'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
  'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
  'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
  'exchange[]' = '109',
  'exchange[]' = '127',
  'exchange[]' = '51',
  'exchange[]' = '108',
  'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
  'order[col]' = 'eq_market_cap',
  'order[dir]' = 'd'
)

r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)

My current attempt at combining which I would welcome feedback on. This is all the returned columns, for all pages, and I have to handle missing columns and different ordering of columns as well as 1 column being a data.frame. As the returned number is far greater than those visible on page, you could simply revise to subset returned columns with a mask just for the columns present in the tabs.

library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)

headers = c(
  'User-Agent' = 'Mozilla/5.0',
  'X-Requested-With' = 'XMLHttpRequest'
)

data = list(
  'country[]' = '6',
  'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
  'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
  'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
  'exchange[]' = '109',
  'exchange[]' = '127',
  'exchange[]' = '51',
  'exchange[]' = '108',
  'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
  'order[col]' = 'eq_market_cap',
  'order[dir]' = 'd'
)

get_data <- function(page_number){
  data['pn'] = page_number
  r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
  s <- r %>% read_html() %>% html_node('p') %>% html_text()
  if(page_number==1){ return(s) }
  else{return(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))}

}

clean_df <- function(df){
  interim <- df['viewData']
  df_minus <- subset(df, select = -c(viewData))
  df_clean <- cbind.data.frame(c(interim, df_minus))
  return(df_clean)
}

initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)

dfs <- map(.x = 2:num_pages,
    .f = ~clean_df(get_data(.))) 

r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')

r Web scraping: Unable to read the main table

1 Answers1