0

In my first post, I would like to share my pet project. I am in the process of making a machine learning algorithm that can assign buy/sell/hold positions to securities. The first step of this project is to build the dataframe that contains the securities' basic information as well as relevant predictive indicators. I am using rvest to webscrape data from two different websites that give stock information. Below is my code:

#load all variables of interest
 for(i in 1:nrow(stockdata)){
  #price
  url <- paste0('https://www.nasdaq.com/symbol/',tolower(stockdata[,1][i]), 
 sep="") 
  html <- read_html(url)
  #Select the text I want
  Price <- html_nodes(html,'#qwidget_lastsale')
  stockdata$Price[i] <-  html_text(Price)

  #price change percentage
   url <- paste0('https://finviz.com/quote.ashx?t=',stockdata[,1][i], sep="") 
   html <- read_html(url)
   #Select the text I want
   change <- html_nodes(html,'.table-dark-row:nth-child(12) .snapshot-td2:nth- 
 child(12) b')
   stockdata$PriceChange[i] <-  html_text(change)

}

I have truncated the code, but the above works in pulling data. Unfortunately, the process is horrifically slow. I have many more variables to pull, and each one slows it down more and more. My knowledge of vectorization is decent for speeding up the process, but not sure how to apply it. Any tips on making this process faster in its execution or some knowledge on general speedier iteration tips would be greatly appreciated.

SoloMatt
  • 13
  • 2
  • 1
    The first step in speeding up any piece of code is to identify the part that is slow. You need to [profile your code](http://adv-r.had.co.nz/Profiling.html). – Gregor Thomas Nov 13 '18 at 15:38
  • The other thing general good practice to do is to do as much as you can one time up-front rather than inside a loop. For example, instead of putting `url <- paste0(...)` inside the loop, create the URLs outside the loop. `price_url <- paste0('https://www.nasdaq.com/symbol/',tolower(stockdata[,1]))` and then use `price_url[i]` inside the loop. Using `data.table` instead of data frames can also give good speed-up in general. – Gregor Thomas Nov 13 '18 at 15:42
  • 1
    Your loop process is making two calls out to internet for each symbol. To speed this up, you need to find a method to pass multiple symbols to each `read_html` statement and/or find a site that contains all the information you are requesting. Nasdaq should have a API which would faster than the web access. Be sure not to violate the terms of service of these websites which you are scrapping. – Dave2e Nov 13 '18 at 15:57
  • @Gregor This doesn't work. I need something like a list that stores the 26 different URLs. I need to iterate over it. This method puts all of the symbols at the end of a single URL in a vector. – SoloMatt Feb 22 '19 at 01:39
  • You can iterate over a vector. `vec = c("http://google.com", "http://stackoverflow.com")`. There is a vector with 2 URLs. Here is an example of iterating over it: `for (i in seq_along(vec)) print(vec[i])`. You probably need a list to store your *results*, but you certainly do not need a list to store 26 different URLs - a vector works just fine for that. – Gregor Thomas Feb 22 '19 at 01:43
  • You could use parallel calculations to parallelize your loop (see doParallel and parallel R packages) – Emmanuel Hamel Apr 07 '23 at 22:48

0 Answers0