2

Is anyone experienced in scraping data from the Yahoo! Finance key statistics page with R? I am familiar scraping data directly from html using read_html, html_nodes(), and html_text() from rvest package. However, this web page MSFT key stats is a bit complicated, I am not sure if all the stats are kept in XHR, JS, or Doc. I am guessing the data is stored in JSON. If anyone knows a good way to extract and parse data for this web page with R, kindly answer my question, great thanks in advance!

Or if there is a more convenient way to extract these metrics via quantmod or Quandl, kindly let me know, that would be a extremely good solution!

tonykuoyj
  • 81
  • 3
  • 10
  • As an alternative , you can look into `getFinancials()` and `viewFinancials()` methods in `quantmod` It uses data from Google Finance, though, and other src parameters are not implemented as yet. – R.S. Oct 25 '16 at 17:43
  • With `docl = htmlParse('http://finance.yahoo.com/quote/MSFT/key-statistics?p=MSFT')` you could see a section `(function (root) { /* -- Data -- */` where apparently is the data. For example `"beta":{"raw":1.39107,"fmt":"1.39"}` , goo luck! – Robert Oct 25 '16 at 17:46
  • 2
    Thanks @Robert, I also find another doc in XHR [Y! Finanace Stats](https://query2.finance.yahoo.com/v10/finance/quoteSummary/MSFT?formatted=true&crumb=loFaprfreJS&lang=en-US&region=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com) which stores a clean JSON for the metrics! Thanks a lot, will share the parsing script later on. – tonykuoyj Oct 27 '16 at 05:13
  • Check out [these answers](http://stackoverflow.com/questions/2614767/using-r-to-analyze-balance-sheets-and-income-statements/15975391#comment64534601_15975391). – hvollmeier Oct 27 '16 at 06:51

3 Answers3

7

I know this is an older thread, but I used it to scrape Yahoo Analyst tables so I figure I would share.

# Yahoo webscrape Analysts
library(XML)

symbol = "HD"
url <- paste('https://finance.yahoo.com/quote/HD/analysts?p=',symbol,sep="")
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")

earningEstimates <- readHTMLTable(tableNodes[[1]])
revenueEstimates <- readHTMLTable(tableNodes[[2]])
earningHistory <- readHTMLTable(tableNodes[[3]])
epsTrend <- readHTMLTable(tableNodes[[4]])
epsRevisions <- readHTMLTable(tableNodes[[5]])
growthEst <- readHTMLTable(tableNodes[[6]])

Cheers, Sody

Aaron Soderstrom
  • 599
  • 1
  • 6
  • 12
4

I gave up on Excel a long time ago. R is definitely the way to go for things like this.

library(XML)

stocks <- c("AXP","BA","CAT","CSCO")

for (s in stocks) {
      url <- paste0("http://finviz.com/quote.ashx?t=", s)
      webpage <- readLines(url)
      html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
      tableNodes <- getNodeSet(html, "//table")

      # ASSIGN TO STOCK NAMED DFS
      assign(s, readHTMLTable(tableNodes[[9]], 
                header= c("data1", "data2", "data3", "data4", "data5", "data6",
                          "data7", "data8", "data9", "data10", "data11", "data12")))

      # ADD COLUMN TO IDENTIFY STOCK 
      df <- get(s)
      df['stock'] <- s
      assign(s, df)
}

# COMBINE ALL STOCK DATA 
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]

# SAVE TO CSV
write.table(stockdata, "C:/Users/your_path_here/Desktop/MyData.csv", sep=",", 
            row.names=FALSE, col.names=FALSE)

# REMOVE TEMP OBJECTS
rm(df, stockdatalist)
2

When I use the methods shown here with XML library, I get a Warning Warning in readLines(page) : incomplete final line found on 'https://finance.yahoo.com/quote/DIS/key-statistics?p=DIS'

We can use rvest and xml2 for a cleaner approach. This example demonstrates how to pull a key statistic from the key-statistics Yahoo! Finance page. Here I want to obtain the float of an equity. I don't believe float is available from quantmod, but some of the key stats values are. You'll have to reference the list.

library(xml2)
library(rvest)

getFloat <- function(stock){
    url <- paste0("https://finance.yahoo.com/quote/", stock, "/key-statistics?p=", stock)
    tables <- read_html(url) %>%
    html_nodes("table") %>%
    html_table()
    float <- as.vector(tables[[3]][4,2])
    last <- substr(float, nchar(float)-1+1, nchar(float))
    float <-gsub("[a-zA-Z]", "", float)
    float <- as.numeric(as.character(float))
    if(last == "k"){
        float <- float * 1000
    } else if (last == "M") {
        float <- float * 1000000
    } else if (last == "B") {
        float <- float * 1000000000
    }
    return(float)
}
getFloat("DIS")

[1] 1.81e+09

That's a lot of shares of Disney available.

kraggle
  • 196
  • 2
  • 9