1

I'm currently working on a project that involves querying yahoo-finance for many different ticker symbols. The bottleneck is acquiring the data from yahoo, so I was wondering if there is a way I might go about speeding this up.

If I used multiple machines to query and then aggregated the data, would that help? I only have one physical machine; how might I go about doing that?

Thanks!

EDIT: Currently, I'm using Node.js, yahoo-finance, and Q.deferred to ask yahoo for historical data. Then, once all the promises are fulfilled (for each ticker), I'm doing a Q.all() to persist the data.

    var data = [];
    tickers = ["goog", "aapl", ...];
    ...
    Q.all(_.map(tickers, function(symbol) { 
        return getYahooPromise(symbol);
     }))
    .done( function() { persistData(data) });

getYahooPromise retrieves data for the ticker symbol and pushes it into the data array. Once all promises are resolved, the data is persisted in a MySQL database.

SECOND EDIT: More code:

var sequentialCalls = [];

for ( var i = 0; i < tickers.length / chunkSize; i++ ) {
    sequentialCalls.push( persistYahooChunk );
}
sequentialCalls.push( function(callback) { 
    connection.end(); 
    callback();
});

async.series( sequentialCalls )



exports.persistYahooChunk = function(callback) {
console.log("Starting yahoo query");
var currentTickers = tickers.slice(currentTickerIndex,currentTickerIndex + chunkSize);


return yahooFinance.historical( {
    symbols: currentTickers,
    from: "2015-01-28",
    to: "2015-02-05"
}).then( function(result) {
    console.log("Query " + currentTickerIndex +  "/" + tickers.length + "completed");
    currentTickerIndex += chunkSize;
    //add valid data
    var toPersist = _.map(result, function(quotes, symbol) {
            return [symbol, quotes.length != 0 ];
    });


    var query = "INSERT INTO `ticker` (`symbol`, `valid`) VALUES ?";
    connection.query(query, [toPersist], function(err, result) {
        if (err) {
            console.log (err);
        }
        //console.log(result);

        callback();
    });
});

}

ZenPylon
  • 518
  • 4
  • 11
  • 1
    are you executing 1 query per ticker or a single query for all the tickers? Also, show some code. – Bruno Apr 14 '15 at 16:44
  • Ah, just edited - one query per ticker. If that's the cause of a slow-down, is there a way I can atomize the queries so that if an operation goes bad I don't lose all the data? – ZenPylon Apr 14 '15 at 16:47

1 Answers1

1

The bottleneck is because you are doing one query per ticker.

Depending on the data you need to pull, if you could do a single query that includes all your tickers it would be much faster.

Here is an example if you need to get all current prices for a list of tickers, with a single query :

http://finance.yahoo.com/webservice/v1/symbols/A,B,C,D,E/quote?format=json

Bruno
  • 4,685
  • 7
  • 54
  • 105
  • Is there a way I can store parts of the query, so that it's not all-or-nothing? Or would I have to dive into yahoo-finance code to do so? – ZenPylon Apr 14 '15 at 16:53
  • I am not sure to understand what you mean/want to achieve? – Bruno Apr 14 '15 at 16:54
  • For example, if the internet connection dies during the query, I will lose all the data. Taking a look at my current implementation, it doesn't fix this issue, but I wonder if there's a way to do it. – ZenPylon Apr 14 '15 at 16:56
  • There is a way for sure but there is no magic...if you need a fail-safe mechanism you would need to implement it yourself since Yahoo only serves a response to some request. :\ Maybe provide a manual way to re-do the query in case of failure – Bruno Apr 14 '15 at 17:01
  • Ok, so I implemented this method using "chunking", so I'm grabbing stock tickers (i.e. I'm querying) in chunks of 10, 20, whatever. But it doesn't seem any faster. Does yahoo limit the download speed? – ZenPylon Apr 15 '15 at 00:28
  • 1
    How many tickers do you have and do you have benchmarking tests with times to back you up? Can you post all your code, with the throttling so we have a good understanding of your whole setup – Bruno Apr 15 '15 at 00:56
  • I've posted the code...Embarrassingly I'm going through all the combinations of 2 letter tickers and storing which ones are valid and which aren't. I'm not familiar with throttling Thanks so much for your help. – ZenPylon Apr 15 '15 at 01:11
  • 1
    All the permutations of 2 letters is 676 tickers to query, strange method, but volume is not too big. My guess is you can pull all the price data with 2 queries and maybe even 1. I would recommend creating a new question with your code update and with the `javascript` tag, hopefully someone can help you – Bruno Apr 15 '15 at 13:46
  • 1
    @ajdecker1022 For a list of valid US tickers (including ADRs), consider using `ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqtraded.txt`. It lists ~8k securities. For a description of the files, see `http://www.nasdaqtrader.com/trader.aspx?id=symboldirdefs`. – George Apr 18 '15 at 03:21