Fastest way to read webpage data

Question

I have a .csv file containing transaction IDs of nearly 1 million transactions associated with a bitcoin wallet (both sent and received transactions), which I read into RStudio as a tibble. Now I am trying to add another column to the table that lists the fees for each transaction. This is done using an API call.

For example, to get the fee for the txid 73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f, I have to open: https://blockchain.info/q/txfee/73336c8b2f8bbf9c4165de515765463d6e835a9f3f87bf822d8bcb23c074ae7f and read the data there directly.

This is my current code to record fees for the first 500 transactions:

library(readr)
library(curl)
tx <- read_csv("transactions.csv", col_names = c("txid", "amount"), skip = 0, n_max = 500)
tx$fee <- 0
for (i in 1:nrow(tx))
    tx$fee[i] <- scan(paste0("https://blockchain.info/q/txfee/", tx$txid[i]))
write_csv(tx, "tx_with_fees.csv")

Clearly, my biggest bottleneck is the time taken to access the webpage. The method used to read data hardly seems to matter (I tried curl, get and scan). With the above code, it takes around 0.4 seconds to record fees for each transaction.

What I did next was to simply open 5 instances of RStudio and run the code for different sets of 100 rows in each instance. This way I have been able to process each row in 0.1 seconds on average. That's a 4x improvement in speed but I am sure there are more efficient ways to parallelize than simply opening multiple instances of RStudio.

What would be the best way to do that?

It’s unclear what the question is: since you already tagged the question with [tag:rparallel], did you try using the ‘parallel’ R package? Why isn’t it suitable for your purposes? — Anyway, parallelising the site access is probably *not a good solution*, you are likely running afoul of their user agreement and might be blocked/IP banned. Check the user agreement of the website carefully regarding web scraping. Rather than *accelerating* access it’s likely that will have to *rate-limit* yourself, by making access artificially *slower*. — Konrad Rudolph, Apr 22 '21 at 10:39
@KonradRudolph no I did not. I am unable to figure out how to use parallel processing. I used the tag hoping someone familiar with it can give me some pointers. — Baheej Anwar, Apr 22 '21 at 11:17
@KonradRudolph their ToS states "You shall not make requests to the API that are, in our sole discretion, excessive". Excessive is pretty vaguely worded and I don't think a few hundred requests a second comes off as excessive (in context, my parallel instances send ~10 requests per second). To cause something like a DoS attack, you might need to flood their servers with hundreds of thousands of requests? — Baheej Anwar, Apr 22 '21 at 11:25
worth considering they may throttle your download speed if the same address is making too many requests — user438383, Apr 22 '21 at 11:31
@BaheejAnwar I agree that “excessive” is unfortunately vague, but tens of requests per second could easily fall under this, because you’re not the only user of their service. — Konrad Rudolph, Apr 22 '21 at 12:35

Fastest way to read webpage data

0 Answers0