1

I am trying to work out how to define a custom for loop in R, or if that's even possible.

Examples

A couple of things that would be nice to have are

Is it possible to define a new kind of for loop in R (and if so, how), or is this an inherent limitation of the language and hence not something that can be done?

Use cases

Here's a random example of how for_each_with_index could simplify finicky arithmetic

Suppose we want to scrape the 36th to the 55th article from a website and assign the output to a position in a list. This works well

library(rvest)
library(dplyr)
articles <- vector(mode = "list", length = 20)
for(i in 36:55) {
  paste0("Scraping article ", i) %>% print
  articles[[i - 35]] <- read_html(paste0("http://afr.herokuapp.com/articles/", i)) %>% 
             html_nodes("p") %>% html_text %>% paste0(collapse="/n")
           }

But we see some finicky arithmetic (36:55, i - 35 etc) that could theoretically be abstracted away through for_each_with_index enumerating over each element of the articles object, like so:

# NOT ACTUAL R CODE

library(rvest)
library(dplyr)
articles <- vector(mode = "list", length = 20)
for_each_with_index(articles, i) {
  paste0("Scraping article ", i) %>% print
  articles[[i]] <- read_html(paste0("http://afr.herokuapp.com/articles/", i + 35)) %>% 
             html_nodes("p") %>% html_text %>% paste0(collapse="/n")
           }

By using for_each_with_index, we avoided the fiddly arithmetic . This example is very simple, but when the complexity turns up some notches i.e when we have various conditionals, nested loops etc, things get much more complex and these seemingly small improvements in clarity become more profound

Community
  • 1
  • 1
stevec
  • 41,291
  • 27
  • 223
  • 311
  • @chinsoon12 I have updated the question with an example. Sorry it's a long one. I hope it makes sense – stevec Dec 24 '19 at 01:47
  • 1
    See ```lapply(36:55, function (i) {read_html(paste0(...))})``` – Cole Dec 24 '19 at 02:02
  • I fail to see how `i + 35` is less finicky than `i - 35` – Hong Ooi Dec 24 '19 at 02:23
  • The essential difference is not between i + 35 and i - 35, but between using a for loop and not. For loops are highly inefficient in R, involving lots of unnecessary copying. They work just fine for looping over a small number of strings, but if you have to do computation on hundreds of thousands of records, for loops will kill your performance. – BigFinger Dec 24 '19 at 03:30
  • 1
    @BigFinger is wrong: for loops are not particularly inefficient. Like the other control constructs (if, while, etc.) they are internally function calls, just with special rules in the parser to construct the call. If you want different semantics than for, you can define your own function, but you can't change the syntax of the language, so it will need to be done with an infix operator (like Martin Morgan's answer) or a regular function call. – user2554330 Dec 24 '19 at 10:49

4 Answers4

3

The foreach package provides one model

res = foreach(i = 1:3) %do% {
    sqrt(i)
}

This is using the R %any% construct, which is an infix operator that can be defined by the user, so

`%with_index%` <- function(lhs, rhs) {
    ## implement ...
    Map(function(i) {
        list(i, rhs(lhs[[i]]))
    }, seq_along(lhs))
}

1:10 %with_index% sqrt

It has also defined the foreach() function to set up the right hand side. %do% has to be written in such a way that the implementation works for relatively general rhs, and this is not a trivial task.

Implementing for_each() %with_index% {} would probably be quite interesting, and very educational.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
1
  1. It is best to avoid for loops in R, especially for your main computation. Looping in R is achieved with functions like lapply, sapply, mapply, tapply. These are flexible and can be customized by passing your own functions into them.
  2. Have a look at the try() function, which you can use to wrap your code in. If you set the argument "silent" to true, errors will be ignored.

Thanks for posting an example. The solution by @HubertL is the right approach. An index is not needed in this case. If you really want to pass the index to lapply instead of the actual page number, this can be easily done:

my_scraper <- function(article_id){ 
  paste0("Scraping article ", article_id) %>% print
  read_html(paste0("http://afr.herokuapp.com/articles/", article_id + 35)) %>%
    html_nodes("p") %>% 
    html_text %>% 
    paste0(collapse="/n")}

articles <- lapply(1:20, my_scraper)
BigFinger
  • 1,033
  • 6
  • 7
  • how long does it take to perform a million iterations, `system.time( for (i in 1:1000000) {} )` ? What about `lapply(1:1000000, function(i) {})` ? – Martin Morgan Dec 24 '19 at 03:46
  • Yes, that's a great question. If you put it like this, there should be no benefit to using lapply over the for loop. I need to revise my answer. In this example, read_html operates on one element of the list at a time, so there doesn't seem to be a way to optimize the code. – BigFinger Dec 24 '19 at 04:09
1

You could do it with this function:

for_with_index <- function(var, index, seq, expr) {
  env <- parent.frame() # This is where evaluation takes place
  for (i in seq_along(seq)) {
    assign(as.character(substitute(index)), i, envir = env)
    assign(as.character(substitute(var)), seq[i], envir = env)
    eval(substitute(expr), envir = env)
  }
}

for_with_index(i, j, 7:9, cat("Entry ", j, " is ", i, "\n"))
#> Entry  1  is  7 
#> Entry  2  is  8 
#> Entry  3  is  9

If you want to use for-like syntax, it's a little harder, because you can't modify the parser. However, after parsing, for loops are just function calls, so you can still do it if you can figure out where to put the index in the call. One way might be to write it like this:

for (i in {7:9;j}) 
  cat("Entry ", j, " is ", i, "\n")

That's legal syntax, but in the standard loop it wouldn't work, because {7:9;j} evaluates the same as j, which isn't what you want. But you can write your own for loop function to handle it:

`for` <- function(var, seq, expr) { 
  env <- parent.frame()
  seq <- substitute(seq)
  if (is.call(seq) && seq[[1]] == "{" && length(seq) == 3) {
    index2 <- seq[[3]]
    seq <- eval(seq[[2]], env)
    for (index in seq_along(seq)) {
      assign(as.character(substitute(var)), seq[index], envir = env)
      assign(as.character(index2), index, envir = env)
      eval(substitute(expr), envir = env)
    }
  } else {
    seq <- eval(seq, env)
    oldfor <- substitute(for (var in seq) expr, 
                         list(var = substitute(var), 
                              seq = seq, 
                              expr = substitute(expr)))
    oldfor[[1]] <- base::`for`
    eval(oldfor, env)
  }
}

for (i in 7:9) 
  print(i)
#> [1] 7
#> [1] 8
#> [1] 9

for (i in {7:9; j}) 
  cat("Entry ", j, " is ", i, "\n")
#> Entry  1  is  7 
#> Entry  2  is  8 
#> Entry  3  is  9
user2554330
  • 37,248
  • 4
  • 43
  • 90
0

Expanding on @Cole comment, and as mentionned by @BigFinger in their answer, you "always" should think lapply when you need a for loop:

library(rvest)
library(dplyr)

my_scraper <- function(article_id){ 
  paste0("Scraping article ", article_id) %>% print
  read_html(paste0("http://afr.herokuapp.com/articles/", article_id)) %>%
    html_nodes("p") %>% 
    html_text %>% 
    paste0(collapse="/n")}

articles <- lapply(36:55, my_scraper)

lapply() builds a list so you don't have to initialize it.

lapply is not easy to use at first, but it is very convenient. If you like tidyverse you can have also a look at purr::map()

HubertL
  • 19,246
  • 3
  • 32
  • 51