0

I am trying to optimize a simple R code I wrote on two aspects:

1) For loops

2) Writing data into my PostgreSQL database

For 1) I know for loops should be avoided at all cost and it's recommended to use lapply but I am not clear on how to translate my code below using lapply.

For 2) what I do below is working but I am not sure this is the most efficient way (for example doing this way versus rbinding all data into an R dataframe and then load the whole dataframe into my PostgreSQL database.)

EDIT: I updated my code with a reproducible example below.

for (i in 1:100){

   search <- paste0("https://github.com/search?o=desc&p=", i, &q=R&type=Repositories)

   download.file(search, destfile ='scrape.html',quiet = TRUE)

   url <- read_html('scrape.html')

   github_title <- url%>%html_nodes(xpath="//div[@class=mt-n1]")%>%html_text()

   github_link <- url%>%html_nodes(xpath="//div[@class=mt-n1]//@href")%>%html_text()

   df <- data.frame(github_title, github_link )

   colnames(df) <- c("title", "link")

   dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)

   cat(i)
}

Thanks a lot for all your inputs!

ML_Enthousiast
  • 1,147
  • 1
  • 15
  • 39
  • 3
    Ah! The myth continues: *I know `for` loops should be avoided at all cost and it's recommended to use `lapply`* – Parfait Nov 05 '19 at 19:00
  • so far the for loop was "visually" most comprehensive to me. The thing is that I have to do some regex for each of the elements (a,b,c,d) so I don't know how to perform this operations within lapply for example. – ML_Enthousiast Nov 05 '19 at 19:13
  • Thanks for your feedback, I edited the code, I am trying to progress in R so trying to browse to many github repositories. – ML_Enthousiast Nov 06 '19 at 09:26

1 Answers1

2

First of all, it is a myth that should be completely thrashed that lapply is in any way faster than equivalent code using a for loop. For years this has been fixed, and for loops should in every case be faster than the equivalent lapply.

I will visualize using a for loop as you seem to find this more intuitive. Do however note that i work mostly in T-sql and there might be some conversion necessary.

n <- 1e5
outputDat <- vector('list', n)
for (i in 1:10000){
  id <- element_a[i]
  location <- element_b[i]
  language <- element_c[i]
  date_creation <- element_d[i]
  df <- data.frame(id, location, language, date_creation)
  colnames(df) <- c("id", "location", "language", "date_creation")
  outputDat[[i]] <- df
}
## Combine data.frames
outputDat <- do.call('rbind', outputDat)
#Write the combined data.frame into the database.
##dbBegin(con)   #<= might speed up might not.
dbWriteTable(con, "my_database", df, append = TRUE, row.names = FALSE)
##dbCommit(con)  #<= might speed up might not.

Using Transact-SQL you could alternatively combine the entire string into a single insert into statement. Here I'll deviate and use apply to iterate over the rows, as it is much more readable in this case. A for loop is once again just as fast if done properly.

#Create the statements. here 
statement <- paste0("('", apply(outputDat, 1, paste0, collapse = "','"), "')", collapse = ",\n") #\n can be removed, but makes printing nicer.
##Optional: Print a bit of the statement
# cat(substr(statement, 1, 2000))

##dbBegin(con)   #<= might speed up might not.
dbExecute(con, statement <- paste0(
'
/*
  SET NOCOCUNT ON seems to be necessary in the DBI API.
  It seems to react to 'n rows affected' messages. 
  Note only affects this method, not the one using dbWriteTable
*/
--SET NOCOUNT ON 
INSERT INTO [my table] values ', statement))
##dbCommit(con)   #<= might speed up might not.

Note as i comment, this might simply fail to properly upload the table, as the DBI package seems to sometimes fail this kind of transaction, if it results in one or more messages about n rows affected.

Last but not least once the statements are made, this could be copied and pasted from R into any GUI that directly access the database, using for example writeLines(statement, 'clipboard') or writing into a text file (a file is more stable if your data contains a lot of rows). In rare outlier cases this last resort can be faster, if for whatever reason DBI or alternative R packages seem to run overly slow without reason. As this seems to be somewhat of a personal project, this might be sufficient for your use.

Oliver
  • 8,169
  • 3
  • 15
  • 37
  • Thanks a lot Oliver! So from your example I understand that it is not recommended to "append" the new rows to the database at each loop? In that case, if we store `outputDat` within the R environment and if this list gets very heavy (many Gigabytes for example), I might encounter issues no? – ML_Enthousiast Nov 06 '19 at 09:42
  • 1) Indeed you are correct. The reason is that `R` copies the `data.frame` each time you use `rbind`, so while `rbind` is running there will be 3 `data.frames` in memory. `outputDat` a copy of `outputDat` and the `data.frame` that you are extending into `outputDat` (and this is the best case). – Oliver Nov 06 '19 at 10:32
  • 2) You might encounter memory problems, if you data.frame is larger than what your computer might be able to handle. One way to get around this, is the send and clear your list of `data.frames` at intervals (say 500? 1000? Depends on size). Most likely you can get around it by using `data.table` instead of `data.frame`, which avoids copying of objects in most cases. I'd suggest trying first note if there are memory issues and then find a solution. – Oliver Nov 06 '19 at 10:39
  • Thanks, it's very clear. Just to be sure I see the full aspects, what is the main disadvantage of doing `dbWriteTable`directly within the for loop? – ML_Enthousiast Nov 06 '19 at 17:01
  • 1
    Sure thing, glad i could help. Without diving into the source code, there would be 2 obvious disadvantages. The main disadvantage is overhead. Every time you call `dbWriteTable` it will have to call external processes to access the database connection. This overhead is two-fold, first there will be overhead as it has to process any function that is contained in `dbWriteTable` directly in `R`. Secondly it likely sends a statement similar to my last example. In most SQL databases the engine is optimized to perform this task, but a slight overhead will be present for every execute. – Oliver Nov 06 '19 at 17:10
  • Most likely the overhead that is processing data to fit in the SQL database properly is the biggest overhead and this is likely in the `R` (or `C`) department. – Oliver Nov 06 '19 at 17:12
  • Amazing. Thanks a ton ! – ML_Enthousiast Nov 06 '19 at 17:21
  • 1
    my bad... just did it! – ML_Enthousiast Feb 03 '20 at 08:32