5

I have a package that scrapes data from the internet and displays its content based on the function call. But recently I got a message from CRAN that the data becomes stale when Binary build is installed (since the function was mentioned in utils.R and it has downloaded while the build).

For the past few days, I've tried the following but no success:

  • Global Variable using <<- but it generates a CRAN note and I also went through a few answers which advised against the approach Note: no visible binding for global variable
  • Create a new environment and then add this downloaded object in that, but it never worked out since I couldn't access the object in other functions. Ref: Where to create package environment variables?

This is the current package files: https://github.com/amrrs/tiobeindexr/tree/master/R

Tried solution:

zzz.r file:

.onLoad <- function (libname, pkgname)
{

  assign("newEnv", new.env(hash = TRUE, parent = parent.frame()))

  newEnv$.all_tablesx789  <- rvest::html_table(xml2::read_html('https://www.tiobe.com/tiobe-index/'))


}

one of the functions in the core code.

hall_of_fame <- function() {

  #check_data()

  #.GlobalEnv$.all_tablesx789 <- check_data()

  newEnv$.all_tablesx789[[4]]

}

The package builds fine, but the object is not found. Error below:

Error in hall_of_fame() : object 'newEnv' not found

I've only a couple of days to save my package on CRAN and I hope I've provided enough data from saving this question being downloaded.

Thanks!

amrrs
  • 6,215
  • 2
  • 18
  • 27
  • 1
    Create an environment in your package. Have a function that downloads the data and writes it to that environment. Then call that function in .onLoad. – Thomas Sep 02 '18 at 14:25
  • @Thomas Thanks for the comment. It helped me using hrbrmstr's logic to solve the problem. – amrrs Sep 03 '18 at 10:26
  • Please don't do this. The ability to load your package shouldn't depend on an internet connection and a particular site being up. It should really only update when the user requests it. I'd recommend an `update_mypkg_data()` function with perhaps a package startup message advising the user to run it. – Hugh Sep 03 '18 at 10:46
  • @Hugh But the package itself can function only when it's connected to the internet otherwise in earlier case, the data the user got was not the right / new one – amrrs Sep 03 '18 at 11:04

2 Answers2

3

Consider adding memoise as a dependency so you can get in-session caching for free with a minimal dependency chain then using a package environment and (just for fun) an active binding.

Create new env (you can stick this in, say, aaa.R):

.pkgenv <- new.env(parent=emptyenv())

Now, (say, in zzz.R) setup one function that does the table grabbing:

.get_tiboe_tables <- function(url) {
  message("Delete this since it's just to show caching works") # delete this
  content <- xml2::read_html(url)
  rvest::html_table(content)
}

And "memoise" it (again, in zzz.R):

get_tiboe_tables <- memoise::memoise(.get_tiboe_tables)

Now, create an active binding which will let us access the tables like a variable (i.e. w/o the ()). It's more "fun" than necessary (again, in zzz.R):

makeActiveBinding(
  sym = "all_tables",
  fun = function() get_tiboe_tables('https://www.tiobe.com/tiobe-index/'),
  env = .pkgenv
)

Now, get the value like this (notice we get the "loading" message as it "primes" the cache:

str(.pkgenv$all_tables, 1)
## Delete this since it's just to show caching works ** the loading msg
## List of 4
##  $ :'data.frame':    20 obs. of  6 variables:
##  $ :'data.frame':    30 obs. of  3 variables:
##  $ :'data.frame':    15 obs. of  8 variables:
##  $ :'data.frame':    15 obs. of  2 variables:

On subsequent calls there is no loading message since it's retrieving the cached value:

str(.pkgenv$all_tables, 1)
## List of 4
##  $ :'data.frame':    20 obs. of  6 variables:
##  $ :'data.frame':    30 obs. of  3 variables:
##  $ :'data.frame':    15 obs. of  8 variables:
##  $ :'data.frame':    15 obs. of  2 variables:

On the next R session it will refresh the tables. That way, there's fresh data without abusing the site. You can use file collation instead of sorted-name hacking as well.

Note that you can export the active binding as well and your users can then use it like a variable instead of calling it like a function.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • Thank you sir for this amazing reply. While I'm still trying to understand the code, Could you please tell me how the tables would be refreshed in each session? Because when i load the function in a new session `message("Delete this since it's just to show caching works") ` is not getting printed – amrrs Sep 03 '18 at 04:02
  • What are you defining as a new session? https://rud.is/dl/tiboe-so-ss.png – hrbrmstr Sep 03 '18 at 10:08
  • I tried to restart my R session in Rstudio and reloaded the package. – amrrs Sep 03 '18 at 10:19
3

Actually, I took a slightly different approach from the above answer. This is in reference with Thomas' comment and the reason is I didn't want to add memoise as a dependency and tried an alternative.

Creating a new package in aaa.R:

.pkgenv <- new.env(parent=emptyenv())

Loading data into the tables within the environment using .onAttach() in zzz.R

.onAttach <- function(libname, pkgname) {

  packageStartupMessage("Downloading TIOBE Index Data using your Internet...")

  tryCatch({
    .pkgenv$.get_tiboe_tables <- rvest::html_table(xml2::read_html("https://www.tiobe.com/tiobe-index/"))
  },
  error = function(e){
    packageStartupMessage("Downloading TIOBE Index data failed!")
    packageStartupMessage("Error Message:")
    packageStartupMessage(e)
    return(NA)
  })

}

My earlier mistakes seems that I was trying to create the new enviroment inside .onLoad() itself.

amrrs
  • 6,215
  • 2
  • 18
  • 27