24

Data:

I have a shiny dashboard application and my dataset is around 600 MB in size. It swells by 100 MB every month. My data resides locally in MySQL.

MenuItems:

I have 6 - 7 sidebar menuItems on my dashboard and each of them has 10 - 12 different outputs - charts and tables. Each of these tabs has 3 - 6 inputs such as selectizeInput, slider, date range, etc. to filter the data.

Data subsets:

Since I cannot load all the data into the memory, for every menu item I create a subset of data based on the date range by keeping the date range to just 2 - 3 days from the system date.

For example:

df1 <- reactive({df[df$date >- dateinput[1] & df$date <- dateinput[2], ]})

The above gets the data for my first menu item and depending on the selectInput or other inputs, I am further filtering the data. For example, If I have a selectInput for Gender (male and female) then I further subset df1 to:

df2 <- reactive({
       if(is.null(input$Gender)){ 
          df1 
       } else if(input$Gender == "Male") 
          {df1[df1$Gender == "Male",]} 
       )}

If I have more than 1 input, I subset this df1 further and pass on the values to df2. df2 becomes the reactive dataset for all the charts and tables in that MenuItem.

The more the number of menuItem I create more subsets to suit the filters and analysis.

I face two problems:

  1. On older machines, the app is not loading. and
  2. On newer machines, the app loads very slowly sometimes 5 - 6 minutes

After the first set of data load, the charts and tables gets rendered faster on reactive changes.

To counter this, I have tried moving all common and repetitive parameters and libraries to global.R.

I have two questions:

1.are there any basic hygiene factors that one needs to keep in mind when mining data in R especially through shiny (Mining in R is extremely fast).

2.I have read about parallel processing, but almost always all the examples talk about distributing a single heavier calculation. Can we distribute through parallel processing, subsetting the data or distributing charts / tables preparation.

Please note, I am a researcher and not a programmer, but have learnt to use shiny and host applications on the cloud or locally recently.

Guidance on this will be very helpful for many novice users of R like me.

Apricot
  • 2,925
  • 5
  • 42
  • 88
  • Really interesting question, i think i will face this problem in few month with my dashboard. Are you sending request to your SQL data base evry time or are you loading it into a Rdata once the App is running ? Then using `df1[df1$Gender == "Male",]` to subset your data is very slow, have you try to use `filter` function from the package `dplyr` its really faster when you have large data set – Christophe D. Jun 29 '16 at 07:34
  • You need to figure out if the slowness is due to database access or any R functions. Here are some ideas that might help http://stackoverflow.com/questions/21484115/code-profiling-for-shiny-app When you say that the app does not load on older machines, try figure out if it's due to memory limitation or browser version etc. Shiny has heavy use of Javascript which may not work on older browsers. – Xiongbing Jin Jun 29 '16 at 18:23
  • Another interesting point related to this question, that I don't know the answer to myself: if you have multiple reactive values, all containing big (sub)datasets, like df1 and df2 in this example, are those data frames kept in memory at all times? I would guess that they are. If they are, it might be a good practice to have a single df <- reactive({...}) to always returns the 'current' dataset you are working on. – Joris Gillis Jun 30 '16 at 08:16
  • @Joris Gillis Yes...the datasets are in memory. I tried delaying the loading of dataset, but it didn't work.... I tried having one single reactive set...but since each menu item has multiple filters I necessarily create subsets with will work within a menu item. For example, my first menuItem is Toplines where I render only the summary of data, and this has date as a filter. My next menu item is tabulate, where I allow the users to tabulate the data based on the filters...now in this if i don;t allow date filter, then all the filters applied in menuItem 1 gets carried in menuItem 2. – Apricot Jun 30 '16 at 12:03
  • @Apricot It depends on how your shiny app is reacting. Do you have filter controls and a 'Compute' button on each tab. If so, you can check in the current dataset reactive which tab is open, wrap the data processing for each tab into a separate function and call the appropriate function. Another point: I would try to push down the filtering and data preparation steps as much as possible into the database. Especially if you are converting dates from string to POSIXct or Date objects in R. Converting dates is quite an expensive operation, so you should minimise its use. – Joris Gillis Jul 01 '16 at 07:46
  • @JorisGillis My app stores date as character in mysql....and i use as.Date function to convert....and on pushing the filtering into database - currently I fetch one large dataset for a menuItem and manipulate throughout that page...I will certainly give this a try...many thanks – Apricot Jul 01 '16 at 08:26

1 Answers1

16

This is a very interesting question and deserves more proper responses rather than comments. I would like to relate my experience and thoughts. I built a commercial R+shiny application with Shiny Server Pro, using database(s) and loads of other tricks.

Delayed UI loading time
My app takes over 30s to load, i.e. to give control back to the user.

The issue

Shiny is a single page application. Therefore a complex app, with loads of tabs, data loaded to populate some of the menus & selectors etc. is affected and this starts from the initial loading time.

UI possible mitigations

  • Use dynamic UI components (wisely) to add complexity after start up. For example a particular menu may start very simply with few elements, then add more elements at a later stage.
  • Joe Cheng proposed insertUI and removeUI when my app was almost finished, so I didn't get around to use them, but they also could contribute to a simpler page for start up.

Use of database

My app used MonetDB and later PostgreSQL. The performance of MonetDB was good, but I had a multiple user conflict (complex issue that I cannot detail here) and this forced me to move to PostgreSQL as an alternative. PostgreSQL was fine, but it took a dramatic time to start due to the cache warming up issue. The design required to load at start up loads of data into the DB: bad design.

RDBMS delays possible mitigations

I think I tried most tricks with varying success.

  • Limit RDBMS usage. As I decided from the start to use data.table to speed up data manipulations without been constraints by copying, I was also using fread for any type of csv reading. At the time fwrite (still from data.table) wasn't even on the horizon, otherwise it would merit serious considerations.
  • App re-design. the app architecture has a lot to do with the degree of intensity that RDBMS are used. I'm convinced that time can be saved by a design that could take into account R+shiny (mainly R) limitations.
  • Now MonetDB has R functions embedded into the code, so it should be even faster than before. It certainly deserves a good look. On the other hand the multi-user features should be thoroughly tested: most of R database code does not take into account to be used in a multi-user environment as offered by shiny. Maybe RStudio should be doing something more about this. Honestly they have already started, with the experimental introduction of connection pools and that is great.

Excessive use of reactivity

I think it is great to play with an advanced framework like shiny, and reactivity is a lot of fun to learn. On the other hand over a wide and complex application things can easily get out of hand.

Excessive reactivity possible mitigations

  • Debugging each function gives a precise idea of how many time a particular shiny function is called, and any reactive function is called usually more than once. Of course all this burns cpu time, and needs at least to keep under control.
  • Constructs like observeEvent now have parameters like ignoreInit: a wise use to these parameters can save at least a void cycle at initialisation time.

In my experience we have only scratched the surface of what it is possible to do with shiny. On the other hand there is a limit due to the single process nature of R. With Shiny Server Pro it is possible to envisage to use load balancers and spread multiple users across different servers. On the other hand to get into these territories we would need some kind of messaging system across the various instances. Already know I see the need for that in complex Shiny Server Pro applications (e.g. when there is the need to manage different classes of users, but at the same time to communicate between them). But this is out of scope to this SO question.

Enzo
  • 2,543
  • 1
  • 25
  • 38