1

I made a shiny app to search one big dataframe, and I thought of using stringi. However when I run the app I get a warning that empty search patterns are not supported. With this example app I can ignore this warning just fine (though it keeps spamming), however with my big dataframe app slows everything down and the only way I can stop the app is to terminate the R session.

## app.R ##
require(shiny)
require(stringi)
require(dplyr)
require(DT)

ui <- fluidPage(textInput("searchall", label =  "Search"),
            dataTableOutput("tableSearch"))

server <- function(input, output, session) {
  data(GNI2014)
  output$tableSearch <- DT::renderDataTable(datatable(
  GNI2014 %>% filter(
      if (!is.null(input$searchall))
        stri_detect_fixed(str = country , pattern = input$searchall)
    ),
    options = list(sDom  = '<"top">lrt<"bottom">ip')
  ))
}

shinyApp(ui, server)

When I run this app I get flooded with the following warning:

Warning in stri_detect_fixed(str = country, pattern = input$searchall) : empty search patterns are not supported

What would be the best approach to bypass this warning and slowdown that comes with it.

magasr
  • 493
  • 5
  • 21

1 Answers1

1

You don't need stringi() for this. The fastest way to query the data is to use data.table() with a key on country, and use grepl() to subset the data.

Example using the GNI2014 data from the treemap package.

library(treemap)
library(data.table)
data(GNI2014)
gni2014table <- data.table(GNI2014)
setkey(gni2014table,"country")
searchText <- "berm"
gni2014table[grepl(searchText,gni2014table$country,ignore.case=TRUE),]

searchText <- "United"
gni2014table[grepl(searchText,gni2014table$country,ignore.case=TRUE),]

...and the output.

> library(treemap)
> library(data.table)
> data(GNI2014)
> gni2014table <- data.table(GNI2014)
> setkey(gni2014table,"country")
> searchText <- "berm"
> gni2014table[grepl(searchText,gni2014table$country,ignore.case=TRUE),]
   iso3 country     continent population    GNI
1:  BMU Bermuda North America      67837 106140
> 
> searchText <- "United"
> gni2014table[grepl(searchText,gni2014table$country,ignore.case=TRUE),]
   iso3              country     continent population   GNI
1:  ARE United Arab Emirates          Asia    4798491 44600
2:  GBR       United Kingdom        Europe   62262000 43430
3:  USA        United States North America  313973000 55200
>

Returning only the column that you want to populate the field on the UI looks like this.

searchText <- "United Arab"
gni2014table[grepl(searchText,gni2014table$country,ignore.case=TRUE),country]

UPDATE 20 Dec 2017: Add code to run microbenchmarks, showing that in first test case lgrepl() runs 20 ms faster than stringi_detect_fixed(), and in the second case, stringi_detect_fixed() is 60 ms faster than lgrepl() for 100 iterations of the request.

library(treemap)
library(data.table)
library(microbenchmark)
data(GNI2014)
gni2014table <- data.table(GNI2014)
setkey(gni2014table,"country")
searchText <- "berm"

microbenchmark(gni2014table[grepl(searchText,gni2014table$country,
                                  ignore.case=TRUE),])

searchText <- "United Arab"
microbenchmark(gni2014table[grepl(searchText,gni2014table$country,
                                  ignore.case=TRUE),country])

library(stringi)
searchText <- "berm"

microbenchmark(gni2014table[stri_detect_fixed(searchText,
                              gni2014table$country,
                              case_insensitive=TRUE),])

searchText <- "United Arab"

microbenchmark(gni2014table[stri_detect_fixed(searchText,
                            gni2014table$country,case_insensitive=TRUE),])

You'll have to run the code yourself to reproduce the benchmarks, because the output of microbenchmark() doesn't display easily on SO.

That said, a summarized version of the timings is:

searchText      Function             Mean (in Microseconds)
-------------   -------------------- -----------------------
berm            grepl                526.2545
United Arab     grepl                583.1789
berm            stringi_detect_fixed 545.8772
United Arab     stringi_detect_fixed 524.1132
Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • Thanks for the answer. I thought srtingi would be preferable option considering I have a big dataframe with bigi-sh chunk of text in each field and given the findings here: https://stackoverflow.com/q/24257850/3967488 – magasr Dec 20 '17 at 07:47
  • 1
    @magasr the URL you posted includes conflicting timings, where one answer shows grepl() is faster than stringi_detect_fixed(), and the other indicates the reverse. I will update my post to include microbenchmarks for your actual problem, and you'll see that as I coded a solution, one search is faster with grepl(), but the other is faster with stringi_detect_fixed(). That said, the results for 100 iterations of each call are within 60 milliseconds of each other, or virtually indistinguishable on an individual call basis. – Len Greski Dec 20 '17 at 12:13
  • 1
    edit: "...results for 100 iterations are within 60 microseconds of each other". – Len Greski Dec 20 '17 at 12:29