0

I have a dataframe of ~20,0000 observations, I am focused specifically on a column that has abstracts of scientific journals. I am attempting to pull plant species out of these abstracts. So I wanted to use this function to do so...

find.all.matches <- function(search.col,pat){
  captured <- str_match_all(search.col,pattern = pat)
  t <- lapply(captured, str_trim)
  t2 <- lapply(t, function(x) gsub("[^a-z]","",x))
  t3 <- sapply(t2, unique)
  t4 <- lapply(t3, toString)
  found.col <- unlist(t4)
  return(found.col)
}

I have a dataframe of all recognized plant species which is 1496575 obs. of 1 variable.

I created a pattern for this dataframe...

WFO_list <- WFO_keywords_l
WFO_list[length(WFO_list)] <- paste0(WFO_list[length(WFO_list)],"[^a-z]")
WFO_list[1] <- paste0("[^a-z]",WFO_list[1])
WFO_pat <- paste(WFO_list,collapse="[^a-z]|[^a-z]")

I then ran this line to achieve the desired result....

WFO_capture <- find.all.matches(search.col = all_data$title_l, 
                                    pat = WFO_pat)

I received an error...

Error in stri_match_all_regex(string, pattern, omit_no_match = TRUE, opts_regex = opts(pattern)) :
Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG, context=`[^a-z]schoenoxiphium ecklonii var. ecklonii[^a-z]|[^a-z]cyperus violifolia[^a-z]|[^a-z]carex viridula var. viridula[^a-z]|[^a-z]mariscus phleoides[^a-z]|[^a-z]tetraria compar[^a-z]|[^a-z]fimbristylis schulzii[^a-z]|[^a-z]scirpus orbicephala[^a-z]|[^a-z]trichophorum bracteatum[^a-z]|[^a-z]scirpus uniflorum[^a-z]|[^a-z]blysmopsis exilis[^a-z]|[^a-z]carex arcatica f. taldycola

I have used this function before with much smaller datasets, I think the large list is tripping the function up. I am wondering if there is any way to overcome this. Any help is greatly appreciated!

For reference

> head(WFOspecies)
                          scientificName
1: Schoenoxiphium ecklonii var. ecklonii
2:                    Cyperus violifolia
3:          Carex viridula var. viridula
4:                    Mariscus phleoides
5:                       Tetraria compar
6:                 Fimbristylis schulzii
Mark
  • 7,785
  • 2
  • 14
  • 34
  • 2
    hi Melissa! How long is WFO_keywords_l? is that the dataframe? If so, you can imagine why it might throw that error, considering the dataframe is 1496575 rows long – Mark Aug 23 '23 at 04:47
  • 2
    I don't think regex matching is the best approach here. I would write a function to tokenize each abstract into n-grams using the [tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) package and then filter the resulting data frame for the occurrence of the scientific names. – neilfws Aug 23 '23 at 04:57
  • 2
    It's hard to say without a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), but perhaps try `stringi::stri_extract_all(WFO_pat, all_data$title_l)` / `regmatches(all_data$title_l, gregexpr(WFO_pat, all_data$title_l, perl=TRUE))`. If you still get the U_REGEX_PATTERN_TOO_BIG error you should probably try a different approach (as suggested by neilfws). – jared_mamrot Aug 23 '23 at 05:21
  • 1
    @Mark, yes, the data frame is huge so I might have to find another way to do this – Melissa Duda Aug 23 '23 at 15:10
  • @MelissaDuda probably the way forward is this: https://stackoverflow.com/questions/76958078/regex-error-pattern-exceeds-limits-on-size-or-complexity#comment135667415_76958078 – Mark Aug 24 '23 at 06:12

0 Answers0