I have a dataframe
of ~20,0000 observations, I am focused specifically on a column that has abstracts of scientific journals. I am attempting to pull plant species out of these abstracts. So I wanted to use this function to do so...
find.all.matches <- function(search.col,pat){
captured <- str_match_all(search.col,pattern = pat)
t <- lapply(captured, str_trim)
t2 <- lapply(t, function(x) gsub("[^a-z]","",x))
t3 <- sapply(t2, unique)
t4 <- lapply(t3, toString)
found.col <- unlist(t4)
return(found.col)
}
I have a dataframe
of all recognized plant species which is 1496575 obs. of 1 variable
.
I created a pattern for this dataframe...
WFO_list <- WFO_keywords_l
WFO_list[length(WFO_list)] <- paste0(WFO_list[length(WFO_list)],"[^a-z]")
WFO_list[1] <- paste0("[^a-z]",WFO_list[1])
WFO_pat <- paste(WFO_list,collapse="[^a-z]|[^a-z]")
I then ran this line to achieve the desired result....
WFO_capture <- find.all.matches(search.col = all_data$title_l,
pat = WFO_pat)
I received an error...
Error in stri_match_all_regex(string, pattern, omit_no_match = TRUE, opts_regex = opts(pattern)) :
Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG, context=`[^a-z]schoenoxiphium ecklonii var. ecklonii[^a-z]|[^a-z]cyperus violifolia[^a-z]|[^a-z]carex viridula var. viridula[^a-z]|[^a-z]mariscus phleoides[^a-z]|[^a-z]tetraria compar[^a-z]|[^a-z]fimbristylis schulzii[^a-z]|[^a-z]scirpus orbicephala[^a-z]|[^a-z]trichophorum bracteatum[^a-z]|[^a-z]scirpus uniflorum[^a-z]|[^a-z]blysmopsis exilis[^a-z]|[^a-z]carex arcatica f. taldycola
I have used this function before with much smaller datasets, I think the large list is tripping the function up. I am wondering if there is any way to overcome this. Any help is greatly appreciated!
For reference
> head(WFOspecies)
scientificName
1: Schoenoxiphium ecklonii var. ecklonii
2: Cyperus violifolia
3: Carex viridula var. viridula
4: Mariscus phleoides
5: Tetraria compar
6: Fimbristylis schulzii