{edited} Hi everyone!
I'm attempting to systematically extract data from a textbook (pdf). Because this task doesn't easily translate to reproducible example, I'm providing 2 pages from the book as an example here. These two pages contain a list of species scientific names (genus species) and a series of 2-charactor codes. I would like to extract all species' scientific names and their code(s) from the 2-page example provided.
Here's an example of what I would like to extract (species = green, code = blue):
So far, I've been able to recover the scientific names pretty reliably, but the codes are not extracting as I would like:
library(pdftools)
library(tidyverse)
plants <- pdf_text("World_Checklist_of_Useful_Plant_Species_2020-pages-12-13.pdf") %>%
str_split("\n") # splitting up the document by pages: result is a list of length = # pages (689)
species_full <- list()
taxa_full <- list()
use_full <- list()
for(i in 1:length(plants)){
# for loop to search for species names across all subsetted pages
species_full[[i]] <- plants[[i]] %>%
str_extract("[A-Z]+[a-z]+ [a-z]+\\b") # extracting words with upper and lower case letters between margins and abbr. words
use_full[[i]] <- plants[[i]] %>%
str_extract("(?<=\\|).+(?=\\|)") %>% # extracting use codes
str_split("\n") %>%
str_extract_all("[A-Z]+[A-Z]")
}
species_full_df <- species_full %>%
unlist() %>% # unlisting
as.data.frame() %>%
drop_na() %>%
rename(species = ".") %>%
filter(!species %in% c("Checklist of", "Database developed")) # removing artifacts from page headers
use_full_df <- use_full %>%
unlist() %>% # unlisting
as.data.frame() %>%
rename(code = ".") %>%
filter(!code == "<NA>") %>%
as.data.frame()
From this code, I obtain the following in species_full_df
:
> head(species_full_df)
species
1 Encephalartos cupidus
2 Encephalartos cycadifolius
3 Encephalartos eugene
4 Encephalartos friderici
5 Encephalartos heenanii
6 Cycas apoa
(Note that the order is not preserved, but most of the species names are there)
I obtain these results from use_full_df
:
> head(use_full_df)
code
1 RBG
2 EU
3 EU
4 MA
5 ME
6 ME
The issue: the extraction is grabbing 3-character codes (which I would only like to extract the 2-charactor usage code), and is returning only a single code per row (which many species have more than one code).
Could you advise how to improve this process? Presumably my use of regular expressions are abhorrent.
Thank you in advance!
-Alex.