0

{edited} Hi everyone!

I'm attempting to systematically extract data from a textbook (pdf). Because this task doesn't easily translate to reproducible example, I'm providing 2 pages from the book as an example here. These two pages contain a list of species scientific names (genus species) and a series of 2-charactor codes. I would like to extract all species' scientific names and their code(s) from the 2-page example provided.

Here's an example of what I would like to extract (species = green, code = blue):

Example of data I would like to extract

So far, I've been able to recover the scientific names pretty reliably, but the codes are not extracting as I would like:

library(pdftools)
library(tidyverse)

plants <- pdf_text("World_Checklist_of_Useful_Plant_Species_2020-pages-12-13.pdf") %>% 
  str_split("\n") # splitting up the document by pages: result is a list of length = # pages (689)

species_full <- list()
taxa_full <- list()
use_full <- list()

for(i in 1:length(plants)){ 
  # for loop to search for species names across all subsetted pages
  species_full[[i]] <- plants[[i]] %>%
    str_extract("[A-Z]+[a-z]+ [a-z]+\\b") # extracting words with upper and lower case letters between margins and abbr. words
  
  use_full[[i]] <- plants[[i]] %>%
    str_extract("(?<=\\|).+(?=\\|)") %>% # extracting use codes
    str_split("\n") %>%
    str_extract_all("[A-Z]+[A-Z]")
  
}

species_full_df <- species_full %>%
  unlist() %>% # unlisting
  as.data.frame() %>%
  drop_na() %>%
  rename(species = ".") %>%
  filter(!species %in% c("Checklist of", "Database developed")) # removing artifacts from page headers

use_full_df <- use_full %>% 
  unlist() %>% # unlisting
  as.data.frame() %>%
  rename(code = ".") %>%
  filter(!code == "<NA>") %>%
  as.data.frame()

From this code, I obtain the following in species_full_df:

> head(species_full_df)
                     species
1      Encephalartos cupidus
2 Encephalartos cycadifolius
3       Encephalartos eugene
4    Encephalartos friderici
5     Encephalartos heenanii
6                 Cycas apoa

(Note that the order is not preserved, but most of the species names are there)

I obtain these results from use_full_df:

> head(use_full_df)
  code
1  RBG
2   EU
3   EU
4   MA
5   ME
6   ME

The issue: the extraction is grabbing 3-character codes (which I would only like to extract the 2-charactor usage code), and is returning only a single code per row (which many species have more than one code).

Could you advise how to improve this process? Presumably my use of regular expressions are abhorrent.

Thank you in advance!

-Alex.

  • 2
    Hello Alex. Welcome to SO. It would be much easier for us to help you if you offered just a few (~1 to 3) pages, rather than the whole book. Furthermore, please explain -probably with an image- which part of the text blocks is the one you are interested in. As most of us are not botanists, we might not understand what different codes or conventions mean to you. As this is a very time-consuming process, if you offered an example of a successful and a not-successful extraction it would also help. – Nicolás Velasquez Jul 21 '21 at 18:06
  • 1
    @NicolásVelásquez I apologize for not making this question more simple, I understand that the question was vague and the task is very time-consuming. I've edited the question to focus on only 2 pages from the text, to include I diagram of the text I'm interested in extracting, and to clarify successful and unsuccessful extractions. Please let me know if there is anything else I can do to simplify the question. – J. Alex Baecher Jul 21 '21 at 18:47
  • 1
    `str_extract_all("[A-Z]{2}")` *might* help a little bit with getting only two-letter codes – Ben Bolker Jul 21 '21 at 18:52
  • Thank you, @BenBolker ! Your solution fixed the issue of grabbing items other than the 2-letter codes! (Hopefully it holds up for the larger extraction). If you have any suggestions about preserving species' codes which have more than 1 code, I'd love to hear it! As always, thank you for your help. – J. Alex Baecher Jul 21 '21 at 18:59
  • If I change the bit in the for loop to only: ``` for(i in 1:length(plants)){ species_full[[i]] <- plants[[i]] %>% str_extract("[A-Z]+[a-z]+ [a-z]+\\b") use_full[[i]] <- plants[[i]] %>% str_extract_all("[A-Z]{2}") } ``` It returns a list which preserves multiple codes: ``` > use_full[[1]][[15]] [[1]][[15]] [1] "ME" "EU" "HF" "MA" "ME" "PO" "SU" ``` Ideally, this list could be transferred to a `data.frame` with columns for each of the 10 possible codes, and each row would have 1's and NA's corresponding to each species' codes. – J. Alex Baecher Jul 21 '21 at 19:10
  • 1
    does `dplyr::bind_rows()` do what you want? – Ben Bolker Jul 21 '21 at 19:15
  • @BenBolker, unfortunately `dplyr::bind_rows()` doesn't appear to work: "Error: Can't recycle `..4` (size 0) to match `..7` (size 2)." But, if I use `purrr::flatten()`, I get this: > use_full %>% + flatten() %>% + tail() [[1]] [1] "MA" "ME" "ME" [[2]] character(0) [[3]] [1] "ME" "ME" [[4]] character(0) [[5]] character(0) [[6]] character(0) But, I still can't manage to convert this into a `data.frame`. – J. Alex Baecher Jul 21 '21 at 20:17

1 Answers1

2

I would tackle it a different way. First, I would rely on package tabulizer which works marvels to parse columns in pdf into a line-string signal. Then, instead of a loop on lines, I would turn the raw lines into a tibble/data.frame to vectorize transformations.

library(tabulizer)
library(splitstackshape)
library(tidyverse)

text_plants <- tabulizer::extract_text(file = "World_Checklist_of_Useful_Plant_Species_2020-pages-12-13.pdf")

df_plants <- 
  read.delim(file = textConnection(text_plants), header = FALSE) %>% as_tibble() %>% #as_tibble is optional, but helps a lot for exploring the results of the read.delim and the following mutations.
  filter(grepl("^\\s?(World.Checklist.of.Useful.Plant|m.diazgranados@kew.org|Page *\\d+ of \\d+|\\s*$)", V1) == FALSE) %>% # Optional. Removes the first and final with headers and footers.
  mutate(V1 = trimws(V1), 
         is_metadata = grepl('^\\s?\\d+.*[|]', V1), #Starts by checking those lines that have metadata, and which are always below a plant
         is_plant = lead(is_metadata), #Identifies those lines with the plant name, which seems to be always above a metadata line
         plant_metadata = if_else(is_plant == TRUE, true = trimws(lead(V1)), false = NA_character_)) %>% #moves the metadata signal into the same row but different variable of the plant signal.
  filter(is_plant == TRUE) %>% # Removes all lines not lsiting a plant.
  rename(plant = V1) %>% 
  mutate(usage_codes = str_extract(string = plant_metadata, pattern = "(?<=\\|).+(?=\\|)") %>% trimws()) %>% # Extractx the "usage codes"
  select(plant, usage_codes) %>% 
  splitstackshape::cSplit(splitCols = "usage_codes", sep = " ", direction = "long") %>% # Extracts the usage code into a tidy table with plats as ID
  filter(!is.na(usage_codes)) %>% 
  mutate(exists = TRUE) %>%
  pivot_wider(id_cols = plant, names_from = usage_codes, values_from = exists, values_fill = FALSE) # pivots the tidy table into a wide format.

df_plants
# A tibble: 114 x 10
   plant                      ME    HF    PO    SU    EU    GS    MA    IF    AF   
   <chr>                      <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 Cycas apoa K.D.Hill        TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2 Cycas circinalis L.        TRUE  TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE FALSE
 3 Cycas inermis Lour.        TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 4 Cycas media R.Br.          TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 5 Cycas micronesica K.D.Hill TRUE  TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 6 Cycas pectinata Buch.-Ham. TRUE  TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 7 Cycas revoluta Thunb.      TRUE  TRUE  FALSE FALSE TRUE  TRUE  TRUE  FALSE FALSE
 8 Cycas rumphii Miq.         TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  FALSE
 9 Cycas siamensis Miq.       TRUE  TRUE  FALSE FALSE TRUE  FALSE FALSE FALSE FALSE
10 Cycas taiwaniana Carruth.  FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE
# … with 104 more rows
Nicolás Velasquez
  • 5,623
  • 11
  • 22
  • 1
    Wow! You've blown my expectations out of the water! Thank you so much for this detailed solution, and the accuracy is one-to-one with the textbook! I'm so very grateful... – J. Alex Baecher Jul 22 '21 at 02:05