0

I extracted table from pdf using pdftools in r. The table in PDF has multi-line texts for the columns. I replaced the spaces with more than 2 spaces with "|" so that it's easier. But the problem I'm running into is that because of the multi-line and the way the table is formatted in the PDF, the data is coming in out of order. The original looks like this

enter image description here

The data that I extracted looks like this:

    scale_definitions <- c("", "                                        to lack passion                        easily annoyed", 
"      Excitable", "                                        to lack a sense of urgency             emotionally volatile", 
"", "                                        naive                                  mistrustful", 
"      Skeptical", "                                        gullible                               cynical", 
"", "                                        overly confident                       too conservative", 
"      Cautious", "                                        to make risky decisions                risk averse", 
"", "                                        to avoid conflict                      aloof and remote", 
"      Reserved", "                                        too sensitive                          indifferent to others' feelings", 
"", "                                        unengaged                              uncooperative", 
"      Leisurely", "                                        self-absorbed                          stubborn", 
"", "                                        unduly modest                          arrogant", 
"      Bold", "                                        self-doubting                          entitled and self-promoting", 
"", "                                        over controlled                        charming and fun", 
"      Mischievous", "                                        inflexible                             careless about commitments", 
"", "                                        repressed                              dramatic", 
"      Colorful", "                                        apathetic                              noisy", 
"", "                                        too tactical                           impractical", 
"      Imaginative", "                                        to lack vision                         eccentric", 
"", "                                        careless about details                 perfectionistic", 
"      Diligent", "                                        easily distracted                      micromanaging", 
"", "                                        possibly insubordinate                 respectful and deferential", 
"      Dutiful", "                                        too independent                        eager to please"
)

scale_definitions <-  scale_definitions %>% str_replace_all("\\s{2,}", "|")

How do I best put this in dataframe?

user1828605
  • 1,723
  • 1
  • 24
  • 63

1 Answers1

2

Unfortunately a reprex will be to complex so here goes a description of how you can achive a structured df:

I am afraid you have to use pdftools::pdf_data() instead of pdftools::pdf_text().

This way you get a df for each page in a list. In these dfs you get a line for each word on the page and the exact location (plus extensions IRCC). With this at hands you can write a parser to accomplish your task... which will be a bit of work but this is the only way I know to solve this sort of problem.

update:

I found a readr function that helps for your case, since we can assume a fixed lenght (nchar()) for the colum positions:

library(tidyverse)

scale_definitions %>%
    # parse into columns by lenght and there for implicitely start position
    readr::read_fwf(fwf_widths(c(39, 40, 40), c("col1", "col2", "col3"))) %>%
    # build group ID from row number
    dplyr::mutate(grp = (dplyr::row_number() - 1) %/% 3) %>%
    # firm groupings
    dplyr::group_by(grp) %>%
    # impute missing value in col 1
    tidyr::fill(col1, .direction = "downup") %>%
    # remove groupings to prevent unwanted behaviour down stream
    dplyr::ungroup() %>%
    # remove auxiliary variable
    dplyr::select(-grp) %>%
    # convert to long format (saver to remove NAs)
    tidyr::pivot_longer(-col1, names_to = "cols", values_to = "vals") %>%
    # remove NAs
    dplyr::filter(!is.na(vals))

# A tibble: 44 x 3
   col1      cols  vals
   <chr>     <chr> <chr>
 1 Excitable col2  to lack passion
 2 Excitable col3  easily annoyed
 3 Excitable col2  to lack a sense of urgency
 4 Excitable col3  emotionally volatile
 5 Skeptical col2  naive
 6 Skeptical col3  mistrustful
 7 Skeptical col2  gullible
 8 Skeptical col3  cynical
 9 Cautious  col2  overly confident
10 Cautious  col3  too conservative
# ... with 34 more rows
DPH
  • 4,244
  • 1
  • 8
  • 18
  • oh wow! that's actually more trouble than it's worth. Seems like it's going to be really difficult to automate this sort of tasks. Thanks for the suggestion, @DPH – user1828605 Aug 25 '21 at 20:10
  • 1
    @user1828605 the first time arround it sure is some of work - If you need to do it once for a few pages it might be easier to use an online converter to excel and ajust the possible issue concerning the line breaks... if you need process multiple pdfs and/or multiple pages and/or need to run this task frequently, then the time investment might be worthwhile – DPH Aug 25 '21 at 20:16
  • 1
    @user1828605 have a look at my updated answer => given the fixed positions of the columns in your example the task might be relatively easy to solve with specific readr function – DPH Aug 30 '21 at 03:09
  • +2 for your thoughtfulness and diligence. You had already given me a pointer and you still went out of your way to help me get a direction. That's awesome and shows your pedagogical inclination. You're a good teacher and an empathetic one at that. Thank you for taking your valuable time for this. This is what SO is all about - helping each other. – user1828605 Aug 30 '21 at 18:45
  • I've marked it as answer even though I haven't yet tried this. This is more than I had asked for and deserves the kudos. – user1828605 Aug 30 '21 at 18:46