1

I'm converting a massive collection of pdfs into a single massive csv.

A typical pdf looks like this:

enter image description here

When I use pdftools to convert the page into a single text string I get this:

enter image description here

When I use the cat() function on the page's string I get this:

enter image description here

My question is: how does the cat() function know where each column begins? How can it cleanly parse through the pdf string? I'm using R code that puts a ';' whenever there is more than two white spaces in the string, but I'm coming up with a bunch of problems.

First problem was that the 'sexo' variable only has one space between it and the 'c.identidad' variable. My solution was to use str_replace() to add spaces in front of and the 'sexo' variables ('VAR' and 'MUJ').

The second problem I found was that some observations had only one space between the 'domicilio electoral' and 'circumscripcion' variable. I figured out that 'circumscripcion' was essentially the same as the 'comuna' in the header. So my solution was to str_locate() to identify what the 'comuna' was and then use str_replace() to add spaces to the front and back of all instances where the string matched the 'comuna'.

The third problem was that sometimes the 'domicilio electoral' variable of an observation would also include the 'comuna'. This means that my code to switch the string into pdf (any white space with a length greater than 2 gets converted into a ';') would provide an extra variable for any observation that included a 'comuna' string more than once.

Once the third problem came up I began to think that there must be a simpler way to parse through this pdf document. If cat() can cleanly parse through the pdf string, then there has to be something similar to help me parse through it in an ordered way. Any help, guidance, links, or ideas would be greatly appreciated.


The pdfs I'm using are public and can be found at:http://web.servel.cl/padronAuditado.html

Below is the final version of my code. It has about a 6% obvious error rate. I measured error by counting observations that had the 'mesa' variable as blank.

require(stringr)
require(dplyr)
require(pdftools)

pdf_to_csv2 <- function(doc){

 text <- pdf_text(doc)

 COMUNA_LOCATE = str_locate(text[1], "COMUNA:.+PAGINA")

 COMUNA = text[1] %>% str_sub(COMUNA_LOCATE[1], COMUNA_LOCATE[2]) %>% 
   str_replace('COMUNA:', '') %>%
   str_replace('PAGINA', '') %>%
   str_trim('both')

 cleaned_text <- text %>% str_replace_all(' VAR ', '   VAR   ') %>%
   str_replace_all(' MUJ ', '   MUJ   ') %>%
   str_replace_all(paste(COMUNA, '                       '), paste('    ', COMUNA, '    ')) %>%
   str_replace('IDENTIDAD SEXO', 'IDENTIDAD   SEXO') %>%
   str_replace_all('\r\n', 'END_LINE') %>%
   str_replace_all('\\s\\s+', ';') %>%
   str_replace_all('END_LINE', '\n')

 for(i in c(1:length(cleaned_text))) {write(cleaned_text[i], paste0(doc, '_', i))}

 }
beemyfriend
  • 85
  • 1
  • 11
  • For your `cat()` question: `cat()` has no idea about "columns". The string has `\r\n`'s in it and those are CR+LF (carriage return + line feed or MS-DOS line endings). `cat()` spits out the raw text and the console knows to move to the next line. The other two should be separate questions. – hrbrmstr Sep 10 '16 at 11:04
  • The website it not accesible anymore. – Emmanuel Hamel Sep 26 '22 at 21:05

0 Answers0