1

Im need obtain the names of set a many pdf files (36000 files). But only the names not load all object. Finally make a data frame like this:

enter image description here

The link of 21 example files: https://drive.google.com/drive/folders/1zUKyVJFICq4Q69zs48wqFNq1UPDvCgbf?usp=sharing

Im use this code:

#set directory 
library(pdftools)
library(tm)

files=list.files(pattern = "pdf$")
files

all=lapply(files, pdf_text)
lapply(all, length) 
x=Corpus(URISource(files), readerControl = list(reader = readPDF))
x

class(x) #character

DAT_FINAL <- data.frame(text = sapply(x, as.character), stringsAsFactors = T)
DAT_FINAL

The idea is has a data frame because I need compare the numeric names with an excel file for find the missing numbers between documents.

Update:

enter image description here

1 Answers1

2

A possible solution (instead of /tmp/PDFS/, use the path to the directory where your PDF are placed):

library(tidyverse)

data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
  mutate(number = str_extract(pdfs, "^\\d+"), .before = pdfs)

#>    number   pdfs
#> 1       1  1.pdf
#> 2      10 10.pdf
#> 3      12 12.pdf
#> 4      13 13.pdf
#> 5      14 14.pdf
#> 6      15 15.pdf
#> 7      16 16.pdf
#> 8      17 17.pdf
#> 9      18 18.pdf
#> 10     19 19.pdf
#> 11      2  2.pdf
#> 12     20 20.pdf
#> 13     21 21.pdf
#> 14     22 22.pdf
#> 15     23 23.pdf
#> 16      3  3.pdf
#> 17      4  4.pdf
#> 18      5  5.pdf
#> 19      6  6.pdf
#> 20      8  8.pdf
#> 21      9  9.pdf

Or using tidyr::extract:

data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
  extract(pdfs, into = "number", "(\\d+)\\.pdf", remove = F, convert = T) %>% 
  select(number, pdfs)

EDIT

To answer a further question of the OP (see comments below):

library(tidyverse)

data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
  mutate(number = str_extract(pdfs, ".*(?=\\.pdf)"), .before = pdfs)

#>    number      pdfs
#> 1       1     1.pdf
#> 2      10    10.pdf
#> 3     10A   10A.pdf
#> 4      12    12.pdf
#> 5      13    13.pdf
#> 6      14    14.pdf
#> 7      15    15.pdf
#> 8      16    16.pdf
#> 9      17    17.pdf
#> 10    17A   17A.pdf
#> 11     18    18.pdf
#> 12     19    19.pdf
#> 13      2     2.pdf
#> 14     20    20.pdf
#> 15     21    21.pdf
#> 16  21ABV 21ABV.pdf
#> 17     22    22.pdf
#> 18     23    23.pdf
#> 19      3     3.pdf
#> 20      4     4.pdf
#> 21      5     5.pdf
#> 22      6     6.pdf
#> 23      8     8.pdf
#> 24      9     9.pdf
PaulS
  • 21,159
  • 2
  • 9
  • 26
  • Im add a new update files, because some files are numeric and character in the name. Like this: 10A.pdf, 17A.pdf, 21ABV.pdf. Is necessary extract all the name. – Miguel Angel Acosta Chinchilla Aug 29 '22 at 12:42
  • That is not clear what you want: please, elaborate a bit more. – PaulS Aug 29 '22 at 12:48
  • I am trying to get all the full name, number and full letters (characters) in a separate column. Since only the numbers appear in the column _number_. Many files have names made up of numbers and letters. Im add this files in Drive. #> number pdfs #> 1 1 1.pdf #> 2 10A 10A.pdf #> 3 21ABV 21ABV.pdf – Miguel Angel Acosta Chinchilla Aug 29 '22 at 13:16
  • OK, @MiguelAngelAcostaChinchilla, just try this: `data.frame(pdfs = list.files("/tmp/PDFS/")) %>% mutate(number = str_extract(pdfs, ".*(?=\\.pdf)"), .before = pdfs)` – PaulS Aug 29 '22 at 13:26
  • I have meanwhile updated my answer. – PaulS Aug 29 '22 at 13:29