Extract PDF data with varying white space as seperation

Question

I'm looking at getting data from this PDFs.

I'm running into a problem, where location names with multiple words ("Northern Island" for example) are being put into different columns.

The "sep" argument within "read.table" seems to only be able to read a single space as a delimiter. Ideally, I'd like anything with more than one space to act as a delimiter. Is this at all possible?


url <- "C:/Users/files/PSSS Weekly Bulletin - W1 2019 (Dec 31-Jan 06).pdf"

# Convert the PDF to a text string
txt <- pdf_text(url)

# get the working directory
wd <- getwd()

#write the file to the working directory
file_name <- paste0(wd, "/", "temp.txt")
write(txt, file = file_name, sep = "\t")

# Convert to a table. Data is located starting line 25, and lasts 25 lines
# P.S: I've tried this code with and without the "sep" argument. No change. 
dtaPCF <- read.table(file_name, skip = 24, nrows = 25, fill = TRUE, header = TRUE)

# Here is the text that I'd like to read.table with. Ideally, I'd want to keep the headers, but it's not a dealbreaker if that doesn't work.


Country/Area      No. sites  No. reported  % reported  AFR  Diarrhoea  ILI  PF  DLI

American Samoa   0          0             0%          0    0          0    0   0

Cook Islands     13         11            85%         0    3          3    0   0

FSM              4          3             75%         0    21         74   0   3

Fiji             0          0             0%          0    0          0    0   0

French Polynesia 31         16            52%         3    9          11   3   3

Guam             0          0             0%          0    0          0    0   0

Kiribati         7          7             100%        0    172        609  22  0

Marshall Islands 2          2             100%        0    4          0    2   0

N Mariana Is     7          7             100%        4    13         60   17  0

Nauru            0          0             0%          0    0          0    0   0

New Caledonia    0          0             0%          0    0          0    0   0

New Zealand      0          0             0%          0    0          0    0   0

Niue             0          0             0%          0    0          0    0   0

PNG              0          0             0%          0    0          0    0   0

Palau            0          0             0%          0    0          0    0   0

Pitcairn Islands 1          1             100%        0    0          0    0   0

Samoa            13         6             46%         0    262        606  18  4

Solomon Islands  13         4             31%         0    75         59   4   1

Tokelau          2          2             100%        0    2          9    0   0

Tonga            11         11            100%        0    17         73   0   0

Tuvalu           0          0             0%          0    0          0    0   0

Vanuatu          11         7             64%         0    49         171  0   1

Wallis & Futuna  0          0             0%          0    0          0    0   0

Can you post a sample of just the relevant text portion? Also, try `data.table::fread` . It's pretty smart with detecting the right delimiter — Rohit, Jul 17 '19 at 11:42
@Rohit I've added some sample of the text. I tried using fread, which produced a marginally better result, but did not solve the sep problem. — deetseeker, Jul 18 '19 at 03:13
Try `read.fwf()` or `readr::read_fwf()` instead. Your data seems to be fixed width, so one of them should work. You'll have to play around with it a bit to get the right output — Rohit, Jul 18 '19 at 05:47
@Rohit thanks for the suggestion. I used read_fwf() and got it to work. — deetseeker, Jul 18 '19 at 07:31

score 0 · Answer 1 · answered Jul 18 '19 at 07:33

Here is the code I ended up using. I used notepad to check the maximum character length of each column and used those for fwf_widths().

library(readr)

dtaPCF <- read_fwf(file_name,
                   skip = 47,
                   n_max = 23,
                   trim_ws = TRUE,
                   fwf_widths(c(17, 11, 14, 12, 5, 11, 5, 4, 1), 
                              c("Country/Area", "No. sites", "No. reported", 
                                "% reported", "AFR", "Diarrhoea", "ILI", "PF", "DLI")))

Extract PDF data with varying white space as seperation

1 Answers1