I'm trying to create a data frame from the following PDF
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
However, when I call tab1
it only has one column:
[,1]
[1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
[2,] "AS OF JUNE 29, 2020 AT 3:00 PM"
[3,] "POSITIVE CASE STATUS OTHER TESTS"
[4,] "TOTAL"
[5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"
[6,] "TOTAL 495 16 519 97 805"
[7,] "ADIRONDACK 0 0 0 75 0"
[8,] "ALBION 0 0 0 0 2"
[9,] "ALTONA 0 0 0 0 1"
I would like to extract what should be the individual columns to create a dataframe (e.g. for row 7 I extract its contents into the following columns: Facility ("Adirondack") Recovered (0) Decesased (0) Positive (0) Pending (75) Negative (0) ). I'm thinking that the most efficient way to do this would be to make cuts in tab1 based on spaces, but this doesn't work since some of the facilities have multiple words in them, so the space cut would get messed up. Does anyone have an idea for a solution? Thanks for the help!