I have crime data in PDF format. The tables in the PDF are scanned copies rather than properly formatted tables. I am trying to use tabulizer package to extract the tables from the PDF but somehow I keep running into the following errors:
Error in is.factor(x) : object 'crimes98' not found
Error in .jcall(pageIterator, "Ljava/lang/Object;", "next") :
java.lang.IndexOutOfBoundsException: Page number does not exist
I am attaching the link to one of the PDF that is available online: https://ncrb.gov.in/sites/default/files/crime_in_india_table_additional_table_chapter_reports/TABLE-8-DISTRICT-WISE%20INCIDENCE%20OF%20COGNIZABLE%20CRIMES%20%28IPC%29%20DURING%201998-1998.pdf
The codes that I have tried:
#Loading the PDF
NCRB98 <- "crimes98.pdf"
#Extracting tables from PDF
extract_tables(crimes98)
# Load the PDF file
NCRB98 <- "crimes98.pdf"
# Define a function to extract tables from a PDF and save them as CSV files
extract_tables_to_csv <- function(NCRB98, output_dir = getwd("E:/NCRB Data/2001_2012/PDFs")) {
# Read the PDF and extract tables using tabulizer
tables <- extract_tables(NCRB98, pages = "all", output = "data.frame", method = "stream")
# Save tables as CSV files
for (i in seq_along(tables)) {
csv_file <- file.path(output_dir, paste0("table_", i, ".csv"))
write.csv(tables[[i]], csv_file, row.names = FALSE)
}
cat("Tables extracted and saved as CSV files in:", output_dir, "\n")
}
# Call the function to extract tables from the PDF and save them as CSV files
extract_tables_to_csv(NCRB98, output_dir = "E:/NCRB Data/2001_2012/PDFs")