Extracting scanned tables in PDF in RStudio using tabulizer

Question

I have crime data in PDF format. The tables in the PDF are scanned copies rather than properly formatted tables. I am trying to use tabulizer package to extract the tables from the PDF but somehow I keep running into the following errors:

Error in is.factor(x) : object 'crimes98' not found

Error in .jcall(pageIterator, "Ljava/lang/Object;", "next") : 
java.lang.IndexOutOfBoundsException: Page number does not exist

I am attaching the link to one of the PDF that is available online: https://ncrb.gov.in/sites/default/files/crime_in_india_table_additional_table_chapter_reports/TABLE-8-DISTRICT-WISE%20INCIDENCE%20OF%20COGNIZABLE%20CRIMES%20%28IPC%29%20DURING%201998-1998.pdf

The codes that I have tried:

#Loading the PDF 
NCRB98 <- "crimes98.pdf"
#Extracting tables from PDF
extract_tables(crimes98)

# Load the PDF file
NCRB98 <- "crimes98.pdf"

# Define a function to extract tables from a PDF and save them as CSV files
extract_tables_to_csv <- function(NCRB98, output_dir = getwd("E:/NCRB Data/2001_2012/PDFs")) {
  # Read the PDF and extract tables using tabulizer
  tables <- extract_tables(NCRB98, pages = "all", output = "data.frame", method = "stream")
  
  # Save tables as CSV files
  for (i in seq_along(tables)) {
    csv_file <- file.path(output_dir, paste0("table_", i, ".csv"))
    write.csv(tables[[i]], csv_file, row.names = FALSE)
  }
  
  cat("Tables extracted and saved as CSV files in:", output_dir, "\n")
}

# Call the function to extract tables from the PDF and save them as CSV files
extract_tables_to_csv(NCRB98, output_dir = "E:/NCRB Data/2001_2012/PDFs")

Extracting scanned tables in PDF in RStudio using tabulizer

0 Answers0