I have to pull data from a pdf uploaded at a URL. The pdf is in an image/.png format hence while using the tesseract package few of the lines were not recognized.
The code:
library(rvest)
library(dplyr)
library(pdftools)
library(tesseract)
url="https://www.hindustancopper.com/Page/PriceCircular"
links=url %>%
#reading the html of the url
read_html()%>%
#fetching out the nodes and the attributes
html_nodes("#viewTable li:nth-child(1) a") %>% html_attr("href")%>%
#replacing few strings
str_replace("../..",'')
str(links)
#using pdftools to read the pdf
base_url <- 'https://www.hindustancopper.com'
# combine the base url with the event url
event_url <- paste0(base_url, links)
event_url
#since the link has a scan copy and not the pdf itself hence using tesseract package
pdf_convert(event_url,
pages = 1,
dpi = 850,
filenames = "page1.png")
# what does the data look like
text <- ocr("page1.png")
cat(text)
The actual output reads the list of products and its prices as:
CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.
The expected output should be:
CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc
I have tried several times changing the value of dpi argument but that did not help much. Thanks in advance!