1

I have to pull data from a pdf uploaded at a URL. The pdf is in an image/.png format hence while using the tesseract package few of the lines were not recognized.

The code:

library(rvest)
library(dplyr)
library(pdftools)
library(tesseract)

url="https://www.hindustancopper.com/Page/PriceCircular"
links=url %>% 
  #reading the html of the url
  read_html()%>%
  #fetching out the nodes and the attributes
  html_nodes("#viewTable li:nth-child(1) a") %>% html_attr("href")%>%
  #replacing few strings
  str_replace("../..",'')
str(links)

#using pdftools to read the pdf
base_url <- 'https://www.hindustancopper.com'
# combine the base url with the event url
event_url <- paste0(base_url, links)
event_url

#since the link has a scan copy and not the pdf itself hence using tesseract package
pdf_convert(event_url, 
            pages = 1, 
            dpi = 850, 
            filenames = "page1.png")
# what does the data look like
text <- ocr("page1.png")
cat(text)

The actual output reads the list of products and its prices as:

CONTINUOUS CAST COPPER WIRE ROD 11 MM 44567 
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc.

The expected output should be:

CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056...etc

I have tried several times changing the value of dpi argument but that did not help much. Thanks in advance!

Ami
  • 197
  • 1
  • 12
  • Have you tried with different PSM? – nguyenq Apr 06 '20 at 14:51
  • PSM is already inbuilt in this function. I do not think that any of the functions used provide any option to declare psm. Refer to the following URL: https://rdrr.io/github/hansthompson/pdfHarvester/src/R/Tesseract.R – Ami Apr 07 '20 at 06:26
  • You need to be able to try another page segmentation mode as it could capture the region that the current PSM misses. I don't understand why it is fixed to -psm 7, which treats the image as a single text line, which would not work optimally for multi-line text image. https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc – nguyenq Apr 07 '20 at 22:30

1 Answers1

2

I am using Ubuntu 18.04 and tesseract 5.0.0-alpha-647-g4a00 for below command.

I downloaded one of sample pdf as referred in your code.

https://www.hindustancopper.com/Upload/Reports/0-637189269505122500-AnnualReport.pdf

Then I convert it to png using this command

pdftoppm 0-637189269505122500-AnnualReport.pdf report.png -png

Then by using gimp, I rotate the document so that it is leveled.

Then I use this tesseract command to translate the document.

tesseract report.png stdout -l eng --oem 3 --psm 6 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789:.-/ "

Here is the result:

HINDUSTAN COPPER LIMITED
A GOVT. OF INDIA ENTERPRISE
kK
Registered Head Office
Tamra Bhavan
1 Ashutosh Chowdhury Avenue
Kolkata - 700019
Ref: HCL/HO/MKTG/Cu-P/ 2019-2020
Date : 02-MAR-20
Sub: Basic Price of Cathodes and CC Rods for the month of MAR 2020.
The Basic Price of Copper Cathodes and CC Copper Rods for the month of MAR 2020 are as follows:
Basic Price Ex-Works /
Ex.Godown basis Rs. / MT
CONTINUOUS CAST COPPER WIRE ROD 11 MM 441567
CATHODE FULL 434122
CONTINUOUS CAST COPPER WIRE ROD NS 439678
CONTINUOUS CAST COPPER WIRE ROD 16 MM 443056
COPPER CATHODE CUT 437856
CONTINUOUS CAST COPPER WIRE ROD 8 MM 440078
CONTINUOUS CAST COPPER WIRE ROD 19.6 MM 444546
CONTINUOUS CAST COPPER WIRE ROD 12.5 MM 441567
Note : Monthly LME CSP Avg. : 5686.45 Monthly Avg. Exchange Rate : 71.59
The price ruling on the date of delivery will be applicable. irrespective of the date of making financial arrangements i.e.
advance payment/opening of letter of credit. GST other statutory levies will be extra as applicable.
For purchase against usance Letter of Credit the interest rate chargeable shall be 10 per annum for the credit
period up to 90/60/30 days.
Customers may note that the price and interest rate is subject to change without prior notice. The price and interest rate
ruling on the date of delivery will be applicable irrespective of the date of their making financial arrangements. All bank
charges of negotiating bank will be borne by us.
ADD YAS
Zl Bl rTeri68
S Parashar
DGM Commercial
us2018
  • 603
  • 6
  • 11
  • Thanks for the reply. I have used another package magick to rotate and read the image and it has worked. Thanks again. – Ami Apr 20 '20 at 04:15