Camelot scraping issue for Non English (Tamil) PDF

Question

Python Camelot works a charm when it comes to English. But when it comes to Tamil it's not scraping the words properly. It gives more or less junk characters close to the characters I would like to understand what the issue is and how it captures the non-English data.

Work Done So Far: I am trying to scrape data from a PDF from the Tamil Nadu Election Commission. Sample single page data here. For example, the word

is getting scraped as ெபயர்.

Reference: The CSV output just for the first table is attached below

"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"

Code used for scraping:

# coding: utf8
import camelot

tables = camelot.read_pdf('2.pdf',  encoding='utf-8', pages= '1-end' )

tables
x = tables.n 
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv')

Addition / Edit for clarity as pointed out by @tripleee For Non Tamil Users. This is the header of the table The Expected output is வ.எண் பெயர்‌ பாலினம்‌ பெயர்‌ கட்சி வாக்குகள்‌ % முடிவு But , the output which has come "வ.எண்.","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்ற வாக்கள்","சதவதம் %",""

For those of us who can't read Tamil, can you supply the expected result as text, too? The only difference to my untrained eye is an unwanted joiner as the second character (assuming this is a left-to-right script). Could you please also spell out the Unicode code points, as these strings turn out to be hard to copy-paste correctly (at least on my iPhone)? — tripleee, Mar 13 '22 at 12:45
Are the emoji characters in your result an example of corruption, or part of the expected output? Perhaps you could provide the expected output for, say, the first couple of rows? (Probably trim down the sample to just a few rows, too.) — tripleee, Mar 13 '22 at 12:48
Atttmpting to copy/paste the problematic word out of the example PDF exhibits the same problem. I'm guessing the PDF extraction is fine actually, and the problem might be with lacking Tamil glyph support in some fonts, or the font rendering engine. — tripleee, Mar 13 '22 at 12:54
Hi @triplee , I find it offensive when i am not able to update it immediately when some one is trying to help me out. Please give me 30 minutes .I am not in front of my system. — sibi kanagaraj, Mar 13 '22 at 13:18
Sorry if I have somehow offended you. I'm afraid I don't precisely understand what I did wrong. Did you dislike my edit? — tripleee, Mar 13 '22 at 13:53
@tripleee No no .. What I meant was , I was not in a position to reply you immediately. I blamed myself ;) . — sibi kanagaraj, Mar 13 '22 at 14:13
@KJ What possibly could be done to get it right ?I am using Ubuntu 20 and viewing the outputs either in LibreOffice Calc or Gedit . — sibi kanagaraj, Mar 13 '22 at 14:26
@tripleee Have Updated the question . And Thank you for Formatting the question . — sibi kanagaraj, Mar 13 '22 at 14:27

Camelot scraping issue for Non English (Tamil) PDF

0 Answers0