Python Camelot works a charm when it comes to English. But when it comes to Tamil it's not scraping the words properly. It gives more or less junk characters close to the characters I would like to understand what the issue is and how it captures the non-English data.
Work Done So Far: I am trying to scrape data from a PDF from the Tamil Nadu Election Commission. Sample single page data here. For example, the word
is getting scraped as ெபயர்
.
Reference: The CSV output just for the first table is attached below
"வ.
எண்.","ெபயர்","பானம்","தந்ைத /கணவர்
ெபயர்","கட்ச","ெபற்ற
வாக்கள்","சதவதம்
%",""
"1","இந்தராேதவ.ப","ெபண்","பழனச்சாம ஆர்","நா.த.க.","144","2.97","ைவப்த்
ெதாைக
இழப்"
"2","கீதா.வ","ெபண்","ேகாப ேஜா","அ.இ.அ.த..க","1355","27.97","ேதால்வ"
"3","சவகாம.ம","ெபண்","மேகஸ்வரன் ேக
ஆர்","ப.ேஜ.ப","341","7.04","ைவப்த்
ெதாைக
இழப்"
"4","ெசல்லம்மாள்.ஆ","ெபண்","ஆகம்","ேயட்ைச
ேவட்பாளர்","184","3.80","ைவப்த்
ெதாைக
இழப்"
"5","பாமத.","ெபண்","மார்","ேயட்ைச
ேவட்பாளர்","31","0.64","ைவப்த்
ெதாைக
இழப்"
"6","ஜனா ராண.வ","ெபண்","வஸ்வநாதன் எம்","த..க","2790","57.59","ெவற்ற"
Code used for scraping:
# coding: utf8
import camelot
tables = camelot.read_pdf('2.pdf', encoding='utf-8', pages= '1-end' )
tables
x = tables.n
print ("No of tables",x)
tables.export('ariyalur.csv', f='csv')
Addition / Edit for clarity as pointed out by @tripleee
For Non Tamil Users.
This is the header of the table
The Expected output is
வ.எண் பெயர் பாலினம் பெயர் கட்சி வாக்குகள் % முடிவு
But , the output which has come
"வ.எண்.","ெபயர்","பானம்","தந்ைத /கணவர் ெபயர்","கட்ச","ெபற்ற வாக்கள்","சதவதம்
%",""