1

A pdf for which I am trying to extract a table from, correctly identifies the table but the table data is extracted as unicode rather than string data.

from tabula import read_pdf
df = read_pdf('https://watermark.silverchair.com/fsab153.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAs0wggLJBgkqhkiG9w0BBwagggK6MIICtgIBADCCAq8GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMOXfntjWl9L87SyaXAgEQgIICgMSxXbyEzl4Y3sDeaGncgcE9V93d46LWUAnMiKz0KtHAKJA1HpPuefZZzrhJlD_hNUzK9C4uWwF1EfAbe0aWG3c_sFLetD5kqOWXzuGARvCRWOvmAEKpgtx0Desj5MY9lH7Zp7XxbfLBLScOIK6X_qEZ3Low6GkQfm1iBCbVHUg9ueKxLaYghX--uHPqmx43RZHk8bAjoDdMDT9lPsVXqlZJkmS2UT6T3uzC1jPTz3eON93C5CaEpW4lG_zvzMMltlZZm04Zz1vWd7WsXa_Gvc1gwO1AwUNcBxrRrr7Af5U02SPMaFF8dL0cOqrpw24LPzrg8ibtBq9yKidnCM-B2z74goz41kzv2KNZoPYQLj5XYlbyTknoE-MDo6cq_tGMw7igxbsrKUbGzSGILZ-bDQAVTyGKlU1QudNbZd4lDOe36kdr6dlhWHe7aK6vQgczTOYvQ0v1G5HwouxwTO0WPVpxawld76AZLhathmV4fMmNAYFpZDOytT4YAZEj-jjkPvzJ7HeA_-7ifmtwqLiOSILbLuJgEhLQ5frm9YXSn3crSInflJEsMm6Bs8pE_5H8vdex2tXzL6ZmHiDkDMdB_YM8iOhJGdMfZWsCJ0TcrtZyWZv5t-M1NzhLutplX-mYInE1sXZSTLHcOD0YDhEeMPNJhdGvISG_IbwDfH9OKuGQ0x8UCoe2DPVKOd53PYghKf2Bk8q7tILs3WeHgItnvRbkevjYS287gh_5052TKJJbC8dYxkVlHn-JCsbaMfn_SlYSaWjOfVxvSHKsVlFj5ry-cfScH8ai1bra8LASgwg4y_vpNeeDiA0CwZaPy2l_TF1O_yFsaKItyDkCMJXqhjI', pages=3)[0]
df['Unnamed: 0']

screenshot

What is the correct way to extract the data in UTF-8 or ASCII?

Edit: something on my system (Debian) is able to interpret these codes though (see below) and the question is, how do I get this information out?

screenshot with whole DataFrame

zoof
  • 159
  • 8
  • All characters in the column `year` belong to [*Private Use Area* (Basic Multilingual Plane) block](https://en.wikipedia.org/wiki/Private_Use_Areas) `U+E000..U+F8FF`… – JosefZ Jan 05 '23 at 20:50
  • I updated question with an additional screenshot. – zoof Jan 06 '23 at 01:33
  • 1
    Using Excel was a good idea but didn't work. I'll give OCR a try -- even with some edits, better than manually entering it. – zoof Jan 06 '23 at 17:16
  • 1
    `pdftotext` seems to do ok but requires more work than I'd like. OCR was a disaster, at least without significant work tuning parameters. – zoof Jan 06 '23 at 18:53

1 Answers1

1

After trying various suggestions from the comments, I ended up creating a dictionary to map the UTF to the required digits. I wrote the extracted table to a csv file and applied the map to get readable data.

utf_map = {'\uf639':'0', '\uf6dc':'1', '\uf63a':'2', '\uf63b':'3', '\uf63c':'4',
           '\uf63d':'5', '\uf63e':'6', '\uf63f':'7', '\uf640':'8', '\uf641':'9'}

with open('cod_catch.csv') as f:
    string = f.read()
    new_string = ''
    for ch in string:
        if ch==' ':
            pass
        elif ch in utf_map:
            new_string += utf_map[ch]
        else:
            new_string += ch

with open('cod_catch_translated.csv', 'w') as f:
    f.write(new_string)

cod_catch = pd.read_csv('cod_catch_translated.csv')

print(cod_catch)

enter image description here

Many thanks for all the suggestions!

zoof
  • 159
  • 8