2

I want to take a PDF File as an input. And as an output file I want a csv file to show. So all the textual data which is there in the pdf file should be converted to a csv file. But I am not understanding how would this happen..I need your help at the earliest as I've tried to do but couldn't do it.

what ive done is used a library called Tabula-py which converts pdf to csv file. It does create a csv format but there are no contents being copied to the csv file from the pdf file.

heres the code

from tabula import convert_into,read_pdf
import tabula
df = tabula.read_pdf("crimestory.pdf", spreadsheet=True, 
                     pages='all',output_format="csv")
df.to_csv('crimestoryy.csv', index=False)

the output should come as a csv file where the data is present. what i am getting is a blank csv file.

Selcuk
  • 57,004
  • 12
  • 102
  • 110
cerebral_assassin
  • 212
  • 1
  • 4
  • 16

2 Answers2

2

I have find answer to this question by my own To tackle this issue I came up with converting the pdf file into a text file. Then I converted this text file to a csv file.here's my code.

conversion.py

import os.path
import csv
import pdftotext
#Load your PDF
with open("crimestory.pdf", "rb") as f:
   pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('crimestory.txt', 'w') as f:
    f.write("\n\n".join(pdf))

save_path = "/home/mayureshk/PycharmProjects/NLP/"

completeName_in = os.path.join(save_path, 'crimestory' + '.txt')
completeName_out = os.path.join(save_path, 'crimestoryycsv' + '.csv')

file1 = open(completeName_in)
In_text = csv.reader(file1, delimiter=',')

file2 = open(completeName_out, 'w')
out_csv = csv.writer(file2)

file3 = out_csv.writerows(In_text)

file1.close()
file2.close()
Kalana
  • 5,631
  • 7
  • 30
  • 51
cerebral_assassin
  • 212
  • 1
  • 4
  • 16
1

Try this, hope it will works

import tabula

# convert PDF into CSV
tabula.convert_into("crimestory.pdf", "crimestory.csv", output_format="csv", pages='all')

or

df = tabula.read_pdf("crimestory.pdf", encoding='utf-8', spreadsheet=True, pages='all')
df.to_csv('crimestory.csv', encoding='utf-8')

or

from tabula import read_pdf
df = read_pdf("crimestory.pdf")
df
#make sure df displays your pdf contents in the output

from tabula import convert_into
convert_into("crimestory.pdf", "crimestory.csv", output_format="csv")
!cat.crimestory.csv
Kalana
  • 5,631
  • 7
  • 30
  • 51
  • The first one again shows an empty csv file.. whereas the second one is giving a TypeError:AttributeError: 'NoneType' object has no attribute 'to_csv' @Kalana Eranda Jayasuriya – cerebral_assassin Sep 23 '19 at 05:55
  • can you rename the output file into another name and run the first code again – Kalana Sep 23 '19 at 05:59
  • I did it but its showing no contents in the csv file. its a blank csv file – cerebral_assassin Sep 23 '19 at 06:07
  • If it didn't work rename `pages='all'` to `pages='1-'` then run it – Kalana Sep 23 '19 at 06:08
  • I have add another code to my answer please check it, – Kalana Sep 23 '19 at 06:18
  • The code works however its displaying some infos. like heres the output on my terminal -- INFO: OpenType Layout tables used in font Times New Roman,Italic are not implemented in PDFBox and will be ignored Sep 23, 2019 11:55:59 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 INFO: OpenType Layout tables used in font Arial are not implemented in PDFBox and will be ignored Sep 23, 2019 11:55:59 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 INFO: OpenType Layout tables used in font Times New Roman,Italic are not implemented in PDFBox and will be ignored – cerebral_assassin Sep 23 '19 at 06:27
  • Is `crimestory.csv` genarated when you used 3rd code – Kalana Sep 23 '19 at 06:45
  • it generated a blank csv file... so to workaround i tried converting this pdf to a text file then text file to a csv file! and it worked ! – cerebral_assassin Sep 23 '19 at 06:55
  • comment that solution in hear as an answer. It will help lot of developers who have faced like your issue – Kalana Sep 23 '19 at 07:08