0

I'm writing a python program to extract tables from excel sheets and pdf. Currently, I'm using different libraries for each file type. Xlrd for excel sheets, Pdfminer for pdf.

I'm wondering if there is a generic approach to extract tables from any type of file (xls, pdf, csv, word etc.). Since I'm planning to expand the list of supported file types, writing different functions for each file type would be cumbersome.

P.S. I came across PETL while looking for solutions. I could not find any excel/pdf extraction examples and I could not fully understand the documentation. Would PETL fulfill my requirement? If yes, I would really appreciate an example. Thank you.

Parag
  • 21
  • 2
  • excel and csv should be readable by most libraries, I know pandas and numpy support it. I think extracting tables from word documents and particularly pdfs is more tricky since you can in theory have an entire textbook with tons of tables and need to parse out specific ones. I want to say if you have a working technique for each one already then you can just write one function to delegate to the correct one based on file extension but I very much doubt you will find a pre made "one size fits all" solution that includes pdf scraping. – Tadhg McDonald-Jensen Jul 08 '20 at 17:01
  • 1
    There is no generic approach since there is no generic table. All of them are specific, and require specific coding. – alfadog67 Jul 08 '20 at 17:03
  • @Parag I haven't used it but taking a quick look at the documentation it looks like it is optimized for loading large csv files and expressing it in different formats without having to re-load the file multiple times. I'm pretty sure that isn't the kind of functionality you need. – Tadhg McDonald-Jensen Jul 08 '20 at 19:08
  • @TadhgMcDonald-Jensen, I've accidentally deleted my previous comment thanking you, so thanks again for taking the time to help me. – Parag Jul 08 '20 at 19:12
  • If I may ask, you say: "I'm planning to expand the list of supported file types." may I ask why? what is your application that requires such extensive support for file types? frequently a program that just tells the user their garbage input isn't allowed is better than trying to parse the garbage. – Tadhg McDonald-Jensen Jul 08 '20 at 19:13

0 Answers0