0

I need to transfer some pdf table content to Excel. I used the PyMuPDF module to be able to put the PDF content to a .txt file. In which it is easier to work with and I did it successfully.

Here is the PDF content that I need to put in the excel cells.

As you can see in the .txt file I was able to transfer each column and row of the pdf. They are displayed sequentially.

- I need some way to read the txt strings sequentially so I can put each line of the txt into a .xlsx cell.

- Some way to setup triggers to start reading the document sequentially and lines to throw away. Example: Start reading after a specific word, stop reading when some word is reached. Things like this. Because these documents have headers and unuseful information that are also transcript to the txt file. So I need to ignore some contents of the txt to gather only the useful information to put in the .xlsx cells.

*I'm using the xlrd library, I would like to know how I can work with things here. (optional)

I don't know if it is a problem, but when I use the count method to count the number of lines, it returned only 15 lines. The document has 568 lines in total. It only showed the full empty ones.

    with open(nome_arquivo_nota, 'r'):
    for line in nome_arquivo_nota:
        count += 1
print(count)

= 15 .

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • 15 lines, but your text files shows more - what are the EOL of your text file, perhaps open it in e. g. a Notepad++ editor and click on "Show All Characters" - is it Cr, CrLf or only Lf? Or does a Python crack know, what lets Python think, that there is a new line? – BitLauncher Jun 26 '20 at 20:27
  • @BitLauncher The EOL at the end of each line shows "CR LF". I dont know what it means. – Matheus Oliveira Jun 28 '20 at 21:59
  • Here is an introduction to CR and LF: https://en.wikipedia.org/wiki/Newline – BitLauncher Jun 30 '20 at 18:59

0 Answers0