In Python 3, I have a PDF file "Ativos_Fevereiro_2018_servidores.pdf" with 6,041 pages. I'm on a machine with Ubuntu. The file is here: https://drive.google.com/file/d/1P8kF0gUOVls6sOGed4R0C2PlVF5RFtU6/view?usp=sharing
On each page there is text at the top of the page, two lines. And below a table, with header and two columns. Each table in 36 rows, less on the last page
At the end of each page, after the tables, there is also a line of text
I want to create a CSV from this PDF, considering only the tables in the pages. And ignoring the texts before and after the tables
To avoid java-memory errors I thought I'd split the file into groups of 300 pages. I did so in tabula-py:
import tabula
import pandas as pd
dfs = []
for i in range(1,6041, 300):
if i != 1:
i = i + 1
i2 = i + 300
if i2 > 6041:
i2 = 6041
print(i)
print(i2)
try:
df = tabula.read_pdf("Ativos_Fevereiro_2018.pdf", encoding='latin-1', spreadsheet=True, pages='i-i2', header=0)
dfs.append(df)
print('Page ', len(df), ' parsed.')
except:
print('Error on page: ', i)
output = pd.concat(dfs)
output.to_csv('servidores_rj_ativos_fev_18.csv', encoding='utf-8', index=False)
But the range I made is wrong:
1
301
Error: Syntax error in page range specification
Error on page: 1
302
602
...
Error: Syntax error in page range specification
Error on page: 5702
6002
6041
Error: Syntax error in page range specification
Error on page: 6002
Traceback (most recent call last):
File "roboseguranca_pdftocsv.py", line 26, in <module>
output = pd.concat(dfs)
File "/home/reinaldo/Documentos/Code/intercept/seguranca/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 212, in concat
copy=copy)
File "/home/reinaldo/Documentos/Code/intercept/seguranca/lib/python3.6/site-packages/pandas/core/reshape/concat.py", line 245, in __init__
raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate
Please, how can I correct the range error?